JParaCrawl

Overview

JParaCrawl is the largest publicly available English-Japanese parallel corpus created by NTT.
It was created by largely crawling the web and automatically aligning parallel sentences.
For more details, see our paper.

Download
Parallel corpus NMT Models (based on v3.0)
License

JParaCrawl and the trained models are distributed under the following license.
For commercial use, please contact us.

Terms of Use for Bilingual Data, Monolingual Data and Trained Models

Nippon Telegraph and Telephone Corporation (Hereinafter referred to as "our company".) will provide bilingual data, monolingual data and trained models (Hereinafter referred to as "this data.") subject to your acceptance of these Terms of Use. We assume that you have agreed to these Terms of Use when you start using this data (including downloads).

Article 1 (Use conditions)
This data can only be used for research purposes involving information analysis (Including, but not limited to, replication and distribution. Hereinafter the same in this article.). The same applies to the derived data created based on this data. However, this data is not available for commercial use, including the sale of translators trained using this data.

Article 2 (Disclaimer)
Our company does not warrant the quality, performance or any other aspects of this data. We shall not be liable for any direct or indirect damages caused by the use of this data. Our company shall not be liable for any damage to the system caused by the installation of this data.

Article 3 (Other).
This data may be changed in whole or in part, or provision of this data may be interrupted or stopped at our company’s discretion without prior notice.


==========

対訳データ,単言語データおよび学習済みモデル利用に関する利用規約

日本電信電話株式会社(以下、「当社」という。)は、本利用規約に同意されることを条件として、対訳データ、単言語データおよび学習済みモデル(以下、「本データ」という。)を提供します。なお、本データの利用(ダウンロードも含む)を開始した時点で、本利用規約にご同意頂いたものとみなします。

第1条(利用条件)
本データは、情報解析を伴う研究開発目的にのみご利用(複製および配布を含むが、それに限らない。以下、同じ)頂けます。本データを基に作成された派生データについても同様です。ただし、本データを使って学習したデータを内蔵した翻訳機の販売等を含む商用利用目的には、ご利用頂けません。

第2条(免責)
当社は、本データについて、品質、性能その他一切の保証を行うものではありません。2.直接的損害、間接的損害を問わず、本データの利用によって生ずるいかなる損害についても、一切の責任を負いません。当社は、本データのインストール作業等によって発生するシステムへの影響等、損害についても、一切の責任を負いません。

第3条(その他)
事前通知なしに、当社の判断によって、本データを全部または一部の変更、本データの提供の中断または停止をさせて頂くことがございます。
Citation

If you find our work useful, please cite the following article.
JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

@inproceedings{morishita-etal-2020-jparacrawl,
    title = "{JP}ara{C}rawl: A Large Scale Web-Based {E}nglish-{J}apanese Parallel Corpus",
    author = "Morishita, Makoto  and
      Suzuki, Jun  and
      Nagata, Masaaki",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.443",
    pages = "3603--3609",
    ISBN = "979-10-95546-34-4",
}

Acknowledgements

We have used Bitextor created by the ParaCrawl project. We gratefully acknowledge the ParaCrawl project for releasing the software and fruitful discussions.
We also would like to thank Hisashi Itoh and Takumi Asai for their technical support.

Take down

If we include your copyrighted works and you want us to delete it, please contact us with the following information.

Contact

For any inquiries about JParaCrawl, please contact us by email.

NTT Communication Science Laboratories
Makoto Morishita
jparacrawl-ml -a- ntt.com