Exhibition Program

Science of Communication and Computation

Tuning machine translation with small tuning data

Domain adaptation with JParaCrawl, a large parallel corpus

Abstract

Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them.
We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web and automatically aligning parallel sentences.
Our collected corpus, called JParaCrawl, amassed over 10 million sentence pairs. JParaCrawl is now freely available online for research purposes.
We show how a neural machine translation model trained with it works as a good pre-trained model for fine-tuning specific domains and achieves good performance even if the target domain data is limited.

References

M. Morishita, J. Suzuki, M. Nagata, “JParaCrawl: A large scale web-based Japanese-English parallel corpus,” in Proc. 12th International Conference on Language Resources and Evaluation （LREC）, 2020.

Poster

Please click the icon to open the full-size PDF file.

Contact

Makoto Morishita / Linguistic Intelligence Research Group, Innovative Communication Laboratory
Email: cs-openhouse-ml at hco.ntt.co.jp

Talk：Takeshi Yamada (Head's Talk) | Kou Tanaka | Scinob Kuroki | Sanae Fujita

Exhibition：1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |

Prev | Next