NTT Communication Science Laboratories Innovative Communication Laboratory03




Statistical Machine Translation

Building Translation Systems with Low Cost

Statistical machine translation (SMT) enables us to automatically build machine translation systems using statistical models trained by text data. The statistical models consist of translation and language models. The translation model represents the probable word translations and is trained by bilingual data, which consist of sentence pairs in two different languages. The language model encodes the sentence fluency and is trained by the target language data. The decoder searches for the most likely target word sequence from a large amount of hypotheses using these two models. SMT enables us to construct robust translation systems with low cost in short development cycles if the training data are available.



Conventional rule-based translation needs to develop translation rules by language experts who know both the source and target languages, so it is costly to build translation systems among multiple languages. In contrast, the SMT algorithms do not depend on languages. Therefore, we can build a translation system for any language pairs using SMT technologies if we can prepare the training data. SMT is very effective for multi-language services. Furthermore, it enables domain-specific applications since the statistical model automatically adapts to the training data of a specific domain. For example, a large amount of bilingual data for newspapers, manuals, and such public documents as patents is available nowadays. SMT is also very effective for translation services for these specific domains.
For practical use, preparing bilingual data remains a critical issue. SMT is very effective in the business field where bilingual data are generated in daily routines. For example, supporting human translators is one promising business field.

Recently, SMT is outperforming conventional rule-based translation between language pairs where the word order is similar, e.g., between western languages. However, SMT’s accuracy is lower between distant languages, e.g., Japanese and English. We are investigating this issue, particularly methods using syntactic information.

■Hierarchical Phrase-based Translation

We developed an efficient translation method that utilizes syntax (hierarchical phrases) automatically acquired from the bilingual corpus. We also developed a method that utilizes a large number of features that enable finely tuned translations.


■English-to-Japanese Translation based on Preordering

In Japanese, the syntactic head tends to be located at the end of phrases, such as noun and verb phrases. We can reorder English words into Japanese word order by shifting the English head to the end of phrases. Our English-to-Japanese translation system drastically improves accuracy using this preordering method with conventional SMT.


■Automatic Evaluation Metrics Considering Word Order

The training of SMT is performed based on automatic evaluation metrics. This issue is important to develop better evaluation metrics for improving SMT accuracy. However, conventional automatic evaluation metrics had problems evaluating translation results for such distant language pairs as Japanese and English. Our newly developed Rank-based Intuitive Bilingual Evaluation Score (RIBES) based on the rank correlation of the word order allows evaluation of language pairs where word order is very different.