NTT Communication Science Laboratories Innovative Communication Laboratory04




Natural Language Processing

Language understanding for text analysis

We are working on machine learning algorithms and large-scale semantic databases for accurate language analysis to implement advanced text applications that require an understanding of language, such as information extraction, language translation, summarization, and text classification.

■How is it used?

With the growth of the Internet, huge amounts of text, blogs, and SNS data written by a wide range of people on a variety of topics and of variable quality are readily available. Demand is increasing for text analysis technology that can process such a vast amount of text data for business and service applications, including sentiment analysis and filtering illegal/harmful information.
Globalization has increased opportunities for ordinary citizens to access first-hand, up-to-date information written in foreign languages and to communicate with foreigners in person. This has revived the demand for machine translation.
However, the casually and often hastily written colloquial text found on the Internet and the spontaneous spoken languages used in human conversation include lexical and grammatical errors. Things that can be understood from the context are not explicitly mentioned in such texts; this situation greatly complicates understanding.
We are approaching the problem by building a semantic knowledge database based on large amounts of text and devising a sophisticated machine learning algorithm to implement accurate language analysis technology for text applications requiring language understanding.


■High-accuracy dependency parser based on semi-supervised learning

For training, semi-supervised learning uses not only manually labeled data but also unlabeled data. We have developed a semi-supervised learning method for dependency parsing that achieves high accuracy using such large-scale text data as Web data. It achieved the best published accuracy in English dependency parsing. The technology can be used for machine translation and sentiment analysis.


■Predicate argument structure analysis coping with a variety of expressions

Who did what to whom (5W1H) is important to understand the state and action expressed by a sentence. We have developed a learning method that obtains a set of rules for determining the subject and object of a verb from a large amount of annotated training data. Even without dependency between a predicate and an argument, or if the argument is omitted, it can determine the predicate argument relation from the context. The technology can be used for sentiment analysis and filtering illegal and harmful information.


■Language resources including “Nihongo Goi-Taikei,” the largest Japanese thesaurus

We have developed various semantic databases for advanced language analysis. The following have been published in book form: “Nihongo Goi-Taikei” (thesaurus), “Nihongo no Goi-Tokusei” (psycholinguistic database), and “Kihongo Database” (semantic database).