In 1991, NTT's Communication Science Laboratories (CS Labs.) was founded in Keihanna Science City as an institute responsible for NTT's basic research on information systems on July 4th, the same day of the year as the American Independence Day. While NTT's other laboratories are all concentrated in Kanto region (eastern part of Japan), we are a small research institute that jumped out to Kansai region (western part of Japan), so to speak, as something of an independence movement. From the beginning of our establishment, we have consistently pursued broad research freed from the confines of existing research areas. In these endeavors, we have focused on questions such as ‘What does it mean to do research on communication sciences?’, and we have carried out interdisciplinary research in information science, the social sciences, brain sciences and human sciences. We have been deriving increasing synergistic effects from those research themes we have been exploring especially in human and information sciences. In information science, we have made so-called artificial intelligence (AI) our central concern, as represented particularly in speech and acoustic signal processing, media processing, natural language processing, and machine learning. In human science, we have been analyzing mechanisms for sensory perception and processing in the human brain and for language acquisition by infants.
The remarkable progress of deep learning is representative of recent developments in AI technologies. As a research laboratory, we must apply these leading technologies at a high level and successfully use them to address the challenges we face. However, since this CS Labs. is responsible for long-term basic research, we are constantly taking on challenges that will not just be an extension of conventional research done up to now, but will open up new dimensions while using our experiences and knowledge to deepen our technologies even further, and to strengthen our footing. In this way, we believe it is becoming even more important to shift our research themes boldly.
For example, in research on speech recognition, we are moving from the situation of a single person speaking directly to a microphone to that of several people sitting around a table and talking freely at some distance from the microphone. Naturally, for speech recognition in a situation like this, we must improve the performance of speech recognition itself. However, it is even more important to integrate it with frontend technologies such as noise suppression technology that removes ambient noise even in noisy environments and speech enhancement techniques such as dereverberation technology to eliminate reverberation due to sound echoing or bouncing off walls. By combining these technologies, the CS Labs’ speech recognition engine won the first place for accuracy among 25 technologies competing in an international technical evaluation, the 3rd CHiME Speech Separation and Recognition Challenge, held in 2015.1 (CHiME stands for Computational Hearing in Multisource Environments.) Furthermore, we are also working on a technique that can select to hear a certain voice in multitalker situations.
In machine learning technology, our CS Labs has been researching and developing technologies that automatically find characteristic patterns left unnoticed by human experts from a large amount of data. Now we are extending our research to apply in spatiotemporal dimensions. Namely, we are shifting to technologies that can carry out spatiotemporal, multidimensional, and collective data analysis.2 Furthermore, we are focusing on learning multi-agent simulation technology that not only predicts the future but also proposes in real time optimal strategies of what to do as a result of this prediction. For example, the technology predicts the flow of people in real time and proposes as corrective feedback the best strategy for guiding this flow. We are also working on combinatorial optimization problems to find the global optimal solution that satisfies given conditions from a large number of possible combinations. By using technologies like BDDs (Binary Decision Diagrams) and ZDDs (Zero-suppressed BDDs), we can devise a data structure that efficiently avoids combinatorial explosion and can solve large-scale problems that have been hitherto unsolvable.
In research on dialogue processing with robots, we are shifting from interaction with one robot to interactive processing with multiple (two) robots. When talking with a single human being, one robot may seem sufficient. However, by using multiple robots and properly sharing the roles between them, a natural dialogue between the robots and human beings can continue for a longer time, even if there are problems such as mistakes in speech recognition or a breakdown in the context of the dialogue. In addition, we are synthesizing chat-like dialogue system with questions and answers. The system, while finding out user’s interests with chatting, can teach knowledge about a certain topic the user is interested in with questions and answers to the points.3
In media processing, we are going from processing in a single media to processing by combining several media, and we are shifting to cross-modal scene analysis that treats media interdependently. Once this technology is developed, it will be possible to reproduce indoor visual scenes only from audio information recorded with a microphone, for example.
Changes are also taking place in human science. At the CS Labs, we are focusing on explaining the workings of implicit brain functions related to basic human senses such as visual, auditory, and kinesthetic sensation. We will continue to work on these issues, but as a new theme, we have been making progress in research on issues in brain science that relate to sports.4 This research uses ICT and our knowledge of brain science for ‘training the brain to win’ in sports. Recently, we have also launched research on well-being, a concept related to the richness of the human mind and the satisfaction of people's mental and emotional needs. Together with research in the science of touch (haptics), these are new initiatives that focus on both the "mind and body" of human beings.
There are risks accompanying these new initiatives. It is not certain that we will be able to achieve immediately the scientific results we hope for. In the field of AI research, the Exploration-Exploitation Dilemma is well known and frequently arises in many problems.5 In managing research and development activities, we also face the dilemma of whether to do Exploitation or Exploration. In Exploitation, we do research deeper in detail in a field we have already studied. The research will be productive, but may not anticipate major breakthroughs. In Exploration, we change our line of sight and explore a field about which we have less experience, and do research widely. This approach may achieve major breakthroughs but have less of a chance of yielding immediate productive results compared to work done on the traditional, tried-and-true paths in conventional research fields. Faced with this dilemma, we need to find a balance. I hope that as CS Labs responsible for fundamental research, we will take the spirit of independence to heart and boldly and resolutely set off on the great wide ocean and seek to discover and explore new, unknown expanses of knowledge.