Exhibition Program

Science of Media Information


Learning unknown objects from speech and vision

- Crossmodal audio-visual concept discovery -


In order for AI to visually perceive the world around it and to use language to communicate, it needs a dictionary that associates the visual objects in the world with the spoken words that refers to them. We explore a neural network models that learn semantic correspondences between the objects and the words given images and multilingual speech audio captions describing that images. We show that training a trilingual model simultaneously on English, Hindi, and newly recorded Japanese audio caption data offers improved retrieval performance over the monolingual models. Further, we demonstrate the trilingual model implicitly learns meaningful word-level translations based on images. We aim for a future in which AI discovers concepts autonomously while finding the audio-visual co-occurrences by simply providing media data that exists in the world such as TV broadcasting. We also consider the application to large-scale archive retrieval and automatic annotation that involves interactions between different sensory modalities such as vision, audio, and language.


  • [1] Y. Ohishi, A. Kimura, T. Kawanishi, K. Kashino, D. Harwath, and J. Glass, “Crossmodal Search using Visually Grounded Multilingual Speech Signal,” IEICE Technical report on Pattern Recognition and Media Understanding (to appear)
  • [2] D. Harwath, G. Chuang, and J. Glass, “Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech,” In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2018), April 2018.




Yasunori Ohishi, Media Recognition Group, Media Information Laboratory