Trilingual semantic embeddings of visually grounded speech with self-attention mechanisms

Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, James Glass

May 2020

Abstract

We propose a trilingual semantic embedding model that relates visual objects in an image to a section of audio signal corresponding to spoken words in an unsupervised manner. Unlike existing models, the proposed model incorporates three different languages, namely, English, Hindi, and Japanese. To build the model, we newly collect a Japanese speech caption data set, in addition to existing English and Hindi sets. We show that introduction of the third language improves the performance on average, in terms of crossmodal and cross-lingual information retrieval accuracy, and that the self-attention mechanisms added to the speech encoders also effectively works.

Type

Conference paper

Publication

In International Conference on Acoustics, Speech and Signal Processing