Trilingual semantic embeddings of visually grounded speech with self-attention mechanisms

Abstract

We propose a trilingual semantic embedding model that relates visual objects in an image to a section of audio signal corresponding to spoken words in an unsupervised manner. Unlike existing models, the proposed model incorporates three different languages, namely, English, Hindi, and Japanese. To build the model, we newly collect a Japanese speech caption data set, in addition to existing English and Hindi sets. We show that introduction of the third language improves the performance on average, in terms of crossmodal and cross-lingual information retrieval accuracy, and that the self-attention mechanisms added to the speech encoders also effectively works.

Publication
In International Conference on Acoustics, Speech and Signal Processing