Our dataset publicly available

The Places audio caption (Japanese) 100K corpus

We are excited to announce that our dataset, The Places audio caption (Japanese) 100K corpus, is now available.

https://zenodo.org/record/5563425#.YZZKy2DP0UE

This speech corpus was collected to investigate the learning of spoken language (words, sub-word units, higher-level semantics, etc.) from visually-grounded speech. For a description of the corpus, see:

Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, and James Glass, “Trilingual Semantic Embeddings of Visually Grounded Speech with Self-attention Mechanisms,” in Proc. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 4352–4356.

The corpus only includes audio recordings, and not the associated images. You will need to separately download the Places image dataset here.

The data is distributed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license.

If you use this data in your own publications, please cite the above paper. This is a collaborative work with MIT CSAIL.

Yasunori Ohishi
Yasunori Ohishi
Senior Manager

My research interests include acoustic signal processing, crossmodal learning and music information retrieval.

Related