Our dataset publicly available
The Places audio caption (Japanese) 100K corpus
We are excited to announce that our dataset, The Places audio caption (Japanese) 100K corpus, is now available.
https://zenodo.org/record/5563425#.YZZKy2DP0UE
This speech corpus was collected to investigate the learning of spoken language (words, sub-word units, higher-level semantics, etc.) from visually-grounded speech. For a description of the corpus, see:
Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kunio Kashino, David Harwath, and James Glass, “Trilingual Semantic Embeddings of Visually Grounded Speech with Self-attention Mechanisms,” in Proc. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 4352–4356.
The corpus only includes audio recordings, and not the associated images. You will need to separately download the Places image dataset here.
The data is distributed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license.
If you use this data in your own publications, please cite the above paper. This is a collaborative work with MIT CSAIL.