Exhibition Program

Science of Media Information


Neural audio captioning

- Generating text describing non-speech audio -


Recently, detection and classification of various sounds has attracted many researchers attention. We propose an audio captioning system that can describe various non-speech audio signals in the form of natural language. Most existing audio captioning systems have mainly focused on “what the individual sound is,” or classifying sounds to find object labels or types. In contrast, the proposed system generates (1) an onomatopoeia, i.e. a verbal simulation of non-speech sounds, and (2) an sentence describing sounds, given an audio signal as an input. This allows the description to include more information, such as how the sound sounds and how the tone or volume changes over time. Our approach also enables directly measuring the distance between a sentence and an audio sample. The potential applications include sound effect search systems that can accept detailed sentence queries, audio captioning systems for videos, and AI systems that can hear and represent sounds as humans do.


Kunio Kashino, Media Information Laboratory