Crossmodal sound retrieval based on specific target co-occurrence denoted with weak labels

Abstract

Recent advancements in representation learning enable cross-modal retrieval by modeling an audio-visual co-occurrence in a single aspect, such as physical and linguistic. Unfortunately, in real-world media data, since co-occurrences in various aspects are complexly mixed, it is difficult to distinguish a specific target co-occurrence from many other non-target co-occurrences, resulting in failure in crossmodal retrieval. To overcome this problem, we propose a triplet-loss-based representation learning method that incorporates an awareness mechanism. We adopt weakly-supervised event detection, which provides a constraint in representation learning so that our method can “be aware” of a specific target audio-visual co-occurrence and discriminate it from other non-target co-occurrences. We evaluated the performance of our method by applying it to a sound effect retrieval task using recorded TV broadcast data. In the task, a sound effect appropriate for a given video input should be retrieved. We then conducted objective and subjective evaluations, the results indicating that the proposed method produces significantly better associations of sound and visual effects than baselines with no awareness mechanism.

Publication
In International Conference on Spoken Language Processing