Effects of word-frequency based pre- and post- processings for audio captioning

Abstract

The system we used for Task 6 (Automated Audio Captioning) of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge combines three elements, namely, data augmentation, multi-task learning, and post-processing, for audio captioning. The system received the highest evaluation scores, but which of the individual elements most fully contributed to its performance has not yet been clarified. Here, to asses their contributions, we first conducted an element-wise ablation study on our system to estimate to what extent each element is effective. We then conducted a detailed module-wise ablation study to further clarify the key processing modules for improving accuracy. The results show that data augmentation and post-processing significantly improve the score in our system. In particular, mix-up data augmentation and beam search in post-processing improve SPIDEr by 0.8 and 1.6 points, respectively.

Publication
In International Workshop on Detection and Classification of Acoustic Scenes and Events