14 |
“Huh? What do you mean?” Summarize a long story shortRobust speech summarization against speech recognition errors ![]() |
---|
Speech summarization aims at creating a summary from a long talk. It is an essential technology if we realize AI systems that can correctly understand human speech. One way to realize speech summarization is cascading automatic speech recognition (ASR) and text summarization. One issue of such approaches is that it is difficult to avoid ASR errors, which degrade the performance of summarization. To alleviate this problem, we propose a robust speech summarization against ASR errors. Our proposed system considers multiple ASR results and looks at the context and relationship between words to generate an accurate summary, even if each ASR result contains errors. The idea we proposed is general and can also be applied to other tasks such as speech translation. This research brings us one step closer to realizing machines that can deeply understand humans, by not only transcribing speech word-by-word but also accessing its meaning and intent.

[1] T. Kano, A. Ogawa, M. Delcroix, S. Watanabe, “Attention-based multi-hypothesis fusion for speech summarization,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 487–494, 2021.
[2] T. Kano, A. Ogawa, M. Delcroix, S. Watanabe, “ASR hypothesis fusion using BERT for speech summarization,” in Proc. The 2022 Spring Meeting of the Acoustical Society of Japan (ASJ), 2022.
[3] T. Kano, A. Ogawa, M. Delcroix, S. Watanabe, “Integrating multiple ASR systems into NLP backend with attention fusion,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022.
Takatomo Kano / Signal Processing Research Group, Media Information Laboratory
Email: cs-openhouse-ml@hco.ntt.co.jp