Automatic analysis of meetings promises to relieve humans from tedious transcriptions work. It comprises counting the number of meeting attendees (source counting), detecting which participant is active when (diarization), separating speech in regions where multiple participants talk at the same time ((blind) source separation), and transcribing the said words for the separated audio streams (speech recognition) for recordings in the length of up to multiple hours. All of these are challenging tasks by themselves, and become even more demanding considering that recordings of meetings can be arbitrary long, making batch processing practically unfeasible and asking for block-online processing.
Our recently proposed single-channel source separation and counting system "Recurrent Selective Attention Network" (RSAN), also called "Selective Hearning Network" (SHR) [Kinoshita et al., ICASSP 2018], can efficiently separate and count multiple talkers in clean recordings by treating the source separation problem as an iterative source extraction. It is predicated to a recurrent neural network (RNN) and uses the same neural network multiple times to iteratively extract one source after another while maintaining information about which parts of the spectrogram have not yet been extracted as part speech signal. We achieved state-of-the-art performance comparable to Permutation Invariant Training, Deep Clustering and Deep Attractor Network.
In our newly proposed method, the input spectrogram is split into time blocks of equal length, and the same neural network is now applied not only for each detected source, but also for each block, now being recurrent in two directions over time and the number of sources. It is trained to output an embedding vector alongside the separation mask for each speaker that represents the identity of the extracted speaker. These embedding vectors are passed to the next time block as a source adaptation input, and the model is trained to extract the source that is represented by the input vector. If speakers for which an adaptation embedding input exists are silent throughout a whole time block, a mask filled with zeros is output, but the source embedding vector is passed to the next time block. This causes the system to always output the same source in the same iteration, thus eliminating the permutation problem. The problem of source counting is stripped down to a simple threshold processing on the estimated masks. Unlike any prior works, our approach can in theory handle arbitrary long recordings with an arbitrary amount of talkers. It can track identities even if a talker remains silent over a significant amount of time and maintains a stable output order for all present speakers.
This work will be presented at ICASSP-2019. A preprint of the paper is available here.