ConceptBeam

Concept Driven Target Speech Extraction

Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, Kunio Kashino

Jul 13, 2022

Abstract

We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture of overlapping speakers. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrivals. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image and speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our proposed scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our proposed scheme, we perform experiments using a set of images associated with spoken captions. That is, we generate speech mixtures from these spoken captions and use the images or speech signals as the concept specifiers. We then extract the target speech using the acoustic characteristics estimated from the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation.

Typical scenarios

Figure 2 illustrates the three different scenarios of concept driven target speech extraction. In Scenario 1, two speakers (A and B) speak about two different concepts ($c1$ and $c2$). Given a semantic embedding $\mathbf{E}^{c1}$, ConceptBeam should extract the speech of speaker A speaking about $c1$.
In Scenario 2, a mixture of four different speakers (A, B, C, and D) is received, with A and C speaking about the same concept, i.e., $c1$, and B and D about a different concept, i.e., $c2$. In this scenario, given a semantic embedding $\mathbf{E}^{c1}$, ConceptBeam should extract the speech signals related to $c1$, the speech signals of speakers A and C.
Finally, in Scenario 3, speakers A and C speak about one concept, i.e., $c1$, and speakers A and B about another concept, i.e., $c3$, i.e., speaker A speaks about the two concepts. In this case, ConceptBeam should extract the speech signals related to $c1$, i.e., the speech of speakers A and C, and filter out the speech of speaker A when A speaks about the other concept, i.e., $c3$.
As presented above, ConceptBeam should focus on the coincidence of concepts and ignore other properties such as speakers.

png — Figure 2: Different scenarios of concept driven target speech extraction.

Method

ConceptBeam is composed of four modules: a concept encoder, concept activity computation NN, acoustic embedding NN, and speech extraction NN. The proposed ConceptBeam is related to the activity driven extraction network (ADEnet), which exploits speaker activity information for target speech extraction [8]. ConceptBeam uses a concept specifier to determine the target activity. Unlike ADEnet, the activity may cover multiple speakers speaking about the same concept. Figure 3 is a schematic diagram of the ConceptBeam framework.