Semi-blind source separation with multichannel variational autoencoder

Papers

Hirokazu Kameoka, Li Li, Shota Inoue, Shoji Makino, "Semi-blind source separation with multichannel variational autoencoder," arXiv:1808.00892 [stat.ML], Aug. 2018.

Multichannel variational autoencoder (MVAE)

MVAE is a multichannel source separation method that uses a conditional variational autoencoder (CVAE) to model and estimate the power spectrograms of the sources in a mixture. By training the CVAE using the spectrograms of training examples with source-class labels, we can use the trained decoder distribution as a universal generative model that is able to generate spectrograms conditioned on a specified class label. By treating the latent space variables and the class label as the unknown parameters of this generative model, we can develop a convergence-guaranteed semi-blind source separation algorithm that consists of iteratively estimating the power spectrograms of the underlying sources as well as the separation matrices.

cvae source model — Figure 1. Illustration of the present CVAE.

Audio examples

Here, we demonstrate audio examples of MVAE tested on a multichannel source separation task. We selected speech of two female speakers, 'SF1' and 'SF2', and two male speakers, 'SM1' and 'SM2', from the Voice Conversion Challenge (VCC) 2018 dataset [3] for training and evaluation.