Papers
  • Hirokazu Kameoka, Li Li, Shota Inoue, and Shoji Makino, "Supervised determined source separation with multichannel variational autoencoder," Neural Computation, vol. 31, no. 9, pp. 1891-1914, Sep. 2019. (PDF)
  • Li Li, Hirokazu Kameoka, Shota Inoue, and Shoji Makino, "FastMVAE: A fast optimization algorithm for the multichannel variational autoencoder method," IEEE Access, vol. 8, pp. 228740-228753, Dec. 2020. (PDF)
  • Li Li, Hirokazu Kameoka, and Shoji Makino, "FastMVAE2: On improving and accelerating the fast variational autoencoder-based source separation algorithm for determined mixtures," arXiv:2109.13496, Sep. 2021.(PDF)
MVAE

The MVAE method [1] is a variational autoencoder (VAE)-based source separation algorithm for determined speech mixtures. The basic idea is to model and estimate the power spectrogram of each speech signal in a mixture using a conditional VAE (CVAE) conditioned on a speaker code. To the best of our knowledge, this was the first to incorporate the VAE concept into the multichannel source separation framework, although several similar attempts have been made independently by different research groups around the same time.

FastMVAE and FastMVAE2

The FastMVAE [2] and FastMVAE2 [3] methods are faster versions of the MVAE method. One drawback of the MVAE method is the computational cost of the backpropagation step in the separation-matrix estimation algorithm. To overcome this drawback, the FastMVAE method uses an auxiliary classifier VAE (ACVAE) to model the generative distribution of source spectrograms. By using ACVAE, the backpropagation step in the MVAE algorithm can be replaced by the forward propagation of the pretrained networks, thus significantly reducing the computational cost. The FastMVAE2 method further improves on the FastMVAE method by using a model called ChimeraACVAE to increase separation accuracy while maintaining computational efficiency.

Links to related pages

Please also refer to the following web sites.

Audio examples

Here are audio examples of the input mixture signals, ground truth source signals, and separated signals obtained with the MVAE and FastMVAE2 methods and a conventional method called ILRMA [4].


Mixtures of 2 sources recorded by 2 microphones
Mixtures
(Input)
Sources
(Ground truth)
Separated signals
ILRMA MVAE fMVAE2

Mixtures of 3 sources recorded by 3 microphones
Mixtures
(Input)
Sources
(Ground truth)
Separated signals
ILRMA MVAE fMVAE2

Mixtures of 6 sources recorded by 6 microphones
Mixtures
(Input)
Sources
(Ground truth)
Separated signals
ILRMA MVAE fMVAE2

Mixtures of 9 sources recorded by 9 microphones
Mixtures
(Input)
Sources
(Ground truth)
Separated signals
ILRMA MVAE fMVAE2

Mixtures of 18 sources recorded by 18 microphones
Mixtures
(Input)
Sources
(Ground truth)
Separated signals
ILRMA MVAE fMVAE2


References

[1] Hirokazu Kameoka, Li Li, Shota Inoue, and Shoji Makino, "Supervised determined source separation with multichannel variational autoencoder," Neural Computation, vol. 31, no. 9, pp. 1891-1914, Sep. 2019.

[2] Li Li, Hirokazu Kameoka, Shota Inoue, and Shoji Makino, "FastMVAE: A fast optimization algorithm for the multichannel variational autoencoder method," IEEE Access, vol. 8, pp. 228740-228753, Dec. 2020.

[3] Li Li, Hirokazu Kameoka, and Shoji Makino, "FastMVAE2: On improving and accelerating the fast variational autoencoder-based source separation algorithm for determined mixtures," arXiv:2109.13496, Sep. 2021.

[4] Daichi Kitamura, Nobutaka Ono, Hiroshi Sawada, Hirokazu Kameoka, and Hiroshi Saruwatari, "Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 9, pp. 1626-1641, Sep. 2016.