1. Introduction

(Barker’13) Barker, J. et al. “The PASCAL CHiME speech separation and recognition challenge,” Computer Speech and Language (2013).
(Barker’15) Barker, J. et al. “The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” Proc. ASRU (2015).
(Carletta’05) Carletta, J. et al. “The AMI meeting corpus: A pre-announcement,” Springer (2005).
(Delcroix’13) Delcroix, M. et al. “Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling?” in Proc. Interspeech (2013).
(Harper’15) Haper, M. “The Automatic Speech recognition In Reverberant Environments (ASpIRE) challenge,” Proc. ASRU (2015).
(Hinton’12) Hinton, G., et al. “Deep neural networks for acoustic modeling in speech recognition,” IEEE Sig. Proc. Mag. 29, 82–97 (2012).
(Juang’04) Juang, B. H. et al. “Automatic Speech Recognition – A Brief History of the Technology Development,” (2004).
(Kinoshita’13) Kinoshita, K. et al. “The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech,” Proc. WASPAA (2013).
(Kuttruff’09) Kuttruff, H. “Room Acoustics,” 5th ed. Taylor and Francis (2009).
(Lincoln’05) Lincoln, M. et al., “The multichannel wall street journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments,” Proc. ASRU (2005).
(Matassoni’14) Matassoni, M. et al. “The DIRHA-GRID corpus: baseline and tools for multi-room distant speech recognition using distributed microphones,” Proc. Interspeech (2014).
(Mohamed’12) Mohamed, A. et al. “Acoustic modeling using deep belief networks,” IEEE Trans. ASLP (2012).
(Pallett’03) Pallett, D. S. “A look at NIST'S benchmark ASR tests: past, present, and future,” ASRU (2003).
(Parihar’02) Parihar, N. et al. “DSR front-end large vocabulary continuous speech recognition evaluation”, (2002).
(Saon’15) Saon, G. et al. “The IBM 2015 English Conversational Telephone Speech Recognition System,” arXiv:1505.05899 (2015).
(Saon’16) Saon, G. et al. “The IBM 2016 English Conversational Telephone Speech Recognition System,” Proc. Interspeech (2016).
(Seltzer’14) Seltzer, M. “Robustness is dead! Long live Robustness!” Proc. REVERB (2014).
(Vincent’13) Vincent, E. et al. “The second ’CHiME’ speech separation and recognition challenge: Datasets, tasks and baselines,” Proc. ICASSP (2013).

2. Speech enhancement front-end

(Anguera’07) Anguera, X., et al. “Acoustic beamforming for speaker diarization of meetings,” IEEE Trans. ASLP (2007).
(Araki’07) Araki, S., et al. “Blind speech separation in a meeting situation with maximum snr beamformers,” Proc. ICASSP (2007).
(Boll’79) Boll, S. “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. ASSP (1979).
(Brutti’08) Brutti, A., et al. “Comparison between different sound source localization techniques based on a real data collection,” Proc. HSCMA (2008).
(Delcroix’14) Delcroix, M., et al. “Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge,” Proc. REVERB (2014).
(Delcroix’15) Delcroix, M., et al. “Strategies for distant speech recognition in reverberant environments,” EURASIP Journal ASP (2015).
(Doclo’02) Doclo, S., et al. “GSVD-based optimal filtering for single and multi-microphone speech enhancement,”. IEEE Trans. SP (2002).
(Erdogan’15) Erdogan, H., et al. “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” Proc. ICASSP (2015).
(Haykin’96) Haykin, S. “Adaptive Filter Theory (3rd Ed.),” Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1996).
(Heymann’15) Heymann, J., et al. “BLSTM supported GEV Beamformer front-end for the 3rd CHiME challenge,” Proc. ASRU (2015).
(Heymann’16) Heymann, J., et al. “Neural network based spectral mask estimation for acoustic beamforming,” Proc. ICASSP (2016).
(Higuchi’16) Higuchi, T., et al. “Robust MVDR beamforming using time frequency masks for online/offline ASR in noise,” Proc. ICASSP (2016).
(Hori’15) Hori, T., et al. “The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition,” Proc. of ASRU (2015).
(Jukic’14) Jukic, A., et al. “Speech dereverberation using weighted prediction error with Laplacian model of the desired signal,” Proc. ICASSP (2014).
(Kinoshita’07) Kinoshita, K., et al., “Multi-step linear prediction based speech dereverberation in noisy reverberant environment,” Proc. Interspeech (2007).
(Knapp’76) Knapp, C.H., et al. “The generalized correlation method for estimation of time delay,” IEEE Trans. ASSP (1976).
(Lebart’01) Lebart, K., et al. “A new method based on spectral subtraction for speech Dereverberation,” Acta Acoust (2001).
(Lim’79) Lim, J.S., et al. “Enhancement and bandwidth compression of noisy speech,” Proc. IEEE (1979).
(Mandel’10) Mandel, M.I., et al. “Model-based expectation maximization source separation and localization,” IEEE Trans. ASLP (2010).
(Nakatani’10) Nakatani, T., et al. “Speech Dereverberation based on variance-normalized delayed linear prediction,” IEEE Trans. ASLP (2010).
(Narayanan’13) Narayanan, A., et al. “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” Proc. ICASSP (2013).
(Ozerov’10) Ozerov, A., et al. “Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation,” in IEEE Trans. ASLP (2010).
(Sawada’04) Sawada, H., et al. “A robust and precise method for solving the permutation problem of frequency-domain blind source separation,” IEEE Trans. SAP (2004).
(Souden’13) Souden, M., et al. “A multichannel MMSE-based framework for speech source separation and noise reduction,” IEEE Trans. ASLP (2013).
(Tachioka’14) Tachioka, Y., et al. “Dual system combination approach for various reverberant environments with dereverberation techniques,” Proc. REVERB (2014).
(Van Trees’02) Van Trees, H.L. “Detection, estimation, and modulation theory. Part IV. , Optimum array processing,” Wiley-Interscience, New York (2002).
(Van Veen’88) Van Veen, B.D., et al. “Beamforming: A versatile approach to spatial filtering,” IEEE ASSP (1988).
(Virtanen’07) Virtanen, T. “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Trans. ASLP (2007).
(Wang’06) Wang, D. L., and Brown, G. J. (Eds.). “Computational auditory scene analysis: Principles, algorithms, and applications,” Hoboken, NJ: Wiley/IEEE Press (2006).
(Wang’16) Wang, Z.-Q., et al., “A Joint Training Framework for Robust automatic speech recognition,” IEEE/ACM Trans. ASLP (2016).
(Waritz’07) Warsitz, E., et al. “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Trans. ASLP (2007).
(Weninger’14) Weninger, F., et al. “The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement,” Proc. REVERB (2014).
(Weninger ’15) Weninger, F., et al. “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” Proc. LVA/ICA (2015).
(Xiao’16) Xiao, X., et al. “Deep beamforming networks for multi-channel speech recognition,” Proc. ICASSP (2016).
(Xu’15) Xu, Y., et al. “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. ASLP (2015).
(Yoshioka’12) Yoshioka, T., et al. “Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening,” IEEE Trans. ASLP (2012).
(Yoshioka’12b) Yoshioka, T., , et al. “Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition,” IEEE Signal Process. Mag. (2012).
(Yoshioka’15) Yoshioka, T., et al. “The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices,” Proc. ASRU (2015).

3. Back-end techniques for distant ASR

(Chen’15) Chen, Z., et al, “Integration of Speech Enhancement and Recognition using Long-Short Term Memory Recurrent Neural Network,“ Proc. Interspeech (2015).
(Chunyang’15) Chunyang, W., et al. “Multi-basis adaptive neural network for rapid adaptation in speech recognition,” Proc. ICASSP (2015).
(Delcroix’15a) Delcroix, M., et al. “Strategies for distant speech recognition in reverberant environments,” Proc. CSL (2015).
(Delcroix’15b) Delcroix, M., et al. “Context adaptive deep neural networks for fast acoustic model adaptation,” Proc. ICASSP (2015).
(Delcroix’16a) Delcroix, M., et al. “Context adaptive deep neural networks for fast acoustic model adaptation in noise conditions,” Proc. ICASSP (2016).
(Delcroix’16b) Delcroix, M., et al. “Context adaptive neural network for rapid adaptation of deep CNN based acoustic models,” Proc. Interspeech (2016).
(ETSI’07) Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Adv. Front-end Feature Extraction Algorithm; Compression Algorithms, ETSI ES 202 050 Ver. 1.1.5 (2007).
(Gemmello’06) Gemello, R., et al. “Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training,” Proc. ICASSP (2006).
(Giri’15) Giri, R., et al. “Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning,” Proc. ICASSP (2015).
(Hori’15) Hori, T., et al, “The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition,“ Proc. ASRU (2015).
(Hoshen’15) Hoshen, Y., et al. “Speech Acoustic Modeling from Raw Multichannel Waveforms,” Proc. ICASSP (2015).
(Kim’12) Kim, C., et al. "Power-normalized cepstral coefficients (PNCC) for robust speech recognition." Proc. ICASSP (2012).
(Kundu’15) Kundu, S., et al. “Joint acoustic factor learning for robust deep neural network based automatic speech recognition,” Proc. ICASSP (2016).
(Li’16) Li, B., et al. “Neural network adaptive beamforming for robust multichannel speech recognition,” Proc. Interspeech (2016).
(Liao’13) Liao, H. “Speaker adaptation of context dependent deep neural networks,” Proc. ICASSP (2013).
(Liu’14) Liu, Y., et al. “Using neural network front-ends on far field multiple microphones based speech recognition,” Proc. ICASSP (2014).
(Mitra’14) Mitra, V., et al. “Damped oscillator cepstral coefficients for robust speech recognition,” Proc. Interspeech (2013).
(Marino’11) Marino , D., et al. "An analysis of automatic speech recognition with multiple microphones.," Proc. Interspeech (2011).
(Neto’95) Neto, J., et al. “Speaker adaptation for hybrid HMM-ANN continuous speech recognition system,” Proc. Interspeech (1995).
(Ochiai’14) Ochiai, T., et al. “Speaker adaptive training using deep neural networks,” Proc. ICASSP (2014).
(Peddinti ‘15) Peddinti, V., et al, “A time delay neural network architecture for efficient modeling of long temporal contexts." Proc. Interspeech (2015).
(Sainath’16) Sainath, T. N., et al. “Factored spatial and spectral multichannel raw waveform CLDNNS,” Proc. ICASSP (2016).
(Saon’13) Saon, G., et al. “Speaker adaptation of neural network acoustic models using i-vectors,” Proc. ASRU (2013).
(Shluter’07) Schluter, R., et al. "Gammatone features and feature combination for large vocabulary speech recognition." Proc. ICASSP (2007).
(Seltzer’13) Seltzer, M.L., et al. “An investigation of deep neural networks for noise robust speech recognition,” Proc. ICASSP (2013).
(Swietojanski’13) Swietojanski, P., et al. “Hybrid acoustic models for distant and multichannel large vocabulary speech recognition,” Proc. ASRU (2013).
(Swietojanski’14a) Swietojanski, P., et al. “Convolutional neural networks for distant speech recognition,” IEEE Sig. Proc. Letters (2014).
(Swietojanski’14b) Swietojanski, P., et al. “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models, “ Proc. SLT (2014).
(Tachioka’13) Tachioka, Y., et al. “Discriminative methods for noise robust speech recognition: A CHiME Challenge Benchmark,” Proc. CHiME, (2013).
(Tachioka’14) Tachioka, Y., et al. "Dual System Combination Approach for Various Reverberant Environments with Dereverberation Techniques," Proc. REVERB Workshop (2014).
(Tan’15) Tan, T., et al. “Cluster adaptive training for deep neural network,” Proc. ICASSP (2015).
(Waibel’89) Waibel, A., et al. "Phoneme recognition using time-delay neural networks." IEEE transactions on acoustics, speech, and signal processing (1989).
(Weng’14) Weng, C., et al. "Recurrent Deep Neural Networks for Robust Speech Recognition," Proc. ICASSP (2014).
(Weninger’14) Weninger, F., et al. "The MERL/MELCO/TUM system for the REVERB Challenge using Deep Recurrent Neural Network Feature Enhancement," Proc. REVERB Workshop (2014).
(Weninger’15) Weninger, F., et al, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in Proc. Latent Variable Analysis and Signal Separation (2015).
(Xiao’16) Xiao, X., et al. “Deep beamforming networks for multi-channel speech recognition,” Proc. of ICASSP (2016).
(Yoshioka’15a) Yoshioka, T., et al. "Far-field speech recognition using CNN-DNN-HMM with convolution in time," Proc. ICASSP (2015).
(Yoshioka’15b) Yoshioka, T., et al. “The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices,” Proc. ASRU (2015).
(Yu’13) Yu, D., et al. “KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” Proc. ICASSP (2013).

4. Building robust ASR systems

(Anguera’07) Anguera, X., et al. “Acoustic beamforming for speaker diarization of meetings,” IEEE Trans. ASLP (2007).
(Barker’15) Barker, J., et al, “The third `CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines,“ Proc. ASRU (2015).
(Delcroix’15) Delcroix, M., et al. “Strategies for distant speech recognition in reverberant environments,” CSL (2015).
(Erdogan’16) Erdogan, H., et al. Improved MVDR beamforming using single-channel mask prediction networks,” Proc. Interspeech (2016).
(Hori’14) Hori, T., et al. “Real-time one-pass decoding with recurrent neural network language model for speech recognition,” Proc. ICASSP (2014).
(Hori’15) Hori, T., et al. “The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition,“ Proc. ASRU (2015).
(Mitra’14a) Mitra, V., et al. “Damped oscillator cepstral coefficients for robust speech recognition,” Proc. Interspeech (2013).
(Mitra’14b) Mitra, V., et al. “Medium duration modulation cepstral feature for robust speech recognition,” Proc. ICASSP (2014).
(Nakatani’13) Nakatani, T. et al. “Dominance based integration of spatial and spectral features for speech enhancement,” IEEE Trans. ASLP (2013).
(Tachioka’14) Tachioka, Y., et al. "Dual System Combination Approach for Various Reverberant Environments with Dereverberation Techniques," Proc. REVERB Workshop (2014).
(Wang’16) Wang, Z.-Q. et al. “A Joint Training Framework for Robust automatic speech recognition,” IEEE/ACM Trans. ASLP (2016).
(Weninger’14) Weninger, F., et al. "The MERL/MELCO/TUM system for the REVERB Challenge using Deep Recurrent Neural Network Feature Enhancement," Proc. REVERB Workshop (2014).
(Yoshioka’15) Yoshioka, T., et al. “The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices,” Proc. ASRU (2015).

RECENT ADVANCES IN DISTANT SPEECH RECOGNITION

1. Introduction

2. Speech enhancement front-end

3. Back-end techniques for distant ASR

4. Building robust ASR systems