NTT Communication Science Laboratories Media Information Laboratory_03

title_research_e.gif

index_e.gif

title_media_icon.gif
title_media_e.gif
title_media_1_e.gif

Speech and Audio Signal Modeling

Constructing new sparse representation model and
speech generating process model for audio applications

Our research focuses on modeling various types of audio signals including speech and music instruments. In particular, we are concerned with constructing a new sparse representation model and a statistical model of speech generating process. The sparse representation model is intended to be used to build an intelligent audio processing system, that is able to autonomously and adaptively understand what is acoustically happening in complex audio scenes, by jointly performing source separation and unsupervised learning of the structure and regularity characterizing each source. The statistical model of speech generating process is designed so that one can develop a novel speech analyzer, that is able to extract parameters related to phonemic information and non-linguistic information (such as speaker’s intention, level of attention and mood) .

■New sparse representation model

media_3_1j.jpgAlthough real world sound sources produce an almost infinitely wide variety of waveforms, in some transform domain they can often be seen to consist of a limited number of elements. For example, speech would consist of a limited set of elements corresponding to phonemic symbols in the timbre domain, and piano music would consist of a limited set of elements corresponding to semitone units in the pitch domain. A signal processing framework that utilizes such properties of sound sources is called sparse signal processing. One approach involves applying Non-negative Matrix Factorization (NMF) to an observed magnitude spectrogram interpreted as a non-negative matrix. This method allows us to obtain a finite set of spectrum atoms that are considered to be the dominant elements composing the observed spectrogram. This ability to capture constituent spectrum atoms underlying the observed spectrogram is proven to be very powerful, thanks to which the NMF approach has been applied to such problems as monaural source separation, noise reduction, music transcription, bandwidth expansion, and missing data imputation with notable success, thus attracting a lot of attention in the field of audio signal processing in recent years. However, the model employed in this approach has one serious limitation. That is, it is based on a sum-of-atoms representation described in the magnitude domain under the assumption that the magnitude spectrum of the sum of constituent signals is equal to the sum of the magnitude spectra of those signals (which holds only approximately). This makes it difficult to directly incorporate the NMF model into classical signal processing systems formulated in the time domain, thus severely limiting the range of possible extensions. To overcome this limitation, we proposed to develop a time domain model (a complex spectrogram model) that possesses an NMF-like ability, and called it the “complex NMF” model.

media_3_2e.jpg

■New speech generating process model

One important issue in speech signal processing problems, including speech analysis, synthesis, modification, coding, and enhancement is how successfully we can design a speech signal model that describes the characteristics of real speech well. We are specifically concerned with modeling the generating processes of phoneme and intonation, which can be thought of as two major factors characterizing speech, based on physical modeling and knowledge derived from physiological studies. The unified model of speech is then designed by combining the phoneme and intonation models so that one can develop a novel speech analyzer, that is able to extract parameters related to phonemic information and non-linguistic information (such as speaker’s intention, level of attention and mood). We are also concerned with modeling the generating process of the pitch contour of a singing voice, with future practical applications in mind, such as singing voice synthesis, automatic singing skill evaluation, and singing style modification.

media_3_3e.jpg