Abstract of Doctral Dissertation

``Estimation of Articulatory Movements from Speech Signal Using an HMM-Based Speech Production Model,'' Department of Information Processing, Tokyo Institute of Technology, August 2006.

Acoustic-to-articulatory inverse mapping is a difficult problem because of its non-linear and one-to-many characteristics. Over the years, many estimation techniques based on the continuity constraints of articulatory movements have been proposed to solve this problem. However, previous studies have showed that the primary articulatory features for consonants were not estimated well only by the constraints. To improve the estimation accuracy, the phoneme-specific dynamic constraints of articulatory movements beyond the continuity constraints are required.

This thesis describes a novel method that determines articulatory movements from a speech signal using an HMM-based speech production model. The model consists of hidden Markov models (HMMs) of articulatory parameters for each phoneme and articulatory-to-acoustic mapping for each HMM state. The model is constructed by using simultaneously observed articulatory and acoustic data collected using an electro-magnetic articulographic (EMA) system and audio recording. The articulatory parameters are represented by the mid-sagittal positions of the lips, incisor, tongue and velum. Using the model, I devise a method of maximum a-posteriori estimation of articulatory parameters using dynamic features for a given speech signal. Results show that the speaker-dependent phoneme-specific dynamic constraints of articulatory movements significantly decrease the RMS error between the measured and estimated articulatory parameters.

To determine articulatory parameters from an unknown speaker's speech signal using the model, a new speaker adaptation technique is introduced. In this method, the articulatory-to-acoustic mapping of a reference model is modified to the unknown speaker on the basis of the estimation of articulatory parameters. This allows a stochastic adaptation of geometrical differences in the vocal tract among the speakers. Results show that the error of the estimated parameters with the method is significantly smaller than that without it, but that differences in the dynamic constraints of articulatory parameters among the unknown and reference speakers affect the estimation accuracy of articulatory parameters.

To improve the estimation accuracy of articulatory parameters from an unknown speaker's speech signal, I propose a method of speaker normalization of articulatory parameters based on speaker-independent phoneme-specific articulatory HMMs and a speaker-adaptive matrix. These features are statistically separated from a multi-speaker articulatory database using speaker-adaptive training (SAT). I evaluate the proposed method in terms of the RMS error between the measured and estimated articulatory parameters. Results show that there is no statistically significant difference in the error between speaker-dependent models and speaker-adaptive ones with two matrices for each speaker and that modeling of inter-speaker variability in the articulatory parameter domain is better than that in the speech spectrum domain.

Finally, psychophysics experiments examine a relationship between vowel perception characteristics and the articulatory parameters that produce the vowels. Vowel stimuli are created on the basis of directly measured articulatory-acoustic data. Results show that thresholds for vowel formant frequency discrimination are significantly correlated with the ratio of formant change to articulatory movements, but not with the predictions based on an auditory frequency resolution. This is consistent with ``the motor theory of speech perception,'' which assumes that the speech perception involves a process of inferring the articulatory gestures of the speaker.

These results would support the idea that the small estimation error of articulatory parameters obtained by my techniques guarantees the small perceptual differences between the input speech signal and re-synthesized speech signal produced from the estimated articulatory parameters.

back