Controlling the voice quality dimension of prosody in synthetic speech using an acoustic glottal model
Citation:
MURPHY, ANDREW, Controlling the voice quality dimension of prosody in synthetic speech using an acoustic glottal model, Trinity College Dublin.School of Linguistic Speech & Comm Sci, 2021Download Item:
Abstract:
Statistical parametric speech synthesis (SPSS) offers a means of generating synthetic speech without the need for complex and extensive rules. One way in which this approach is sometimes lacking is through the use of simple excitation models that may result in unnatural, robotic sounding synthetic speech. A further limitation is that these systems tend to lack prosodic variation. For example, they do not capture the expressive nature of human spoken interaction. This is something that would be highly desirable for applications utilising speech synthesis,such as educational games or synthetic voices for people with disordered speech. The use of a more complex excitation model could offer the flexibility in the voice source that could provide a basis for more adequate modelling of prosody. However, acoustic models of the voice source entail many potentially important parameters and controlling these could be a challenge. The main aims of this work were to: investigate how an acoustic glottal model could be used to manipulate aspects of linguistic and paralinguistic prosody of synthetic speech using a minimal set of control parameters; implement the knowledge gained from this investigation into an analysis-and-synthesis system; use this system in SPSS; and conduct pilot tests to demonstrate how the system can be used to explore the voice source correlates of prosody,through user-driven manipulation tasks. To achieve the first goal,experiments were carried out to explore how the global wave shape parameter, Rd (Fant, 1995), could be used to control aspects of linguistic and paralinguistic prosody. This parameter can be used to generate glottal pulse shapes that result in voice qualities ranging from breathy to tense. As the tense-lax dimension of voice quality is important in prosodic modulation, Rd appears to be ideal for minimising the number of control parameters needed to transform voice quality.To allow control of this parameter, the glottal source, and the vocal tract filter that shapes it, must be modelled using the principles of the source-filter model of speech production. These speech components must be separated effectively to do this. Inverse filtering was used to obtain estimates of the source signal by removing the effects of the vocal tract transfer function from the speech signal. An acoustic glottal model, the Liljencrants-Fant (LF) model (Fant et al., 1985), was then used to parameterise the glottal source.
Three experiments were carried out,using manually inverse filtered data,to investigate how Rd could be used as a control parameter for linguistic and paralinguistic prosody,even in the absence of f0modulation. Experiment1 examined how manipulating Rd could be used to control where focal prominence (an aspect of linguistic prosody) occurs in an utterance. Experiment 2 explored how Rd could be used as a control parameter for perceived affect. Experiment 3 built upon the results of Experiment 1 to optimise the implementation of the Rd parameter contour.The results of these experiments confirmed that Rd can serve as a control parameter to generate linguistic prominence as well as paralinguistic modification of affective colouring. The results confirmed, and elaborated on,the findings of earlier research, suggesting that tense-lax modulation of voice quality is important in prosodic expression. They indicated that a more tense phonation on the focally accented item can be used to signal prominence, while laxer phonation of post-focal material provides source deaccentuation that further enhances the perceived prominence.These experiments provided information concerning Rd ranges and settings that fed into the development of the second goal of this work,i.e.an analysis-and-synthesis system, called GlórCáil,for the control of parameters for prosodic variation in synthesis. The system also allows for some speaker characteristic transformation,letting the user manipulate both voice source and vocal tract parameters before resynthesis. This provides the means to alter the prosodic pattern and speaker characteristics of an utterance. The interface allows the user to listen to any changes they make after resynthesis, to see if they have the desired effect or if further manipulations are required. The third goal was achieved by integrating the finished system into a DNN-based speech synthesis framework so that it could be used to generate unseen synthetic speech.The final goal of this work was achieved by demonstrating the system’s ability to transform linguistic and paralinguistic prosody, as well as speaker characteristics, in copy synthesis. Two manipulation tasks were carried out using purpose-built interfaces developed for these experiments. Experiment 4 involved participants modifying an utterance so that it sounded like an appropriate response to a given question by moving sliders that controlled the Rd parameter. Experiment 5 involved participants manipulating parameters to make an utterance sound like it was being spoken by a particular speaker in a particular affective state. The responses were then used to modify the default parameters generated by the DNN-based speech synthesis system, by multiplying them by a scaling factor,to create a set of stimuli. These stimuli were used in the listening test of Experiment 6,where participants were asked to identify the speaker, the emotion of the speaker, and rate the magnitude of the emotion and naturalness of the utterance on five-point scales. Although participants identified sad stimuli successfully, this was not the case for happy stimuli. It is likely that additional modifications of the vocal tract and f0 contour are needed to improve identification rates.The last two experiments were intended as pilot demonstrations: given that the focus of the thesis is on the voice quality dimension, and the inclusion of vocal tract and f0 modulation is beyond the intended scope of the work. Using the GlórCáil system, future work is planned which will explore how the voice quality dimension combines with f0in both linguistic and paralinguistic prosodic expression. How these combine in speaker voice transformation is another area which will be of interest.The GlórCáil system experiments reported here are seen as a contribution not only towards better control of the voice quality dimension of prosody in speech synthesis, but also towards research methodologies that will enhance our understanding of this vital dimension of human communication.
Sponsor
Grant Number
Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media
Author's Homepage:
https://tcdlocalportal.tcd.ie/pls/EnterApex/f?p=800:71:0::::P71_USERNAME:MURPHA61Description:
APPROVED
Author: MURPHY, ANDREW
Sponsor:
Department of Tourism, Culture, Arts, Gaeltacht, Sport and MediaAdvisor:
Gobl, ChristerPublisher:
Trinity College Dublin. School of Linguistic Speech & Comm Sci. C.L.C.S.Type of material:
ThesisCollections
Availability:
Full text availableKeywords:
Speech synthesis, Voice quality, Prosody, Irish, TTS, DNN, Voice source, Acoustic glottal modelMetadata
Show full item recordLicences: