Bayesian Cognition and Machine Learning for Speech communication

Description

The "Bayesian Cognition and Machine Learning for Speech communication" chair, a part of the Grenoble MIAI institute, brings together researchers whose areas of expertise are in the fields of Speech Communication, Cognition, Machine Learning, and Probabilistic Modeling of sensorimotor systems. The team members come from Gipsa-lab (UMR CNRS 5216) and LPNC (UMR CNRS 5105), two labs at UGA and Grenoble INP. The aim of the chair is to build a global computational model of speech production and speech perception, that is, a system able to learn how to speak and to perceive speech from examples provided by the environment. To this purpose, an original approach is proposed, which associates the algorithmic and mathematical framework of data-based Deep Learning and hypothesis-driven Probabilistic Modelling. This approach was developed in order to design more interpretable and thus more explainable and transferrable models, with more rapid and economical implementations and more robustness and versatility. Our aim is to build models of speech communication that reach the state-of-the-art performance of current deep-learning based systems while drastically limiting the amount of training data.

Scientific approach

Figure 1 - Processes and variables included in the joint model of speech and speech perception

Jointly modeling the speech production and perception processes amounts to designing models of the relationships between the various variables that are involved in those processes, namely motor/control variables, multi-sensory variables and linguistic/phonological variables (Figure 1). Over the years, the members of the group have explored two complementary approaches to do this. Firstly, a hypothesis-driven probabilistic programming approach (Tenenbaum et al., 2011; Bessière et al., 2013) has been used to explicitly design a set of multidimensional probabilistic functions that link the variables. These probabilistic models are defined from theoretical hypotheses about the physical mechanisms, neurocognitive processing and representations of speech production and speech perception in humans. This has led to a number of significant insights concerning speech perception in adverse conditions (Moulin-Frier et al., 2012; Laurent et al., 2015), variability and robustness in speech production (Patri et al., 2015, 2018), speech representations in the brain (Barnaud et al., 2018) and the emergence of sound systems in a society of communicating agents (Moulin-Frier et al., 2015; Schwartz et al., 2016). Secondly, data-driven deep-learning algorithmic and mathematical frameworks such as artificial deep neural networks makes it possible to establish a direct mapping, i.e., a deterministic regression, between subsets of variables. This approach was the basis for the development of efficient systems for speech processing, voice conversion, noise reduction, feature extraction, acoustic-articulatory inversion, speech synthesis and braincomputer interfaces (Hueber et al., 2015; Hueber & Bailly, 2016; Bocquelet et al., 2016; Fabre et al., 2017; Girin et al., 2017; Schultz et al., 2017). The computational challenge addressed in the program of the chair is to take the best from these two complementary approaches to elaborate a computational system which is able to learn how to produce and perceive speech from examples provided by the environment. These two approaches have recently begun to intersect with the emergence of Deep Generative Models, such as Variational AutoEncoders (VAE), or Generative Adversarial Networks (GAN). Basically, parameters characterizing probability distributions of data are encoded/mapped within deep neural networks. These models can be used as supervised probabilistic priors in a more general (probabilistic) model. They provide efficient ways to extract/model/manipulate the low-dimensional latent space that represents the structure of the high-dimensional data.

Research Program


Figure 2 - Components of the research program

In this context, it is crucial to clarify how complex multidimensional probabilistic relations between speech production and perception variables can be learned, and to make this learning more efficient. A systematic exploration of the numerous multidimensional spaces involved in these processes is unfeasible. This is why a cognition-based approach of probabilistic programming is favored, since it provides a structure for the development of the computational models and the learning processes in relation with existing knowledge about physics (aeroacoustics, biomechanics) and neuroscience/psychology (sensory-motor control of speech, language representations in the brain, developmental schedule of speech and language acquisition in infancy and childhood). The overall program of the project consists in associating the formalization of the speech communication model made possible by the Probabilistic Programming framework and the learning capabilities of advanced machine learning methods to tackle these learning issues, elaborate new learning models and evaluate them (Figure 2).

More specifically during the next 4 years, research will be organized around three main topics:
 

  • Developing a computational agent, able to learn speech representations from raw speech data in a weakly supervised scenario. This agent will contain an articulatory model of the human vocal tract, an articulatory-to-acoustic synthesis system, and a learning architecture combining deep learning algorithms and developmental principles inspired from cognitive sciences. The first step will consist in designing, implementing and testing a “deep” version of a Bayesian computational model of speech communication called COSMO (Moulin-Frier et al., 2012, 2015; Laurent et al., 2017; Barnaud et al., 2019), in which some of the probability distributions are implemented by generative neural models (e.g., VAE, GAN). The second stage will consist in reimplementing the speech communication agent entirely in an end-to-end neural architecture. PhD project of Marc-Antoine Georges.

  • Developing a general methodology for incremental sequence-to-sequence mapping. It will require the development of end-to-end classification and regression neural models able to deliver chunks of output data on-the-fly, from only a partial observation of input data. Possible approaches include: (1) Predicting online “the future” of the output sequence from “the past and present” of the input sequence, with an acceptable tolerance to possible errors, or (2) learn automatically from the data an optimal “waiting policy” that prevents the model from outputing data when the uncertainty is too high. This will be applied to address two speech processing problems: Incremental Text-toSpeech synthesis in which speech is synthesized while the user is typing the text and Incremental Speech Enhancement in which unintelligible portions of the speech signal are replaced on-the-fly with reconstructed portions. This work is carried out in close collaboration with the MIAI Chair “Artificial Intelligence & Language” held by Laurent Besacier. PhD project of Brooke Stephenson.

  • Developing a model of speech production that incorporates an on-line processing of feedback information by extending our current Bayesian model for speech motor planning called Bayesian GEPPETO (Patri et al., 2015). This involves first learning a “dynamical internal model” of the speech production apparatus, which predicts articulatory movements and spectro-temporal acoustical properties of speech production from time varying motor commands, and predicts with minimal delay the probability of reaching the intended sensory goals of speech production with the appropriate timing. This learning relies on simulations of a biomechanical model of the vocal tract, which generates articulatory movements from motor commands and whose computational complexity will be reduced by using machine-learning based methods of model order reduction (PhD project of Maxime Calka). Then, assuming that the brain works like an optimal controller/estimator, research will focus on how sensory feedback and internal prediction of speech signal concurrently guide speech production, allow online error correction, and shape speech perception (Forthcoming PhD project).

This work is also linked to the work being done in an international project, carried out by members of the chair in collaboration with Anne-Lise Giraud and Itsaso Olasagasti of the "Auditory, Speech and Language Neuroscience" group at UNIGE (Université de Genève, Switzerland), and founded by the IDEX "Université Grenoble Alpes Université de l’Innovation" (Bio-Bayes project - IDEX ISP19). In this project, Bayesian hierarchical and predictive models of speech communication will be developed that account for observations of neuronal oscillatory systems in the brain (PhD project of Mamady Nabé).

The evaluation of the impact of this research program for the development of speech technologies will be done in terms of rapidity and completeness of learning as well as in terms of the amount of data required to reach a satisfactory level of learning. Moreover, in the context of deep learning, the analysis of the representations learned by deep artificial neural networks from raw articulatory, acoustic and linguistic data could provide important insights into the sensory-motor representations potentially encoded in the human brain.

 

Bibliographic References

  • Barnaud, M.L., Bessière, P., Diard, J., & Schwartz, J.L. (2018). Reanalyzing neurocognitive data on the role of the motor system in speech perception within COSMO, a Bayesian perceptuo-motor model of speech communication. Brain and Language, 187, 1932.https://doi.org/10.1016/j.bandl.2017.12.003
  • Barnaud, M.L., Schwartz, J.L., Bessière, P., & Diard, J. (2019). Computer simulations of coupled idiosyncrasies in speech perception and speech production with COSMO, a perceptuo-motor Bayesian model of speech communication. PloS one, 14(1), e0210302. ? Bessière, P., Mazer, E., Ahuactzin-Larios, J.-M., & Mekhnacha, K. (2013). Bayesian Programming. Boca Raton, FL: CRC Press.
  • Bocquelet, F., Hueber, T., Girin, L., Savariaux, C., & Yvert, B. (2016). Real-time control of an articulatory-based speech synthesizer for brain computer interfaces. PLoS computational biology, 12(11): e1005119. ? Fabre, D., Hueber, T., Girin, L., Alameda-Pineda, X., & Badin, P. (2017). Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract. Speech Communication, 93, 63-75.
  • Girin, L., Hueber, T., & Alameda-Pineda, X. (2017). Extending the cascaded gaussian mixture regression framework for cross-speaker acoustic-articulatory mapping. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(3), 662-673.
  • Laurent, R., Barnaud, M. L., Schwartz, J. L., Bessière, P., & Diard, J. (2017). The complementary roles of auditory and motor information evaluated in a Bayesian perceptuomotor model of speech perception. Psychological Review, 124(5), 572-602. http://dx.doi.org/10.1037/rev0000069
  • Hueber, T., & Bailly, G. (2016). Statistical conversion of silent articulation into audible speech using full-covariance HMM. Computer Speech & Language, 36, 274-293.
  • Hueber, T., Girin, L., Alameda-Pineda, X., & Bailly, G. (2015). Speaker-adaptive acousticarticulatory inversion using cascaded Gaussian mixture regression. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12), 2246-2259.
  • Moulin-Frier, C., Laurent, R., Bessière, P., Schwartz, J. L., & Diard, J. (2012). Adverse conditions improve distinguishability of auditory, motor, and perceptuo-motor theories of speech perception: An exploratory Bayesian modelling study. Language and Cognitive Processes, 27(7-8), 1240-1263.
  • Moulin-Frier, C., Diard, J., Schwartz, J. L., & Bessière, P. (2015). COSMO (“Communicating about Objects using Sensory–Motor Operations”): A Bayesian modeling framework for studying speech communication and the emergence of phonological systems. Journal of Phonetics, 53, 5-41.
  • Patri, J. F., Perrier, P., Schwartz, J. L., & Diard, J. (2018). What drives the perceptual change resulting from speech motor adaptation? Evaluation of hypotheses in a Bayesian modeling framework. PLoS computational biology, 14(1), e1005942.
  • Patri, J. F., Diard, J., & Perrier, P. (2015). Optimal speech motor control and token-to-token variability: a Bayesian modeling approach. Biological cybernetics, 109(6), 611-626.
  • Schultz, T., Wand, M., Hueber, T., Krusienski, D. J., Herff, C., & Brumberg, J. S. (2017). Biosignal-based spoken communication: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12), 2257-2271.
  • Tenenbaum, J. B., Kemp, C., Griffiths, T. L., & Goodman, N. D. (2011). How to grow a mind: Statistics, structure, and abstraction. Science, 331(6022), 1279-1285.

Expected Outcomes

The research program, which combines cognitive plausibility and alignment with ground-truth data, should result in significant improvement along three basic dimensions:

  • "More explainable and transferable": Connecting data-driven machine learning approaches with cognitive and developmental assumptions should render emerging features and structures more explainable and interpretable. It should facilitate the evaluation of their applicability limitations, predicting their errors and suggesting ways to improve their behavior.
  • "More rapid and economic": Implementing plausible developmental sequences and plausible hierarchies in the model structuration should favor learning transfer, ensure faster learning and quicker convergence. It should enable the models to learn with reduced sets of data, and favor economic hardware processing (exploiting mechanisms such as predictive coding, attentional filtering or multiplex coding).
  • “More robust and versatile”: The generative nature of the implemented models and the adequacy of the exploited developmental schedules should enable the models to process atypical or noisy speech thanks to internally generated outputs, and conversely to adapt their own speech productions in response to perturbations. It should lead to natural and coherent variability constrained and structured by the properties of the modelled system.

Industrial Partners

  • ProBayes, 38330 Montbonnot
  • ANSYS France, 69100 Villeurbanne

National and International Collaborations

  • UNIGE, Genève, Switzerland : Anne-Lise Giraud, auditory processing and neurophysiological modelling (Grant from IDEX “Université Grenoble Alpes Université de l’Innovation”)
  • TIMC-IMAG, UGA: Yohan Payan, Model Order Reduction of biomechanical models of speech articulators (CIFRE doctoral grant from ANSYS France and ANRT)

Scientific publications

Peer-reviewed publications

  • Hueber, T., Tatulli, E., Girin, L., & Schwartz, J-L. (2020). Evaluating the potential gain of auditory and audiovisual speech predictive coding using deep learning, Neural Computation, vol. 32(3), 596-625.
  • Patri, J. F., Ostry, D. J., Diard, J., Schwartz, J. L., Trudeau-Fisette, P., Savariaux, C., & Perrier, P. (2020). Speakers are able to categorize vowels based on tongue somatosensation. Proceedings of the National Academy of Sciences, 117(11), 6255-6263.
  • Patri, J. F., Diard, J., & Perrier, P. (2019). Modeling sensory preference in speech motor planning: a Bayesian modeling framework. Frontiers in Psychology, 10, 2339.