Bayesian Cognition and Machine Learning for Speech communication


The "Bayesian Cognition and Machine Learning for Speech communication" chair, a part of the Grenoble MIAI institute, brings together researchers whose areas of expertise are in the fields of Speech Communication, Cognition, Machine Learning, and Probabilistic Modeling of sensorimotor systems. The team members come from Gipsa-lab (UMR CNRS 5216) and LPNC (UMR CNRS 5105), two labs at UGA and Grenoble INP-UGA. The aim of the chair is to build a global computational model of speech production and speech perception, that is, a system able to learn how to speak and to perceive speech from examples provided by the environment. To this purpose, an original approach is proposed, which associates the algorithmic and mathematical framework of data-based Deep Learning and hypothesis-driven Probabilistic Modelling. This approach was developed in order to design more interpretable and thus more explainable and transferrable models, with more rapid and economical implementations and more robustness and versatility. Our aim is to build models of speech communication that reach the state-of-the-art performance of current deep-learning based systems while drastically limiting the amount of training data.


Figure 1 - Processes and variables included in the joint model of speech and speech perception

Jointly modeling the speech production and perception processes amounts to designing models of the relationships between the various variables that are involved in those processes, namely motor/control variables, multi-sensory variables and linguistic/phonological variables (Figure 1). Over the years, the members of the group have explored two complementary approaches to do this. Firstly, a hypothesis-driven probabilistic programming approach (Tenenbaum et al., 2011; Bessière et al., 2013) has been used to explicitly design a set of multidimensional probabilistic functions that link the variables. These probabilistic models are defined from theoretical hypotheses about the physical mechanisms, neurocognitive processing and representations of speech production and speech perception in humans. This has led to a number of significant insights concerning speech perception in adverse conditions (Moulin-Frier et al., 2012; Laurent et al., 2015), variability and robustness in speech production (Patri et al., 2015, 2018), speech representations in the brain (Barnaud et al., 2018) and the emergence of sound systems in a society of communicating agents (Moulin-Frier et al., 2015; Schwartz et al., 2016). Secondly, data-driven deep-learning algorithmic and mathematical frameworks such as artificial deep neural networks makes it possible to establish a direct mapping, i.e., a deterministic regression, between subsets of variables. This approach was the basis for the development of efficient systems for speech processing, voice conversion, noise reduction, feature extraction, acoustic-articulatory inversion, speech synthesis and braincomputer interfaces (Hueber et al., 2015; Hueber & Bailly, 2016; Bocquelet et al., 2016; Fabre et al., 2017; Girin et al., 2017; Schultz et al., 2017). The computational challenge addressed in the program of the chair is to take the best from these two complementary approaches to elaborate a computational system which is able to learn how to produce and perceive speech from examples provided by the environment. These two approaches have recently begun to intersect with the emergence of Deep Generative Models, such as Variational AutoEncoders (VAE), or Generative Adversarial Networks (GAN). Basically, parameters characterizing probability distributions of data are encoded/mapped within deep neural networks. These models can be used as supervised probabilistic priors in a more general (probabilistic) model. They provide efficient ways to extract/model/manipulate the low-dimensional latent space that represents the structure of the high-dimensional data.


Figure 2 - Components of the research program

In this context, it is crucial to clarify how complex multidimensional probabilistic relations between speech production and perception variables can be learned, and to make this learning more efficient. A systematic exploration of the numerous multidimensional spaces involved in these processes is unfeasible. This is why a cognition-based approach of probabilistic programming is favored, since it provides a structure for the development of the computational models and the learning processes in relation with existing knowledge about physics (aeroacoustics, biomechanics) and neuroscience/psychology (sensory-motor control of speech, language representations in the brain, developmental schedule of speech and language acquisition in infancy and childhood). The overall program of the project consists in associating the formalization of the speech communication model made possible by the Probabilistic Programming framework and the learning capabilities of advanced machine learning methods to tackle these learning issues, elaborate new learning models and evaluate them (Figure 2).

More specifically during the next 4 years, research will be organized around three main topics:

  • Developing a computational agent, able to learn speech representations from raw speech data in a weakly supervised scenario. This agent will contain an articulatory model of the human vocal tract, an articulatory-to-acoustic synthesis system, and a learning architecture combining deep learning algorithms and developmental principles inspired from cognitive sciences. The first step will consist in designing, implementing and testing a “deep” version of a Bayesian computational model of speech communication called COSMO (Moulin-Frier et al., 2012, 2015; Laurent et al., 2017; Barnaud et al., 2019), in which some of the probability distributions are implemented by generative neural models (e.g., VAE, GAN). The second stage will consist in reimplementing the speech communication agent entirely in an end-to-end neural architecture. PhD project of Marc-Antoine Georges.
  • Developing a general methodology for incremental sequence-to-sequence mapping. It will require the development of end-to-end classification and regression neural models able to deliver chunks of output data on-the-fly, from only a partial observation of input data. Possible approaches include: (1) Predicting online “the future” of the output sequence from “the past and present” of the input sequence, with an acceptable tolerance to possible errors, or (2) learn automatically from the data an optimal “waiting policy” that prevents the model from outputing data when the uncertainty is too high. This will be applied to address two speech processing problems: Incremental Text-toSpeech synthesis in which speech is synthesized while the user is typing the text and Incremental Speech Enhancement in which unintelligible portions of the speech signal are replaced on-the-fly with reconstructed portions. This work is carried out in close collaboration with the MIAI Chair “Artificial Intelligence & Language” held by Laurent Besacier. PhD project of Brooke Stephenson.
  • Developing a model of speech production that incorporates an on-line processing of feedback information by extending our current Bayesian model for speech motor planning called Bayesian GEPPETO (Patri et al., 2015). This involves first learning a “dynamical internal model” of the speech production apparatus, which predicts articulatory movements and spectro-temporal acoustical properties of speech production from time varying motor commands, and predicts with minimal delay the probability of reaching the intended sensory goals of speech production with the appropriate timing. This learning relies on simulations of a biomechanical model of the vocal tract, which generates articulatory movements from motor commands and whose computational complexity will be reduced by using machine-learning based methods of model order reduction (PhD project of Maxime Calka). Then, assuming that the brain works like an optimal controller/estimator, research will focus on how sensory feedback and internal prediction of speech signal concurrently guide speech production, allow online error correction, and shape speech perception (Forthcoming PhD project).

This work is also linked to the work being done in an international project, carried out by members of the chair in collaboration with Anne-Lise Giraud and Itsaso Olasagasti of the "Auditory, Speech and Language Neuroscience" group at UNIGE (Université de Genève, Switzerland), and founded by the IDEX "Université Grenoble Alpes Université de l’Innovation" (Bio-Bayes project - IDEX ISP19). In this project, Bayesian hierarchical and predictive models of speech communication will be developed that account for observations of neuronal oscillatory systems in the brain (PhD project of Mamady Nabé).

The evaluation of the impact of this research program for the development of speech technologies will be done in terms of rapidity and completeness of learning as well as in terms of the amount of data required to reach a satisfactory level of learning. Moreover, in the context of deep learning, the analysis of the representations learned by deep artificial neural networks from raw articulatory, acoustic and linguistic data could provide important insights into the sensory-motor representations potentially encoded in the human brain.


  • Marc-Antoine Georges, Laurent Girin, Jean-Luc Schwartz, Thomas Hueber. Learning robust speech representation with an articulatory-regularized variational autoencoder. Interspeech 2021 - 22nd Annual Conference of the International Speech Communication Association, Aug 2021, Brno, Czech Republic.
  • Brooke Stephenson, Thomas Hueber, Laurent Girin, Laurent Besacier. Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input. Interspeech 2021 - 22nd Annual Conference of the International Speech Communication Association, Aug 2021, Brno, Czech Republic. pp.3865-3869.
  • Xiaoyu Bie, Laurent Girin, Simon Leglaive, Thomas Hueber, Xavier Alameda-Pineda. A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling. Interspeech 2021 - 22nd Annual Conference of the International Speech Communication Association, Aug 2021, Brno, Czech Republic. pp.1-5.
  • Pierre Baraduc, Tsiky Rakatomalala, Pascal Perrier. Tongue motor control: deriving articulator trajectories and muscle activation patterns from an optimization principle. Neural Control of Movement 2021 (Virtual Conference). Abstract. April 2021
  • Pascal Perrier, Ny-Tsiky Rakotomalala, Pierre Baraduc. Some thoughts about trajectory formation in speech production. Neural bases of speech production, 2021, UCSF, San Francisco, Virtual Symposium, Invited Conference, May 2021.
  • Mamady Nabé, Jean-Luc Schwartz, Julien Diard. COSMO-Onset: A Neurally-Inspired Computational Model of Spoken Word Recognition, Combining Top-Down Prediction and Bottom-Up Detection of Syllabic Onsets. Frontiers in Systems Neuroscience, Frontiers, 2021, 15, pp.653975.
  • Girin, L., Leglaive, S., Bie, X., Diard, J., Hueber, T., & Alameda-Pineda, X. (2021). Dynamical Variational Autoencoders: A Comprehensive Review. Foundations and Trends in Machine Learning. Pending publication in December.
  • Georges, M.-A., Badin, P., Diard, J., Girin, L., Schwartz, J.-L., Hueber, T. (2020). Towards an articulatory-driven neural vocoder for speech synthesis. ISSP 2020 - 12th International Seminar on Speech Production, Dec 2020, Providence (virtual), United States.
  • Baraduc, P., Perrier, P. (2020). Tongue motor control stability: integrating feedback, dynamical internal representation and optimal planning. ISSP 2020 - 12th International Seminar on Speech Production, Dec 2020, Providence (virtual), United States.
  • Calka, M., Perrier, P., Ohayon, J., Grivot Boichon, C. Rochette, Payan, Y. (2020). Real-time simulations of human tongue movements with a reduced order model of a non-linear dynamic biomechanical model. Computer Methods in Biomechanics and Biomedical Engineering, Taylor & Francis, 2020, 23 (sup1), pp.S55-S57.
  • Stephensen, B., Besacier, L., Girin, L., Hueber, T. (2020). What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS. Proceedings of Interspeech 2020 (pp. 215-219). October 25–29, 2020, Shanghai, China.
  • Girin, L., Leglaive, S., Bie, X., Diard, J., Hueber, T., & Alameda-Pineda, X. (2020). Dynamical Variational Autoencoders: A Comprehensive Review. ArXiv preprint arXiv:2008.12595.
  • Calka, M., Perrier, P., Ohayon, J., Grivot-Boichon, C., Rochette, M., Payan, Y. (2021).
  • Machine-Learning based model order reduction of a biomechanical model of the human tongue. Computer Methods and Programs in Biomedicine, Vol. 198, 105786.
  • Hueber, T., Tatulli, E., Girin, L., & Schwartz, J-L. (2020). Evaluating the potential gain of auditory and audiovisual speech predictive coding using deep learning. Neural Computation, vol. 32(3), 596-625.
  • Patri, J. F., Ostry, D. J., Diard, J., Schwartz, J. L., Trudeau-Fisette, P., Savariaux, C., & Perrier, P. (2020). Speakers are able to categorize vowels based on tongue somatosensation. Proceedings of the National Academy of Sciences, 117(11), 6255-6263.
  • Patri, J.F., Diard, J., & Perrier, P. (2019). Modeling sensory preference in speech motor planning: a Bayesian modeling framework. Frontiers in Psychology, 10, Article 2339.


The research program, which combines cognitive plausibility and alignment with ground-truth data, should result in significant improvement along three basic dimensions:

  • "More explainable and transferable": Connecting data-driven machine learning approaches with cognitive and developmental assumptions should render emerging features and structures more explainable and interpretable. It should facilitate the evaluation of their applicability limitations, predicting their errors and suggesting ways to improve their behavior.
  • "More rapid and economic": Implementing plausible developmental sequences and plausible hierarchies in the model structuration should favor learning transfer, ensure faster learning and quicker convergence. It should enable the models to learn with reduced sets of data, and favor economic hardware processing (exploiting mechanisms such as predictive coding, attentional filtering or multiplex coding).
  • “More robust and versatile”: The generative nature of the implemented models and the adequacy of the exploited developmental schedules should enable the models to process atypical or noisy speech thanks to internally generated outputs, and conversely to adapt their own speech productions in response to perturbations. It should lead to natural and coherent variability constrained and structured by the properties of the modelled system.


  • ProBayes, 38330 Montbonnot
  • ANSYS France, 69100 Villeurbanne


  • UNIGE, Genève, Switzerland : Anne-Lise Giraud, auditory processing and neurophysiological modelling (Grant from IDEX “Université Grenoble Alpes Université de l’Innovation”)

  • TIMC-IMAG, UGA: Yohan Payan, Model Order Reduction of biomechanical models of speech articulators (CIFRE doctoral grant from ANSYS France and ANRT)

Published on  January 11, 2024
Updated on January 11, 2024