Artificial Intelligence & Language

Description

Context

Modern natural language processing (NLP) systems are excessively dependent on the availability of annotated resources (data addiction) which (1) increases the digital gap between high and low resource languages or dialects and (2) increases the risk of machine bias where empirically trained models reproduce controversial societal inequalities such as gender, racial bias, etc. Simultaneously, improving algorithms for automatic analysis of text and speech create new opportunities for basic and applied research in language science (e.g. descriptive and theoretical linguistics, sociolinguistics, study of language development) but these algorithms must be able to learn from few examples since corpora collected by linguists are of limited size. From these observations, the objective of the chair is to make NLP less data addicted (and therefore fairer) as well as to contribute to a methodological turn in language-related social science research by leveraging machine learning and modern natural language processing techniques.

Goals

We aim to build models that 1/ can learn to process language from as little data as a learning child and 2/ are free of the social biases included in the data. For this, human labeling is replaced by weaker signals in the form of multimodal information, prior knowledge, inductive biases, cross language (or task) similarities, context.
 

Research avenues

Due to the cost of data annotation, an open issue in NLP is the design of learning methods that are data-efficient (generalize from a few examples) and can leverage diverse types of knowledge. In this context, we propose to use:
 
  1. modeling based on the same information shared between languages or tasks (e.g. multilingual or multitask learning) ;
     
  2. zero-shot methods that do not need annotated data (e.g. but use representations learnt from speech or text without supervision) ;
     
  3. expert knowledge in empirical systems (include priors in bayesian or neural models, use typological features) ;
     
  4. multimodal data for semantic supervision (models of visually grounded speech and language) ;
     
  5. parsimonious models (known to have a better explanatory predictive power) ;
     
  6. inductive biases (from psycholinguistic work on language acquisition) ;
     
  7. innovative language data collection methodologies (via crowdsourcing, mobile apps).

Activities

2 PhDs have started in January 2020:
 
  • Brooke Stephenson: Incremental (low latency) TTS (co-supervision with chaire P. Perrier)
  • Lorenzo Lupo: Document level neural machine translation

2 M2R have been supervised:
 
  • End-to-end speech parsing (Ousama Gasmi)
  • Analysis of contextualized language models’ lexical functions (Vincent Bellue)

Collaborative work on self supervised text representation learning - Release of FlauBERT (language model for French, trained with Jean Zay supercomputer): https://github.com/getalp/Flaubert 

Chair events

Organization of ALPS (Advanced Language Processing School) Winter school that will take place (virtually)
from Sunday, January 17th to Friday 22nd 2021 - http://lig-alps.imag.fr

Selected List of publications 

  • Erica Shimomoto, François Portet, Kazuhiro Fukui. Text classification based on the word subspace representation. Pattern Analysis and Applications, Springer Verlag, 2021.

  • Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier Investigating alignment interpretability for low-resource NMT. Machine Translation, Springer Verlag, 2021.

  • Maha Elbayad, Laurent Besacier, Jakob Verbeek Joint Source-Target Encoding with Pervasive Attention. Machine Translation, Springer Verlag, 2021.

  • Hang Le, Juan Miguel Pino, Changhan Wang, Jiatao Gu, Didier Schwab, Laurent Besacier: Lightweight Adapter Tuning for Multilingual Speech Translation. ACL/IJCNLP (2) 2021: 817-824.

  • Zae Myung Kim, Laurent Besacier, Vassilina Nikoulina, Didier Schwab: Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads? ACL/IJCNLP (Findings) 2021: 2832-2841.

  • Ahmet Üstün, Alexandre Berard, Laurent Besacier, Matthias Gallé: Multilingual Unsupervised Neural Machine Translation with Denoising Adapters. EMNLP (1) 2021: 6650-6662

  • Ha Nguyen, Yannick Estève, Laurent Besacier: An Empirical Study of End-To-End Simultaneous Speech Translation Decoding Strategies.ICASSP 2021: 7528-7532

  • Ha Nguyen, Yannick Estève, Laurent Besacier: Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation. 22nd Annual Conference of the International Speech Communication Association, Aug 2021, Brno, Czech Republic

  • Louise Tarrade, Jean-Pierre Chevrot, Jean-Philippe Magué. Buzz or Change: How the Social Network Structure Conditions the Fate of Lexical Innovations on Twitter. 8th Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2021), Oct 2021, Nijmegen, Radboud University, Netherlands.

  • Jean-Pierre Chevrot. Can we predict the socio-demographic characteristics of Twitter users from their tweets? The contribution of massive data and artificial intelligence. Linguistic variation in European languages – New perspectives on diasystematic variation at the occasion of the centenary of Coseriu’s birth (1921-2021), Nov 2021, Copenhague, Denmark.

  • Evain, Manh Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong, Natalia Tomashenko, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Estève, Benjamin Lecouteux, François Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier. Task Agnostic and Task Specific Self-Supervised Learning from Speech with LeBenchmarkThirty-fifth Conference on Neural Information Processing Systems ( NeurIPS 2021), Dec 2021, on-line, United States.

  • Brooke Stephenson, Thomas Hueber, Laurent Girin, Laurent Besacier Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input Interspeech 2021 - 22nd Annual Conference of the International Speech Communication Association, Aug 2021, Brno, Czech Republic. pp.3865-3869.

  • Solène Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong, Natalia Tomashenko, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Estève, Benjamin Lecouteux, François Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech INTERSPEECH 2021: Conference of the International Speech Communication Association, Aug 2021, Brno, Czech Republic.

  • Maximin Coavoux. BERT-Proof Syntactic Structures: Investigating Errors in Discontinuous Constituency Parsing Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), Association for Computational Linguistics, Aug 2021, Online, France. pp.3259-3272.

  • Hady Elsahar, Maximin Coavoux, Jos Rozen, Matthias Gallé Self-Supervised and Controlled Multi-Document Opinion Summarization Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Apr 2021, Online, Unknown Region. pp.1646—1662.

  • Ali Can Kocabiyikoglu, Jean-Marc Babouchkine, François Portet, Raheel Qader Neural Medication Extraction: A Comparison of Recent Models in Supervised and Semi-supervised Learning Settings ICHI 2021 : IEEE International Conference on Healthcare Informatics, Sep 2021, Victoria, Canada.

  • Sannara Ek, François Portet, Philippe Lalanda, German Vega A Federated Learning Aggregation Algorithm for Pervasive Computing: Evaluation and Comparison19th IEEE International Conference on Pervasive Computing and Communications PerCom 2021, Mar 2021, Kassel (virtual), Germany.

  • Anastasiia Usmanova, François Portet, Philippe Lalanda, German Vega A distillation-based approach integrating continual learning and federated learning for pervasive services 3rd Workshop on Continual and Multimodal Learning for Internet of Things -- Co-located with IJCAI 2021, Aug 2021, Montreal, Canada.

  • Odette Scharenborg, Laurent Besacier, Alan Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stuker, Pierre Godard, Markus M¨uller, Lucas Ondel, Shruti Palaskar, Philip Arthur, Francesco Ciannella, Mingxing Du, Elin Larsen, Danny Merkx, Rachid Riad, Liming Wang, and Emmanuel Dupoux. Speech technology for unwritten languages. IEEE/ACM Transactions on Audio, Speech and Language Processing, February 2020.

  • Eric Le Ferrand, Steven Bird, and Laurent Besacier. Enabling Interactive Transcription in an Indigenous Communit´y. In COLING 2020 (short paper), Virtual, Spain, December 2020.

  • Vaishali Pal, Manish Shrivastava, and Laurent Besacier. ConfNet2Seq: Full Length Answer Generation from Spoken Questions.In Text, Speech and Dialogue (TSD) 2020, Brno, Czech Republic, September 2020b.

  • Maha Elbayad, Ha Nguyen, Fethi Bougares, Natalia Tomashenko, Antoine Caubri`ere, Benjamin Lecouteux, Yannick Est`eve, and Laurent Besacier. ON-TRAC Consortium for End-to-End and Simultaneous Speech Translation Challenge Tasks at IWSLT 2020. In The International Conference on Spoken Language Translation ACL - 17th IWSLT, Seattle, WA, United States, July 2020b.

  • Maha Elbayad, Laurent Besacier, and Jakob Verbeek. Efficient Wait-k Models for Simultaneous Machine Translation. In Interspeech 2020, Shangai (Virtual Conf), China, October 2020a

  • Brooke Stephenson, Laurent Besacier, Laurent Girin, and Thomas Hueber. What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS. In Interspeech 2020, Shangai (Virtual Conf), China, October 2020.

  • Vaishali Pal, Fabien Guillot, Manish Shrivastava, Jean-Michel Renders, and Laurent Besacier. Modeling ASR Ambiguity for Neural Dialogue State Tracking. In Interspeech 2020, Shangai (Virtual Conf), China, October 2020a.

  • Ha Nguyen, Fethi Bougares, Natalia Tomashenko, Yannick Est`eve, and Laurent Besacier. Investigating Self-supervised Pre-training for End-to-end Speech Translation. In Interspeech 2020, Shangai (Virtual Conf), China, October 2020.

  • Ewan Dunbar, Julien Karadayi, Mathieu Bernard, Xuan-Nga Cao, Robin Algayres, Lucas Ondel, Laurent Besacier, Sakriani Sakti, and Emmanuel Dupoux. The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units. In Interspeech 2020, Shangai (Virtual Conf), China, October 2020.

  • Loïc Vial, Benjamin Lecouteux, Didier Schwab, Hang Le, Laurent Besacier. The LIG system for the English-Czech Text Translation Task of IWSLT 2019. IWSLT (16th International Workshop on Spoken Language Translation), 2019, Hong-Kong, China. Ewan Dunbar, Robin Algayres, Julien Karadayi, Mathieu Bernard, Juan Benjumea, et al.

  • The Zero Resource Speech Challenge 2019: TTS without T. Interspeech 2019 - 20th Annual Conference of the International Speech Communication Association, Sep 2019, Graz, Austria.

  • Laurent Besacier, Elodie Gauthier, Sylvie Voisin. LESSONS LEARNED AFTER DEVELOPMENT AND USE OF A DATA COLLECTION APP FOR LANGUAGE DOCUMENTATION (LIG-AIKUMA). International Congress of Phonetic Sciences ICPhS 2019, Aug 2019, Melbourne, Australia.

  • Marcely Zanon Boito, Aline Villavicencio, Laurent Besacier. Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings. Interspeech 2019, Sep 2019, Graz, Austria.

  • Pierre Godard, Laurent Besacier, François Yvon. Controlling Utterance Length in NMT-based Word Segmentation with Attention. International Workshop on Spoken Language Translation, Nov 2019, Hong-Kong, China

  • Manh Ha Nguyen, Natalia Tomashenko, Marcely Zanon Boito, Antoine Caubrière, Fethi Bougares, et al.. ON-TRAC Consortium End-to-End Speech Translation Systems for the IWSLT 2019 Shared Task. 16th International Workshop on Spoken Language Translation 2019, Nov 2019, Hong Kong, China.

  • William Havard, Jean-Pierre Chevrot, Laurent Besacier. Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese. International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, Brighton, United Kingdom. pp.8618-8622.

  • Mahault Garnerin, Solange Rossato, Laurent Besacier. Gender Representation in French Broadcast Corpora and Its Impact on ASR Performance. the 1st International Workshop, Oct 2019, Nice, France. pp.3-9

  • William Havard, Jean-Pierre Chevrot, Laurent Besacier. Word Recognition, Competition, and Activation in a Model of Visually Grounded Speech. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Nov 2019, Hong Kong, China. pp.339-348


  •