Friday 28 November 2025 - 11am | ILCC

Speaker: Thomas Hueber (Grenoble-Alpes University)

Title: Computational Models of Speech Acquisition with Self-Supervised Deep Learning: From Audiovisual Predictive Coding to Vocal Imitation

Abstract: Learning speech and language is an extraordinary challenge: it requires mastering a complex, multilayered system (from phonetics to syntax) despite ambiguity, variability, and the absence of explicit supervision. How do children accomplish this? Experimental and behavioral studies have highlighted the contributions of multiple mechanisms, including innate biases, the ability to extract statistical regularities from sensory input (statistical learning), visual grounding, the role of self-production (i.e. sensorimotor grounding), and social interaction.Yet, the precise interplay among these mechanisms remains an open question. Computational modeling using self- or weakly-supervised learning offer a complementary approach to test and quantify these processes. By building learner models that process raw, unlabeled auditory and visual inputs, researchers investigate how linguistically meaningful representations emerge from data and how these representations support both speech perception and production.

In this talk, I will present our recent studies that follow this computational modeling approach. The first focuses on speech perception and examines how self-supervised learning (SSL) models based on predictive coding can benefit from additional visual information from the speaker’s lips, shedding light on the respective contributions of auditory and visual cues [1]. The second shifts to speech production and introduces a computational agent that learns to imitate speech in a self-supervised manner, linking acoustics, articulatory gestures, and emerging discrete speech units [2]. Finally, I will present a follow-up study showing that rich acoustic representations extracted from a pre-trained SSL model can significantly enhance articulatory control in such an agent, providing new insights into the role of statistical learning mechanisms in bootstrapping articulatory control [3].

References:

[1] Hueber, T., Tatulli, E., Girin, L., Schwartz, J-L., "Evaluating the potential gain of auditory and audiovisual speech predictive coding using deep learning", Neural Computation, vol. 32 (3), pp. 596-625.

[2] Georges M-A., Lavechin M., Schwartz J-L, Hueber T., (2024) "Decode, Move and Speak! Self-supervised Learning of Speech Units, Gestures, and Sound Relationships Using Vocal Imitation", Computational Linguistics, vol. 50, num 4, pp. 1345–1373.

[3] Lavechin, M., Hueber T. (2025) "From perception to production: how acoustic invariance facilitates articulatory learning in a self-supervised vocal imitation model", Proc. of EMNLP 2025, pp. 23863–23874.

Biography: Thomas Hueber is a Senior Research Scientist (Directeur de recherche) at CNRS, conducting his research at GIPSA-lab in Grenoble, France, where he leads the CRISSP team (Cognitive Robotics, Interactive Systems, and Speech Processing). He received his Ph.D. in Computer Science from Pierre and Marie Curie University (Paris) in 2009 and his "Habilitation à diriger des recherches (HDR)" from Grenoble Alpes University in 2019. His research focuses on automatic speech processing, with an emphasis on models and systems inspired by, or grounded in, human speech perception and production mechanisms. Applications include assistive technologies and computational models of speech acquisition, perception, and control. He served as Associate Editor of the EURASIP Journal on Audio, Speech, and Music Processing (2019–2024) and is currently co-head of the CNRS national research group on natural language processing (GDR TAL). Since 2025, he has held the Chair in Developmental AI for Speech and Language at the Grenoble Institute on Artificial Intelligence.