Monday 15th September 2025 - 11am | ILCC

Title: Beyond More Data: Advancing Low-Resource Natural Language Processing from Tokenization to Inference

Abstract: In a field where the state of the art is often advanced by scale - building larger models on more data - I will make the argument that a surprising amount of progress can be achieved by modifying other parts of the NLP pipeline, especially for low-resource languages. I will introduce a parity-aware modification of byte-pair encoding that is optimized towards cross-lingual fairness in tokenization length, and can yield more equitable models where length translates directly into costs, while also performing better on cross-lingual benchmarks. I will also discuss the task of machine translation, where massively multilingual models and large language models have been shown to handle many translation directions, but which still suffer from problems such as hallucinations or translations in the wrong language. I will show how these issues can be reduced massively with contrastive decoding methods that pair each input with appropriate contrastive inputs, and sketch wider applications of this strategy to interacting with LLMs.

Biography: Rico Sennrich is assistant professor at the University of Zurich, where he has worked since 2019. He has a long history with the University of Edinburgh as visiting researcher, postdoc, lecturer, and now honorary fellow. His research covers various areas of NLP, with a special focus on high-quality machine translation, multilingual and low-resource NLP, and multimodal models.
https://www.cl.uzh.ch/de/about-us/people/team/compling/sennrich.html