Friday 30 January 2026 - 11am

Speaker: Uri Berger (The Hebrew University of Jerusalem)

Title: Human-Inspired Multimodal Language Modeling: Acquisition, Pragmatic Effects, and Human Alignment

Abstract: Vision–language models (VLMs) achieve impressive performance across many multimodal tasks, yet their outputs often diverge from human behavior. In this talk, I outline a path toward understanding the sources of these differences and for making VLM outputs more human-like, following three complementary steps. 

First, I examine the semantic structures that emerge when models are trained with both visual and linguistic input. I show that, much like in humans, the categories learned from multimodal data tend to be scene-based (e.g., “water-related objects,” “tree-related objects”), in contrast to the taxonomic categories (e.g., “animals,” “vehicles”) that arise in text-only training.

Second, I investigate how pragmatic cues, such as salient visual categories and speakers’ cultural backgrounds, influence image descriptions. I demonstrate that visual features shape the syntactic form of the generated description, and that cultural background strongly affects which entities speakers choose to mention.

Finally, I present our efforts to make VLM outputs more human-aligned. I introduce reformulation feedback- a technique inspired by parents feedback to their children- and show that applying it on captioning models at inference time significantly improves human judgments of caption quality. I then survey current evaluation practices for image captioning models, highlight that the field relies on five widely used metrics that correlate poorly with human ratings, and propose directions for substantially improving these correlations.

Biography: Uri Berger is a last year PhD candidate in a joint program at the University of Melbourne and the Hebrew University of Jerusalem, under the supervision of Lea Frermann, Omri Abend and Gabriel Stanovsky. Before that, Uri did a MSc at the Hebrew University of Jerusalem working with Ari Rappoport on Spiking Neural Networks. Uri is interested in learning in non–text-only environments, particularly those involving multimodality or interactivity.