IPAB Workshop - 22/1/26 | IPAB | School of Informatics

Title: On Inverse Language Modelling via End-to-End Differentiable Language Models

Abstract: Language Models (LMs) are ubiquitous in AI thanks to their impressive capabilities. Yet, we lack understanding of their inner workings and of what drives their impressive capabilities. Critically, the AI community has mainly been studying their forward mapping. In this work, we investigate their inverse mapping.

We formalise the invertibility problem of LMs as the problem of finding an inverse prompt to a given target output for a given subject LM, and propose an optimisation-based inverse operator that can provide inverse prompts to any given target output for any white-box subject LM. Our inverse operator relies on our proposed end-to-end Differentiable Language Model (DLM) extension to common, non-differentiable LMs. Indeed, beyond the tokenizers of LMs, despite their simple tokens-in-token-out autoregressive processing, LMs are not monolithic, but rather made up of different modules. Among these, the hard embedding module and next-token sampling module are not differentiable.

To render the whole LM pipeline differentiable end-to-end, we first shift from using sequences of tokens (SoTs) as the main abstraction to instead rely on sequences of distributions over tokens (SDoTs) as the main abstraction. This shift highlights a distributional viewpoint over LMs, which allows us to then swap the two SoTs-processing, non-differentiable modules for differentiable, SDoT-processing counterparts, namely a soft embedding module and a Gumbel-Softmax (GS)-based next-token sampling module.

We present experiments on invertibility of LMs with variants of the GS gradient estimator at different LM scale, showing that our DLM-powered inverse operator can reliably provide inverse prompts to varied target outputs for any white-box LM, out-of-the-box. This work paves the way for the community to build tools to better understand LMs' capabilities through a focus on, not only their forward mapping but also, their inverse mapping.