Multimodal Interpretability from Partial Sight

We seek to build DGMs that capture the joint distribution over co-observed visual and language data.

Led by Siddharth Narayanaswamy and Ivan Titov with Victor Prokhorov (Postdoctoral Researcher)


We seek to build DGMs that capture the joint distribution over co-observed visual and language data (e.g. abstract scenes, COCO, VQA), while faithfully capturing the conceptual mapping between the observations in an interpretable manner. This relies on two key observations: (a) perceptual domains (e.g. images) are inherently interpretable, and (b) a key characteristic of useful abstractions are that they are low(er) dimensional (than the data) and correspond to some conceptually meaningful component of the observation. We will seek to leverage recent work on conditional neural processes (Garnelo et al, 2018) to develop partial-image representations to mediate effectively, and in an interpretable manner, between vision and language data. Evaluation of this framework will involve both the ability to generate multimodal data against state-of-the-art approaches, as well as on human-measured interpretability of the learnt representations.  Our project image represents multi-modal data (images, text) as a "partial specification" that allows effective encoding and reconstruction of data.