IPAB Workshop - 9/4/26

Speaker: Yu Cheng

Title: Think, Look, and Revise: Inconsistency-Aware Visual Self-Correction in MLLMs

Abstract: Tool-augmented multimodal reasoning integrates external tools (e.g., object detection, depth estimation) into multimodal large language models (MLLMs) to address perceptual bottlenecks in complex visual tasks. However, existing approaches rarely verify tool outputs, limiting their ability to detect and recover from tool failures. We propose ReVISE, a framework that equips MLLMs with verification and dynamic error recovery for tool-augmented reasoning. ReVISE introduces (1) a curated training dataset that supervises reflective behaviors, enabling models to validate tool-derived evidence, reformulate queries when visual mismatches arise, and fall back to intrinsic grounding when external tools are unreliable; and (2) a reinforcement learning based targeted rewards that encourage internal reflection and penalize spatial misalignment. Experiments on several benchmarks demonstrate consistent improvements over existing methods, highlighting the importance of error detection and correction in tool-augmented multimodal reasoning.

Speaker: Xin Feng

Title: Continual Learning in Pixel-Level Generation: Using Integrated Gradient Attribution to Restore without Forgetting

Abstract: Modern generative AI and unified vision architectures have achieved remarkable progress in pixel-level generation such as image processing, yet they implicitly assume a static, closed-world setting where all degradations are known at training time and past data can be freely revisited. In practice, new corruption types, capture devices, and environmental conditions emerge over time, and access to historical data is often restricted by storage, privacy, or ownership constraints. This work presents Restoring without Forgetting (RwF), a filter-level continual adaptation framework that addresses a key open challenge: how can restoration models learn sequentially without catastrophic forgetting? RwF leverages parameter-space integrated gradient attribution to identify and isolate the small subset of convolutional filters responsible for modeling each degradation pattern. It maintains a filter bank and generates task-specific filters through compact factorized low-rank transformations, further augmented with cross-task attention and prototypical contrastive learning. Across six restoration tasks on two backbone architectures, RwF achieves competitive quality against all-in-one methods with full data access, while requiring roughly an order of magnitude fewer additional parameters than LoRA-style adaptation.