Wednesday 27 August - 11.00

Speaker: Eve Fleisig (University of California, Berkeley)

Title: GRACE: A Granular Benchmark for Evaluating Model Calibration Against Human Calibration

Abstract:  Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of questions containing a series of gradually easier clues, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. We then hosted live human vs. model competitions to gather 1,749 data points on human and model timing, accuracy, and confidence. We find that although humans are less accurate than models, humans are generally better calibrated. We also introduce a metric based on GRACE, CalScore, that we use to analyze types of model miscalibration that differ from human behavior.

Biography: Eve Fleisig is a rising fifth-year PhD student at UC Berkeley advised by Dan Klein. Her research lies at the intersection of natural language processing and AI ethics, with a focus on preventing societal harms of language models and ensuring that AI systems account for the perspectives of diverse populations. Previously, she received a B.S. in computer science from Princeton University. Her research has been awarded an NSF Graduate Research Fellowship, Berkeley Chancellor’s Fellowship, EMNLP Outstanding Paper Award, and NAACL Outstanding Paper Award.