AI benchmarks are often designed to measure capabilities, but what happens when they become optimization targets instead?
Join us for an AIMSEC seminar, where Sang Truong from Stanford will present a measurement-theoretic framework for addressing two of AI evaluation's hardest problems: predicting model performance from sparse data, and designing benchmarks that resist gaming.
📅 Wednesday, March 25 | 11:30am–12:30pm ET
📍GHC 6501 or Zoom
Lunch provided for in-person attendees
Please register here if you plan to attend the talk:
You can also register to meet 1-on-1 with Sang here:
Title: AI Measurement Science: Predictive and Strategically Robust Model Evaluation
Abstract: As AI systems become agentic, strategically optimized, and widely deployed, existing evaluation paradigms face structural limitations. Benchmarks are typically treated as static scoreboards, yet they increasingly function as optimization targets. This creates two core challenges: how to measure system capabilities efficiently with limited data, and how to preserve the robustness of evaluation under strategic pressure. In this talk, I present a measurement-theoretic framework for frontier AI systems centered on two directions. First, predictive evaluation. We treat evaluation itself as a predictive modeling problem. Our amortized latent variable models infer model capability directly from benchmark data and predict performance on unseen tasks from sparse observations, enabling deployment-relevant prediction even in the small-data regime. Second, evaluation robustness. We model benchmarking as a Stackelberg game between evaluator and model builder. Using an information design perspective, we show that deterministic benchmarks are inherently gameable and construct stochastic, incentive-aligned evaluation mechanisms under which genuine capability improvement becomes the dominant strategy. Together, this work advances a science of AI measurement that connects statistical predictability, strategic robustness, and real-world deployment.
