What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models
This addresses the challenge of user control in generative AI, offering a new evaluation metric that is incremental but practically relevant.
The paper tackles the problem of evaluating generative models by introducing steerability, which measures whether users can achieve specific goals, and finds that despite high-quality outputs, models perform poorly on steerability, with simple improvements achieving over 2x gains.
How should we evaluate the quality of generative models? Many existing metrics focus on a model's producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical decomposition for quantifying steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user's goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in user studies of text-to-image and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerability. These results suggest that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: simple image-based steering mechanisms achieve more than 2x improvement on this benchmark.