CARES: Context-Aware Resolution Selector for VLMs
This addresses efficiency issues for users of VLMs in multimodal tasks, though it is incremental as it builds on existing VLM architectures.
The paper tackles the problem of high compute and latency in vision-language models (VLMs) due to processing images at high resolution, by introducing CARES, a lightweight module that predicts the minimal sufficient input resolution, preserving task performance while reducing compute by up to 80% across five multimodal benchmarks.
Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.