CR LGMay 20, 2025

Lessons from Defending Gemini Against Indirect Prompt Injections

Chongyang Shi, Sharon Lin, Shuang Song, Jamie Hayes, Ilia Shumailov, Itay Yona, Juliette Pluto, Aneesh Pappu, Christopher A. Choquette-Choo, Milad Nasr, Chawin Sitawarin, Gena Gibson

DeepMind

arXiv:2505.14534v130.137 citationsh-index: 31

Originality Synthesis-oriented

AI Analysis

This addresses security risks for users of AI models like Gemini that interact with untrusted data, though it is incremental as it focuses on evaluation and lessons rather than a new defense method.

The paper tackles the problem of adversarial robustness in Gemini models against indirect prompt injections, where malicious instructions in untrusted data can cause models to mishandle user data or permissions. It describes Google DeepMind's evaluation approach using an adversarial framework with adaptive attacks, which helps improve Gemini's resilience.

Gemini is increasingly used to perform tasks on behalf of users, where function-calling and tool-use capabilities enable the model to access user data. Some tools, however, require access to untrusted data introducing risk. Adversaries can embed malicious instructions in untrusted data which cause the model to deviate from the user's expectations and mishandle their data or permissions. In this report, we set out Google DeepMind's approach to evaluating the adversarial robustness of Gemini models and describe the main lessons learned from the process. We test how Gemini performs against a sophisticated adversary through an adversarial evaluation framework, which deploys a suite of adaptive attack techniques to run continuously against past, current, and future versions of Gemini. We describe how these ongoing evaluations directly help make Gemini more resilient against manipulation.

View on arXiv PDF

Similar