CLJan 29, 2025

RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts

Eujeong Choi, Younghun Jeong, Soomin Kim, Won Ik Cho

arXiv:2501.17715v16.72 citationsHas CodePACLIC

Originality Synthesis-oriented

AI Analysis

This addresses security concerns for developers of conversational AI systems by providing a dataset to test against unauthorized access, though it is incremental as it focuses on a specific language and dataset creation.

The authors tackled the problem of jailbreaking in conversational agents by creating RICoTA, a Korean red teaming dataset of 609 prompts from in-the-wild user dialogues to evaluate LLMs' ability to identify and mitigate such risks.

User interactions with conversational agents (CAs) evolve in the era of heavily guardrailed large language models (LLMs). As users push beyond programmed boundaries to explore and build relationships with these systems, there is a growing concern regarding the potential for unauthorized access or manipulation, commonly referred to as "jailbreaking." Moreover, with CAs that possess highly human-like qualities, users show a tendency toward initiating intimate sexual interactions or attempting to tame their chatbots. To capture and reflect these in-the-wild interactions into chatbot designs, we propose RICoTA, a Korean red teaming dataset that consists of 609 prompts challenging LLMs with in-the-wild user-made dialogues capturing jailbreak attempts. We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community, containing specific testing and gaming intentions with a social chatbot. With these prompts, we aim to evaluate LLMs' ability to identify the type of conversation and users' testing purposes to derive chatbot design implications for mitigating jailbreaking risks. Our dataset will be made publicly available via GitHub.

View on arXiv PDF Code

Similar