AIMay 15
GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph ConstructionLiangyi Huang, Zichen Liu, Fei Shao et al.
Security knowledge graphs can provide computable external memory for security agents, but constructing them from long-form cyber threat intelligence (CTI) remains difficult: LLMs often lack grounded security-domain knowledge, and end-to-end document-to-graph training is hard to supervise with cheap, stable rewards. We present GRID (Graph Representation of Intelligence Data), an end-to-end framework for security text knowledge graph construction. GRID first builds security-domain supervision from CTI articles by creating traceable article-graph alignments through graph extraction and knowledge-graph-conditioned text revision. It then turns document-to-graph learning into a scripted task bank combining four-option multi-select questions with triple-level regex matching targets, yielding more stable task-specific rewards than repeatedly scoring full graph outputs with an LLM judge. Using this supervision pipeline, we train two Qwen3-4B-Instruct-2507-based 4B extractors: a primary Task-bank Reward model and a secondary End2End Reward model with LLM-as-judge precision/recall rewards. On 249 CTI articles from GRID, CASIE, CTINexus, MalKG, and SecureNLP, the Task-bank Reward model with the ontology-guided GRID extraction pipeline reaches 84.62% source-averaged precision, 64.91% source-averaged recall, and 68.53% Avg F1, achieving the best source-averaged recall and near-top Avg F1 with lower token usage and deployment cost. The End2End Reward model reaches 76.91% precision, 53.85% recall, and 58.06% Avg F1. Further analyses show that task-bank rewards can be built once offline and reused across later post-training runs, outperforming online End2End LLM-as-judge reward and weaker alternatives such as Choice-only Reward and End2End SFT without RL.
CLMay 12
PreScam: A Benchmark for Predicting Scam Progression from Early ConversationsWeixiang Sun, Shang Ma, Yiyang Li et al.
Conversational scams, such as romance and investment scams, are emerging as a major form of online fraud. Unlike one-shot scam lures such as fake lottery or unpaid toll messages, they unfold through multi-turn conversations in which scammers gradually manipulate victims using evolving psychological techniques. However, existing research mainly focuses on static scam detection or synthetic scams, leaving open whether language models can understand how real-world scams progress over time. We introduce PreScam, a benchmark for modeling scam progression from early conversations. Built from user-submitted scam reports, PreScam filters and structures 177,989 raw reports into 11,573 conversational scam instances spanning 20 scam categories. Each instance is hierarchically structured according to the scam lifecycle defined by the proposed scam kill chain, and further annotated at the turn level with scammer psychological actions and victim responses. We benchmark models on two tasks: real-time termination prediction, which estimates whether a conversation is approaching the termination stage, and scammer action prediction, which forecasts the scammer's subsequent actions. Results show a clear gap between surface-level fluency and progression modeling: supervised encoders substantially outperform zero-shot LLMs on real-time termination prediction, while next-action prediction remains only moderately successful even for strong LLMs. Taken together, these results show that current models can capture some scam-related cues, yet still struggle to track how risk escalates and how manipulation unfolds across turns.
HCApr 15, 2025
The Obvious Invisible Threat: LLM-Powered GUI Agents' Vulnerability to Fine-Print InjectionsChaoran Chen, Zhiping Zhang, Bingcan Guo et al.
A Large Language Model (LLM) powered GUI agent is a specialized autonomous system that performs tasks on the user's behalf according to high-level instructions. It does so by perceiving and interpreting the graphical user interfaces (GUIs) of relevant apps, often visually, inferring necessary sequences of actions, and then interacting with GUIs by executing the actions such as clicking, typing, and tapping. To complete real-world tasks, such as filling forms or booking services, GUI agents often need to process and act on sensitive user data. However, this autonomy introduces new privacy and security risks. Adversaries can inject malicious content into the GUIs that alters agent behaviors or induces unintended disclosures of private information. These attacks often exploit the discrepancy between visual saliency for agents and human users, or the agent's limited ability to detect violations of contextual integrity in task automation. In this paper, we characterized six types of such attacks, and conducted an experimental study to test these attacks with six state-of-the-art GUI agents, 234 adversarial webpages, and 39 human participants. Our findings suggest that GUI agents are highly vulnerable, particularly to contextually embedded threats. Moreover, human users are also susceptible to many of these attacks, indicating that simple human oversight may not reliably prevent failures. This misalignment highlights the need for privacy-aware agent design. We propose practical defense strategies to inform the development of safer and more reliable GUI agents.
SEApr 4
From UI to Code: Mobile Ads Detection via LLM-Unified Static-Dynamic AnalysisShang Ma, Wei Cheng, Yanfang Ye et al.
Mobile advertisements (ads) are essential to the app economy, yet detecting them is challenging because ad content is dynamically fetched from remote servers and rendered through diverse user interfaces (UIs), making ads difficult to locate and trigger at runtime. To address this challenge, we present ADWISE, a novel framework that formulates mobile ads detection as LLM-guided, ad-oriented UI exploration. ADWISE first performs static program analysis to identify UI widgets used to place ads, which we call ad widgets. It then uses a grounded LLM reasoning loop to navigate toward and trigger these widgets under three complementary domain guidance signals: (1) WTG-based guidance, which provides global transition priors from a statically constructed window transition graph (WTG); (2) semantic guidance, which reasons over app functionality to prioritize user-likely interaction paths; and (3) structural guidance, which applies retrieval-augmented generation to match the current UI against recurring ad-heavy layouts from a knowledge base. By combining static program analysis with LLM-based reasoning over UI structure, app semantics, and retrieved analogies, ADWISE enables more effective ads detection in complex mobile UIs. Experiments on 100 benchmark apps show that ADWISE outperforms state-of-the-art baselines by 25.60% in ad widget detection. In addition, ADWISE uncovers 34.34% more ad regulation violations across six categories, directly benefiting downstream ad regulation.