h-index24
12papers
160citations
Novelty54%
AI Score55

12 Papers

CRJan 23
DeMark: A Query-Free Black-Box Attack on Deepfake Watermarking Defenses

Wei Song, Zhenchang Xing, Liming Zhu et al.

The rapid proliferation of realistic deepfakes has raised urgent concerns over their misuse, motivating the use of defensive watermarks in synthetic images for reliable detection and provenance tracking. However, this defense paradigm assumes such watermarks are inherently resistant to removal. We challenge this assumption with DeMark, a query-free black-box attack framework that targets defensive image watermarking schemes for deepfakes. DeMark exploits latent-space vulnerabilities in encoder-decoder watermarking models through a compressive sensing based sparsification process, suppressing watermark signals while preserving perceptual and structural realism appropriate for deepfakes. Across eight state-of-the-art watermarking schemes, DeMark reduces watermark detection accuracy from 100% to 32.9% on average while maintaining natural visual quality, outperforming existing attacks. We further evaluate three defense strategies, including image super resolution, sparse watermarking, and adversarial training, and find them largely ineffective. These results demonstrate that current encoder decoder watermarking schemes remain vulnerable to latent-space manipulations, underscoring the need for more robust watermarking methods to safeguard against deepfakes.

CVJan 28
FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models

Haonan Zhong, Wei Song, Tingxu Han et al.

Text-to-video (T2V) diffusion models have achieved rapid progress, yet their demographic biases, particularly gender bias, remain largely unexplored. We present FairT2V, a training-free debiasing framework for text-to-video generation that mitigates encoder-induced bias without finetuning. We first analyze demographic bias in T2V models and show that it primarily originates from pretrained text encoders, which encode implicit gender associations even for neutral prompts. We quantify this effect with a gender-leaning score that correlates with bias in generated videos. Based on this insight, FairT2V mitigates demographic bias by neutralizing prompt embeddings via anchor-based spherical geodesic transformations while preserving semantics. To maintain temporal coherence, we apply debiasing only during early identity-forming steps through a dynamic denoising schedule. We further propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification. Experiments on the modern T2V model Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality.

CVFeb 11Code
VideoSTF: Stress-Testing Output Repetition in Video Large Language Models

Yuxin Cao, Wei Song, Shangzhi Xu et al.

Video Large Language Models (VideoLLMs) have recently achieved strong performance in video understanding tasks. However, we identify a previously underexplored generation failure: severe output repetition, where models degenerate into self-reinforcing loops of repeated phrases or sentences. This failure mode is not captured by existing VideoLLM benchmarks, which focus primarily on task accuracy and factual correctness. We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs. VideoSTF formalizes repetition using three complementary n-gram-based metrics and provides a standardized testbed of 10,000 diverse videos together with a library of controlled temporal transformations. Using VideoSTF, we conduct pervasive testing, temporal stress testing, and adversarial exploitation across 10 advanced VideoLLMs. We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs. Moreover, we show that simple temporal transformations can efficiently induce repetitive degeneration in a black-box setting, exposing output repetition as an exploitable security vulnerability. Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems. Our evaluation code and scripts are available at: https://github.com/yuxincao22/VideoSTF_benchmark.

PLJun 14, 2017Code
Understanding and Analyzing Java Reflection

Yue Li, Tian Tan, Jingling Xue

Java reflection has been increasingly used in a wide range of software. It allows a software system to inspect and/or modify the behaviour of its classes, interfaces, methods and fields at runtime, enabling the software to adapt to dynamically changing runtime environments. However, this dynamic language feature imposes significant challenges to static analysis, because the behaviour of reflection-rich software is logically complex and statically hard to predict, especially when manipulated frequently by statically unknown string values. As a result, existing static analysis tools either ignore reflection or handle it partially, resulting in missed, important behaviours, i.e., unsound results. Therefore, improving or even achieving soundness in (static) reflection analysis -- an analysis that infers statically the behaviour of reflective code -- will provide significant benefits to many analysis clients, such as bug detectors, security analyzers and program verifiers. This paper makes two contributions: we provide a comprehensive understanding of Java reflection through examining its underlying concept, API and real-world usage, and, building on this, we introduce a new static approach to resolving Java reflection effectively in practice. We have implemented our reflection analysis in an open-source tool, called SOLAR, and evaluated its effectiveness extensively with large Java programs and libraries. Our experimental results demonstrate that SOLAR is able to (1) resolve reflection more soundly than the state-of-the-art reflection analysis; (2) automatically and accurately identify the parts of the program where reflection is resolved unsoundly or imprecisely; and (3) guide users to iteratively refine the analysis results by using lightweight annotations until their specific requirements are satisfied.

PLJan 20, 2017Code
Demand-Driven Pointer Analysis with Strong Updates via Value-Flow Refinement

Yulei Sui, Jingling Xue

We present a new demand-driven flow- and context-sensitive pointer analysis with strong updates for C programs, called SUPA, that enables computing points-to information via value-flow refinement, in environments with small time and memory budgets such as IDEs. We formulate SUPA by solving a graph reachability problem on an inter-procedural value-flow graph representing a program's def-use chains, which are pre-computed efficiently but over-approximately. To answer a client query (a request for a variable's points-to set), SUPA reasons about the flow of values along the pre-computed def-use chains sparsely (rather than across all program points), by performing only the work necessary for the query (rather than analyzing the whole program). In particular, strong updates are performed to filter out spurious def-use chains through value-flow refinement as long as the total budget is not exhausted. SUPA facilitates efficiency and precision tradeoffs by applying different pointer analyses in a hybrid multi-stage analysis framework. We have implemented SUPA in LLVM (3.5.0) and evaluate it by choosing uninitialized pointer detection as a major client on 18 open-source C programs. As the analysis budget increases, SUPA achieves improved precision, with its single-stage flow-sensitive analysis reaching 97.4% of that achieved by whole-program flow-sensitive analysis by consuming about 0.18 seconds and 65KB of memory per query, on average (with a budget of at most 10000 value-flow edges per query). With context-sensitivity also considered, SUPA's two- stage analysis becomes more precise for some programs but also incurs more analysis times. SUPA is also amenable to parallelization. A parallel implementation of its single-stage flow-sensitive analysis achieves a speedup of up to 6.9x with an average of 3.05x a 8-core machine with respect its sequential version.

MMAug 14, 2025
Failures to Surface Harmful Contents in Video Large Language Models

Yuxin Cao, Wei Song, Derui Wang et al.

Video Large Language Models (VideoLLMs) are increasingly deployed on numerous critical applications, where users rely on auto-generated summaries while casually skimming the video stream. We show that this interaction hides a critical safety gap: if harmful content is embedded in a video, either as full-frame inserts or as small corner patches, state-of-the-art VideoLLMs rarely mention the harmful content in the output, despite its clear visibility to human viewers. A root-cause analysis reveals three compounding design flaws: (1) insufficient temporal coverage resulting from the sparse, uniformly spaced frame sampling used by most leading VideoLLMs, (2) spatial information loss introduced by aggressive token downsampling within sampled frames, and (3) encoder-decoder disconnection, whereby visual cues are only weakly utilized during text generation. Leveraging these insights, we craft three zero-query black-box attacks, aligning with these flaws in the processing pipeline. Our large-scale evaluation across five leading VideoLLMs shows that the harmfulness omission rate exceeds 90% in most cases. Even when harmful content is clearly present in all frames, these models consistently fail to identify it. These results underscore a fundamental vulnerability in current VideoLLMs' designs and highlight the urgent need for sampling strategies, token compression, and decoding mechanisms that guarantee semantic coverage rather than speed alone.

CVSep 25, 2025
Poisoning Prompt-Guided Sampling in Video Large Language Models

Yuxin Cao, Wei Song, Jingling Xue et al.

Video Large Language Models (VideoLLMs) have emerged as powerful tools for understanding videos, supporting tasks such as summarization, captioning, and question answering. Their performance has been driven by advances in frame sampling, progressing from uniform-based to semantic-similarity-based and, most recently, prompt-guided strategies. While vulnerabilities have been identified in earlier sampling strategies, the safety of prompt-guided sampling remains unexplored. We close this gap by presenting PoisonVID, the first black-box poisoning attack that undermines prompt-guided sampling in VideoLLMs. PoisonVID compromises the underlying prompt-guided sampling mechanism through a closed-loop optimization strategy that iteratively optimizes a universal perturbation to suppress harmful frame relevance scores, guided by a depiction set constructed from paraphrased harmful descriptions leveraging a shadow VideoLLM and a lightweight language model, i.e., GPT-4o-mini. Comprehensively evaluated on three prompt-guided sampling strategies and across three advanced VideoLLMs, PoisonVID achieves 82% - 99% attack success rate, highlighting the importance of developing future advanced sampling strategies for VideoLLMs.

AIAug 18, 2025
Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models

Wei Song, Haonan Zhong, Ziqi Ding et al.

The Model Context Protocol (MCP) enables large language models (LLMs) to access external resources on demand. While commonly assumed to enhance performance, how LLMs actually leverage this capability remains poorly understood. We introduce MCPGAUGE, the first comprehensive evaluation framework for probing LLM-MCP interactions along four key dimensions: proactivity (self-initiated tool use), compliance (adherence to tool-use instructions), effectiveness (task performance post-integration), and overhead (computational cost incurred). MCPGAUGE comprises a 160-prompt suite and 25 datasets spanning knowledge comprehension, general reasoning, and code generation. Our large-scale evaluation, spanning six commercial LLMs, 30 MCP tool suites, and both one- and two-turn interaction settings, comprises around 20,000 API calls and over USD 6,000 in computational cost. This comprehensive study reveals four key findings that challenge prevailing assumptions about the effectiveness of MCP integration. These insights highlight critical limitations in current AI-tool integration and position MCPGAUGE as a principled benchmark for advancing controllable, tool-augmented LLMs.

NIMay 2, 2025
ai.txt: A Domain-Specific Language for Guiding AI Interactions with the Internet

Yuekang Li, Wei Song, Bangshuo Zhu et al.

We introduce ai.txt, a novel domain-specific language (DSL) designed to explicitly regulate interactions between AI models, agents, and web content, addressing critical limitations of the widely adopted robots.txt standard. As AI increasingly engages with online materials for tasks such as training, summarization, and content modification, existing regulatory methods lack the necessary granularity and semantic expressiveness to ensure ethical and legal compliance. ai.txt extends traditional URL-based access controls by enabling precise element-level regulations and incorporating natural language instructions interpretable by AI systems. To facilitate practical deployment, we provide an integrated development environment with code autocompletion and automatic XML generation. Furthermore, we propose two compliance mechanisms: XML-based programmatic enforcement and natural language prompt integration, and demonstrate their effectiveness through preliminary experiments and case studies. Our approach aims to aid the governance of AI-Internet interactions, promoting responsible AI use in digital ecosystems.

NEOct 30, 2020
Fusion-Catalyzed Pruning for Optimizing Deep Learning on Intelligent Edge Devices

Guangli Li, Xiu Ma, Xueying Wang et al.

The increasing computational cost of deep neural network models limits the applicability of intelligent applications on resource-constrained edge devices. While a number of neural network pruning methods have been proposed to compress the models, prevailing approaches focus only on parametric operators (e.g., convolution), which may miss optimization opportunities. In this paper, we present a novel fusion-catalyzed pruning approach, called FuPruner, which simultaneously optimizes the parametric and non-parametric operators for accelerating neural networks. We introduce an aggressive fusion method to equivalently transform a model, which extends the optimization space of pruning and enables non-parametric operators to be pruned in a similar manner as parametric operators, and a dynamic filter pruning method is applied to decrease the computational cost of models while retaining the accuracy requirement. Moreover, FuPruner provides configurable optimization options for controlling fusion and pruning, allowing much more flexible performance-accuracy trade-offs to be made. Evaluation with state-of-the-art residual neural networks on five representative intelligent edge platforms, Jetson TX2, Jetson Nano, Edge TPU, NCS, and NCS2, demonstrates the effectiveness of our approach, which can accelerate the inference of models on CIFAR-10 and ImageNet datasets.

SEMay 4, 2019
A Feature-Oriented Corpus for Understanding, Evaluating and Improving Fuzz Testing

Xiaogang Zhu, Xiaotao Feng, Tengyun Jiao et al.

Fuzzing is a promising technique for detecting security vulnerabilities. Newly developed fuzzers are typically evaluated in terms of the number of bugs found on vulnerable programs/binaries. However,existing corpora usually do not capture the features that prevent fuzzers from finding bugs, leading to ambiguous conclusions on the pros and cons of the fuzzers evaluated. A typical example is that Driller detects more bugs than AFL, but its evaluation cannot establish if the advancement of Driller stems from the concolic execution or not, since, for example, its ability in resolving a dataset`s magic values is unclear. In this paper, we propose to address the above problem by generating corpora based on search-hampering features. As a proof-of-concept, we have designed FEData, a prototype corpus that currently focuses on four search-hampering features to generate vulnerable programs for fuzz testing. Unlike existing corpora that can only answer "how", FEData can also further answer "why" by exposing (or understanding) the reasons for the identified weaknesses in a fuzzer. The "why" information serves as the key to the improvement of fuzzers.

CRDec 16, 2016
Ripple: Reflection Analysis for Android Apps in Incomplete Information Environments

Yifei Zhang, Tian Tan, Yue Li et al.

Despite its widespread use in Android apps, reflection poses graving problems for static security analysis. Currently, string inference is applied to handle reflection, resulting in significantly missed security vulnerabilities. In this paper, we bring forward the ubiquity of incomplete information environments (IIEs) for Android apps, where some critical data-flows are missing during static analysis, and the need for resolving reflective calls under IIEs. We present Ripple, the first IIE-aware static reflection analysis for Android apps that resolves reflective calls more soundly than string inference. Validation with 17 popular Android apps from Google Play demonstrates the effectiveness of Ripple in discovering reflective targets with a low false positive rate. As a result, Ripple enables FlowDroid, a taint analysis for Android apps, to find hundreds of sensitive data leakages that would otherwise be missed. As a fundamental analysis, Ripple will be valuable for many security analysis clients, since more program behaviors can now be analyzed under IIEs.