Shenghao Wu

h-index3

4papers

96citations

4 Papers

12.6CRJun 9

When Poison Fails After Retrieval: Revisiting Corpus Poisoning under Chunking and Reranking Pipelines

Xi Nie, Hongwei Li, Shenghao Wu et al.

Retrieval-Augmented Generation (RAG) systems are vulnerable to corpus poisoning attacks that manipulate downstream model outputs through malicious knowledge injection. Existing studies mainly evaluate poisoning under simplified retrieval settings, overlooking practical RAG pipelines involving document chunking, dense retrieval, reranking, and grounded generation. In this paper, we revisit corpus poisoning under realistic multi-stage retrieval pipelines and show that many existing attacks substantially degrade after reranking despite achieving high retrieval-stage relevance. We identify retrieval granularity mismatch as a key reason for this failure: document-level adversarial signals are often fragmented during chunking, while rerankers favor locally coherent and answer-bearing passages rather than globally optimized semantic similarity. Based on this observation, we propose Chunk-aware and Rerank-Consistent Poisoning (CRCP), a poisoning framework that jointly optimizes retrieval relevance, reranker consistency, and chunk-boundary robustness. CRCP explicitly models chunking transformations during optimization to generate locally self-contained adversarial passages that remain effective under varying chunking configurations. Experiments on standard RAG benchmarks with multiple retrievers and rerankers show that existing poisoning methods are highly sensitive to chunk size and reranking strategies, whereas CRCP achieves substantially higher attack success rates and stronger robustness across realistic retrieval pipelines. Our findings highlight an important realism gap in current RAG security evaluation and suggest that poisoning in modern RAG systems should be studied as a multi-stage retrieval consistency problem rather than a retrieval-only problem.

12.7NEJun 29

Semantics-Aware Bilevel Co-Evolution: Towards Automated Multicomponent Algorithm Design

Zhiyao Zhang, Shenghao Wu, Xingyu Wu et al.

LLM-assisted evolutionary search (LES) has emerged as a promising paradigm for automated algorithm design. However, existing methods usually suffer from two inherent limitations when facing the automated design of real-world complex algorithms that usually consist of multiple components. The first limitation is that they either focus on modifying entire algorithms, making it difficult to reuse high-quality components, or concentrate on component refinement within a limited set of predefined multicomponent configurations. The second limitation is the insufficient explicit modeling and exploitation of algorithm semantics. These limitations severely degrade search efficiency and hinder effective exploration of complex design spaces. Therefore, this paper proposes STABLE (Semantics-Aware Bilevel Co-Evolution), an LES method purpose-built for automated multicomponent algorithm design that introduces structural algorithm formulation and semantics-driven evolution. In STABLE, complex algorithms are organized into hierarchical and modular architectures rooted in domain knowledge, aligning the search space with their intrinsic compositional traits. Based on this structured algorithm formulation, STABLE simultaneously optimizes high-level multicomponent configurations and low-level functional components, enabling coordinated cross-level updates while maintaining suitable granularities for design space exploration. At each level, STABLE establishes a multi-faceted semantic model to assist LLMs in capturing structural correlations, functional compatibilities, and inherent rationalities among algorithm components. This semantic model serves as the core guidance for evolutionary search, enabling principled algorithm generation and algorithm evaluation. Extensive experiments demonstrate that STABLE outperform both human-designed baselines and those from advanced LES methods.

11.8CVJun 1, 2025

Perceptual Inductive Bias Is What You Need Before Contrastive Learning

Tianqin Li, Junru Zhao, Dunhan Jiang et al.

David Marr's seminal theory of human perception stipulates that visual processing is a multi-stage process, prioritizing the derivation of boundary and surface properties before forming semantic object representations. In contrast, contrastive representation learning frameworks typically bypass this explicit multi-stage approach, defining their objective as the direct learning of a semantic representation space for objects. While effective in general contexts, this approach sacrifices the inductive biases of vision, leading to slower convergence speed and learning shortcut resulting in texture bias. In this work, we demonstrate that leveraging Marr's multi-stage theory-by first constructing boundary and surface-level representations using perceptual constructs from early visual processing stages and subsequently training for object semantics-leads to 2x faster convergence on ResNet18, improved final representations on semantic segmentation, depth estimation, and object recognition, and enhanced robustness and out-of-distribution capability. Together, we propose a pretraining stage before the general contrastive representation pretraining to further enhance the final representation quality and reduce the overall convergence time via inductive bias from human vision systems.

17.9MLMay 25, 2023Code

Counterfactual Generative Models for Time-Varying Treatments

Shenghao Wu, Wenbin Zhou, Minshuo Chen et al.

Estimating the counterfactual outcome of treatment is essential for decision-making in public health and clinical science, among others. Often, treatments are administered in a sequential, time-varying manner, leading to an exponentially increased number of possible counterfactual outcomes. Furthermore, in modern applications, the outcomes are high-dimensional and conventional average treatment effect estimation fails to capture disparities in individuals. To tackle these challenges, we propose a novel conditional generative framework capable of producing counterfactual samples under time-varying treatment, without the need for explicit density estimation. Our method carefully addresses the distribution mismatch between the observed and counterfactual distributions via a loss function based on inverse probability re-weighting, and supports integration with state-of-the-art conditional generative models such as the guided diffusion and conditional variational autoencoder. We present a thorough evaluation of our method using both synthetic and real-world data. Our results demonstrate that our method is capable of generating high-quality counterfactual samples and outperforms the state-of-the-art baselines.