LGOct 27, 2023
Optimal Transport for Treatment Effect EstimationHao Wang, Zhichao Chen, Jiajun Fan et al. · pku
Estimating conditional average treatment effect from observational data is highly challenging due to the existence of treatment selection bias. Prevalent methods mitigate this issue by aligning distributions of different treatment groups in the latent space. However, there are two critical problems that these methods fail to address: (1) mini-batch sampling effects (MSE), which causes misalignment in non-ideal mini-batches with outcome imbalance and outliers; (2) unobserved confounder effects (UCE), which results in inaccurate discrepancy calculation due to the neglect of unobserved confounders. To tackle these problems, we propose a principled approach named Entire Space CounterFactual Regression (ESCFR), which is a new take on optimal transport in the context of causality. Specifically, based on the framework of stochastic optimal transport, we propose a relaxed mass-preserving regularizer to address the MSE issue and design a proximal factual outcome regularizer to handle the UCE issue. Extensive experiments demonstrate that our proposed ESCFR can successfully tackle the treatment selection bias and achieve significantly better performance than state-of-the-art methods.
HCOct 20, 2022
3D Human Mesh Construction Leveraging Wi-FiYichao Wang, Jie Yang
In this paper, we present, Wi-Mesh, a WiFi vision-based 3D human mesh construction system. Our system leverages the advances of WiFi to visualize the shape and deformations of the human body for 3D mesh construction. In particular, it leverages multiple transmitting and receiving antennas on WiFi devices to estimate the two-dimensional angle of arrival (2D AoA) of the WiFi signal reflections to enable WiFi devices to see the physical environment as we humans do. It then extracts only the images of the human body from the physical environment and leverages deep learning models to digitize the extracted human body into a 3D mesh representation. Experimental evaluation under various indoor environments shows that Wi-Mesh achieves an average vertices location error of 2.81cm and joint position error of 2.4cm, which is comparable to the systems that utilize specialized and dedicated hardware. The proposed system has the advantage of re-using the WiFi devices that already exist in the environment for potential mass adoption. It can also work in non-line of sight (NLoS), poor lighting conditions, and baggy clothes, where the camera-based systems do not work well.
64.9IRMay 17Code
RAGR: Review-Augmented Generative RecommendationYingyi Zhang, Junyi Li, Yejing Wang et al.
Sequential recommendation (SR) is traditionally formulated as next-item prediction over a chronological sequence of interacted items. Although recent generative recommendation (GR) methods introduce new machinery, such as semantic IDs, autoregressive decoding, and unified token spaces, they largely inherit the same item-only modeling assumption. We argue that this design constitutes a structural bottleneck, because user decision-making is not purely behavioral: while item interactions reveal what users choose, review feedback often explain why they choose it by exposing latent evaluative factors. Motivated by this observation, we propose Review-Augmented Generative Recommendation (RAGR), a novel GR framework that incorporates review feedback directly into the generative user sequence rather than treating reviews as auxiliary side information. Specifically, RAGR introduces a Review-Augmented User Sequence Modeling mechanism that interleaves item semantic IDs and review semantic IDs in chronological order to construct a mixed behavioral-semantic sequence, enabling review signals to participate directly in autoregressive next-token generation. To preserve the recommendation objective, we further introduce an Item-Centric Task Generation Alignment strategy based on direct preference optimization (DPO), which encourages the model to favor item tokens over review tokens at prediction positions. Experiments on three real-world datasets show that RAGR yields consistent and significant gains over strong GR backbones across all metrics. Our code and data are available at \url{https://github.com/Zhang-Yingyi/TKDE_RAGR}.
90.3IRMay 11Code
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge DiscoveryXiaopeng Li, Wenlin Zhang, Yingyi Zhang et al.
Deep Research agents driven by LLMs have automated the scholarly discovery pipeline, from planning and query formulation to iterative web exploration. Yet they remain constrained by a static, ``one-size-fits-all'' retrieval paradigm. Current systems fail to adaptively adjust the depth and breadth of exploration based on the user's existing expertise or latent interests, frequently resulting in reports that are either redundant for experts or overly dense for novices. To address this, we introduce Personalized Deep Research (PDR), a framework that integrates dynamic user context into the core retrieval-reasoning loop. Rather than treating personalization as a post-hoc formatting step, PDR unifies user profile modeling with iterative query development, dual-stage (private/public) retrieval, and context-aware synthesis. This allows the system to autonomously align research sub-goals with user intent and optimize the stopping criteria for evidence collection. To facilitate benchmarking, we release the PDR Dataset, covering four realistic user tasks, and propose a hybrid evaluation framework combining lexical metrics with LLM-based judgments to assess factual accuracy and personalization alignment. Experimental results against commercial baselines demonstrate that PDR significantly improves retrieval utility and report relevance, effectively bridging the gap between generic information retrieval and personalized knowledge acquisition. The resource is available to the public at https://github.com/Applied-Machine-Learning-Lab/SIGIR2026_PDR.
CLAug 11, 2022
A Comprehensive Survey of Natural Language Generation Advances from the Perspective of Digital DeceptionKeenan Jones, Enes Altuncu, Virginia N. L. Franqueira et al.
In recent years there has been substantial growth in the capabilities of systems designed to generate text that mimics the fluency and coherence of human language. From this, there has been considerable research aimed at examining the potential uses of these natural language generators (NLG) towards a wide number of tasks. The increasing capabilities of powerful text generators to mimic human writing convincingly raises the potential for deception and other forms of dangerous misuse. As these systems improve, and it becomes ever harder to distinguish between human-written and machine-generated text, malicious actors could leverage these powerful NLG systems to a wide variety of ends, including the creation of fake news and misinformation, the generation of fake online product reviews, or via chatbots as means of convincing users to divulge private information. In this paper, we provide an overview of the NLG field via the identification and examination of 119 survey-like papers focused on NLG research. From these identified papers, we outline a proposed high-level taxonomy of the central concepts that constitute NLG, including the methods used to develop generalised NLG systems, the means by which these systems are evaluated, and the popular NLG tasks and subtasks that exist. In turn, we provide an overview and discussion of each of these items with respect to current research and offer an examination of the potential roles of NLG in deception and detection systems to counteract these threats. Moreover, we discuss the broader challenges of NLG, including the risks of bias that are often exhibited by existing text generation systems. This work offers a broad overview of the field of NLG with respect to its potential for misuse, aiming to provide a high-level understanding of this rapidly developing area of research.
99.1IRMar 10
Evoking User Memory: Personalizing LLM via Recollection-Familiarity Adaptive RetrievalYingyi Zhang, Junyi Li, Wenlin Zhang et al.
Personalized large language models (LLMs) rely on memory retrieval to incorporate user-specific histories, preferences, and contexts. Existing approaches either overload the LLM by feeding all the user's past memory into the prompt, which is costly and unscalable, or simplify retrieval into a one-shot similarity search, which captures only surface matches. Cognitive science, however, shows that human memory operates through a dual process: Familiarity, offering fast but coarse recognition, and Recollection, enabling deliberate, chain-like reconstruction for deeply recovering episodic content. Current systems lack both the ability to perform recollection retrieval and mechanisms to adaptively switch between the dual retrieval paths, leading to either insufficient recall or the inclusion of noise. To address this, we propose RF-Mem (Recollection-Familiarity Memory Retrieval), a familiarity uncertainty-guided dual-path memory retriever. RF-Mem measures the familiarity signal through the mean score and entropy. High familiarity leads to the direct top-K Familiarity retrieval path, while low familiarity activates the Recollection path. In the Recollection path, the system clusters candidate memories and applies alpha-mix with the query to iteratively expand evidence in embedding space, simulating deliberate contextual reconstruction. This design embeds human-like dual-process recognition into the retriever, avoiding full-context overhead and enabling scalable, adaptive personalization. Experiments across three benchmarks and corpus scales demonstrate that RF-Mem consistently outperforms both one-shot retrieval and full-context reasoning under fixed budget and latency constraints. Our code can be found in the Reproducibility Statement.
90.1IRApr 19
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic SearchSheng Zhang, Junyi Li, Yingyi Zhang et al.
Recent advances in large language models (LLMs) have scaled the potential for reasoning and agentic search, wherein models autonomously plan, retrieve, and reason over external knowledge to answer complex queries. However, the iterative think-search loop accumulates long system memories, leading to memory dilution problem. In addition, existing memory management methods struggle to capture fine-grained semantic relations between queries and documents and often lose substantial information. Therefore, we propose MemSearch-o1, an agentic search framework built on reasoning-aligned memory growth and retracing. MemSearch-o1 dynamically grows fine-grained memory fragments from memory seed tokens from the queries, then retraces and deeply refines the memory via a contribution function, and finally reorganizes a globally connected memory path. This shifts memory management from stream-like concatenation to structured, token-level growth with path-based reasoning. Experiments on eight benchmark datasets show that MemSearch-o1 substantially mitigates memory dilution, and more effectively activates the reasoning potential of diverse LLMs, establishing a solid foundation for memory-aware agentic intelligence.
CLDec 11, 2024Code
Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented GenerationPengyue Jia, Derong Xu, Xiaopeng Li et al.
The reranker and generator are two critical components in the Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking relevant documents and generating responses. However, due to differences in pre-training data and objectives, there is an inevitable gap between the documents ranked as relevant by the reranker and those required by the generator to support answering the query. To address this gap, we propose RADIO, a novel and practical preference alignment framework with RAtionale DIstillatiOn. Specifically, we first propose a rationale extraction method that leverages the reasoning capabilities of Large Language Models (LLMs) to extract the rationales necessary for answering the query. Subsequently, a rationale-based alignment process is designed to rerank the documents based on the extracted rationales, and fine-tune the reranker to align the preferences. We conduct extensive experiments on two tasks across three datasets to demonstrate the effectiveness of our approach compared to baseline methods. Our code is released online to ease reproduction.
CLMay 26, 2025Code
From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational AgentsDerong Xu, Yi Wen, Pengyue Jia et al.
Large Language Models (LLMs) have recently been widely adopted in conversational agents. However, the increasingly long interactions between users and agents accumulate extensive dialogue records, making it difficult for LLMs with limited context windows to maintain a coherent long-term dialogue memory and deliver personalized responses. While retrieval-augmented memory systems have emerged to address this issue, existing methods often depend on single-granularity memory segmentation and retrieval. This approach falls short in capturing deep memory connections, leading to partial retrieval of useful information or substantial noise, resulting in suboptimal performance. To tackle these limits, we propose MemGAS, a framework that enhances memory consolidation by constructing multi-granularity association, adaptive selection, and retrieval. MemGAS is based on multi-granularity memory units and employs Gaussian Mixture Models to cluster and associate new memories with historical ones. An entropy-based router adaptively selects optimal granularity by evaluating query relevance distributions and balancing information completeness and noise. Retrieved memories are further refined via LLM-based filtering. Experiments on four long-term memory benchmarks demonstrate that MemGAS outperforms state-of-the-art methods on both question answer and retrieval tasks, achieving superior performance across different query types and top-K settings. \footnote{https://github.com/quqxui/MemGAS}
IRMar 19, 2024Code
ERASE: Benchmarking Feature Selection Methods for Deep Recommender SystemsPengyue Jia, Yejing Wang, Zhaocheng Du et al.
Deep Recommender Systems (DRS) are increasingly dependent on a large number of feature fields for more precise recommendations. Effective feature selection methods are consequently becoming critical for further enhancing the accuracy and optimizing storage efficiencies to align with the deployment demands. This research area, particularly in the context of DRS, is nascent and faces three core challenges. Firstly, variant experimental setups across research papers often yield unfair comparisons, obscuring practical insights. Secondly, the existing literature's lack of detailed analysis on selection attributes, based on large-scale datasets and a thorough comparison among selection techniques and DRS backbones, restricts the generalizability of findings and impedes deployment on DRS. Lastly, research often focuses on comparing the peak performance achievable by feature selection methods, an approach that is typically computationally infeasible for identifying the optimal hyperparameters and overlooks evaluating the robustness and stability of these methods. To bridge these gaps, this paper presents ERASE, a comprehensive bEnchmaRk for feAture SElection for DRS. ERASE comprises a thorough evaluation of eleven feature selection methods, covering both traditional and deep learning approaches, across four public datasets, private industrial datasets, and a real-world commercial platform, achieving significant enhancement. Our code is available online for ease of reproduction.
81.8CRMay 4
APIOT: Autonomous Vulnerability Management Across Bare-Metal Industrial OT NetworksAdel ElZemity, Budi Arief, Shujun Li et al.
Bare-metal operational technology (OT) devices -- especially the microcontrollers running Modbus/TCP and CoAP at the base of industrial control systems -- have remained outside the reach of autonomous security attacks. Prior autonomous pentesting studies target Linux and web systems, whose shells and filesystems are familiar to LLM agents. Bare-metal OT has neither, so agents must reason directly over protocol fields and parser semantics. This requires new action-space designs and runtime controls, and opens new research questions about protocol-level exploit reasoning and its deployment envelope. We present APIOT (Autonomous Purple-teaming for Industrial OT), the first large language model (LLM) framework demonstrating an autonomous attack and remediation of bare-metal OT devices, achieving the full discovery -> exploitation -> patching -> verification cycle without step-by-step human intervention. We implemented and evaluated this framework on Zephyr RTOS firmware across heterogeneous industrial IoT (IIoT) topologies. Through 290 experiment runs spanning five frontier LLMs, three network topologies, two impairment levels, and guided versus unguided conditions, APIOT achieved a mission success rate of 90.0% on the full attack-remediation cycle. We found that the runtime governance layer (which we call an overseer) is a critical engineering variable: without it, agents exhibit systematic degenerate patterns, including repetition loops, missing crash verification, and reconnaissance deadlocks. Together, these findings carry two implications beyond our testbed. Attacker expertise is no longer the binding constraint on bare-metal OT exploitation, and defender threat models must now assume LLM-augmented adversaries capable of executing autonomous discovery-through-remediation cycles against industrial firmware.
IROct 24, 2025
Doc-Researcher: A Unified System for Multimodal Document Parsing and Deep ResearchKuicai Dong, Shurui Huang, Fangda Ye et al.
Deep Research systems have revolutionized how LLMs solve complex questions through iterative reasoning and evidence gathering. However, current systems remain fundamentally constrained to textual web data, overlooking the vast knowledge embedded in multimodal documents Processing such documents demands sophisticated parsing to preserve visual semantics (figures, tables, charts, and equations), intelligent chunking to maintain structural coherence, and adaptive retrieval across modalities, which are capabilities absent in existing systems. In response, we present Doc-Researcher, a unified system that bridges this gap through three integrated components: (i) deep multimodal parsing that preserves layout structure and visual semantics while creating multi-granular representations from chunk to document level, (ii) systematic retrieval architecture supporting text-only, vision-only, and hybrid paradigms with dynamic granularity selection, and (iii) iterative multi-agent workflows that decompose complex queries, progressively accumulate evidence, and synthesize comprehensive answers across documents and modalities. To enable rigorous evaluation, we introduce M4DocBench, the first benchmark for Multi-modal, Multi-hop, Multi-document, and Multi-turn deep research. Featuring 158 expert-annotated questions with complete evidence chains across 304 documents, M4DocBench tests capabilities that existing benchmarks cannot assess. Experiments demonstrate that Doc-Researcher achieves 50.6% accuracy, 3.4xbetter than state-of-the-art baselines, validating that effective document research requires not just better retrieval, but fundamentally deep parsing that preserve multimodal integrity and support iterative research. Our work establishes a new paradigm for conducting deep research on multimodal document collections.
CLMay 22, 2025
Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented GenerationDerong Xu, Pengyue Jia, Xiaopeng Li et al.
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information. Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system. Building on this foundation, graph-based RAG systems go a step further by retrieving subgraphs, which preserve the relationships between knowledge entities and provide more comprehensive context. However, graph RAG faces two challenges: (1) Retrieving relevant information introduces irrelevant nodes (especially in dense graph databases, where retrieval usually extends to adjacent nodes), and leads to overly lengthy inputs that hinder efficiency; (2) The representation gap between graph and language during generation with LLMs limits the ability to fully leverage graph structures for enhanced understanding. To address these limitations, we propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase. It first formulates a subgraph by retrieving nodes and edges. Then an Aligner is proposed to jointly optimize a graph encoder with an LLM-summarized reasoning chain. It achieves dual alignment of graph node and representation by leveraging KL divergence loss and contrastive loss, facilitating efficient pruning of irrelevant knowledge and establishing a unified semantic space. The Generator integrates the aligned graph data with LLM to produce coherent and accurate answers. Experiments on the GraphQA benchmark across three tasks (including common sense reasoning, scene graph understanding, and knowledge graph reasoning) validate the effectiveness of our method. The codes are available in this repository\footnote{https://anonymous.4open.science/r/Align-GRAG-F3D8/}.
CVJul 7, 2025
MCFormer: A Multi-Cost-Volume Network and Comprehensive Benchmark for Particle Image VelocimetryZicheng Lin, Xiaoqiang Li, Yichao Wang et al.
Particle Image Velocimetry (PIV) is fundamental to fluid dynamics, yet deep learning applications face significant hurdles. A critical gap exists: the lack of comprehensive evaluation of how diverse optical flow models perform specifically on PIV data, largely due to limitations in available datasets and the absence of a standardized benchmark. This prevents fair comparison and hinders progress. To address this, our primary contribution is a novel, large-scale synthetic PIV benchmark dataset generated from diverse CFD simulations (JHTDB and Blasius). It features unprecedented variety in particle densities, flow velocities, and continuous motion, enabling, for the first time, a standardized and rigorous evaluation of various optical flow and PIV algorithms. Complementing this, we propose Multi Cost Volume PIV (MCFormer), a new deep network architecture leveraging multi-frame temporal information and multiple cost volumes, specifically designed for PIV's sparse nature. Our comprehensive benchmark evaluation, the first of its kind, reveals significant performance variations among adapted optical flow models and demonstrates that MCFormer significantly outperforms existing methods, achieving the lowest overall normalized endpoint error (NEPE). This work provides both a foundational benchmark resource essential for future PIV research and a state-of-the-art method tailored for PIV challenges. We make our benchmark dataset and code publicly available to foster future research in this area.
IRJun 18, 2024
LLM4MSR: An LLM-Enhanced Paradigm for Multi-Scenario RecommendationYuhao Wang, Yichao Wang, Zichuan Fu et al.
As the demand for more personalized recommendation grows and a dramatic boom in commercial scenarios arises, the study on multi-scenario recommendation (MSR) has attracted much attention, which uses the data from all scenarios to simultaneously improve their recommendation performance. However, existing methods tend to integrate insufficient scenario knowledge and neglect learning personalized cross-scenario preferences, thus leading to sub-optimal performance. Meanwhile, though large language model (LLM) has shown great capability of reasoning and capturing semantic information, the high inference latency and high computation cost of tuning hinder its implementation in industrial recommender systems. To fill these gaps, we propose an LLM-enhanced paradigm LLM4MSR in this work. Specifically, we first leverage LLM to uncover multi-level knowledge from the designed scenario- and user-level prompt without fine-tuning the LLM, then adopt hierarchical meta networks to generate multi-level meta layers to explicitly improve the scenario-aware and personalized recommendation capability. Our experiments on KuaiSAR-small, KuaiSAR, and Amazon datasets validate significant advantages of LLM4MSR: (i) the effectiveness and compatibility with different multi-scenario backbone models, (ii) high efficiency and deployability on industrial recommender systems, and (iii) improved interpretability. The implemented code and data is available to ease reproduction.
IRSep 4, 2020
A Practical Incremental Method to Train Deep CTR ModelsYichao Wang, Huifeng Guo, Ruiming Tang et al.
Deep learning models in recommender systems are usually trained in the batch mode, namely iteratively trained on a fixed-size window of training data. Such batch mode training of deep learning models suffers from low training efficiency, which may lead to performance degradation when the model is not produced on time. To tackle this issue, incremental learning is proposed and has received much attention recently. Incremental learning has great potential in recommender systems, as two consecutive window of training data overlap most of the volume. It aims to update the model incrementally with only the newly incoming samples from the timestamp when the model is updated last time, which is much more efficient than the batch mode training. However, most of the incremental learning methods focus on the research area of image recognition where new tasks or classes are learned over time. In this work, we introduce a practical incremental method to train deep CTR models, which consists of three decoupled modules (namely, data, feature and model module). Our method can achieve comparable performance to the conventional batch mode training with much better training efficiency. We conduct extensive experiments on a public benchmark and a private dataset to demonstrate the effectiveness of our proposed method.
IRApr 14, 2020
Personalized Re-ranking for Improving Diversity in Live Recommender SystemsYichao Wang, Xiangyu Zhang, Zhirong Liu et al.
Users of industrial recommender systems are normally suggesteda list of items at one time. Ideally, such list-wise recommendationshould provide diverse and relevant options to the users. However, in practice, list-wise recommendation is implemented as top-N recommendation. Top-N recommendation selects the first N items from candidates to display. The list is generated by a ranking function, which is learned from labeled data to optimize accuracy.However, top-N recommendation may lead to suboptimal, as it focuses on accuracy of each individual item independently and overlooks mutual influence between items. Therefore, we propose a personalized re-ranking model for improving diversity of the recommendation list in real recommender systems. The proposed re-ranking model can be easily deployed as a follow-up component after any existing ranking function. The re-ranking model improves the diversity by employing personalized Determinental Point Process (DPP). DPP has been applied in some recommender systems to improve the diversity and increase the user engagement.However, DPP does not take into account the fact that users may have individual propensities to the diversity. To overcome such limitation, our re-ranking model proposes a personalized DPP to model the trade-off between accuracy and diversity for each individual user. We implement and deploy the personalized DPP model on alarge scale industrial recommender system. Experimental results on both offline and online demonstrate the efficiency of our proposed re-ranking model.