Zehao Lin

CL
h-index32
19papers
1,972citations
Novelty56%
AI Score59

19 Papers

CVSep 24, 2024Code
MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

Jiacheng Ruan, Wenzhen Yuan, Zehao Lin et al.

Large visual-language models (LVLMs) have achieved great success in multiple applications. However, they still encounter challenges in complex scenes, especially those involving camouflaged objects. This is primarily due to the lack of samples related to camouflaged scenes in the training dataset. To mitigate this issue, we construct the MM-CamObj dataset for the first time, comprising two subsets: CamObj-Align and CamObj-Instruct. Specifically, CamObj-Align contains 11,363 image-text pairs, and it is designed for VL alignment and injecting rich knowledge of camouflaged scenes into LVLMs. CamObj-Instruct is collected for fine-tuning the LVLMs with improved instruction-following capabilities, and it includes 11,363 images and 68,849 conversations with diverse instructions. Based on the MM-CamObj dataset, we propose the CamObj-Llava, an LVLM specifically designed for addressing tasks in camouflaged scenes. To facilitate our model's effective acquisition of knowledge about camouflaged objects and scenes, we introduce a curriculum learning strategy with six distinct modes. Additionally, we construct the CamObj-Bench to evaluate the existing LVLMs' capabilities of understanding, recognition, localization and count in camouflage scenes. This benchmark includes 600 images and 7 tasks, with a total of 9,449 questions. Extensive experiments are conducted on the CamObj-Bench with CamObj-Llava, 8 existing open-source and 3 closed-source LVLMs. Surprisingly, the results indicate that our model achieves a 25.84% improvement in 4 out of 7 tasks compared to GPT-4o. Code and datasets will be available at https://github.com/JCruan519/MM-CamObj.

93.5PFJun 1
SparseX: Efficient Segment-Level KV Cache Sharing for Interleaved LLM Serving

Quqing Zhang, Kai Chen, Ning Liao et al.

In long-context LLM serving, the prefill stage often dominates time-to-first-token and computational cost. Although Prefix Cache in vLLM/PagedAttention has been widely used to reuse identical prompt prefixes, repeated content in practical applications frequently appears as non-prefix, cross-request, cross-turn, and cross-agent segments, which makes conventional cache mechanisms insufficient. This paper presents SparseX, a segment-level KV Cache sharing method for common serving scenarios. SparseX uses contiguous token segments as reuse units and exploits Sparse-Q indices that naturally arise in KV Cache reuse workloads to estimate the key tokens that require correction. Based on this estimate, SparseX performs Sparse-KV Recomputation within a single forward pass, thereby restoring cross-segment contextual interactions under complex interleaved reuse patterns while avoiding additional models or separate preprocessing stages for token selection. SparseX further implements a full+sparse hybrid attention mode based on a layer-specific threshold: early layers retain full attention to obtain a more stable token-importance signal, and later layers switch to sparse recomputation to improve reuse quality on complex long-context tasks. We implement SparseX-vLLM on top of vLLM, integrating segment-level cache lookup, PagedAttention management, RoPE alignment, Sparse-Q token selection, and FlashAttention backends into a unified execution path. SparseX is model-agnostic, training-free, and compatible with Prefix Cache, and it provides unified support for common online serving scenarios including multi-round chat, retrieval-augmented generation (RAG), and agent workflows.

CLJul 1, 2024
$\text{Memory}^3$: Language Modeling with Explicit Memory

Hongkang Yang, Zehao Lin, Wenjin Wang et al.

The training and inference of large language models (LLMs) are together a costly process that transports knowledge from raw data to meaningful computation. Inspired by the memory hierarchy of the human brain, we reduce this cost by equipping LLMs with explicit memory, a memory format cheaper than model parameters and text retrieval-augmented generation (RAG). Conceptually, with most of its knowledge externalized to explicit memories, the LLM can enjoy a smaller parameter size, training cost, and inference cost, all proportional to the amount of remaining "abstract knowledge". As a preliminary proof of concept, we train from scratch a 2.4B LLM, which achieves better performance than much larger LLMs as well as RAG models, and maintains higher decoding speed than RAG. The model is named $\text{Memory}^3$, since explicit memory is the third form of memory in LLMs after implicit memory (model parameters) and working memory (context key-values). We introduce a memory circuitry theory to support the externalization of knowledge, and present novel techniques including a memory sparsification mechanism that makes storage tractable and a two-stage pretraining scheme that facilitates memory formation.

CVOct 16, 2024Code
FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion

Jiacheng Ruan, Yebin Yang, Zehao Lin et al.

Benefiting from the revolutionary advances in large language models (LLMs) and foundational vision models, large vision-language models (LVLMs) have also made significant progress. However, current benchmarks focus on tasks that evaluating only a single aspect of LVLM capabilities (e.g., recognition, detection, understanding). These tasks fail to fully demonstrate LVLMs' potential in complex application scenarios. To comprehensively assess the performance of existing LVLMs, we propose a more challenging task called the Flow Text with Image Insertion task (FTII). This task requires LVLMs to simultaneously possess outstanding abilities in image comprehension, instruction understanding, and long-text interpretation. Specifically, given several text paragraphs and a set of candidate images, as the text paragraphs accumulate, the LVLMs are required to select the most suitable image from the candidates to insert after the corresponding paragraph. Constructing a benchmark for such a task is highly challenging, particularly in determining the sequence of flowing text and images. To address this challenge, we turn to professional news reports, which naturally contain a gold standard for image-text sequences. Based on this, we introduce the Flow Text with Image Insertion Benchmark (FTII-Bench), which includes 318 high-quality Chinese image-text news articles and 307 high-quality English image-text news articles, covering 10 different news domains. Using these 625 high-quality articles, we construct problems of two different types with multiple levels of difficulty. Furthermore, we establish two different evaluation pipelines based on the CLIP model and existing LVLMs. We evaluate 9 open-source and 2 closed-source LVLMs as well as 2 CLIP-based models. Results indicate that even the most advanced models (e.g., GPT-4o) face significant challenges when tackling the FTII task.

93.1CRApr 17
A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty

Zehao Lin, Chunyu Li, Kai Chen

Research on large language model (LLM) security is shifting from "will the model leak training data" to a more consequential question: can an agent with persistent, long-term memory be continuously shaped, cross-session poisoned, accessed without authorization, and propagated across shared organizational state? Recent surveys cover memory architectures and agent mechanisms, but fewer center the epistemic and governance properties of persistent, writable memory as the reason memory is an independent security problem. This survey addresses that gap. Drawing on cognitive neuroscience and the philosophy of memory, we characterize agent memory as malleable, rewritable, and socially propagating, and develop a memory-lifecycle framework organized around six phases -- Write, Store, Retrieve, Execute, Share, Forget/Rollback -- cross-tabulated against four security objectives: integrity, confidentiality, availability, governance. We organize the literature on memory poisoning, extraction, retrieval corruption, control-flow hijacking, cross-agent propagation, rollback, and governance, and situate representative architectures as determinants of which phases are explicitly governable. Three findings stand out: the literature concentrates on write- and retrieve-time integrity attacks, while confidentiality, availability, store/forget, and benign-persistence failures remain sparsely studied; no published architecture covers all nine governance primitives we identify; and using LLMs themselves for memory security remains sparse yet essential. We unify these under mnemonic sovereignty -- verifiable, recoverable governance over what may be written, who may read, when updates are authorized, and which states may be forgotten -- arguing future secure agents will be differentiated not only by recall capacity, but by memory governance quality.

CLJul 4, 2025
MemOS: A Memory OS for AI System

Zhiyu Li, Shichao Song, Chenyang Xi et al.

Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge consistency.Existing models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended periods.While Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent representations.Recent work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.

CLFeb 20, 2025
SurveyX: Academic Survey Automation via Large Language Models

Xun Liang, Jiawei Yang, Yezhaohui Wang et al.

Large Language Models (LLMs) have demonstrated exceptional comprehension capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for automated survey generation. However, recent research related to automated survey generation remains constrained by some critical limitations like finite context window, lack of in-depth content discussion, and absence of systematic evaluation frameworks. Inspired by human writing processes, we propose SurveyX, an efficient and organized system for automated survey generation that decomposes the survey composing process into two phases: the Preparation and Generation phases. By innovatively introducing online reference retrieval, a pre-processing method called AttributeTree, and a re-polishing process, SurveyX significantly enhances the efficacy of survey composition. Experimental evaluation results show that SurveyX outperforms existing automated survey generation systems in content quality (0.259 improvement) and citation quality (1.76 enhancement), approaching human expert performance across multiple evaluation dimensions. Examples of surveys generated by SurveyX are available on www.surveyx.cn

CLMay 28, 2025
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models

Zhiyu Li, Shichao Song, Hanyu Wang et al.

Large Language Models (LLMs) have emerged as foundational infrastructure in the pursuit of Artificial General Intelligence (AGI). Despite their remarkable capabilities in language perception and generation, current LLMs fundamentally lack a unified and structured architecture for handling memory. They primarily rely on parametric memory (knowledge encoded in model weights) and ephemeral activation memory (context-limited runtime states). While emerging methods like Retrieval-Augmented Generation (RAG) incorporate plaintext memory, they lack lifecycle management and multi-modal integration, limiting their capacity for long-term knowledge evolution. To address this, we introduce MemOS, a memory operating system designed for LLMs that, for the first time, elevates memory to a first-class operational resource. It builds unified mechanisms for representation, organization, and governance across three core memory types: parametric, activation, and plaintext. At its core is the MemCube, a standardized memory abstraction that enables tracking, fusion, and migration of heterogeneous memory, while offering structured, traceable access across tasks and contexts. MemOS establishes a memory-centric execution framework with strong controllability, adaptability, and evolvability. It fills a critical gap in current LLM infrastructure and lays the groundwork for continual adaptation, personalized intelligence, and cross-platform coordination in next-generation intelligent systems.

LGMay 29, 2025
Scalable Complexity Control Facilitates Reasoning Ability of LLMs

Liangkai Hang, Junjie Yao, Zhiwei Bai et al.

The reasoning ability of large language models (LLMs) has been rapidly advancing in recent years, attracting interest in more fundamental approaches that can reliably enhance their generalizability. This work demonstrates that model complexity control, conveniently implementable by adjusting the initialization rate and weight decay coefficient, improves the scaling law of LLMs consistently over varying model sizes and data sizes. This gain is further illustrated by comparing the benchmark performance of 2.4B models pretrained on 1T tokens with different complexity hyperparameters. Instead of fixing the initialization std, we found that a constant initialization rate (the exponent of std) enables the scaling law to descend faster in both model and data sizes. These results indicate that complexity control is a promising direction for the continual advancement of LLMs.

CLNov 16, 2025
TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction

Jie Zhang, Bo Tang, Wanzi Shao et al.

Retrieval-Augmented Generation (RAG) improves large language models by retrieving external knowledge, often truncated into smaller chunks due to the input context window, which leads to information loss, resulting in response hallucinations and broken reasoning chains. Moreover, traditional RAG retrieves unstructured knowledge, introducing irrelevant details that hinder accurate reasoning. To address these issues, we propose TAdaRAG, a novel RAG framework for on-the-fly task-adaptive knowledge graph construction from external sources. Specifically, we design an intent-driven routing mechanism to a domain-specific extraction template, followed by supervised fine-tuning and a reinforcement learning-based implicit extraction mechanism, ensuring concise, coherent, and non-redundant knowledge integration. Evaluations on six public benchmarks and a real-world business benchmark (NowNewsQA) across three backbone models demonstrate that TAdaRAG outperforms existing methods across diverse domains and long-text tasks, highlighting its strong generalization and practical effectiveness.

LGJul 24, 2025
Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling

Ning Liao, Xiaoxing Wang, Zehao Lin et al.

A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid datasets. Such a paradigm enables knowledge in the general domain, and different scientific disciplines can be decoupled, avoiding the negative influence among knowledge in different domains. With 53.3B total parameters and 13.3B activated, Innovator extends Qwen2.5-7B using a shared general expert and 64 specialized scientific experts with 8 activated. Trained on 300B tokens with tri-level quality-controlled data, Innovator achieves 25% average improvement across 30 scientific tasks with a win rate as 70%, while retaining 99% performance in general tasks. Furthermore, Innovator-Reason, which is post-trained from Innovator for reasoning boosting, exhibits excellent reasoning performance in solving complex scientific problems with improvements over 30%.

IRMay 28, 2025
Xinyu AI Search: Enhanced Relevance and Comprehensive Results with Rich Answer Presentations

Bo Tang, Junyi Zhu, Chenyang Xi et al.

Traditional search engines struggle to synthesize fragmented information for complex queries, while generative AI search engines face challenges in relevance, comprehensiveness, and presentation. To address these limitations, we introduce Xinyu AI Search, a novel system that incorporates a query-decomposition graph to dynamically break down complex queries into sub-queries, enabling stepwise retrieval and generation. Our retrieval pipeline enhances diversity through multi-source aggregation and query expansion, while filtering and re-ranking strategies optimize passage relevance. Additionally, Xinyu AI Search introduces a novel approach for fine-grained, precise built-in citation and innovates in result presentation by integrating timeline visualization and textual-visual choreography. Evaluated on recent real-world queries, Xinyu AI Search outperforms eight existing technologies in human assessments, excelling in relevance, comprehensiveness, and insightfulness. Ablation studies validate the necessity of its key sub-modules. Our work presents the first comprehensive framework for generative AI search engines, bridging retrieval, generation, and user-centric presentation.

CVAug 26, 2021
Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Guodun Li, Yuchen Zhai, Zehao Lin et al.

Stylized image captioning systems aim to generate a caption not only semantically related to a given image but also consistent with a given style description. One of the biggest challenges with this task is the lack of sufficient paired stylized data. Many studies focus on unsupervised approaches, without considering from the perspective of data augmentation. We begin with the observation that people may recall similar emotions when they are in similar scenes, and often express similar emotions with similar style phrases, which underpins our data augmentation idea. In this paper, we propose a novel Extract-Retrieve-Generate data augmentation framework to extract style phrases from small-scale stylized sentences and graft them to large-scale factual captions. First, we design the emotional signal extractor to extract style phrases from small-scale stylized sentences. Second, we construct the plugable multi-modal scene retriever to retrieve scenes represented with pairs of an image and its stylized caption, which are similar to the query image or caption in the large-scale factual data. In the end, based on the style phrases of similar scenes and the factual description of the current scene, we build the emotion-aware caption generator to generate fluent and diversified stylized captions for the current scene. Extensive experimental results show that our framework can alleviate the data scarcity problem effectively. It also significantly boosts the performance of several existing image captioning models in both supervised and unsupervised settings, which outperforms the state-of-the-art stylized image captioning methods in terms of both sentence relevance and stylishness by a substantial margin.

CLJul 12, 2021
Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations

Jingyao Zhou, Haipang Wu, Zehao Lin et al.

Most recently proposed approaches in dialogue state tracking (DST) leverage the context and the last dialogue states to track current dialogue states, which are often slot-value pairs. Although the context contains the complete dialogue information, the information is usually indirect and even requires reasoning to obtain. The information in the lastly predicted dialogue states is direct, but when there is a prediction error, the dialogue information from this source will be incomplete or erroneous. In this paper, we propose the Dialogue State Tracking with Multi-Level Fusion of Predicted Dialogue States and Conversations network (FPDSC). This model extracts information of each dialogue turn by modeling interactions among each turn utterance, the corresponding last dialogue states, and dialogue slots. Then the representation of each dialogue turn is aggregated by a hierarchical structure to form the passage information, which is utilized in the current turn of DST. Experimental results validate the effectiveness of the fusion network with 55.03% and 59.07% joint accuracy on MultiWOZ 2.0 and MultiWOZ 2.1 datasets, which reaches the state-of-the-art performance. Furthermore, we conduct the deleted-value and related-slot experiments on MultiWOZ 2.1 to evaluate our model.

CLMay 27, 2020
Predict-then-Decide: A Predictive Approach for Wait or Answer Task in Dialogue Systems

Zehao Lin, Shaobo Cui, Guodun Li et al.

Different people have different habits of describing their intents in conversations. Some people tend to deliberate their intents in several successive utterances, i.e., they use several consistent messages for readability instead of a long sentence to express their question. This creates a predicament faced by the application of dialogue systems, especially in real-world industry scenarios, in which the dialogue system is unsure whether it should answer the query of user immediately or wait for further supplementary input. Motivated by such an interesting predicament, we define a novel Wait-or-Answer task for dialogue systems. We shed light on a new research topic about how the dialogue system can be more intelligent to behave in this Wait-or-Answer quandary. Further, we propose a predictive approach named Predict-then-Decide (PTD) to tackle this Wait-or-Answer task. More specifically, we take advantage of a decision model to help the dialogue system decide whether to wait or answer. The decision of decision model is made with the assistance of two ancillary prediction models: a user prediction and an agent prediction. The user prediction model tries to predict what the user would supplement and uses its prediction to persuade the decision model that the user has some information to add, so the dialogue system should wait. The agent prediction model tries to predict the answer of the dialogue system and convince the decision model that it is a superior choice to answer the query of user immediately since the input of user has come to an end. We conduct our experiments on two real-life scenarios and three public datasets. Experimental results on five datasets show our proposed PTD approach significantly outperforms the existing models in solving this Wait-or-Answer problem.

CLMay 21, 2020
MTSS: Learn from Multiple Domain Teachers and Become a Multi-domain Dialogue Expert

Shuke Peng, Feng Ji, Zehao Lin et al.

How to build a high-quality multi-domain dialogue system is a challenging work due to its complicated and entangled dialogue state space among each domain, which seriously limits the quality of dialogue policy, and further affects the generated response. In this paper, we propose a novel method to acquire a satisfying policy and subtly circumvent the knotty dialogue state representation problem in the multi-domain setting. Inspired by real school teaching scenarios, our method is composed of multiple domain-specific teachers and a universal student. Each individual teacher only focuses on one specific domain and learns its corresponding domain knowledge and dialogue policy based on a precisely extracted single domain dialogue state representation. Then, these domain-specific teachers impart their domain knowledge and policies to a universal student model and collectively make this student model a multi-domain dialogue expert. Experiment results show that our method reaches competitive results with SOTAs in both multi-domain and single domain setting.

CLFeb 22, 2020
"Wait, I'm Still Talking!" Predicting the Dialogue Interaction Behavior Using Imagine-Then-Arbitrate Model

Zehao Lin, Shaobo Cui, Guodun Li et al.

Producing natural and accurate responses like human beings is the ultimate goal of intelligent dialogue agents. So far, most of the past works concentrate on selecting or generating one pertinent and fluent response according to current query and its context. These models work on a one-to-one environment, making one response to one utterance each round. However, in real human-human conversations, human often sequentially sends several short messages for readability instead of a long message in one turn. Thus messages will not end with an explicit ending signal, which is crucial for agents to decide when to reply. So the first step for an intelligent dialogue agent is not replying but deciding if it should reply at the moment. To address this issue, in this paper, we propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to help the agent decide whether to wait or to make a response directly. Our method has two imaginator modules and an arbitrator module. The two imaginators will learn the agent's and user's speaking style respectively, generate possible utterances as the input of the arbitrator, combining with dialogue history. And the arbitrator decides whether to wait or to make a response to the user directly. To verify the performance and effectiveness of our method, we prepared two dialogue datasets and compared our approach with several popular models. Experimental results show that our model performs well on addressing ending prediction issue and outperforms baseline models.

CLSep 25, 2019
Task-Oriented Conversation Generation Using Heterogeneous Memory Networks

Zehao Lin, Xinjing Huang, Feng Ji et al.

How to incorporate external knowledge into a neural dialogue model is critically important for dialogue systems to behave like real humans. To handle this problem, memory networks are usually a great choice and a promising way. However, existing memory networks do not perform well when leveraging heterogeneous information from different sources. In this paper, we propose a novel and versatile external memory networks called Heterogeneous Memory Networks (HMNs), to simultaneously utilize user utterances, dialogue history and background knowledge tuples. In our method, historical sequential dialogues are encoded and stored into the context-aware memory enhanced by gating mechanism while grounding knowledge tuples are encoded and stored into the context-free memory. During decoding, the decoder augmented with HMNs recurrently selects each word in one response utterance from these two memories and a general vocabulary. Experimental results on multiple real-world datasets show that HMNs significantly outperform the state-of-the-art data-driven task-oriented dialogue models in most domains.

CLAug 20, 2019
Teacher-Student Framework Enhanced Multi-domain Dialogue Generation

Shuke Peng, Xinjing Huang, Zehao Lin et al.

Dialogue systems dealing with multi-domain tasks are highly required. How to record the state remains a key problem in a task-oriented dialogue system. Normally we use human-defined features as dialogue states and apply a state tracker to extract these features. However, the performance of such a system is limited by the error propagation of a state tracker. In this paper, we propose a dialogue generation model that needs no external state trackers and still benefits from human-labeled semantic data. By using a teacher-student framework, several teacher models are firstly trained in their individual domains, learn dialogue policies from labeled states. And then the learned knowledge and experience are merged and transferred to a universal student model, which takes raw utterance as its input. Experiments show that the dialogue system trained under our framework outperforms the one uses a belief tracker.