Xiaojie Wang

CL
h-index22
65papers
7,796citations
Novelty48%
AI Score60

65 Papers

CLMar 21, 2022Code
A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots

Sai Zhang, Yuwei Hu, Yuchuan Wu et al.

A slot value might be provided segment by segment over multiple-turn interactions in a dialog, especially for some important information such as phone numbers and names. It is a common phenomenon in daily life, but little attention has been paid to it in previous work. To fill the gap, this paper defines a new task named Sub-Slot based Task-Oriented Dialog (SSTOD) and builds a Chinese dialog dataset SSD for boosting research on SSTOD. The dataset includes a total of 40K dialogs and 500K utterances from four different domains: Chinese names, phone numbers, ID numbers and license plate numbers. The data is well annotated with sub-slot values, slot values, dialog states and actions. We find some new linguistic phenomena and interactive manners in SSTOD which raise critical challenges of building dialog agents for the task. We test three state-of-the-art dialog models on SSTOD and find they cannot handle the task well on any of the four domains. We also investigate an improved model by involving slot knowledge in a plug-in manner. More work should be done to meet the new challenges raised from SSTOD which widely exists in real-life applications. The dataset and code are publicly available via https://github.com/shunjiu/SSTOD.

NAJun 3
Multimodal sampling via Schrödinger-Föllmer samplers with temperatures

Xiaojie Wang, Xiaoyan Zhang

Generating samples from complex and high-dimensional distributions is ubiquitous in various scientific fields of statistical physics, Bayesian inference, scientific computing and machine learning. Very recently, Huang et al. (IEEE Trans. Inform. Theory, 2025) proposed new Schrödinger-Föllmer samplers (SFS), based on the Euler discretization of the Schrödinger-Föllmer diffusion evolving on the unit interval $[0, 1]$. There, a convergence rate of order $\mathcal{O}(\sqrt{h})$ in the $L^2$-Wasserstein distance was obtained for the Euler discretization with a uniform time step-size $h>0$. By introducing a temperature parameter, different samplers are proposed in this paper, based on the Euler discretization of the Schrödinger-Föllmer process with temperatures. As revealed by numerical experiments, high temperatures are vital, particularly in sampling from multimodal distributions. Further, a novel approach of error analysis is developed for the time discretization and an enhanced convergence rate of order $\mathcal{O}(h)$ is obtained in the $L^2$-Wasserstein distance, under certain smoothness conditions on the drift. This significantly improves the existing order-half convergence in the aforementioned paper. Unlike Langevin samplers, SFS is gradient-free, works in a unit interval $[0, 1]$ and does not require any ergodicity. Numerical experiments confirm the convergence rate and show that, the SFS substantially outperforms vanilla Langevin samplers, particularly in sampling from multimodal distributions.

CVMay 7Code
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model

Junhui Yin, Nan Pu, Xinyu Zhang et al.

Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt design, leading to suboptimal outcomes. The underlying reasons are: 1) class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, which helps prevent misclassification of data from different classes into a single class; 2) compared to class-specific prompts, instance-specific prompts neglect the richer class-level information across multiple instances, potentially causing data from the same class to be divided into multiple classes. To effectively supplement the class-specific knowledge into existing methods, we propose a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI comprises two key components, i.e., class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples that belong to the same class and stores the learned prompts in a class-level knowledge bank. The latter provides a plug-and-play mechanism for each test instance to retrieve relevant class-level knowledge from the knowledge bank and inject such knowledge to refine model predictions. Extensive experiments demonstrate that our CAKI effectively improves the performance of existing methods on base and novel classes. Code is publicly available at \href{https://github.com/yjh576/CAKI}{this https URL}.

LGOct 27, 2023
A Survey on Trustworthy Edge Intelligence: From Security and Reliability To Transparency and Sustainability

Xiaojie Wang, Beibei Wang, Yu Wu et al.

Edge Intelligence (EI) integrates Edge Computing (EC) and Artificial Intelligence (AI) to push the capabilities of AI to the network edge for real-time, efficient and secure intelligent decision-making and computation. However, EI faces various challenges due to resource constraints, heterogeneous network environments, and diverse service requirements of different applications, which together affect the trustworthiness of EI in the eyes of stakeholders. This survey comprehensively summarizes the characteristics, architecture, technologies, and solutions of trustworthy EI. Specifically, we first emphasize the need for trustworthy EI in the context of the trend toward large models. We then provide an initial definition of trustworthy EI, explore its key characteristics and give a multi-layered architecture for trustworthy EI. Then, we summarize several important issues that hinder the achievement of trustworthy EI. Subsequently, we present enabling technologies for trustworthy EI systems and provide an in-depth literature review of the state-of-the-art solutions for realizing the trustworthiness of EI. Finally, we discuss the corresponding research challenges and open issues.

CLMay 24Code
Lngram: N-gram Conditional Memory in Latent Space

Yunao Zheng, Guoyang Xia, Xiaojie Wang et al.

Sequence modeling requires both compositional reasoning and local static knowledge retrieval, yet standard Transformers handle both through dense computation. Engram partially decouples retrieval from the backbone, but its token-based keys remain tied to text tokenization and hash compression. We propose Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains. Analyses with LogitLens and CKA suggest that Lngram enables prediction-relevant information to emerge earlier, increasing effective depth with limited inference and memory overhead. Code is available at https://github.com/zyaaa-ux/Lngram.

NANov 20, 2016
Exponential integrability properties of numerical approximation processes for nonlinear stochastic differential equations

Martin Hutzenthaler, Arnulf Jentzen, Xiaojie Wang

Exponential integrability properties of numerical approximations are a key tool for establishing positive rates of strong and numerically weak convergence for a large class of nonlinear stochastic differential equations. It turns out that well-known numerical approximation processes such as Euler-Maruyama approximations, linear-implicit Euler approximations, and some tamed Euler approximations from the literature rarely preserve exponential integrability properties of the exact solution. The main contribution of this article is to identify a class of stopped increment-tamed Euler approximations which preserve exponential integrability properties of the exact solution under minor additional assumptions on the involved functions.

NANov 25, 2015
Full discretisation of semi-linear stochastic wave equations driven by multiplicative noise

Rikard Anton, David Cohen, Stig Larsson et al.

A fully discrete approximation of the semi-linear stochastic wave equation driven by multiplicative noise is presented. A standard linear finite element approximation is used in space and a stochastic trigonometric method for the temporal approximation. This explicit time integrator allows for mean-square error bounds independent of the space discretisation and thus do not suffer from a step size restriction as in the often used Störmer-Verlet-leap-frog scheme. Furthermore, it satisfies an almost trace formula (i.e., a linear drift of the expected value of the energy of the problem). Numerical experiments are presented and confirm the theoretical results.

CVOct 24, 2022
Towards Unifying Reference Expression Generation and Comprehension

Duo Zheng, Tao Kong, Ya Jing et al. · bytedance

Reference Expression Generation (REG) and Comprehension (REC) are two highly correlated tasks. Modeling REG and REC simultaneously for utilizing the relation between them is a promising way to improve both. However, the problem of distinct inputs, as well as building connections between them in a single model, brings challenges to the design and training of the joint model. To address the problems, we propose a unified model for REG and REC, named UniRef. It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention. Additionally, IRTF could generate pseudo input regions for the REC task to enable a uniform way for sharing the identical representation space across the REC and REG. We further propose Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region Prediction (TRP) to pre-train UniRef model on multi-granular corpora. The VMLM and TRP are directly related to REG and REC, respectively, but could help each other. We conduct extensive experiments on three benchmark datasets, RefCOCO, RefCOCO+ and RefCOCOg. Experimental results show that our model outperforms previous state-of-the-art methods on both REG and REC.

CVAug 7, 2024Code
Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis

Zebin Yao, Fangxiang Feng, Ruifan Li et al.

The customization of text-to-image models has seen significant advancements, yet generating multiple personalized concepts remains a challenging task. Current methods struggle with attribute leakage and layout confusion when handling multiple concepts, leading to reduced concept fidelity and semantic consistency. In this work, we introduce a novel training-free framework, Concept Conductor, designed to ensure visual fidelity and correct layout in multi-concept customization. Concept Conductor isolates the sampling processes of multiple custom models to prevent attribute leakage between different concepts and corrects erroneous layouts through self-attention-based spatial guidance. Additionally, we present a concept injection technique that employs shape-aware masks to specify the generation area for each concept. This technique injects the structure and appearance of personalized concepts through feature fusion in the attention layers, ensuring harmony in the final image. Extensive qualitative and quantitative experiments demonstrate that Concept Conductor can consistently generate composite images with accurate layouts while preserving the visual details of each concept. Compared to existing baselines, Concept Conductor shows significant performance improvements. Our method supports the combination of any number of concepts and maintains high fidelity even when dealing with visually similar concepts. The code and models are available at https://github.com/Nihukat/Concept-Conductor.

NAAug 24, 2014
The tamed Milstein method for commutative stochastic differential equations with non-globally Lipschitz continuous coefficients

Xiaojie Wang, Siqing Gan

For stochastic differential equations (SDEs) with a superlinearly growing and globally one-sided Lipschitz continuous drift coefficient, the classical explicit Euler scheme fails to converge strongly to the exact solution. Recently, an explicit strongly convergent numerical scheme, called the tamed Euler method, is proposed in [Hutzenthaler, Jentzen & Kloeden, Ann. Appl. Probab., 22 (2012), pp. 1611-1641.] for such SDEs. Motivated by their work, we here introduce a tamed version of the Milstein scheme for SDEs with commutative noise. The proposed method is also explicit and easily implementable, but achieves higher strong convergence order than the tamed Euler method does. In recovering the strong convergence order one of the new method, new difficulties arise and kind of a bootstrap argument is developed to overcome them. Finally, an illustrative example confirms the computational efficiency of the tamed Milstein method compared to the tamed Euler method.

CLJul 23, 2023
FATRER: Full-Attention Topic Regularizer for Accurate and Robust Conversational Emotion Recognition

Yuzhao Mao, Di Lu, Xiaojie Wang et al.

This paper concentrates on the understanding of interlocutors' emotions evoked in conversational utterances. Previous studies in this literature mainly focus on more accurate emotional predictions, while ignoring model robustness when the local context is corrupted by adversarial attacks. To maintain robustness while ensuring accuracy, we propose an emotion recognizer augmented by a full-attention topic regularizer, which enables an emotion-related global view when modeling the local context in a conversation. A joint topic modeling strategy is introduced to implement regularization from both representation and loss perspectives. To avoid over-regularization, we drop the constraints on prior distributions that exist in traditional topic modeling and perform probabilistic approximations based entirely on attention alignment. Experiments show that our models obtain more favorable results than state-of-the-art models, and gain convincing robustness under three types of adversarial attacks.

CVMar 16, 2022
Spot the Difference: A Cooperative Object-Referring Game in Non-Perfectly Co-Observable Scene

Duo Zheng, Fandong Meng, Qingyi Si et al. · tsinghua

Visual dialog has witnessed great progress after introducing various vision-oriented goals into the conversation, especially such as GuessWhich and GuessWhat, where the only image is visible by either and both of the questioner and the answerer, respectively. Researchers explore more on visual dialog tasks in such kind of single- or perfectly co-observable visual scene, while somewhat neglect the exploration on tasks of non perfectly co-observable visual scene, where the images accessed by two agents may not be exactly the same, often occurred in practice. Although building common ground in non-perfectly co-observable visual scene through conversation is significant for advanced dialog agents, the lack of such dialog task and corresponding large-scale dataset makes it impossible to carry out in-depth research. To break this limitation, we propose an object-referring game in non-perfectly co-observable visual scene, where the goal is to spot the difference between the similar visual scenes through conversing in natural language. The task addresses challenges of the dialog strategy in non-perfectly co-observable visual scene and the ability of categorizing objects. Correspondingly, we construct a large-scale multimodal dataset, named SpotDiff, which contains 87k Virtual Reality images and 97k dialogs generated by self-play. Finally, we give benchmark models for this task, and conduct extensive experiments to evaluate its performance as well as analyze its main challenges.

NAJul 4, 2011
The improved split-step backward Euler method for stochastic differential delay equations

Xiaojie Wang, Siqing Gan

A new, improved split-step backward Euler (SSBE) method is introduced and analyzed for stochastic differential delay equations(SDDEs) with generic variable delay. The method is proved to be convergent in mean-square sense under conditions (Assumption 3.1) that the diffusion coefficient $g(x,y)$ is globally Lipschitz in both $x$ and $y$, but the drift coefficient $f(x,y)$ satisfies one-sided Lipschitz condition in $x$ and globally Lipschitz in $y$. Further, exponential mean-square stability of the proposed method is investigated for SDDEs that have a negative one-sided Lipschitz constant. Our results show that the method has the unconditional stability property in the sense that it can well reproduce stability of underlying system, without any restrictions on stepsize $h$. Numerical experiments and comparisons with existing methods for SDDEs illustrate the computational efficiency of our method.

NAMay 14, 2016
Sharp mean-square regularity results for SPDEs with fractional noise and optimal convergence rates for the numerical approximations

Xiaojie Wang, Ruisheng Qi, Fengze Jiang

This article offers sharp spatial and temporal mean-square regularity results for a class of semi-linear parabolic stochastic partial differential equations (SPDEs) driven by infinite dimensional fractional Brownian motion with the Hurst parameter greater than one-half. In addition, mean-square numerical approximation of such problem are investigated, performed by the spectral Galerkin method in space and the linear implicit Euler method in time. The obtained sharp regularity properties of the problems enable us to identify optimal mean-square convergence rates of the full discrete scheme. These theoretical findings are accompanied by several numerical examples.

CLApr 2, 2022
Co-VQA : Answering by Interactive Sub Question Sequence

Ruonan Wang, Yuxi Qian, Fangxiang Feng et al.

Most existing approaches to Visual Question Answering (VQA) answer questions directly, however, people usually decompose a complex question into a sequence of simple sub questions and finally obtain the answer to the original question after answering the sub question sequence(SQS). By simulating the process, this paper proposes a conversation-based VQA (Co-VQA) framework, which consists of three components: Questioner, Oracle, and Answerer. Questioner raises the sub questions using an extending HRED model, and Oracle answers them one-by-one. An Adaptive Chain Visual Reasoning Model (ACVRM) for Answerer is also proposed, where the question-answer pair is used to update the visual representation sequentially. To perform supervised learning for each model, we introduce a well-designed method to build a SQS for each question on VQA 2.0 and VQA-CP v2 datasets. Experimental results show that our method achieves state-of-the-art on VQA-CP v2. Further analyses show that SQSs help build direct semantic connections between questions and images, provide question-adaptive variable-length reasoning chains, and with explicit interpretability as well as error traceability.

NAApr 1, 2012
A Runge-Kutta type scheme for nonlinear stochastic partial differential equations with multiplicative trace class noise

Xiaojie Wang, Siqing Gan

In this paper a new Runge-Kutta type scheme is introduced for nonlinear stochastic partial differential equations (SPDEs) with multiplicative trace class noise. The proposed scheme converges with respect to the computational effort with a higher order than the well-known linear implicit Euler scheme. In comparison to the infinite dimensional analog of Milstein type scheme recently proposed in [Jentzen & Röckner (2012); A Milstein scheme for SPDEs, Arxiv preprint arXiv:1001.2751v4], our scheme is easier to implement and needs less computational effort due to avoiding the derivative of the diffusion function. The new scheme can be regarded as an infinite dimensional analog of Runge-Kutta method for finite dimensional stochastic ordinary differential equations (SODEs). Numerical examples are reported to support the theoretical results.

CVApr 3, 2022
Question-Driven Graph Fusion Network For Visual Question Answering

Yuxi Qian, Yuncong Hu, Ruonan Wang et al.

Existing Visual Question Answering (VQA) models have explored various visual relationships between objects in the image to answer complex questions, which inevitably introduces irrelevant information brought by inaccurate object detection and text grounding. To address the problem, we propose a Question-Driven Graph Fusion Network (QD-GFN). It first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation process of the three graphs, further, our QD-GFN adopts an object filtering mechanism to remove question-irrelevant objects contained in the image. Experiment results demonstrate that our QD-GFN outperforms the prior state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets. Further analysis shows that both the novel graph aggregation method and object filtering mechanism play a significant role in improving the performance of the model.

CLJan 5, 2023
SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph

Yuxing Long, Binyuan Hui, Fulong Ye et al.

Existing multimodal conversation agents have shown impressive abilities to locate absolute positions or retrieve attributes in simple scenarios, but they fail to perform well when complex relative positions and information alignments are involved, which poses a bottleneck in response quality. In this paper, we propose a Situated Conversation Agent Petrained with Multimodal Questions from INcremental Layout Graph (SPRING) with abilities of reasoning multi-hops spatial relations and connecting them with visual attributes in crowded situated scenarios. Specifically, we design two types of Multimodal Question Answering (MQA) tasks to pretrain the agent. All QA pairs utilized during pretraining are generated from novel Incremental Layout Graphs (ILG). QA pair difficulty labels automatically annotated by ILG are used to promote MQA-based Curriculum Learning. Experimental results verify the SPRING's effectiveness, showing that it significantly outperforms state-of-the-art approaches on both SIMMC 1.0 and SIMMC 2.0 datasets.

NAFeb 19, 2016
An accelerated exponential time integrator for semi-linear stochastic strongly damped wave equation with additive noise

Ruisheng Qi, Xiaojie Wang

This paper is concerned with the strong approximation of a semi-linear stochastic wave equation with strong damping, driven by additive noise. Based on a spatial discretization performed by a spectral Galerkin method, we introduce a kind of accelerated exponential time integrator involving linear functionals of the noise. Under appropriate assumptions, we provide error bounds for the proposed full-discrete scheme. It is shown that the scheme achieves higher strong order in time direction than the order of temporal regularity of the underlying problem, which allows for higher convergence rate than usual time-stepping schemes. For the space-time white noise case in two or three spatial dimensions, the scheme still exhibits a good convergence performance. Another striking finding is that, even for the velocity with low regularity the scheme always promises first order strong convergence in time. Numerical examples are finally reported to confirm our theoretical findings.

CVAug 19, 2023
Whether you can locate or not? Interactive Referring Expression Generation

Fulong Ye, Yuxing Long, Fangxiang Feng et al.

Referring Expression Generation (REG) aims to generate unambiguous Referring Expressions (REs) for objects in a visual scene, with a dual task of Referring Expression Comprehension (REC) to locate the referred object. Existing methods construct REG models independently by using only the REs as ground truth for model training, without considering the potential interaction between REG and REC models. In this paper, we propose an Interactive REG (IREG) model that can interact with a real REC model, utilizing signals indicating whether the object is located and the visual region located by the REC model to gradually modify REs. Our experimental results on three RE benchmark datasets, RefCOCO, RefCOCO+, and RefCOCOg show that IREG outperforms previous state-of-the-art methods on popular evaluation metrics. Furthermore, a human evaluation shows that IREG generates better REs with the capability of interaction.

CVMay 23, 2022
GR-GAN: Gradual Refinement Text-to-image Generation

Bo Yang, Fangxiang Feng, Xiaojie Wang

A good Text-to-Image model should not only generate high quality images, but also ensure the consistency between the text and the generated image. Previous models failed to simultaneously fix both sides well. This paper proposes a Gradual Refinement Generative Adversarial Network (GR-GAN) to alleviates the problem efficiently. A GRG module is designed to generate images from low resolution to high resolution with the corresponding text constraints from coarse granularity (sentence) to fine granularity (word) stage by stage, a ITM module is designed to provide image-text matching losses at both sentence-image level and word-region level for corresponding stages. We also introduce a new metric Cross-Model Distance (CMD) for simultaneously evaluating image quality and image-text consistency. Experimental results show GR-GAN significant outperform previous models, and achieve new state-of-the-art on both FID and CMD. A detailed analysis demonstrates the efficiency of different generation stages in GR-GAN.

CLSep 12, 2023
AKEM: Aligning Knowledge Base to Queries with Ensemble Model for Entity Recognition and Linking

Di Lu, Zhongping Liang, Caixia Yuan et al.

This paper presents a novel approach to address the Entity Recognition and Linking Challenge at NLPCC 2015. The task involves extracting named entity mentions from short search queries and linking them to entities within a reference Chinese knowledge base. To tackle this problem, we first expand the existing knowledge base and utilize external knowledge to identify candidate entities, thereby improving the recall rate. Next, we extract features from the candidate entities and utilize Support Vector Regression and Multiple Additive Regression Tree as scoring functions to filter the results. Additionally, we apply rules to further refine the results and enhance precision. Our method is computationally efficient and achieves an F1 score of 0.535.

LGJul 25, 2024
A Two-Stage Imaging Framework Combining CNN and Physics-Informed Neural Networks for Full-Inverse Tomography: A Case Study in Electrical Impedance Tomography (EIT)

Xuanxuan Yang, Yangming Zhang, Haofeng Chen et al.

Electrical Impedance Tomography (EIT) is a highly ill-posed inverse problem, with the challenge of reconstructing internal conductivities using only boundary voltage measurements. Although Physics-Informed Neural Networks (PINNs) have shown potential in solving inverse problems, existing approaches are limited in their applicability to EIT, as they often rely on impractical prior knowledge and assumptions that cannot be satisfied in real-world scenarios. To address these limitations, we propose a two-stage hybrid learning framework that combines Convolutional Neural Networks (CNNs) and PINNs. This framework integrates data-driven and model-driven paradigms, blending supervised and unsupervised learning to reconstruct conductivity distributions while ensuring adherence to the underlying physical laws, thereby overcoming the constraints of existing methods.

CLOct 20, 2023
Explicit Alignment and Many-to-many Entailment Based Reasoning for Conversational Machine Reading

Yangyang Luo, Shiyu Tian, Caixia Yuan et al.

Conversational Machine Reading (CMR) requires answering a user's initial question through multi-turn dialogue interactions based on a given document. Although there exist many effective methods, they largely neglected the alignment between the document and the user-provided information, which significantly affects the intermediate decision-making and subsequent follow-up question generation. To address this issue, we propose a pipeline framework that (1) aligns the aforementioned two sides in an explicit way, (2)makes decisions using a lightweight many-to-many entailment reasoning module, and (3) directly generates follow-up questions based on the document and previously asked questions. Our proposed method achieves state-of-the-art in micro-accuracy and ranks the first place on the public leaderboard of the CMR benchmark dataset ShARC.

CLOct 1, 2023
A Task-oriented Dialog Model with Task-progressive and Policy-aware Pre-training

Lucen Zhong, Hengtong Lu, Caixia Yuan et al.

Pre-trained conversation models (PCMs) have achieved promising progress in recent years. However, existing PCMs for Task-oriented dialog (TOD) are insufficient for capturing the sequential nature of the TOD-related tasks, as well as for learning dialog policy information. To alleviate these problems, this paper proposes a task-progressive PCM with two policy-aware pre-training tasks. The model is pre-trained through three stages where TOD-related tasks are progressively employed according to the task logic of the TOD system. A global policy consistency task is designed to capture the multi-turn dialog policy sequential relation, and an act-based contrastive learning task is designed to capture similarities among samples with the same dialog policy. Our model achieves better results on both MultiWOZ and In-Car end-to-end dialog modeling benchmarks with only 18\% parameters and 25\% pre-training data compared to the previous state-of-the-art PCM, GALAXY.

CVApr 9, 2024Code
DiffHarmony: Latent Diffusion Model Meets Image Harmonization

Pengfei Zhou, Fangxiang Feng, Xiaojie Wang

Image harmonization, which involves adjusting the foreground of a composite image to attain a unified visual consistency with the background, can be conceptualized as an image-to-image translation task. Diffusion models have recently promoted the rapid development of image-to-image translation tasks . However, training diffusion models from scratch is computationally intensive. Fine-tuning pre-trained latent diffusion models entails dealing with the reconstruction error induced by the image compression autoencoder, making it unsuitable for image generation tasks that involve pixel-level evaluation metrics. To deal with these issues, in this paper, we first adapt a pre-trained latent diffusion model to the image harmonization task to generate the harmonious but potentially blurry initial images. Then we implement two strategies: utilizing higher-resolution images during inference and incorporating an additional refinement stage, to further enhance the clarity of the initially harmonized images. Extensive experiments on iHarmony4 datasets demonstrate the superiority of our proposed method. The code and model will be made publicly available at https://github.com/nicecv/DiffHarmony .

CLNov 3, 2025
BARD: budget-aware reasoning distillation

Lujie Niu, Lei Shen, Yi Jiang et al.

While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model's understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.

CVJun 4, 2025Code
Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark

Ziming Cheng, Binrui Xu, Lisheng Gong et al.

With enhanced capabilities and widespread applications, Multimodal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the $\textbf{Multimodal Multi-image Reasoning Benchmark (MMRB)}$, the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises $\textbf{92 sub-tasks}$ covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios. To support fast and scalable evaluation, we propose a sentence-level matching framework using open-source LLMs. Extensive baseline experiments on $\textbf{40 MLLMs}$, including 9 reasoning-specific models and 8 reward models, demonstrate that open-source MLLMs still lag significantly behind commercial MLLMs in multi-image reasoning tasks. Furthermore, current multimodal reward models are nearly incapable of handling multi-image reward ranking tasks.

CLFeb 27, 2025Code
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Haochen Sun, Shuwen Zhang, Lujie Niu et al.

Large Language Models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-based Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks in two novel ways. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments with 13 popular LLMs and show that, while the LLMs exhibit a strong ability in goal interpretation, there are significant shortcomings in active collaboration and continuous adaptation, which are critical for efficiently fulfilling complex tasks. Notably, we highlight the strengths and weaknesses of LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-source benchmark. The environments, 30 open-ended tasks, and the evaluation package are publicly available at https://github.com/YusaeMeow/Collab-Overcooked.

ROFeb 12
Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning

Carl Qi, Xiaojie Wang, Silong Yong et al.

Reasoning about failures is crucial for building reliable and trustworthy robotic systems. Prior approaches either treat failure reasoning as a closed-set classification problem or assume access to ample human annotations. Failures in the real world are typically subtle, combinatorial, and difficult to enumerate, whereas rich reasoning labels are expensive to acquire. We address this problem by introducing ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning. We formulate detection and reasoning as a multi-task self-refinement process, where the model iteratively predicts detection outcomes and natural language reasoning conditioned on past outputs. During training, ARMOR learns from heterogeneous supervision - large-scale sparse binary labels and small-scale rich reasoning annotations - optimized via a combination of offline and online imitation learning. At inference time, ARMOR generates multiple refinement trajectories and selects the most confident prediction via a self-certainty metric. Experiments across diverse environments show that ARMOR achieves state-of-the-art performance by improving over the previous approaches by up to 30% on failure detection rate and up to 100% in reasoning measured through LLM fuzzy match score, demonstrating robustness to heterogeneous supervision and open-ended reasoning beyond predefined failure modes. We provide dditional visualizations on our website: https://sites.google.com/utexas.edu/armor

LGDec 3, 2025
Physics-Driven Learning Framework for Tomographic Tactile Sensing

Xuanxuan Yang, Xiuyang Zhang, Haofeng Chen et al.

Electrical impedance tomography (EIT) provides an attractive solution for large-area tactile sensing due to its minimal wiring and shape flexibility, but its nonlinear inverse problem often leads to severe artifacts and inaccurate contact reconstruction. This work presents PhyDNN, a physics-driven deep reconstruction framework that embeds the EIT forward model directly into the learning objective. By jointly minimizing the discrepancy between predicted and ground-truth conductivity maps and enforcing consistency with the forward PDE, PhyDNN reduces the black-box nature of deep networks and improves both physical plausibility and generalization. To enable efficient backpropagation, we design a differentiable forward-operator network that accurately approximates the nonlinear EIT response, allowing fast physics-guided training. Extensive simulations and real tactile experiments on a 16-electrode soft sensor show that PhyDNN consistently outperforms NOSER, TV, and standard DNNs in reconstructing contact shape, location, and pressure distribution. PhyDNN yields fewer artifacts, sharper boundaries, and higher metric scores, demonstrating its effectiveness for high-quality tomographic tactile sensing.

NAMar 24
Long-time error analysis of finite element fully discrete schemes for SPDEs with non-globally Lipschitz coefficients

Ruisheng Qi, Xiaojie Wang

The present paper proposes new fully discrete schemes for long-time approximations of stochastic partial differential equations (SPDEs) with non-globally Lipschitz coefficients in a bounded domain $D \subset \R^d, d =1,2,3 $. A novel family of linearly implicit time-stepping schemes is introduced, based on a standard Galerkin finite element spatial semi-discretization. A distinguishing feature of the schemes is that the proposed finite element fully discrete approximations preserve uniform-in-time moment bounds in a Banach space $L^{r}(D), r >2$, without requiring any restriction on the time-space discretization stepsize ratio. %established... To show it, some non-standard arguments are developed. First, we derive long-time error estimates in the Banach space $L^r(D)$ for finite element fully discrete approximations of the deterministic linear parabolic equation with non-smooth initial value, which is, to our knowledge, new for the literature on numerical PDEs and of independent interest. These error estimates together with the contractive property of the semi-group in $L^{r}(D), r > 2$, the dissipativity of the nonlinearity and the particular benefit of the taming strategy help us establish the desired uniform-in-time moment bounds. Then both strong and weak error bounds of the proposed schemes are carefully analyzed in a setting of low regularity, with uniform-in-time convergence rates obtained for cases of both space-time white and trace-class noises. The analysis is highly nontrivial, due to the finite element discretization, the low regularity and the presence of the super-linearly growing nonlinearity. %in the long-time scenario... the discretization parameters $h$ and $τ$. Finally, numerical results are presented to verify the previous theoretical findings.

CVApr 22, 2025Code
FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

Zebin Yao, Lei Ren, Huixing Jiang et al.

Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.

CLJun 1, 2024Code
Phased Instruction Fine-Tuning for Large Language Models

Wei Pang, Chuan Zhou, Xiao-Hua Zhou et al.

Instruction Fine-Tuning enhances pre-trained language models from basic next-word prediction to complex instruction-following. However, existing One-off Instruction Fine-Tuning (One-off IFT) method, applied on a diverse instruction, may not effectively boost models' adherence to instructions due to the simultaneous handling of varying instruction complexities. To improve this, Phased Instruction Fine-Tuning (Phased IFT) is proposed, based on the idea that learning to follow instructions is a gradual process. It assesses instruction difficulty using GPT-4, divides the instruction data into subsets of increasing difficulty, and uptrains the model sequentially on these subsets. Experiments with Llama-2 7B/13B/70B, Llama3 8/70B and Mistral-7B models using Alpaca data show that Phased IFT significantly outperforms One-off IFT, supporting the progressive alignment hypothesis and providing a simple and efficient way to enhance large language models. Codes and datasets from our experiments are freely available at https://github.com/xubuvd/PhasedSFT.

CLMay 4, 2023Code
An Asynchronous Updating Reinforcement Learning Framework for Task-oriented Dialog System

Sai Zhang, Yuwei Hu, Xiaojie Wang et al.

Reinforcement learning has been applied to train the dialog systems in many works. Previous approaches divide the dialog system into multiple modules including DST (dialog state tracking) and DP (dialog policy), and train these modules simultaneously. However, different modules influence each other during training. The errors from DST might misguide the dialog policy, and the system action brings extra difficulties for the DST module. To alleviate this problem, we propose Asynchronous Updating Reinforcement Learning framework (AURL) that updates the DST module and the DP module asynchronously under a cooperative setting. Furthermore, curriculum learning is implemented to address the problem of unbalanced data distribution during reinforcement learning sampling, and multiple user models are introduced to increase the dialog diversity. Results on the public SSD-PHONE dataset show that our method achieves a compelling result with a 31.37% improvement on the dialog success rate. The code is publicly available via https://github.com/shunjiu/AURL.

CLSep 10, 2021Code
Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization

Junpeng Liu, Yanyan Zou, Hainan Zhang et al.

Unlike well-structured text, such as news reports and encyclopedia articles, dialogue content often comes from two or more interlocutors, exchanging information with each other. In such a scenario, the topic of a conversation can vary upon progression and the key information for a certain topic is often scattered across multiple utterances of different speakers, which poses challenges to abstractly summarize dialogues. To capture the various topic information of a conversation and outline salient facts for the captured topics, this work proposes two topic-aware contrastive learning objectives, namely coherence detection and sub-summary generation objectives, which are expected to implicitly model the topic change and handle information scattering challenges for the dialogue summarization task. The proposed contrastive objectives are framed as auxiliary tasks for the primary dialogue summarization task, united via an alternative parameter updating strategy. Extensive experiments on benchmark datasets demonstrate that the proposed simple method significantly outperforms strong baselines and achieves new state-of-the-art performance. The code and trained models are publicly available via \href{https://github.com/Junpliu/ConDigSum}{https://github.com/Junpliu/ConDigSum}.

STMay 8
Accelerating Langevin Monte Carlo via Efficient Stochastic Runge--Kutta Methods beyond Log-Concavity

Bin Yang, Xiaojie Wang

Sampling from a high-dimensional probability distribution is a fundamental algorithmic task arising in wide-ranging applications across multiple disciplines, including scientific computing, computational statistics and machine learning. Langevin Monte Carlo (LMC) algorithms are among the most widely used sampling methods in high-dimensional settings. This paper introduces a novel higher-order and Hessian-free LMC sampling algorithm based on an efficient stochastic Runge--Kutta method of strong order $1.5$ for the overdamped Langevin dynamics. In contrast to the existing Runge--Kutta type LMC (Li et al., 2019) involved with three gradient evaluations, the newly proposed algorithm is computationally cheaper and requires only two gradient evaluations for one iteration. Under certain log-smooth conditions, non-asymptotic error bounds of the proposed algorithms are analyzed in $\mathcal{W}_2$-distance. In particular, a uniform-in-time convergence rate of order $O(d ^{\frac32} h^{\frac32})$ is derived in a non-log-concave setting, matching the convergence rate proved in the aforementioned work but under the log-concavity condition. Numerical experiments are finally presented to demonstrate the effectiveness of the new sampling algorithm.

ITDec 8, 2023
Joint User Association, Interference Cancellation and Power Control for Multi-IRS Assisted UAV Communications

Zhaolong Ning, Hao Hu, Xiaojie Wang et al.

Intelligent reflecting surface (IRS)-assisted unmanned aerial vehicle (UAV) communications are expected to alleviate the load of ground base stations in a cost-effective way. Existing studies mainly focus on the deployment and resource allocation of a single IRS instead of multiple IRSs, whereas it is extremely challenging for joint multi-IRS multi-user association in UAV communications with constrained reflecting resources and dynamic scenarios. To address the aforementioned challenges, we propose a new optimization algorithm for joint IRS-user association, trajectory optimization of UAVs, successive interference cancellation (SIC) decoding order scheduling and power allocation to maximize system energy efficiency. We first propose an inverse soft-Q learning-based algorithm to optimize multi-IRS multi-user association. Then, SCA and Dinkelbach-based algorithm are leveraged to optimize UAV trajectory followed by the optimization of SIC decoding order scheduling and power allocation. Finally, theoretical analysis and performance results show significant advantages of the designed algorithm in convergence rate and energy efficiency.

CLJul 28, 2025
Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Jiaju Chen, Yuxuan Lu, Xiaojie Wang et al.

Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging "LLM-as-a-judge" paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate multi-dimensional feedback. Our evaluation experiments in both the educational and medical domains demonstrate that MAJ-EVAL can generate evaluation results that better align with human experts' ratings compared with conventional automated evaluation metrics and existing LLM-as-a-judge methods.

CVMar 10, 2024
In-context Prompt Learning for Test-time Vision Recognition with Frozen Vision-language Model

Junhui Yin, Xinyu Zhang, Lin Wu et al.

Current pre-trained vision-language models, such as CLIP, have demonstrated remarkable zero-shot generalization capabilities across various downstream tasks. However, their performance significantly degrades when test inputs exhibit different distributions. In this paper, we explore the concept of test-time prompt tuning (TTPT), which facilitates the adaptation of the CLIP model to novel downstream tasks through a one-step unsupervised optimization that involves only test samples. Inspired by in-context learning in natural language processing (NLP), we propose In-Context Prompt Learning (InCPL) for test-time visual recognition tasks, which empowers a pre-trained vision-language model with labeled examples as context information on downstream task. Specifically, InCPL associates a new test sample with very few labeled examples (sometimes just one) as context information, enabling reliable label estimation for the test sample and facilitating model adaptation. To achieve this, InCPL employs an efficient language-to-vision translator to explore the textual prior information for visual prompt learning. Further, we introduce a context-aware unsupervised loss to optimize visual prompts tailored to test samples. Finally, we design a cyclic learning strategy for visual and textual prompts to ensure mutual synergy across different modalities. This enables a pre-trained, frozen CLIP model to adapt to any task using its learned adaptive prompt. Our method demonstrates superior performance and achieves state-of-the-art results across various downstream datasets.

ROMar 31
Generalizable Dense Reward for Long-Horizon Robotic Tasks

Silong Yong, Stephen Sheng, Carl Qi et al.

Existing robotic foundation policies are trained primarily via large-scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long-horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dense reward framework combining (1) an extrinsic reward from Large Language Models (LLMs) and Vision-Language Models (VLMs) for task progress recognition, and (2) an intrinsic reward based on policy self-certainty. VLLR uses LLMs to decompose tasks into verifiable subtasks and then VLMs to estimate progress to initialize the value function for a brief warm-up phase, avoiding prohibitive inference cost during full training; and self-certainty provides per-step intrinsic guidance throughout PPO finetuning. Ablation studies reveal complementary benefits: VLM-based value initialization primarily improves task completion efficiency, while self-certainty primarily enhances success rates, particularly on out-of-distribution tasks. On the CHORES benchmark covering mobile manipulation and navigation, VLLR achieves up to 56% absolute success rate gains over the pretrained policy, up to 5% gains over state-of-the-art RL finetuning methods on in-distribution tasks, and up to $10\%$ gains on out-of-distribution tasks, all without manual reward engineering. Additional visualizations can be found in https://silongyong.github.io/vllr_project_page/

CLMay 17, 2025
Enhancing Complex Instruction Following for Large Language Models with Mixture-of-Contexts Fine-tuning

Yuheng Lu, ZiMeng Bai, Caixia Yuan et al.

Large language models (LLMs) exhibit remarkable capabilities in handling natural language tasks; however, they may struggle to consistently follow complex instructions including those involve multiple constraints. Post-training LLMs using supervised fine-tuning (SFT) is a standard approach to improve their ability to follow instructions. In addressing complex instruction following, existing efforts primarily focus on data-driven methods that synthesize complex instruction-output pairs for SFT. However, insufficient attention allocated to crucial sub-contexts may reduce the effectiveness of SFT. In this work, we propose transforming sequentially structured input instruction into multiple parallel instructions containing subcontexts. To support processing this multi-input, we propose MISO (Multi-Input Single-Output), an extension to currently dominant decoder-only transformer-based LLMs. MISO introduces a mixture-of-contexts paradigm that jointly considers the overall instruction-output alignment and the influence of individual sub-contexts to enhance SFT effectiveness. We apply MISO fine-tuning to complex instructionfollowing datasets and evaluate it with standard LLM inference. Empirical results demonstrate the superiority of MISO as a fine-tuning method for LLMs, both in terms of effectiveness in complex instruction-following scenarios and its potential for training efficiency.

CLOct 22, 2024
Controlled Low-Rank Adaptation with Subspace Regularization for Continued Training on Large Language Models

Yuheng Lu, Bingshuo Qian, Caixia Yuan et al.

Large language models (LLMs) exhibit remarkable capabilities in natural language processing but face catastrophic forgetting when learning new tasks, where adaptation to a new domain leads to a substantial decline in performance on previous tasks. In this paper, we propose Controlled LoRA (CLoRA), a sub-space regularization method on LoRA structure. Aiming to reduce the scale of output change while introduce minimal constraint on model capacity, CLoRA imposes constraint on the direction of updating matrix's null space. Experimental results on one-stage LLM finetuning tasks and continual learning settings highlight the superority of CLoRA as a effective parameter efficient finetuning method with catastrophic forgetting mitigating.Further investigation for model parameters indicates that CLoRA effectively balances the trade-off between model capacity and degree of forgetting.

CVNov 22, 2025
FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

Guoyang Xia, Yifeng Ding, Fengfa Li et al.

Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.

MLSep 30, 2025
When Langevin Monte Carlo Meets Randomization: Non-asymptotic Error Bounds beyond Log-Concavity and Gradient Lipschitzness

Xiaojie Wang, Bin Yang

Efficient sampling from complex and high dimensional target distributions turns out to be a fundamental task in diverse disciplines such as scientific computing, statistics and machine learning. In this paper, we revisit the randomized Langevin Monte Carlo (RLMC) for sampling from high dimensional distributions without log-concavity. Under the gradient Lipschitz condition and the log-Sobolev inequality, we prove a uniform-in-time error bound in $\mathcal{W}_2$-distance of order $O(\sqrt{d}h)$ for the RLMC sampling algorithm, which matches the best one in the literature under the log-concavity condition. Moreover, when the gradient of the potential $U$ is non-globally Lipschitz with superlinear growth, modified RLMC algorithms are proposed and analyzed, with non-asymptotic error bounds established. To the best of our knowledge, the modified RLMC algorithms and their non-asymptotic error bounds are new in the non-globally Lipschitz setting.

CLSep 28, 2025
ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

Haojie Ouyang, Jianwei Lv, Lei Ren et al.

Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

CVSep 23, 2025
Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models

Yueyan Li, Chenggong Zhao, Zeyuan Zang et al.

Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the "what" and "where" pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model's perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.

LGFeb 10, 2025
Fine-Tuning is Subgraph Search: A New Lens on Learning Dynamics

Yueyan Li, Wenhao Gao, Caixia Yuan et al.

The study of mechanistic interpretability aims to reverse-engineer a model to explain its behaviors. While recent studies have focused on the static mechanism of a certain behavior, the learning dynamics inside a model remain to be explored. In this work, we develop a fine-tuning method for analyzing the mechanism behind learning. Inspired by the concept of intrinsic dimension, we view a model as a computational graph with redundancy for a specific task, and treat the fine-tuning process as a search for and optimization of a subgraph within this graph. Based on this hypothesis, we propose circuit-tuning, an algorithm that iteratively builds the subgraph for a specific task and updates the relevant parameters in a heuristic way. We first validate our hypothesis through a carefully designed experiment and provide a detailed analysis of the learning dynamics during fine-tuning. Subsequently, we conduct experiments on more complex tasks, demonstrating that circuit-tuning could strike a balance between the performance on the target task and the general capabilities. Our work offers a new analytical method for the dynamics of fine-tuning, provides new findings on the mechanisms behind the training process, and inspires the design of superior algorithms for the training of neural networks.

IRMay 26, 2023
Multimodal Recommendation Dialog with Subjective Preference: A New Challenge and Benchmark

Yuxing Long, Binyuan Hui, Caixia Yuan1 et al.

Existing multimodal task-oriented dialog data fails to demonstrate the diverse expressions of user subjective preferences and recommendation acts in the real-life shopping scenario. This paper introduces a new dataset SURE (Multimodal Recommendation Dialog with SUbjective PREference), which contains 12K shopping dialogs in complex store scenes. The data is built in two phases with human annotations to ensure quality and diversity. SURE is well-annotated with subjective preferences and recommendation acts proposed by sales experts. A comprehensive analysis is given to reveal the distinguishing features of SURE. Three benchmark tasks are then proposed on the data to evaluate the capability of multimodal recommendation agents. Based on the SURE, we propose a baseline model, powered by a state-of-the-art multimodal model, for these tasks.

IROct 29, 2021
Deep Keyphrase Completion

Yu Zhao, Jia Song, Huali Feng et al.

Keyphrase provides accurate information of document content that is highly compact, concise, full of meanings, and widely used for discourse comprehension, organization, and text retrieval. Though previous studies have made substantial efforts for automated keyphrase extraction and generation, surprisingly, few studies have been made for \textit{keyphrase completion} (KPC). KPC aims to generate more keyphrases for document (e.g. scientific publication) taking advantage of document content along with a very limited number of known keyphrases, which can be applied to improve text indexing system, etc. In this paper, we propose a novel KPC method with an encoder-decoder framework. We name it \textit{deep keyphrase completion} (DKPC) since it attempts to capture the deep semantic meaning of the document content together with known keyphrases via a deep learning framework. Specifically, the encoder and the decoder in DKPC play different roles to make full use of the known keyphrases. The former considers the keyphrase-guiding factors, which aggregates information of known keyphrases into context. On the contrary, the latter considers the keyphrase-inhibited factor to inhibit semantically repeated keyphrase generation. Extensive experiments on benchmark datasets demonstrate the efficacy of our proposed model.