CLMar 14, 2022
WCL-BBCD: A Contrastive Learning and Knowledge Graph Approach to Named Entity RecognitionRenjie Zhou, Qiang Hu, Jian Wan et al.
Named Entity Recognition task is one of the core tasks of information extraction. Word ambiguity and word abbreviation are important reasons for the low recognition rate of named entities. In this paper, we propose a novel named entity recognition model WCL-BBCD (Word Contrastive Learning with BERT-BiLSTM-CRF-DBpedia), which incorporates the idea of contrastive learning. The model first trains the sentence pairs in the text, calculate similarity between sentence pairs, and fine-tunes BERT used for the named entity recognition task according to the similarity, so as to alleviate word ambiguity. Then, the fine-tuned BERT is combined with BiLSTM-CRF to perform the named entity recognition task. Finally, the recognition results are corrected in combination with prior knowledge such as knowledge graphs, so as to alleviate the low-recognition-rate problem caused by word abbreviations. The results of experimentals conducted on the CoNLL-2003 English dataset and OntoNotes V5 English dataset show that our model outperforms other similar models on.
98.4IRApr 7Code
QKVQA: Question-Focused Filtering for Knowledge-based VQAWei Ye, Yixin Su, Yueguo Chen et al.
Visual Question Answering (VQA) is the task of answering questions based on image content. Building upon this, Knowledge-Based VQA (KB-VQA) requires models to answer questions that depend on external knowledge beyond the visual content of an image. In such settings, effective knowledge filtering is essential for achieving high question answering accuracy. Typical filtering methods suffer from two issues: they fail to focus on parts relevant to the question during candidate section encoding, and they use similarity metrics to locate a section from a single article, resulting in information limitation. To address these issues, this paper proposes a question-focused, cross-article filtering method. Specifically, we design a trainable Question-Focused Filter (QFF) and a Chunk-based Dynamic Cross-Article Selection module (CDA). This approach maintains inference time comparable to the optimal method with the shorter context length, efficiently obtaining high-quality filtered knowledge. The accuracy outperforms current state-of-the-art methods by 3.2 and 2.2 percentage points on Encyclopedic-VQA and InfoSeek, respectively. The code is publicly available at: https://github.com/leaffeall/QKVQA.
40.4IRMay 8Code
DCGL: Dual-Channel Graph Learning with Large Language Models for Knowledge-Aware RecommendationXinchi Zou, Tongzhenzhi Su, Jianjun Li et al.
Knowledge Graphs (KGs) have proven highly effective for recommendation systems by capturing latent item relationships, while recent integration of Large Language Models (LLMs) has further enhanced semantic understanding and addressed knowledge sparsity issues. Nevertheless, current KG-and-LLM-based methods still face three main limitations: 1) inadequate modeling of implicit semantic relationships beyond explicit KG links; 2) suboptimal single-channel fusion of ID and LLM embeddings, which often leads to signal interference and blurred representations; and 3) insufficient consideration of user-item interaction frequency variations in recommendation strategies. To address these challenges, we propose the Dual-Channel Graph Learning (DCGL) framework, featuring three key innovations: 1) a dual-channel architecture that structurally decouples rich semantic information from user behavioral patterns, preventing early interference; 2) a multi-level contrastive learning mechanism that enhances robustness against KG noise through intra-view contrasts and bridges semantic gaps between channels via inter-view alignment; and 3) a dynamic fusion mechanism that adaptively balances semantic generalization and behavioral specificity based on interaction frequency, resolving the cascading limitation. Extensive experiments on four real-world datasets show that DCGL consistently outperforms state-of-the-art methods, yielding substantial improvements in sparse scenarios while maintaining precision for active users. Our code is available at https://github.com/XinchiZou/DCGL.
CVOct 15, 2024Code
Efficient Diffusion Models: A Comprehensive Survey from Principles to PracticesZhiyuan Ma, Yuzhu Zhang, Guoli Jia et al.
As one of the most popular and sought-after generative models in the recent years, diffusion models have sparked the interests of many researchers and steadily shown excellent advantage in various generative tasks such as image synthesis, video generation, molecule design, 3D scene rendering and multimodal generation, relying on their dense theoretical principles and reliable application practices. The remarkable success of these recent efforts on diffusion models comes largely from progressive design principles and efficient architecture, training, inference, and deployment methodologies. However, there has not been a comprehensive and in-depth review to summarize these principles and practices to help the rapid understanding and application of diffusion models. In this survey, we provide a new efficiency-oriented perspective on these existing efforts, which mainly focuses on the profound principles and efficient practices in architecture designs, model training, fast inference and reliable deployment, to guide further theoretical research, algorithm migration and model application for new scenarios in a reader-friendly way. \url{https://github.com/ponyzym/Efficient-DMs-Survey}
CVJun 10, 2025Code
Flow Diverse and Efficient: Learning Momentum Flow Matching via Stochastic Velocity Field SamplingZhiyuan Ma, Ruixun Liu, Sixian Liu et al.
Recently, the rectified flow (RF) has emerged as the new state-of-the-art among flow-based diffusion models due to its high efficiency advantage in straight path sampling, especially with the amazing images generated by a series of RF models such as Flux 1.0 and SD 3.0. Although a straight-line connection between the noisy and natural data distributions is intuitive, fast, and easy to optimize, it still inevitably leads to: 1) Diversity concerns, which arise since straight-line paths only cover a fairly restricted sampling space. 2) Multi-scale noise modeling concerns, since the straight line flow only needs to optimize the constant velocity field $\bm v$ between the two distributions $\bmπ_0$ and $\bmπ_1$. In this work, we present Discretized-RF, a new family of rectified flow (also called momentum flow models since they refer to the previous velocity component and the random velocity component in each diffusion step), which discretizes the straight path into a series of variable velocity field sub-paths (namely ``momentum fields'') to expand the search space, especially when close to the distribution $p_\text{noise}$. Different from the previous case where noise is directly superimposed on $\bm x$, we introduce noise on the velocity $\bm v$ of the sub-path to change its direction in order to improve the diversity and multi-scale noise modeling abilities. Experimental results on several representative datasets demonstrate that learning momentum flow matching by sampling random velocity fields will produce trajectories that are both diverse and efficient, and can consistently generate high-quality and diverse results. Code is available at https://github.com/liuruixun/momentum-fm.
49.6MMApr 16
Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSDJunhao Xiao, Shun Feng, Zhiyu Wu et al.
Audio-Visual Speaker Detection (AVSD) hinges on modeling both individual temporal continuity and inter-personal social context. Existing coupled architectures struggle to reconcile these tasks in shared representation spaces due to conflicting inductive biases: temporal modeling favors low-frequency smoothness, while inter-personal interaction requires high-frequency discriminability. We propose D$^2$Stream, a decoupled dual-stream framework that explicitly isolates these functionalities into parallel, task-specific branches. Specifically, the Intra-speaker Temporal Continuity (ITC) stream captures longitudinal stability, whereas the Inter-personal Social Relation (ISR) stream models transversal social cues. Quantitative gradient analysis reveals an evolutionary divergence in update directions, stabilizing at 86.1°, which confirms the inherent task conflict and the effectiveness of our structural decoupling. D$^2$Stream breaks the long-standing performance plateau, achieving a state-of-the-art 95.6% mAP on AVA-ActiveSpeaker and superior generalization on Columbia ASD, all within a lightweight and efficient design.
CVJan 7
I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image EditingJinghan Yu, Junhao Xiao, Chenyu Zhu et al.
Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
98.9LGMay 9
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion AlignmentJiaming Li, Chenyu Zhu, Zhiyuan Ma et al.
Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.
17.2IRMay 1
Time-Interval-Aware Disentangled Expert Modeling for Next-Basket RecommendationZhiying Deng, Yuan Fu, Usman Farooq et al.
Next-basket recommendation (NBR) is a type of recommendation that aims to predict a set of items a user will purchase based on their historical transaction basket sequences. It is governed by a dynamic interplay between two distinct user intents: habitual repurchase, which involves repeating past behaviors, and exploratory interest, which involves discovering new items. However, existing NBR methods generally suffer from two limitations: (1) they often entangle these conflicting motives within a single representation, causing habits to overshadow discovery, and (2) they rely on discrete sequential modeling that ignores continuous-time intervals and item-specific periodicities. In this paper, we propose a novel solution named Time-Interval Disentangled Experts (TIDE) to address these challenges. TIDE incorporates a Hawkes-enhanced Fourier Time Encoding to capture item-specific temporal periodicities and dynamic decay. To decouple user intentions, TIDE utilizes a dual-expert architecture that integrates a Habit Expert for recurring needs and a Pattern-Guided Exploration Expert for discovery. Combined with an item-aware gating mechanism, TIDE adaptively balances repurchase and exploration. Extensive experiments on four diverse real-world datasets demonstrate that TIDE consistently outperforms representative state-of-the-art NBR methods.
CVDec 13, 2023
LMD: Faster Image Reconstruction with Latent Masking DiffusionZhiyuan Ma, zhihuan yu, Jianjun Li et al.
As a class of fruitful approaches, diffusion probabilistic models (DPMs) have shown excellent advantages in high-resolution image reconstruction. On the other hand, masked autoencoders (MAEs), as popular self-supervised vision learners, have demonstrated simpler and more effective image reconstruction and transfer capabilities on downstream tasks. However, they all require extremely high training costs, either due to inherent high temporal-dependence (i.e., excessively long diffusion steps) or due to artificially low spatial-dependence (i.e., human-formulated high mask ratio, such as 0.75). To the end, this paper presents LMD, a faster image reconstruction framework with latent masking diffusion. First, we propose to project and reconstruct images in latent space through a pre-trained variational autoencoder, which is theoretically more efficient than in the pixel-based space. Then, we combine the advantages of MAEs and DPMs to design a progressive masking diffusion model, which gradually increases the masking proportion by three different schedulers and reconstructs the latent features from simple to difficult, without sequentially performing denoising diffusion as in DPMs or using fixed high masking ratio as in MAEs, so as to alleviate the high training time-consumption predicament. Our approach allows for learning high-capacity models and accelerate their training (by 3x or more) and barely reduces the original accuracy. Inference speed in downstream tasks also significantly outperforms the previous approaches.
CVOct 16, 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-TrainingZhiyuan Ma, Jianjun Li, Guohui Li et al.
With the flourishing of social media platforms, vision-language pre-training (VLP) recently has received great attention and many remarkable progresses have been achieved. The success of VLP largely benefits from the information complementation and enhancement between different modalities. However, most of recent studies focus on cross-modal contrastive learning (CMCL) to promote image-text alignment by pulling embeddings of positive sample pairs together while pushing those of negative pairs apart, which ignores the natural asymmetry property between different modalities and requires large-scale image-text corpus to achieve arduous progress. To mitigate this predicament, we propose CMAL, a Cross-Modal Associative Learning framework with anchor points detection and cross-modal associative learning for VLP. Specifically, we first respectively embed visual objects and textual tokens into separate hypersphere spaces to learn intra-modal hidden features, and then design a cross-modal associative prompt layer to perform anchor point masking and swap feature filling for constructing a hybrid cross-modal associative prompt. Afterwards, we exploit a unified semantic encoder to learn their cross-modal interactive features for context adaptation. Finally, we design an associative mapping classification layer to learn potential associative mappings between modalities at anchor points, within which we develop a fresh self-supervised associative mapping classification task to boost CMAL's performance. Experimental results verify the effectiveness of CMAL, showing that it achieves competitive performance against previous CMCL-based methods on four common downstream vision-and-language tasks, with significantly fewer corpus. Especially, CMAL obtains new state-of-the-art results on SNLI-VE and REC (testA).
CVMay 18, 2025
Context-Aware Autoregressive Models for Multi-Conditional Image GenerationYixiao Chen, Zhiyuan Ma, Guoli Jia et al.
Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence--offering a concise solution for multi-conditional image generation tasks. In this work, we propose $\textbf{ContextAR}$, a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces computational complexity while preserving effective intra-condition perception. Without any fine-tuning, ContextAR supports arbitrary combinations of conditions during inference time. Experimental results demonstrate the powerful controllability and versatility of our approach, and show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline across diverse multi-condition driven scenarios. Project page: $\href{https://context-ar.github.io/}{https://context-ar.github.io/.}$
IRAug 22, 2025
OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop RetrievalYu Liu, Yanbing Liu, Fangfang Yuan et al.
Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks: 1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions. 2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents. 3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge. Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.
CLDec 13, 2025
SCIR: A Self-Correcting Iterative Refinement Framework for Enhanced Information Extraction Based on SchemaYushen Fang, Jianjun Li, Mingqian Ding et al.
Although Large language Model (LLM)-powered information extraction (IE) systems have shown impressive capabilities, current fine-tuning paradigms face two major limitations: high training costs and difficulties in aligning with LLM preferences. To address these issues, we propose a novel universal IE paradigm, the Self-Correcting Iterative Refinement (SCIR) framework, along with a Multi-task Bilingual (Chinese-English) Self-Correcting (MBSC) dataset containing over 100,000 entries. The SCIR framework achieves plug-and-play compatibility with existing LLMs and IE systems through its Dual-Path Self-Correcting module and feedback-driven optimization, thereby significantly reducing training costs. Concurrently, the MBSC dataset tackles the challenge of preference alignment by indirectly distilling GPT-4's capabilities into IE result detection models. Experimental results demonstrate that SCIR outperforms state-of-the-art IE methods across three key tasks: named entity recognition, relation extraction, and event extraction, achieving a 5.27 percent average improvement in span-based Micro-F1 while reducing training costs by 87 percent compared to baseline approaches. These advancements not only enhance the flexibility and accuracy of IE systems but also pave the way for lightweight and efficient IE paradigms.
CVAug 5, 2025
MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human ErasingJinghan Yu, Junhao Xiao, Zhiyuan Ma et al.
Recent years have witnessed the success of diffusion models in image customization tasks. However, existing mask-guided human erasing methods still struggle in complex scenarios such as human-human occlusion, human-object entanglement, and human-background interference, mainly due to the lack of large-scale multi-instance datasets and effective spatial decoupling to separate foreground from background. To bridge these gaps, we curate the MILD dataset capturing diverse poses, occlusions, and complex multi-instance interactions. We then define the Cross-Domain Attention Gap (CAG), an attention-gap metric to quantify semantic leakage. On top of these, we propose Multi-Layer Diffusion (MILD), which decomposes the generation process into independent denoising pathways, enabling separate reconstruction of each foreground instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, a plug-and-play module that incorporates pose, parsing, and spatial relationships into the diffusion process to improve structural awareness and restoration quality. Additionally, we present Spatially-Modulated Attention, an adaptive mechanism that leverages spatial mask priors to modulate attention across semantic regions, further widening the CAG to effectively minimize boundary artifacts and mitigate semantic leakage. Experiments show that MILD significantly outperforms existing methods. Datasets and code are publicly available at: https://mild-multi-layer-diffusion.github.io/.
IROct 4, 2020
A Light Heterogeneous Graph Collaborative Filtering Model using Textual InformationChaoyang Wang, Zhiqiang Guo, Guohui Li et al.
Due to the development of graph neural networks, graph-based representation learning methods have made great progress in recommender systems. However, data sparsity is still a challenging problem that most graph-based recommendation methods are confronted with. Recent works try to address this problem by utilizing side information. In this paper, we exploit the relevant and easily accessible textual information by advanced natural language processing (NLP) models and propose a light RGCN-based (RGCN, relational graph convolutional network) collaborative filtering method on heterogeneous graphs. Specifically, to incorporate rich textual knowledge, we utilize a pre-trained NLP model to initialize the embeddings of text nodes. Afterward, by performing a simplified RGCN-based node information propagation on the constructed heterogeneous graph, the embeddings of users and items can be adjusted with textual knowledge, which effectively alleviates the negative effects of data sparsity. Moreover, the matching function used by most graph-based representation learning methods is the inner product, which is not appropriate for the obtained embeddings that contain complex semantics. We design a predictive network that combines graph-based representation learning with neural matching function learning, and demonstrate that this architecture can bring a significant performance improvement. Extensive experiments are conducted on three publicly available datasets, and the results verify the superior performance of our method over several baselines.
IRApr 14, 2020
A Text-based Deep Reinforcement Learning Framework for Interactive RecommendationChaoyang Wang, Zhiqiang Guo, Jianjun Li et al.
Due to its nature of learning from dynamic interactions and planning for long-run performance, reinforcement learning (RL) recently has received much attention in interactive recommender systems (IRSs). IRSs usually face the large discrete action space problem, which makes most of the existing RL-based recommendation methods inefficient. Moreover, data sparsity is another challenging problem that most IRSs are confronted with. While the textual information like reviews and descriptions is less sensitive to sparsity, existing RL-based recommendation methods either neglect or are not suitable for incorporating textual information. To address these two problems, in this paper, we propose a Text-based Deep Deterministic Policy Gradient framework (TDDPG-Rec) for IRSs. Specifically, we leverage textual information to map items and users into a feature space, which greatly alleviates the sparsity problem. Moreover, we design an effective method to construct an action candidate set. By the policy vector dynamically learned from TDDPG-Rec that expresses the user's preference, we can select actions from the candidate set effectively. Through experiments on three public datasets, we demonstrate that TDDPG-Rec achieves state-of-the-art performance over several baselines in a time-efficient manner.
CVSep 3, 2019
PSDNet and DPDNet: Efficient channel expansion, Depthwise-Pointwise-Depthwise Inverted Bottleneck BlockGuoqing Li, Meng Zhang, Qianru Zhang et al.
In many real-time applications, the deployment of deep neural networks is constrained by high computational cost and efficient lightweight neural networks are widely concerned. In this paper, we propose that depthwise convolution (DWC) is used to expand the number of channels in a bottleneck block, which is more efficient than 1 x 1 convolution. The proposed Pointwise-Standard-Depthwise network (PSDNet) based on channel expansion with DWC has fewer number of parameters, less computational cost and higher accuracy than corresponding ResNet on CIFAR datasets. To design more efficient lightweight concolutional neural netwok, Depthwise-Pointwise-Depthwise inverted bottleneck block (DPD block) is proposed and DPDNet is designed by stacking DPD block. Meanwhile, the number of parameters of DPDNet is only about 60% of that of MobileNetV2 for networks with the same number of layers, but can achieve approximated accuracy. Additionally, two hyperparameters of DPDNet can make the trade-off between accuracy and computational cost, which makes DPDNet suitable for diverse tasks. Furthermore, we find the networks with more DWC layers outperform the networks with more 1x1 convolution layers, which indicates that extracting spatial information is more important than combining channel information.
CHEM-PHAug 2, 2019
Retrosynthesis with Attention-Based NMT Model and Chemical Analysis of the "Wrong" PredictionsHongliang Duan, Ling Wang, Chengyun Zhang et al.
We cast retrosynthesis as a machine translation problem by introducing a special Tensor2Tensor, an entire attention-based and fully data-driven model. Given a data set comprising about 50,000 diverse reactions extracted from USPTO patents, the model significantly outperforms seq2seq model (34.7%) on a top-1 accuracy by achieving 54.1%. For yielding better results, parameters such as batch size and training time are thoroughly investigated to train the model. Additionally, we offer a novel insight into the causes of grammatically invalid SMILES, and conduct a test in which experienced chemists pick out and analyze the "wrong" predictions that may be chemically plausible but differ from the ground truth. Actually, the effectiveness of our model is un-derestimated and the "true" top-1 accuracy can reach to 64.6%.
CVJul 12, 2019
VarGNet: Variable Group Convolutional Neural Network for Efficient Embedded ComputingQian Zhang, Jianjun Li, Meng Yao et al.
In this paper, we propose a novel network design mechanism for efficient embedded computing. Inspired by the limited computing patterns, we propose to fix the number of channels in a group convolution, instead of the existing practice that fixing the total group numbers. Our solution based network, named Variable Group Convolutional Network (VarGNet), can be optimized easier on hardware side, due to the more unified computing schemes among the layers. Extensive experiments on various vision tasks, including classification, detection, pixel-wise parsing and face recognition, have demonstrated the practical value of our VarGNet.