CLJul 11, 2023Code
Towards Understanding In-Context Learning with Contrastive Demonstrations and Saliency MapsFuxiao Liu, Paiheng Xu, Zongxia Li et al.
We investigate the role of various demonstration components in the in-context learning (ICL) performance of large language models (LLMs). Specifically, we explore the impacts of ground-truth labels, input distribution, and complementary explanations, particularly when these are altered or perturbed. We build on previous work, which offers mixed findings on how these elements influence ICL. To probe these questions, we employ explainable NLP (XNLP) methods and utilize saliency maps of contrastive demonstrations for both qualitative and quantitative analysis. Our findings reveal that flipping ground-truth labels significantly affects the saliency, though it's more noticeable in larger LLMs. Our analysis of the input distribution at a granular level reveals that changing sentiment-indicative terms in a sentiment analysis task to neutral ones does not have as substantial an impact as altering ground-truth labels. Finally, we find that the effectiveness of complementary explanations in boosting ICL performance is task-dependent, with limited benefits seen in sentiment analysis tasks compared to symbolic reasoning tasks. These insights are critical for understanding the functionality of LLMs and guiding the development of effective demonstrations, which is increasingly relevant in light of the growing use of LLMs in applications such as ChatGPT. Our research code is publicly available at https://github.com/paihengxu/XICL.
90.2CVJun 1
Explainable Forensics of Manipulated Segments in Untrimmed Long VideosYue Feng, Jingjing Li, Qijia Lu et al.
The rapid advancement of AI-driven video generation has transformed content creation, while simultaneously increasing the risk of misinformation through localized manipulations in long-form videos. Existing video forensic methods predominantly operate on short, independent clips, and thus fail to capture realistic scenarios where AI-generated content is sparsely embedded within otherwise authentic footage. To bridge this gap, we formulate the task of Temporal AI-Generated Segment Localization and Explanation, which targets authenticity detection, temporal localization, and interpretable analysis of manipulated segments in untrimmed long videos. We further introduce TASLE, a large-scale benchmark comprising 12,472 untrimmed videos with diverse manipulation patterns and rich annotation signals, including temporal boundaries, authenticity labels, and segment-level rationales. In addition, we propose MSLoc, a coarse-to-fine forensic baseline that combines a boundary-sensitive proposal generation module for efficient long-video scanning with an MLLM-based refinement module for precise boundary localization and interpretable reasoning. Experiments validate the effectiveness of the proposed baseline, highlighting the importance of segment-level explainable forensics for long-form AI-generated video analysis. Our dataset and code are publicly available at https://debby-0527.github.io/TASLE.
NAMar 4, 2019
Long time error analysis of finite difference time domain methods for the nonlinear Klein-Gordon equation with weak nonlinearityWeizhu Bao, Yue Feng, Wenfan Yi
We establish error bounds of the finite difference time domain (FDTD) methods for the long time dynamics of the nonlinear Klein-Gordon equation (NKGE) with a cubic nonlinearity, while the nonlinearity strength is characterized by $\varepsilon^2$ with $0 <\varepsilon \leq 1$ a dimensionless parameter. When $0 < \varepsilon \ll 1$, it is in the weak nonlinearity regime and the problem is equivalent to the NKGE with small initial data, while the amplitude of the initial data (and the solution) is at $O(\varepsilon)$. Four different FDTD methods are adapted to discretize the problem and rigorous error bounds of the FDTD methods are established for the long time dynamics, i.e. error bounds are valid up to the time at $O(1/\varepsilon^β)$ with $0 \le β\leq 2$, by using the energy method and the techniques of either the cut-off of the nonlinearity or the mathematical induction to bound the numerical approximate solutions. In the error bounds, we pay particular attention to how error bounds depend explicitly on the mesh size $h$ and time step $τ$ as well as the small parameter $\varepsilon\in (0,1]$, especially in the weak nonlinearity regime when $0 < \varepsilon \ll 1$. Our error bounds indicate that, in order to get ``correct'' numerical solutions up to the time at $O(1/\varepsilon^β)$, the $\varepsilon$-scalability (or meshing strategy) of the FDTD methods should be taken as: $h = O(\varepsilon^{β/2})$ and $τ= O(\varepsilon^{β/2})$. Extensive numerical results are reported to confirm our error bounds and to demonstrate that they are sharp.
CLApr 14, 2022
Dynamic Schema Graph Fusion Network for Multi-Domain Dialogue State TrackingYue Feng, Aldo Lipani, Fanghua Ye et al.
Dialogue State Tracking (DST) aims to keep track of users' intentions during the course of a conversation. In DST, modelling the relations among domains and slots is still an under-studied problem. Existing approaches that have considered such relations generally fall short in: (1) fusing prior slot-domain membership relations and dialogue-aware dynamic slot relations explicitly, and (2) generalizing to unseen domains. To address these issues, we propose a novel \textbf{D}ynamic \textbf{S}chema \textbf{G}raph \textbf{F}usion \textbf{Net}work (\textbf{DSGFNet}), which generates a dynamic schema graph to explicitly fuse the prior slot-domain membership relations and dialogue-aware dynamic slot relations. It also uses the schemata to facilitate knowledge transfer to new domains. DSGFNet consists of a dialogue utterance encoder, a schema graph encoder, a dialogue-aware schema graph evolving network, and a schema graph enhanced dialogue state decoder. Empirical results on benchmark datasets (i.e., SGD, MultiWOZ2.1, and MultiWOZ2.2), show that DSGFNet outperforms existing methods.
IRAug 11, 2023
A Large Language Model Enhanced Conversational Recommender SystemYue Feng, Shuchang Liu, Zhenghai Xue et al.
Conversational recommender systems (CRSs) aim to recommend high-quality items to users through a dialogue interface. It usually contains multiple sub-tasks, such as user preference elicitation, recommendation, explanation, and item information search. To develop effective CRSs, there are some challenges: 1) how to properly manage sub-tasks; 2) how to effectively solve different sub-tasks; and 3) how to correctly generate responses that interact with users. Recently, Large Language Models (LLMs) have exhibited an unprecedented ability to reason and generate, presenting a new opportunity to develop more powerful CRSs. In this work, we propose a new LLM-based CRS, referred to as LLMCRS, to address the above challenges. For sub-task management, we leverage the reasoning ability of LLM to effectively manage sub-task. For sub-task solving, we collaborate LLM with expert models of different sub-tasks to achieve the enhanced performance. For response generation, we utilize the generation ability of LLM as a language interface to better interact with users. Specifically, LLMCRS divides the workflow into four stages: sub-task detection, model matching, sub-task execution, and response generation. LLMCRS also designs schema-based instruction, demonstration-based instruction, dynamic sub-task and model matching, and summary-based generation to instruct LLM to generate desired results in the workflow. Finally, to adapt LLM to conversational recommendations, we also propose to fine-tune LLM with reinforcement learning from CRSs performance feedback, referred to as RLPF. Experimental results on benchmark datasets show that LLMCRS with RLPF outperforms the existing methods.
CVDec 4, 2025Code
SAM3-I: Segment Anything with InstructionsJingjing Li, Yue Feng, Yuchen Guo et al.
Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.
CLJun 16, 2023
Unlocking the Potential of User Feedback: Leveraging Large Language Model as User Simulator to Enhance Dialogue SystemZhiyuan Hu, Yue Feng, Anh Tuan Luu et al.
Dialogue systems and large language models (LLMs) have gained considerable attention. However, the direct utilization of LLMs as task-oriented dialogue (TOD) models has been found to underperform compared to smaller task-specific models. Nonetheless, it is crucial to acknowledge the significant potential of LLMs and explore improved approaches for leveraging their impressive abilities. Motivated by the goal of leveraging LLMs, we propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller TOD model. This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models. By utilizing the satisfaction feedback generated by LLMs, UGRO further optimizes the supervised fine-tuned TOD model. Specifically, the TOD model takes the dialogue history as input and, with the assistance of the user simulator's feedback, generates high-satisfaction responses that meet the user's requirements. Through empirical experiments on two TOD benchmarks, we validate the effectiveness of our method. The results demonstrate that our approach outperforms previous state-of-the-art (SOTA) results.
64.9IRMay 17Code
RAGR: Review-Augmented Generative RecommendationYingyi Zhang, Junyi Li, Yejing Wang et al.
Sequential recommendation (SR) is traditionally formulated as next-item prediction over a chronological sequence of interacted items. Although recent generative recommendation (GR) methods introduce new machinery, such as semantic IDs, autoregressive decoding, and unified token spaces, they largely inherit the same item-only modeling assumption. We argue that this design constitutes a structural bottleneck, because user decision-making is not purely behavioral: while item interactions reveal what users choose, review feedback often explain why they choose it by exposing latent evaluative factors. Motivated by this observation, we propose Review-Augmented Generative Recommendation (RAGR), a novel GR framework that incorporates review feedback directly into the generative user sequence rather than treating reviews as auxiliary side information. Specifically, RAGR introduces a Review-Augmented User Sequence Modeling mechanism that interleaves item semantic IDs and review semantic IDs in chronological order to construct a mixed behavioral-semantic sequence, enabling review signals to participate directly in autoregressive next-token generation. To preserve the recommendation objective, we further introduce an Item-Centric Task Generation Alignment strategy based on direct preference optimization (DPO), which encourages the model to favor item tokens over review tokens at prediction positions. Experiments on three real-world datasets show that RAGR yields consistent and significant gains over strong GR backbones across all metrics. Our code and data are available at \url{https://github.com/Zhang-Yingyi/TKDE_RAGR}.
CLSep 16, 2023
Enhancing Large Language Model Induced Task-Oriented Dialogue Systems Through Look-Forward Motivated GoalsZhiyuan Hu, Yue Feng, Yang Deng et al.
Recently, the development of large language models (LLMs) has been significantly enhanced the question answering and dialogue generation, and makes them become increasingly popular in current practical scenarios. While unlike the general dialogue system which emphasizes the semantic performance, the task-oriented dialogue (ToD) systems aim to achieve the dialogue goal efficiently and successfully in multiple turns. Unfortunately, existing LLM-induced ToD systems lack the direct reward toward the final goal and do not take account of the dialogue proactivity that can strengthen the dialogue efficiency. To fill these gaps, we introduce the ProToD (Proactively Goal-Driven LLM-Induced ToD) approach, which anticipates the future dialogue actions and incorporates the goal-oriented reward signal to enhance ToD systems. Additionally, we present a novel evaluation method that assesses ToD systems based on goal-driven dialogue simulations. This method allows us to gauge user satisfaction, system efficiency and successful rate while overcoming the limitations of current Information and Success metrics. Empirical experiments conducted on the MultiWoZ 2.1 dataset demonstrate that our model can achieve superior performance using only 10% of the data compared to previous end-to-end fully supervised models. This improvement is accompanied by enhanced user satisfaction and efficiency.
CLApr 18, 2022
Learning to Execute Actions or Ask Clarification QuestionsZhengxiang Shi, Yue Feng, Aldo Lipani
Collaborative tasks are ubiquitous activities where a form of communication is required in order to reach a joint goal. Collaborative building is one of such tasks. We wish to develop an intelligent builder agent in a simulated building environment (Minecraft) that can build whatever users wish to build by just talking to the agent. In order to achieve this goal, such agents need to be able to take the initiative by asking clarification questions when further information is needed. Existing works on Minecraft Corpus Dataset only learn to execute instructions neglecting the importance of asking for clarifications. In this paper, we extend the Minecraft Corpus Dataset by annotating all builder utterances into eight types, including clarification questions, and propose a new builder agent model capable of determining when to ask or execute instructions. Experimental results show that our model achieves state-of-the-art performance on the collaborative building task with a substantial improvement. We also define two new tasks, the learning to ask task and the joint learning task. The latter consists of solving both collaborating building and learning to ask tasks jointly.
AIJul 3, 2023
ChatGPT vs. Google: A Comparative Study of Search Performance and User ExperienceRuiyun Xu, Yue Feng, Hailiang Chen
The advent of ChatGPT, a large language model-powered chatbot, has prompted questions about its potential implications for traditional search engines. In this study, we investigate the differences in user behavior when employing search engines and chatbot tools for information-seeking tasks. We carry out a randomized online experiment, dividing participants into two groups: one using a ChatGPT-like tool and the other using a Google Search-like tool. Our findings reveal that the ChatGPT group consistently spends less time on all tasks, with no significant difference in overall task performance between the groups. Notably, ChatGPT levels user search performance across different education levels and excels in answering straightforward questions and providing general solutions but falls short in fact-checking tasks. Users perceive ChatGPT's responses as having higher information quality compared to Google Search, despite displaying a similar level of trust in both tools. Furthermore, participants using ChatGPT report significantly better user experiences in terms of usefulness, enjoyment, and satisfaction, while perceived ease of use remains comparable between the two tools. However, ChatGPT may also lead to overreliance and generate or replicate misinformation, yielding inconsistent results. Our study offers valuable insights for search engine management and highlights opportunities for integrating chatbot technologies into search engine designs.
33.7CLApr 12
From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal ConsultationMingfei Lu, Yi Zhang, Mengjia Wu et al.
Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context-aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general-purpose and legal-domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.
GRSep 8, 2023
Dynamic Mesh-Aware Radiance FieldsYi-Ling Qiao, Alexander Gao, Yiran Xu et al.
Embedding polygonal mesh assets within photorealistic Neural Radience Fields (NeRF) volumes, such that they can be rendered and their dynamics simulated in a physically consistent manner with the NeRF, is under-explored from the system perspective of integrating NeRF into the traditional graphics pipeline. This paper designs a two-way coupling between mesh and NeRF during rendering and simulation. We first review the light transport equations for both mesh and NeRF, then distill them into an efficient algorithm for updating radiance and throughput along a cast ray with an arbitrary number of bounces. To resolve the discrepancy between the linear color space that the path tracer assumes and the sRGB color space that standard NeRF uses, we train NeRF with High Dynamic Range (HDR) images. We also present a strategy to estimate light sources and cast shadows on the NeRF. Finally, we consider how the hybrid surface-volumetric formulation can be efficiently integrated with a high-performance physics simulator that supports cloth, rigid and soft bodies. The full rendering and simulation system can be run on a GPU at interactive rates. We show that a hybrid system approach outperforms alternatives in visual realism for mesh insertion, because it allows realistic light transport from volumetric NeRF media onto surfaces, which affects the appearance of reflective/refractive surfaces and illumination of diffuse surfaces informed by the dynamic scene.
CLJun 14, 2023
PoetryDiffusion: Towards Joint Semantic and Metrical Manipulation in Poetry GenerationZhiyuan Hu, Chumin Liu, Yue Feng et al.
Controllable text generation is a challenging and meaningful field in natural language generation (NLG). Especially, poetry generation is a typical one with well-defined and strict conditions for text generation which is an ideal playground for the assessment of current methodologies. While prior works succeeded in controlling either semantic or metrical aspects of poetry generation, simultaneously addressing both remains a challenge. In this paper, we pioneer the use of the Diffusion model for generating sonnets and Chinese SongCi poetry to tackle such challenges. In terms of semantics, our PoetryDiffusion model, built upon the Diffusion model, generates entire sentences or poetry by comprehensively considering the entirety of sentence information. This approach enhances semantic expression, distinguishing it from autoregressive and large language models (LLMs). For metrical control, the separation feature of diffusion generation and its constraint control module enable us to flexibly incorporate a novel metrical controller to manipulate and evaluate metrics (format and rhythm). The denoising process in PoetryDiffusion allows for gradual enhancement of semantics and flexible integration of the metrical controller which can calculate and impose penalties on states that stray significantly from the target control distribution. Experimental results on two datasets demonstrate that our model outperforms existing models in automatic evaluation of semantic, metrical, and overall performance as well as human evaluation.
CLMar 5, 2024Code
Learning to Use Tools via Cooperative and Interactive AgentsZhengliang Shi, Shen Gao, Xiuyi Chen et al. · baidu
Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility. Existing methods employ one single LLM-based agent to iteratively select and execute tools, thereafter incorporating execution results into the next action prediction. Despite their progress, these methods suffer from performance degradation when addressing practical tasks due to: (1) the pre-defined pipeline with restricted flexibility to calibrate incorrect actions, and (2) the struggle to adapt a general LLM-based agent to perform a variety of specialized actions. To mitigate these problems, we propose ConAgents, a Cooperative and interactive Agents framework, which coordinates three specialized agents for tool selection, tool execution, and action calibration separately. ConAgents introduces two communication protocols to enable the flexible cooperation of agents. To effectively generalize the ConAgents into open-source models, we also propose specialized action distillation, enhancing their ability to perform specialized actions in our framework. Our extensive experiments on three datasets show that the LLMs, when equipped with the ConAgents, outperform baselines with substantial improvement (i.e., up to 14% higher success rate).
CLDec 10, 2022
Topic-Aware Response Generation in Task-Oriented Dialogue with Unstructured Knowledge AccessYue Feng, Gerasimos Lampouras, Ignacio Iacobacci
To alleviate the problem of structured databases' limited coverage, recent task-oriented dialogue systems incorporate external unstructured knowledge to guide the generation of system responses. However, these usually use word or sentence level similarities to detect the relevant knowledge context, which only partially capture the topical level relevance. In this paper, we examine how to better integrate topical information in knowledge grounded task-oriented dialogue and propose ``Topic-Aware Response Generation'' (TARG), an end-to-end response generation model. TARG incorporates multiple topic-aware attention mechanisms to derive the importance weighting scheme over dialogue utterances and external knowledge sources towards a better understanding of the dialogue history. Experimental results indicate that TARG achieves state-of-the-art performance in knowledge selection and response generation, outperforming previous state-of-the-art by 3.2, 3.6, and 4.2 points in EM, F1 and BLEU-4 respectively on Doc2Dial, and performing comparably with previous work on DSTC9; both being knowledge-grounded task-oriented dialogue datasets.
CLMar 18, 2023
A Graph-Guided Reasoning Approach for Open-ended Commonsense Question AnsweringZhen Han, Yue Feng, Mingming Sun
Recently, end-to-end trained models for multiple-choice commonsense question answering (QA) have delivered promising results. However, such question-answering systems cannot be directly applied in real-world scenarios where answer candidates are not provided. Hence, a new benchmark challenge set for open-ended commonsense reasoning (OpenCSR) has been recently released, which contains natural science questions without any predefined choices. On the OpenCSR challenge set, many questions require implicit multi-hop reasoning and have a large decision space, reflecting the difficult nature of this task. Existing work on OpenCSR sorely focuses on improving the retrieval process, which extracts relevant factual sentences from a textual knowledge base, leaving the important and non-trivial reasoning task outside the scope. In this work, we extend the scope to include a reasoner that constructs a question-dependent open knowledge graph based on retrieved supporting facts and employs a sequential subgraph reasoning process to predict the answer. The subgraph can be seen as a concise and compact graphical explanation of the prediction. Experiments on two OpenCSR datasets show that the proposed model achieves great performance on benchmark OpenCSR datasets.
51.5ROMay 19
Bilateral Teleoperation with Compliant 6-DOF Pose-and-Force SensingYue Feng, Weicheng Huang, I-Ming Chen
Existing bilateral teleoperation platforms still rely on costly rigid six-axis force/torque sensors, tightly coupled leader-follower hardware, and kilohertz control loops. We present a Cartesian bilateral framework built on the hardware-agnostic WinGs Operating Studio (WOS) middleware, in which a low-cost compliant 6-DOF pose-and-force sensing end-effector, Delta6, is mounted on both sides so that each manipulator behaves as an end-effector 6-DOF series elastic actuator (SEA). The leader runs a damping-only admittance loop with a 6-D biquad notch filter; the follower realizes a stiffness-damping impedance through a position-based outer loop with a PID wrench-to-pose mapping. Three time scales (hardware I/O, mid-rate impedance/admittance, low-rate teleoperation messages) are explicitly decoupled, enabling the same application to drive heterogeneous arms. On a Lite6/FR3 testbed at 150 Hz, the system tracks stably under delays up to $120\pm40$ ms and 1% packet loss, matches the prescribed virtual stiffness in contact, and shows a favorable cumulative energy signature in passivity-style tests.
46.0ROMay 19
Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich ManipulationYue Feng, Weicheng Huang, I-Ming Chen
Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch attention by an entropy-regularized Optimal Transport (OT) alignment between force-pose-derived sub-queries and visual patches. Explicit marginal constraints act as a structured inductive bias for contact-rich tasks, encouraging conditioning-aware spatial selection that is stable across illumination, distractors, and partial occlusion. SO-TA is paired with a diffusion-based sequence policy mapping observation windows to pose-action chunks. We evaluate SO-TA on three real-robot tasks: tight peg-in-hole assembly, BCM wiring-connector insertion, and curved-surface mark erasing. With ~200 rollouts per condition, SO-TA reaches 100% success on tight peg-in-hole versus 93% for cross-attention at matched capacity, and retains 82.5% success under illumination, distractor, and partial-occlusion perturbations where a concatenation baseline drops to 43.5%. OT-derived patch heatmaps and leave-one-out modality-influence ratios provide interpretable, phase-dependent diagnostics.
LGMay 4, 2025Code
An Empirical Study of Qwen3 QuantizationXingyu Zheng, Yuye Li, Haoran Chu et al.
The Qwen series has emerged as a leading family of open-source Large Language Models (LLMs), demonstrating remarkable capabilities in natural language understanding tasks. With the recent release of Qwen3, which exhibits superior performance across diverse benchmarks, there is growing interest in deploying these models efficiently in resource-constrained environments. Low-bit quantization presents a promising solution, yet its impact on Qwen3's performance remains underexplored. This study conducts a systematic evaluation of Qwen3's robustness under various quantization settings, aiming to uncover both opportunities and challenges in compressing this state-of-the-art model. We rigorously assess 5 existing classic post-training quantization techniques applied to Qwen3, spanning bit-widths from 1 to 8 bits, and evaluate their effectiveness across multiple datasets. Our findings reveal that while Qwen3 maintains competitive performance at moderate bit-widths, it experiences notable degradation in linguistic tasks under ultra-low precision, underscoring the persistent hurdles in LLM compression. These results emphasize the need for further research to mitigate performance loss in extreme quantization scenarios. We anticipate that this empirical analysis will provide actionable insights for advancing quantization methods tailored to Qwen3 and future LLMs, ultimately enhancing their practicality without compromising accuracy. Our project is released on https://github.com/Efficient-ML/Qwen3-Quantization and https://huggingface.co/collections/Efficient-ML/qwen3-quantization-68164450decb1c868788cb2b.
24.4ROApr 7Code
Delta6: A Low-Cost, 6-DOF Force-Sensing Flexible End-EffectorYue Feng, Weicheng Huang, Chen Qiu et al.
This paper presents Delta6, a low-cost, six-degree-of-freedom (6-DOF) force/torque end-effector that combines antagonistic springs with magnetic encoders to deliver accurate wrench sensing while remaining as simple to assemble as flat-pack furniture. A fully 3D-printed prototype, assembled entirely from off-the-shelf parts, withstands peak forces above +/-14.4 N and torques of +/-0.33 N.m per axis; these limits can be further extended by leveraging the proposed parametric analytical model. Without calibration, Delta6 attains a 99th-percentile error of 7% full scale (FS). With lightweight sequence models, the error is reduced to 3.8% FS by the best-performing network. Benchmarks on multiple computing platforms confirm that the device's bandwidth is adjustable, enabling balanced trade-offs among update rate, accuracy, and cost, while durability, thermal drift, and zero-calibration tests confirm its robustness. With Delta6 mounted on a robot arm governed by a force-impedance controller, the system successfully performs two contact-rich tasks: buffing curved surfaces and tight assemblies. Experiments validate the design, showing that Delta6 is a robust, low-cost alternative to existing 6-DOF force sensing solutions. Open-source site: https://wings-robotics.github.io/delta6 .
92.5NAMay 13
A Uniformly Accurate Multiscale Time Integrator for the Klein-Gordon-Schrödinger Equations in the Nonrelativistic Regime via Simplified Transmission ConditionsYue Feng, Caoyi Liu
We propose a novel and simplified multiscale time integrator Fourier pseudospectral (MTI-FP) method for the Klein-Gordon-Schrödinger (KGS) equations with a dimensionless parameter epsilon in (0,1], where epsilon is inversely proportional to the speed of light. The proposed MTI-FP method is rigorously proved to achieve uniform first-order accuracy in time in the nonrelativistic regime, i.e., as epsilon->0. In this regime, the solution of the KGS equations exhibits temporal oscillations with an O(epsilon^2)-wavelength, imposing stringent resolution requirements on classical numerical methods. The uniformly accurate MTI-FP method is built upon two key points: (i) a multiscale decomposition by frequency in each time interval with simplified transmission conditions, and (ii) an exponential integrator for temporal discretization combined with the Fourier pseudospectral method for spatial discretization. Using the energy method and mathematical induction, we rigorously establish two independent error bounds in H^1-norm at O(h^{m0-1} + tau^2/epsilon^2) and O(h^{m0-1} + epsilon^2) with mesh size h, time step tau and m0 an integer dependent on the regularity of the solution. These estimates imply that the MTI-FP method converges uniformly and optimally in space, and uniformly in time at O(tau) with respect to epsilon in (0,1]. Furthermore, by incorporating a linear interpolation of the micro-variables with the multiscale decomposition in each time interval, we obtain a uniformly accurate numerical solution for any t>0. Consequently, the proposed MTI-FP method has a super-resolution property in time from the perspective of Shannon sampling theory. Ample numerical experiments are provided to validate the error estimates and to demonstrate the super-resolution property. Finally, the method is applied to numerically investigate the convergence rates of the KGS equations to different limiting models.
50.9CVMay 11
Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report GenerationKedi Sun, Chaohui Dang, Yue Feng et al.
Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.
CVOct 24, 2025Code
MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly DetectionShengtian Yang, Yue Feng, Yingshi Liu et al.
Video Anomaly Detection (VAD) aims to locate unusual activities or behaviors within videos. Recently, offline VAD has garnered substantial research attention, which has been invigorated by the progress in large language models (LLMs) and vision-language models (VLMs), offering the potential for a more nuanced understanding of anomalies. However, online VAD has seldom received attention due to real-time constraints and computational intensity. In this paper, we introduce a novel Memory-based online scoring queue scheme for Training-free VAD (MoniTor), to address the inherent complexities in online VAD. Specifically, MoniTor applies a streaming input to VLMs, leveraging the capabilities of pre-trained large-scale models. To capture temporal dependencies more effectively, we incorporate a novel prediction mechanism inspired by Long Short-Term Memory (LSTM) networks. This ensures the model can effectively model past states and leverage previous predictions to identify anomalous behaviors. Thereby, it better understands the current frame. Moreover, we design a scoring queue and an anomaly prior to dynamically store recent scores and cover all anomalies in the monitoring scenario, providing guidance for LLMs to distinguish between normal and abnormal behaviors over time. We evaluate MoniTor on two large datasets (i.e., UCF-Crime and XD-Violence) containing various surveillance and real-world scenarios. The results demonstrate that MoniTor outperforms state-of-the-art methods and is competitive with weakly supervised methods without training. Code is available at https://github.com/YsTvT/MoniTor.
CVOct 24, 2025Code
MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual CorrespondenceYue Feng, Jinwei Hu, Qijia Lu et al.
We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at https://github.com/debby-0527/MUVR.
CLOct 18, 2025Code
Unleashing Diverse Thinking Modes in LLMs through Multi-Agent CollaborationZhixuan He, Yue Feng
Large Language Models (LLMs) demonstrate strong performance but often lack interpretable reasoning. This paper introduces the Multi-Agent Collaboration Framework for Diverse Thinking Modes (DiMo), which enhances both performance and interpretability by simulating a structured debate among four specialized LLM agents. Each agent embodies a distinct reasoning paradigm, allowing the framework to collaboratively explore diverse cognitive approaches. Through iterative debate, agents challenge and refine initial responses, yielding more robust conclusions and an explicit, auditable reasoning chain. Across six benchmarks and under a unified open-source setup, DiMo improves accuracy over widely used single-model and debate baselines, with the largest gains on math. We position DiMo as a semantics-aware, Web-native multi-agent framework: it models human-machine intelligence with LLM agents that produce semantically typed, URL-annotated evidence chains for explanations and user-friendly interactions. Although our experiments use standard reasoning benchmarks, the framework is designed to be instantiated over Web corpora and knowledge graphs, combining retrieval-augmented reasoning with structured justifications that downstream systems can inspect and reuse.
CLMay 23, 2023Code
Towards Asking Clarification Questions for Information Seeking on Task-Oriented DialoguesYue Feng, Hossein A. Rahmani, Aldo Lipani et al.
Task-oriented dialogue systems aim at providing users with task-specific services. Users of such systems often do not know all the information about the task they are trying to accomplish, requiring them to seek information about the task. To provide accurate and personalized task-oriented information seeking results, task-oriented dialogue systems need to address two potential issues: 1) users' inability to describe their complex information needs in their requests; and 2) ambiguous/missing information the system has about the users. In this paper, we propose a new Multi-Attention Seq2Seq Network, named MAS2S, which can ask questions to clarify the user's information needs and the user's profile in task-oriented information seeking. We also extend an existing dataset for task-oriented information seeking, leading to the \ourdataset which contains about 100k task-oriented information seeking dialogues that are made publicly available\footnote{Dataset and code is available at \href{https://github.com/sweetalyssum/clarit}{https://github.com/sweetalyssum/clarit}.}. Experimental results on \ourdataset show that MAS2S outperforms baselines on both clarification question generation and answer prediction.
18.0ROApr 9
One Interface, Many Robots: Unified Real-Time Low-Level Motion Planning for Collaborative ArmsYue Feng, Weicheng Huang, I-Ming Chen
This paper proposes a common interface for real-time low-level motion planning of collaborative robotic arms, aimed at enabling broader applicability and improved portability across heterogeneous hardware platforms. In previous work, we introduced WinGs Operating Studio (WOS), a middleware solution that abstracts diverse robotic components into uniform software resources and provides a broad suite of language-agnostic APIs. This paper specifically focuses on its minimal yet flexible interface for real-time end-effector trajectory control. By employing an n-degree polynomial interpolator in conjunction with a quadratic programming solver, the proposed method generates smooth, continuously differentiable trajectories with precise position, velocity, and acceleration profiles. We validate our approach in three distinct scenarios. First, in an offline demonstration, a collaborative arm accurately draws various geometric shapes on paper. Second, in an interruptible, low-frequency re-planning setting, a robotic manipulator grasps a dynamic object placed on a moving mobile robot. Finally, we conducted a teleoperation experiment in which one robotic arm controlled another to perform a series of dexterous manipulations, confirming the proposed method's reliability, versatility, and ease of use.
CLOct 11, 2024
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation FrameworkFan Liu, Yue Feng, Zhao Xu et al.
Despite advancements in enhancing LLM safety against jailbreak attacks, evaluating LLM defenses remains a challenge, with current methods often lacking explainability and generalization to complex scenarios, leading to incomplete assessments (e.g., direct judgment without reasoning, low F1 score of GPT-4 in complex cases, bias in multilingual scenarios). To address this, we present JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios, including synthetic, adversarial, in-the-wild, and multilingual prompts, along with high-quality human-annotated datasets. The JAILJUDGE dataset includes over 35k+ instruction-tune data with reasoning explainability and JAILJUDGETEST, a 4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten languages. To enhance evaluation with explicit reasoning, we propose the JailJudge MultiAgent framework, which enables explainable, fine-grained scoring (1 to 10). This framework supports the construction of instruction-tuning ground truth and facilitates the development of JAILJUDGE Guard, an end-to-end judge model that provides reasoning and eliminates API costs. Additionally, we introduce JailBoost, an attacker-agnostic attack enhancer, and GuardShield, a moderation defense, both leveraging JAILJUDGE Guard. Our experiments demonstrate the state-of-the-art performance of JailJudge methods (JailJudge MultiAgent, JAILJUDGE Guard) across diverse models (e.g., GPT-4, Llama-Guard) and zero-shot scenarios. JailBoost and GuardShield significantly improve jailbreak attack and defense tasks under zero-shot settings, with JailBoost enhancing performance by 29.24% and GuardShield reducing defense ASR from 40.46% to 0.15%.
54.6CVMar 21
Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought ReasoningGuangfu Guo, Xiaoqian Lu, Yue Feng et al.
Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Second, reasoning is performed following this discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues. This method is trained end-to-end, using text cot and answer supervision, without relying on region-level annotations or specialized external tools. Experiments on diverse visual reasoning benchmarks show gains, validating structured and sequential visual cognition.
CLFeb 16
Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language ModelsLance Calvin Lim Gamboa, Yue Feng, Mark Lee
With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by generative models. We expand the linguistic scope of BBQ and construct FilBBQ through a four-phase development process consisting of template categorization, culturally aware translation, new template construction, and prompt generation. These processes resulted in a bias test composed of more than 10,000 prompts which assess whether models demonstrate sexist and homophobic prejudices relevant to the Philippine context. We then apply FilBBQ on models trained in Filipino but do so with a robust evaluation protocol that improves upon the reliability and accuracy of previous BBQ implementations. Specifically, we account for models' response instability by obtaining prompt responses across multiple seeds and averaging the bias scores calculated from these distinctly seeded runs. Our results confirm both the variability of bias scores across different seeds and the presence of sexist and homophobic biases relating to emotion, domesticity, stereotyped queer interests, and polygamy. FilBBQ is available via GitHub.
CLOct 17, 2024
AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative LearningHao Sun, Jiayi Wu, Hengyi Cai et al.
Recent advancements in large language models (LLMs) have been remarkable. Users face a choice between using cloud-based LLMs for generation quality and deploying local-based LLMs for lower computational cost. The former option is typically costly and inefficient, while the latter usually fails to deliver satisfactory performance for reasoning steps requiring deliberate thought processes. In this work, we propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs. Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, handling less complex reasoning steps, and the cloud agent equipped with a larger LLM, managing more intricate reasoning steps. This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent, thereby effectively integrating the strengths of both locally-deployed and cloud-based LLMs, resulting in significant enhancements in task completion performance and efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from mathematical reasoning and complex question answering, using various types of LLMs to instantiate the local and cloud agents. The empirical results show that AdaSwitch effectively improves the performance of the local agent, and sometimes achieves competitive results compared to the cloud agent while utilizing much less computational overhead.
CLApr 5, 2025
Self-Adaptive Cognitive Debiasing for Large Language Models in Decision-MakingYougang Lyu, Shijie Ren, Yue Feng et al.
Large language models (LLMs) have shown potential in supporting decision-making applications, particularly as personal assistants in the financial, healthcare, and legal domains. While prompt engineering strategies have enhanced the capabilities of LLMs in decision-making, cognitive biases inherent to LLMs present significant challenges. Cognitive biases are systematic patterns of deviation from norms or rationality in decision-making that can lead to the production of inaccurate outputs. Existing cognitive bias mitigation strategies assume that input prompts only contain one type of cognitive bias, limiting their effectiveness in more challenging scenarios involving multiple cognitive biases. To fill this gap, we propose a cognitive debiasing approach, self-adaptive cognitive debiasing (SACD), that enhances the reliability of LLMs by iteratively refining prompts. Our method follows three sequential steps - bias determination, bias analysis, and cognitive debiasing - to iteratively mitigate potential cognitive biases in prompts. We evaluate SACD on finance, healthcare, and legal decision-making tasks using both open-weight and closed-weight LLMs. Compared to advanced prompt engineering methods and existing cognitive debiasing techniques, SACD achieves the lowest average bias scores in both single-bias and multi-bias settings.
CVDec 12, 2024
Illusion3D: 3D Multiview Illusion with 2D Diffusion PriorsYue Feng, Vaibhav Sanjay, Spencer Lutz et al.
Automatically generating multiview illusions is a compelling challenge, where a single piece of visual content offers distinct interpretations from different viewing perspectives. Traditional methods, such as shadow art and wire art, create interesting 3D illusions but are limited to simple visual outputs (i.e., figure-ground or line drawing), restricting their artistic expressiveness and practical versatility. Recent diffusion-based illusion generation methods can generate more intricate designs but are confined to 2D images. In this work, we present a simple yet effective approach for creating 3D multiview illusions based on user-provided text prompts or images. Our method leverages a pre-trained text-to-image diffusion model to optimize the textures and geometry of neural 3D representations through differentiable rendering. When viewed from multiple angles, this produces different interpretations. We develop several techniques to improve the quality of the generated 3D multiview illusions. We demonstrate the effectiveness of our approach through extensive experiments and showcase illusion generation with diverse 3D forms.
CLJun 8, 2025
Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative LanguagesLance Calvin Lim Gamboa, Yue Feng, Mark Lee
Emerging research on bias attribution and interpretability have revealed how tokens contribute to biased behavior in language models processing English texts. We build on this line of inquiry by adapting the information-theoretic bias attribution score metric for implementation on models handling agglutinative languages, particularly Filipino. We then demonstrate the effectiveness of our adapted method by using it on a purely Filipino model and on three multilingual models: one trained on languages worldwide and two on Southeast Asian data. Our results show that Filipino models are driven towards bias by words pertaining to people, objects, and relationships, entity-based themes that stand in contrast to the action-heavy nature of bias-contributing themes in English (i.e., criminal, sexual, and prosocial behaviors). These findings point to differences in how English and non-English models process inputs linked to sociodemographic groups and bias.
CLApr 11, 2025
UoB-NLP at SemEval-2025 Task 11: Leveraging Adapters for Multilingual and Cross-Lingual Emotion DetectionFrances Laureano De Leon, Yixiao Wang, Yue Feng et al.
Emotion detection in natural language processing is a challenging task due to the complexity of human emotions and linguistic diversity. While significant progress has been made in high-resource languages, emotion detection in low-resource languages remains underexplored. In this work, we address multilingual and cross-lingual emotion detection by leveraging adapter-based fine-tuning with multilingual pre-trained language models. Adapters introduce a small number of trainable parameters while keeping the pre-trained model weights fixed, offering a parameter-efficient approach to adaptation. We experiment with different adapter tuning strategies, including task-only adapters, target-language-ready task adapters, and language-family-based adapters. Our results show that target-language-ready task adapters achieve the best overall performance, particularly for low-resource African languages with our team ranking 7th for Tigrinya, and 8th for Kinyarwanda in Track A. In Track C, our system ranked 3rd for Amharic, and 4th for Oromo, Tigrinya, Kinyarwanda, Hausa, and Igbo. Our approach outperforms large language models in 11 languages and matches their performance in four others, despite our models having significantly fewer parameters. Furthermore, we find that adapter-based models retain cross-linguistic transfer capabilities while requiring fewer computational resources compared to full fine-tuning for each language.
CVJan 2, 2025
DynamicLip: Shape-Independent Continuous Authentication via Lip Articulator DynamicsHuashan Chen, Yifan Xu, Yue Feng et al.
Biometrics authentication has become increasingly popular due to its security and convenience; however, traditional biometrics are becoming less desirable in scenarios such as new mobile devices, Virtual Reality, and Smart Vehicles. For example, while face authentication is widely used, it suffers from significant privacy concerns. The collection of complete facial data makes it less desirable for privacy-sensitive applications. Lip authentication, on the other hand, has emerged as a promising biometrics method. However, existing lip-based authentication methods heavily depend on static lip shape when the mouth is closed, which can be less robust due to lip shape dynamic motion and can barely work when the user is speaking. In this paper, we revisit the nature of lip biometrics and extract shape-independent features from the lips. We study the dynamic characteristics of lip biometrics based on articulator motion. Building on the knowledge, we propose a system for shape-independent continuous authentication via lip articulator dynamics. This system enables robust, shape-independent and continuous authentication, making it particularly suitable for scenarios with high security and privacy requirements. We conducted comprehensive experiments in different environments and attack scenarios and collected a dataset of 50 subjects. The results indicate that our system achieves an overall accuracy of 99.06% and demonstrates robustness under advanced mimic attacks and AI deepfake attacks, making it a viable solution for continuous biometric authentication in various applications.
AIOct 21, 2025
Med-VRAgent: A Framework for Medical Visual Reasoning-Enhanced AgentsGuangfu Guo, Xiaoqian Lu, Yue Feng
Visual Language Models (VLMs) achieve promising results in medical reasoning but struggle with hallucinations, vague descriptions, inconsistent logic and poor localization. To address this, we propose a agent framework named Medical Visual Reasoning Agent (\textbf{Med-VRAgent}). The approach is based on Visual Guidance and Self-Reward paradigms and Monte Carlo Tree Search (MCTS). By combining the Visual Guidance with tree search, Med-VRAgent improves the medical visual reasoning capabilities of VLMs. We use the trajectories collected by Med-VRAgent as feedback to further improve the performance by fine-tuning the VLMs with the proximal policy optimization (PPO) objective. Experiments on multiple medical VQA benchmarks demonstrate that our method outperforms existing approaches.
CLAug 27, 2025
Social Bias in Multilingual Language Models: A SurveyLance Calvin Lim Gamboa, Yue Feng, Mark Lee
Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field's dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature's inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.
CVJul 8, 2025
Exploring Partial Multi-Label Learning via Integrating Semantic Co-occurrence KnowledgeXin Wu, Fei Teng, Yue Feng et al.
Partial multi-label learning aims to extract knowledge from incompletely annotated data, which includes known correct labels, known incorrect labels, and unknown labels. The core challenge lies in accurately identifying the ambiguous relationships between labels and instances. In this paper, we emphasize that matching co-occurrence patterns between labels and instances is key to addressing this challenge. To this end, we propose Semantic Co-occurrence Insight Network (SCINet), a novel and effective framework for partial multi-label learning. Specifically, SCINet introduces a bi-dominant prompter module, which leverages an off-the-shelf multimodal model to capture text-image correlations and enhance semantic alignment. To reinforce instance-label interdependencies, we develop a cross-modality fusion module that jointly models inter-label correlations, inter-instance relationships, and co-occurrence patterns across instance-label assignments. Moreover, we propose an intrinsic semantic augmentation strategy that enhances the model's understanding of intrinsic data semantics by applying diverse image transformations, thereby fostering a synergistic relationship between label confidence and sample difficulty. Extensive experiments on four widely-used benchmark datasets demonstrate that SCINet surpasses state-of-the-art methods.
AIApr 20, 2025
PolicyEvol-Agent: Evolving Policy via Environment Perception and Self-Awareness with Theory of MindYajie Yu, Yue Feng
Multi-agents has exhibited significant intelligence in real-word simulations with Large language models (LLMs) due to the capabilities of social cognition and knowledge retrieval. However, existing research on agents equipped with effective cognition chains including reasoning, planning, decision-making and reflecting remains limited, especially in the dynamically interactive scenarios. In addition, unlike human, prompt-based responses face challenges in psychological state perception and empirical calibration during uncertain gaming process, which can inevitably lead to cognition bias. In light of above, we introduce PolicyEvol-Agent, a comprehensive LLM-empowered framework characterized by systematically acquiring intentions of others and adaptively optimizing irrational strategies for continual enhancement. Specifically, PolicyEvol-Agent first obtains reflective expertise patterns and then integrates a range of cognitive operations with Theory of Mind alongside internal and external perspectives. Simulation results, outperforming RL-based models and agent-based methods, demonstrate the superiority of PolicyEvol-Agent for final gaming victory. Moreover, the policy evolution mechanism reveals the effectiveness of dynamic guideline adjustments in both automatic and human evaluation.
CVJan 23, 2025
Device-aware Optical Adversarial Attack for a Portable Projector-camera SystemNing Jiang, Yanhong Liu, Dingheng Zeng et al.
Deep-learning-based face recognition (FR) systems are susceptible to adversarial examples in both digital and physical domains. Physical attacks present a greater threat to deployed systems as adversaries can easily access the input channel, allowing them to provide malicious inputs to impersonate a victim. This paper addresses the limitations of existing projector-camera-based adversarial light attacks in practical FR setups. By incorporating device-aware adaptations into the digital attack algorithm, such as resolution-aware and color-aware adjustments, we mitigate the degradation from digital to physical domains. Experimental validation showcases the efficacy of our proposed algorithm against real and spoof adversaries, achieving high physical similarity scores in FR models and state-of-the-art commercial systems. On average, there is only a 14% reduction in scores from digital to physical attacks, with high attack success rate in both white- and black-box scenarios.
CLMay 26, 2023
Schema-Guided User Satisfaction Modeling for Task-Oriented DialoguesYue Feng, Yunlong Jiao, Animesh Prasad et al.
User Satisfaction Modeling (USM) is one of the popular choices for task-oriented dialogue systems evaluation, where user satisfaction typically depends on whether the user's task goals were fulfilled by the system. Task-oriented dialogue systems use task schema, which is a set of task attributes, to encode the user's task goals. Existing studies on USM neglect explicitly modeling the user's task goals fulfillment using the task schema. In this paper, we propose SG-USM, a novel schema-guided user satisfaction modeling framework. It explicitly models the degree to which the user's preferences regarding the task attributes are fulfilled by the system for predicting the user's satisfaction level. SG-USM employs a pre-trained language model for encoding dialogue context and task attributes. Further, it employs a fulfillment representation layer for learning how many task attributes have been fulfilled in the dialogue, an importance predictor component for calculating the importance of task attributes. Finally, it predicts the user satisfaction based on task attribute fulfillment and task attribute importance. Experimental results on benchmark datasets (i.e. MWOZ, SGD, ReDial, and JDDC) show that SG-USM consistently outperforms competitive existing methods. Our extensive analysis demonstrates that SG-USM can improve the interpretability of user satisfaction modeling, has good scalability as it can effectively deal with unseen tasks and can also effectively work in low-resource settings by leveraging unlabeled data.
CLFeb 26, 2022
ASSIST: Towards Label Noise-Robust Dialogue State TrackingFanghua Ye, Yue Feng, Emine Yilmaz
The MultiWOZ 2.0 dataset has greatly boosted the research on dialogue state tracking (DST). However, substantial noise has been discovered in its state annotations. Such noise brings about huge challenges for training DST models robustly. Although several refined versions, including MultiWOZ 2.1-2.4, have been published recently, there are still lots of noisy labels, especially in the training set. Besides, it is costly to rectify all the problematic annotations. In this paper, instead of improving the annotation quality further, we propose a general framework, named ASSIST (lAbel noiSe-robuSt dIalogue State Tracking), to train DST models robustly from noisy labels. ASSIST first generates pseudo labels for each sample in the training set by using an auxiliary model trained on a small clean dataset, then puts the generated pseudo labels and vanilla noisy labels together to train the primary model. We show the validity of ASSIST theoretically. Experimental results also demonstrate that ASSIST improves the joint goal accuracy of DST by up to $28.16\%$ on MultiWOZ 2.0 and $8.41\%$ on MultiWOZ 2.4, compared to using only the vanilla noisy labels.
CLNov 18, 2020
A Sequence-to-Sequence Approach to Dialogue State TrackingYue Feng, Yang Wang, Hang Li
This paper is concerned with dialogue state tracking (DST) in a task-oriented dialogue system. Building a DST module that is highly effective is still a challenging issue, although significant progresses have been made recently. This paper proposes a new approach to dialogue state tracking, referred to as Seq2Seq-DU, which formalizes DST as a sequence-to-sequence problem. Seq2Seq-DU employs two BERT-based encoders to respectively encode the utterances in the dialogue and the descriptions of schemas, an attender to calculate attentions between the utterance embeddings and the schema embeddings, and a decoder to generate pointers to represent the current state of dialogue. Seq2Seq-DU has the following advantages. It can jointly model intents, slots, and slot values; it can leverage the rich representations of utterances and schemas based on BERT; it can effectively deal with categorical and non-categorical slots, and unseen schemas. In addition, Seq2Seq-DU can also be used in the NLU (natural language understanding) module of a dialogue system. Experimental results on benchmark datasets in different settings (SGD, MultiWOZ2.2, MultiWOZ2.1, WOZ2.0, DSTC2, M2M, SNIPS, and ATIS) show that Seq2Seq-DU outperforms the existing methods.
CLApr 29, 2019
Logician: A Unified End-to-End Neural Approach for Open-Domain Information ExtractionMingming Sun, Xu Li, Xin Wang et al.
In this paper, we consider the problem of open information extraction (OIE) for extracting entity and relation level intermediate structures from sentences in open-domain. We focus on four types of valuable intermediate structures (Relation, Attribute, Description, and Concept), and propose a unified knowledge expression form, SAOKE, to express them. We publicly release a data set which contains more than forty thousand sentences and the corresponding facts in the SAOKE format labeled by crowd-sourcing. To our knowledge, this is the largest publicly available human labeled data set for open information extraction tasks. Using this labeled SAOKE data set, we train an end-to-end neural model using the sequenceto-sequence paradigm, called Logician, to transform sentences into facts. For each sentence, different to existing algorithms which generally focus on extracting each single fact without concerning other possible facts, Logician performs a global optimization over all possible involved facts, in which facts not only compete with each other to attract the attention of words, but also cooperate to share words. An experimental study on various types of open domain relation extraction tasks reveals the consistent superiority of Logician to other states-of-the-art algorithms. The experiments verify the reasonableness of SAOKE format, the valuableness of SAOKE data set, the effectiveness of the proposed Logician model, and the feasibility of the methodology to apply end-to-end learning paradigm on supervised data sets for the challenging tasks of open information extraction.