93.1LGJun 4
When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent TrainingYuanfan Li, Qi Zhou, Wenjing Duan et al.
Long-horizon LLM agents require reinforcement learning methods that can assign credit to intermediate decisions under sparse and delayed rewards. Recent group-based methods such as GiGPO improve over GRPO by constructing step-level advantages at repeated anchor states. However, we show that such dense credit can be statistically unreliable: under limited rollouts, rare but lucky actions may receive overly large advantages, producing divergent anchor bias and late-stage training oscillation. We propose Evidence-Calibrated Policy Optimization (ECPO), a critic-free policy optimization algorithm that calibrates step-level credit before policy updates. ECPO combines Evidence-Calibrated Action Advantage, which groups rollouts by canonical actions and shrinks low-count estimates, with Variance-Gated Credit Weighting, which suppresses anchor states dominated by within-action noise. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B show that ECPO consistently outperforms strong baselines, improving GiGPO by +5.2/+7.3 success points on ALFWorld/WebShop with Qwen2.5-1.5B while adding only 0.1% additional advantage-computation overhead.
92.1SEJun 3
UModel: An Agent-Ready Observability Data Modeling Method at ScaleChanghua Pei, Zheyuan Li, Zexin Wang et al.
When networked system failures occur, automatically performing Root Cause Analysis (RCA) using observability data is critical for ensuring networked system reliability. Recently, LLM-based agents have shown promise for automating this diagnosis process through advanced reasoning and autonomous exploration. However, existing observability frameworks remain archaic, characterized by fragmented data silos, incompatible schemas, and insufficient semantic metadata, preventing agents from establishing the complex relationships required for effective RCA. To address these challenges, we present UModel, a unified ontological framework that shifts observability from data-centric to object-centric modeling. UModel constructs a virtual ontological layer where heterogeneous telemetry, entities, and expert knowledge are standardized as objects and interconnected via semantic graphs. In addition, we introduce U-SPL, a pipeline-based query interface that enables agents to autonomously explore system topologies and correlate multimodal data. By re-modeling the "AIOps 2025 Challenge" dataset using UModel, the precision of root cause localization improved by 8%, demonstrating that enhanced data organization can significantly increase the accuracy of downstream tasks. UModel provides a scalable modeling framework that, in its deployment at Alibaba Cloud for more than one year, has served tens of thousands of users, sustained millions of operations per second, and delivered sub-second query latency.
79.5CVMay 25Code
ERNIE-Image Technical ReportJiaxiang Liu, Zhida Feng, Pengyu Zou et al.
We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.
LGFeb 12, 2023Code
Robust Representation Learning by Clustering with Bisimulation Metrics for Visual Reinforcement Learning with DistractionsQiyuan Liu, Qi Zhou, Rui Yang et al.
Recent work has shown that representation learning plays a critical role in sample-efficient reinforcement learning (RL) from pixels. Unfortunately, in real-world scenarios, representation learning is usually fragile to task-irrelevant distractions such as variations in background or viewpoint. To tackle this problem, we propose a novel clustering-based approach, namely Clustering with Bisimulation Metrics (CBM), which learns robust representations by grouping visual observations in the latent space. Specifically, CBM alternates between two steps: (1) grouping observations by measuring their bisimulation distances to the learned prototypes; (2) learning a set of prototypes according to the current cluster assignments. Computing cluster assignments with bisimulation metrics enables CBM to capture task-relevant information, as bisimulation metrics quantify the behavioral similarity between observations. Moreover, CBM encourages the consistency of representations within each group, which facilitates filtering out task-irrelevant information and thus induces robust representations against distractions. An appealing feature is that CBM can achieve sample-efficient representation learning even if multiple distractions exist simultaneously.Experiments demonstrate that CBM significantly improves the sample efficiency of popular visual RL algorithms and achieves state-of-the-art performance on both multiple and single distraction settings. The code is available at https://github.com/MIRALab-USTC/RL-CBM.
LGFeb 19, 2023
Generalization in Visual Reinforcement Learning with the Reward Sequence DistributionJie Wang, Rui Yang, Zijie Geng et al.
Generalization in partially observed markov decision processes (POMDPs) is critical for successful applications of visual reinforcement learning (VRL) in real scenarios. A widely used idea is to learn task-relevant representations that encode task-relevant information of common features in POMDPs, i.e., rewards and transition dynamics. As transition dynamics in the latent state space -- which are task-relevant and invariant to visual distractions -- are unknown to the agents, existing methods alternatively use transition dynamics in the observation space to extract task-relevant information in transition dynamics. However, such transition dynamics in the observation space involve task-irrelevant visual distractions, degrading the generalization performance of VRL methods. To tackle this problem, we propose the reward sequence distribution conditioned on the starting observation and the predefined subsequent action sequence (RSD-OA). The appealing features of RSD-OA include that: (1) RSD-OA is invariant to visual distractions, as it is conditioned on the predefined subsequent action sequence without task-irrelevant information from transition dynamics, and (2) the reward sequence captures long-term task-relevant information in both rewards and transition dynamics. Experiments demonstrate that our representation learning approach based on RSD-OA significantly improves the generalization performance on unseen environments, outperforming several state-of-the-arts on DeepMind Control tasks with visual distractions.
LGDec 14, 2022
Efficient Exploration in Resource-Restricted Reinforcement LearningZhihai Wang, Taoxing Pan, Qi Zhou et al.
In many real-world applications of reinforcement learning (RL), performing actions requires consuming certain types of resources that are non-replenishable in each episode. Typical applications include robotic control with limited energy and video games with consumable items. In tasks with non-replenishable resources, we observe that popular RL methods such as soft actor critic suffer from poor sample efficiency. The major reason is that, they tend to exhaust resources fast and thus the subsequent exploration is severely restricted due to the absence of resources. To address this challenge, we first formalize the aforementioned problem as a resource-restricted reinforcement learning, and then propose a novel resource-aware exploration bonus (RAEB) to make reasonable usage of resources. An appealing feature of RAEB is that, it can significantly reduce unnecessary resource-consuming trials while effectively encouraging the agent to explore unvisited states. Experiments demonstrate that the proposed RAEB significantly outperforms state-of-the-art exploration strategies in resource-restricted reinforcement learning environments, improving the sample efficiency by up to an order of magnitude.
SEAug 25, 2024
LogParser-LLM: Advancing Efficient Log Parsing with Large Language ModelsAoxiao Zhong, Dengyao Mo, Guiyang Liu et al.
Logs are ubiquitous digital footprints, playing an indispensable role in system diagnostics, security analysis, and performance optimization. The extraction of actionable insights from logs is critically dependent on the log parsing process, which converts raw logs into structured formats for downstream analysis. Yet, the complexities of contemporary systems and the dynamic nature of logs pose significant challenges to existing automatic parsing techniques. The emergence of Large Language Models (LLM) offers new horizons. With their expansive knowledge and contextual prowess, LLMs have been transformative across diverse applications. Building on this, we introduce LogParser-LLM, a novel log parser integrated with LLM capabilities. This union seamlessly blends semantic insights with statistical nuances, obviating the need for hyper-parameter tuning and labeled training data, while ensuring rapid adaptability through online parsing. Further deepening our exploration, we address the intricate challenge of parsing granularity, proposing a new metric and integrating human interactions to allow users to calibrate granularity to their specific needs. Our method's efficacy is empirically demonstrated through evaluations on the Loghub-2k and the large-scale LogPub benchmark. In evaluations on the LogPub benchmark, involving an average of 3.6 million logs per dataset across 14 datasets, our LogParser-LLM requires only 272.5 LLM invocations on average, achieving a 90.6% F1 score for grouping accuracy and an 81.1% for parsing accuracy. These results demonstrate the method's high efficiency and accuracy, outperforming current state-of-the-art log parsers, including pattern-based, neural network-based, and existing LLM-enhanced approaches.
70.7IRMay 25
SIREN: Unified Multi-Granularity Semantic Interaction for Multi-Modal Lifelong User Interest ModelingYaqian Zhang, Ruyi Yu, Tianyi Li et al.
Industrial recommender systems increasingly leverage lifelong user behavior histories and rich multi-modal content to capture evolving user preferences. However, effectively integrating multi-modal features into lifelong interest modeling remains challenging due to the inherent misalignment between multi-modal and collaborative spaces. Existing paradigms typically rely on separate modeling of multi-modal sequence and behavior sequence, and late fusion to alleviate the modality gap, which results in coarse-grained multi-modal representation and limited integration. In this paper, we propose SIREN, a unified multi-granularity semantic interaction framework for multi-modal lifelong user interest modeling. In the General Search Unit stage, we introduce two alternative retrieval strategies: multi-modal similarity-based soft retrieval for retrieval effectiveness, and Semantic ID (SemID)-based hard retrieval for efficient industrial serving. For the Exact Search Unit stage, we explicitly incorporate target-aware relevance via coarse similarity buckets and fine-grained prefix-encoded SemIDs, enabling unified interaction with collaborative ID features within the target-conditioned transformer architecture. Extensive experiments on the offline dataset demonstrate that SIREN achieves a state-of-the-art GAUC. Online A/B tests further demonstrate consistent GMV gains across multiple production scenarios, including +2.28% in Weixin Moments, +3.87% in Weixin Official Accounts, and +1.61% in Weixin Channels. From July 2025, SIREN has been fully launched for full-traffic serving in Tencent's advertising platform.
97.4NAMar 29
Quasiperiodic Elliptic Operators: Projection Method and Convergence AnalysisKai Jiang, Jiaqi Tang, Qilong Zhai et al.
Quasiperiodic elliptic operators (QEOs) serve as fundamental models in both mathematics and physics, as exemplified by their role in the numerical modeling of one-dimensional photonic quasicrystals. However, distinct from periodic elliptic operators, approximating eigenpairs for QEOs poses significant challenges, particularly in capturing the full spectral structure (notably the continuous spectrum) and deriving convergence guarantees in the absence of compactness. In this paper, we develop a high-accuracy numerical method to compute eigenpairs of QEOs based on the projection method, which embeds quasiperiodic operators into a higher-dimensional periodic torus. To address the non-compactness issue, we construct a directional-derivative Hilbert space along irrational manifolds of a high-dimensional torus and characterize operators equivalent to QEOs within this space. By integrating spectral theory for non-compact operators into the Babuška-Osborn eigenproblem framework, we establish rigorous convergence analysis and prove that our method achieves spectral accuracy. Numerical experiments validate the accuracy and efficiency of the proposed method, including a one-dimensional photonic quasicrystal and two- and three-dimensional QEOs.
23.3NEMar 29
A Novel Immune Algorithm for Multiparty Multiobjective OptimizationKesheng Chen, Wenjian Luo, Qi Zhou et al.
Traditional multiobjective optimization problems (MOPs) are insufficiently equipped for scenarios involving multiple decision makers (DMs), which are prevalent in many practical applications. These scenarios are categorized as multiparty multiobjective optimization problems (MPMOPs). For MPMOPs, the goal is to find a solution set that is as close to the Pareto front of each DM as much as possible. This poses challenges for evolutionary algorithms in terms of searching and selecting. To better solve MPMOPs, this paper proposes a novel approach called the multiparty immune algorithm (MPIA). The MPIA incorporates an inter-party guided crossover strategy based on the individual's non-dominated sorting ranks from different DM perspectives and an adaptive activation strategy based on the proposed multiparty cover metric (MCM). These strategies enable MPIA to activate suitable individuals for the next operations, maintain population diversity from different DM perspectives, and enhance the algorithm's search capability. To evaluate the performance of MPIA, we compare it with ordinary multiobjective evolutionary algorithms (MOEAs) and state-of-the-art multiparty multiobjective optimization evolutionary algorithms (MPMOEAs) by solving synthetic multiparty multiobjective problems and real-world biparty multiobjective unmanned aerial vehicle path planning (BPUAV-PP) problems involving multiple DMs. Experimental results demonstrate that MPIA outperforms other algorithms.
CVDec 19, 2023Code
Regulating Intermediate 3D Features for Vision-Centric Autonomous DrivingJunkai Xu, Liang Peng, Haoran Cheng et al.
Multi-camera perception tasks have gained significant attention in the field of autonomous driving. However, existing frameworks based on Lift-Splat-Shoot (LSS) in the multi-camera setting cannot produce suitable dense 3D features due to the projection nature and uncontrollable densification process. To resolve this problem, we propose to regulate intermediate dense 3D features with the help of volume rendering. Specifically, we employ volume rendering to process the dense 3D features to obtain corresponding 2D features (e.g., depth maps, semantic maps), which are supervised by associated labels in the training. This manner regulates the generation of dense 3D features on the feature level, providing appropriate dense and unified features for multiple perception tasks. Therefore, our approach is termed Vampire, stands for "Volume rendering As Multi-camera Perception Intermediate feature REgulator". Experimental results on the Occ3D and nuScenes datasets demonstrate that Vampire facilitates fine-grained and appropriate extraction of dense 3D features, and is competitive with existing SOTA methods across diverse downstream perception tasks like 3D occupancy prediction, LiDAR segmentation and 3D objection detection, while utilizing moderate GPU resources. We provide a video demonstration in the supplementary materials and Codes are available at github.com/cskkxjk/Vampire.
CRJul 7, 2024
Evolutionary Trigger Detection and Lightweight Model Repair Based Backdoor DefenseQi Zhou, Zipeng Ye, Yubo Tang et al.
Deep Neural Networks (DNNs) have been widely used in many areas such as autonomous driving and face recognition. However, DNN model is fragile to backdoor attack. A backdoor in the DNN model can be activated by a poisoned input with trigger and leads to wrong prediction, which causes serious security issues in applications. It is challenging for current defenses to eliminate the backdoor effectively with limited computing resources, especially when the sizes and numbers of the triggers are variable as in the physical world. We propose an efficient backdoor defense based on evolutionary trigger detection and lightweight model repair. In the first phase of our method, CAM-focus Evolutionary Trigger Filter (CETF) is proposed for trigger detection. CETF is an effective sample-preprocessing based method with the evolutionary algorithm, and our experimental results show that CETF not only distinguishes the images with triggers accurately from the clean images, but also can be widely used in practice for its simplicity and stability in different backdoor attack situations. In the second phase of our method, we leverage several lightweight unlearning methods with the trigger detected by CETF for model repair, which also constructively demonstrate the underlying correlation of the backdoor with Batch Normalization layers. Source code will be published after accepted.
CVJul 10, 2024
Coherent and Multi-modality Image Inpainting via Latent Space OptimizationLingzhi Pan, Tong Zhang, Bingyuan Chen et al.
With the advancements in denoising diffusion probabilistic models (DDPMs), image inpainting has significantly evolved from merely filling information based on nearby regions to generating content conditioned on various prompts such as text, exemplar images, and sketches. However, existing methods, such as model fine-tuning and simple concatenation of latent vectors, often result in generation failures due to overfitting and inconsistency between the inpainted region and the background. In this paper, we argue that the current large diffusion models are sufficiently powerful to generate realistic images without further tuning. Hence, we introduce PILOT (in\textbf{P}ainting v\textbf{I}a \textbf{L}atent \textbf{O}p\textbf{T}imization), an optimization approach grounded on a novel \textit{semantic centralization} and \textit{background preservation loss}. Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background. Furthermore, we propose a strategy to balance optimization expense and image quality, significantly enhancing generation efficiency. Our method seamlessly integrates with any pre-trained model, including ControlNet and DreamBooth, making it suitable for deployment in multi-modal editing tools. Our qualitative and quantitative evaluations demonstrate that PILOT outperforms existing approaches by generating more coherent, diverse, and faithful inpainted regions in response to provided prompts.
17.3CVApr 1
Gaze to Insight: A Scalable AI Approach for Detecting Gaze Behaviours in Face-to-Face Collaborative LearningJunyuan Liang, Qi Zhou, Sahan Bulathwela et al.
Previous studies have illustrated the potential of analysing gaze behaviours in collaborative learning to provide educationally meaningful information for students to reflect on their learning. Over the past decades, machine learning approaches have been developed to automatically detect gaze behaviours from video data. Yet, since these approaches often require large amounts of labelled data for training, human annotation remains necessary. Additionally, researchers have questioned the cross-configuration robustness of machine learning models developed, as training datasets often fail to encompass the full range of situations encountered in educational contexts. To address these challenges, this study proposes a scalable artificial intelligence approach that leverages pretrained and foundation models to automatically detect gaze behaviours in face-to-face collaborative learning contexts without requiring human-annotated data. The approach utilises pretrained YOLO11 for person tracking, YOLOE-26 with text-prompt capability for education-related object detection, and the Gaze-LLE model for gaze target prediction. The results indicate that the proposed approach achieves an F1-score of 0.829 in detecting students' gaze behaviours from video data, with strong performance for laptop-directed gaze and peer-directed gaze, yet weaker performance for other gaze targets. Furthermore, when compared to other supervised machine learning approaches, the proposed method demonstrates superior and more stable performance in complex contexts, highlighting its better cross-configuration robustness. The implications of this approach for supporting students' collaborative learning in real-world environments are also discussed.
CLMay 20, 2025Code
InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model FusionYuanyi Wang, Zhaoyi Yan, Yiming Zhang et al.
Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.
CYDec 9, 2025
Examining Student Interactions with a Pedagogical AI-Assistant for Essay Writing and their Impact on Students Writing QualityWicaksono Febriantoro, Qi Zhou, Wannapon Suraworachet et al.
The dynamic nature of interactions between students and GenAI, as well as their relationship to writing quality, remains underexplored. While most research has examined how general-purpose GenAI can support writing, fewer studies have investigated how students interact with pedagogically designed systems across different phases of the writing process. To address this gap, we evaluated a GenAI-driven essay-writing assistant (EWA) designed to support higher education students in argumentative writing. Drawing on 1,282 interaction logs from 32 undergraduates during a two-hour writing session, Sequential Pattern Mining and K-Means clustering were used to identify behavioral patterns. Two clusters emerged: Cluster 1 emphasized outline planning and essay structure, while Cluster 2 focused on content development. A Mann-Whitney U test revealed a moderate effect size (r = 0.36) in the essay Organization dimension, with Cluster 1 showing higher scores. Qualitative analysis indicated that students with better performance actively wrote and shared essay sections with EWA for feedback, rather than interacted passively by asking questions. These findings suggest implications for teaching and system design. Teachers can encourage active engagement, while future EWAs may integrate automatic labeling and monitoring to prompt students to move from questioning to writing, enabling fuller benefits from GenAI-supported learning.
AIAug 7, 2025Code
InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning CapabilitiesShuo Cai, Su Lu, Qi Zhou et al.
Large language models (LLMs) have exhibited impressive reasoning abilities on a wide range of complex tasks. However, enhancing these capabilities through post-training remains resource intensive, particularly in terms of data and computational cost. Although recent efforts have sought to improve sample efficiency through selective data curation, existing methods often rely on heuristic or task-specific strategies that hinder scalability. In this work, we introduce InfiAlign, a scalable and sample-efficient post-training framework that integrates supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to align LLMs for enhanced reasoning. At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources. When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks. Additional improvements are obtained through the application of DPO, with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks. Our results highlight the effectiveness of combining principled data selection with full-stage post-training, offering a practical solution for aligning large reasoning models in a scalable and data-efficient manner. The model checkpoints are available at https://huggingface.co/InfiX-ai/InfiAlign-Qwen-7B-SFT.
CVMay 7, 2025Code
DFVO: Learning Darkness-free Visible and Infrared Image Disentanglement and Fusion All at OnceQi Zhou, Yukai Shi, Xiaojun Yang et al.
Visible and infrared image fusion is one of the most crucial tasks in the field of image fusion, aiming to generate fused images with clear structural information and high-quality texture features for high-level vision tasks. However, when faced with severe illumination degradation in visible images, the fusion results of existing image fusion methods often exhibit blurry and dim visual effects, posing major challenges for autonomous driving. To this end, a Darkness-Free network is proposed to handle Visible and infrared image disentanglement and fusion all at Once (DFVO), which employs a cascaded multi-task approach to replace the traditional two-stage cascaded training (enhancement and fusion), addressing the issue of information entropy loss caused by hierarchical data transmission. Specifically, we construct a latent-common feature extractor (LCFE) to obtain latent features for the cascaded tasks strategy. Firstly, a details-extraction module (DEM) is devised to acquire high-frequency semantic information. Secondly, we design a hyper cross-attention module (HCAM) to extract low-frequency information and preserve texture features from source images. Finally, a relevant loss function is designed to guide the holistic network learning, thereby achieving better image fusion. Extensive experiments demonstrate that our proposed approach outperforms state-of-the-art alternatives in terms of qualitative and quantitative evaluations. Particularly, DFVO can generate clearer, more informative, and more evenly illuminated fusion results in the dark environments, achieving best performance on the LLVIP dataset with 63.258 dB PSNR and 0.724 CC, providing more effective information for high-level vision tasks. Our code is publicly accessible at https://github.com/DaVin-Qi530/DFVO.
20.3CVMar 30
CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language ModelsKesheng Chen, Yamin Hu, Qi Zhou et al.
Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual evidence conflicts with commonsense, do models follow what is shown or what commonsense suggests? A characteristic failure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. We term this phenomenon \textbf{commonsense-driven hallucination} (CDH). To evaluate it, we introduce \textbf{CDH-Bench}, a benchmark designed to create explicit \textbf{visual evidence--commonsense conflicts}. CDH-Bench covers three dimensions: \textit{counting anomalies}, \textit{relational anomalies}, and \textit{attribute anomalies}. We evaluate frontier VLMs under \textit{binary Question Answering (QA)} and \textit{multiple-choice QA}, and report metrics including \textit{Counterfactual Accuracy} (CF-Acc), \textit{Commonsense Accuracy} (CS-Acc), \textit{Counterfactual Accuracy Drop} (CFAD), \textit{Commonsense Collapse Rate} (CCR), and \textit{Relative Prior Dependency} (RPD). Results show that even strong models remain vulnerable to prior-driven normalization under visual evidence--commonsense conflict. CDH-Bench provides a controlled diagnostic of visual fidelity under visual evidence--commonsense conflict.
72.6CRMay 4
Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial TrainingWenjing Duan, Qi Zhou, Yuanfan Li
Machine-generated text (MGT) detection is critical for regulating online information ecosystems, yet existing detectors often underperform in few-shot settings and remain vulnerable to adversarial, humanizing attacks. To build accurate and robust detectors under limited supervision, we adopt a threat-modeling perspective and study detector vulnerabilities from an attacker's viewpoint under an output-only black-box setting. Motivated by this perspective, we propose RAG-GuidEd Attacker Strengthens ConTrastive Few-shot Detector (REACT), an adversarial training framework that improves both few-shot detection performance and robustness against attacks. REACT couples a humanization-oriented attacker with a target detector: the attacker leverages retrieval-augmented generation (RAG) to craft highly human-like adversarial examples to evade detection, while the detector learns from these adversaries with a contrastive objective to stabilize few-shot representation learning and enhance robustness. We alternately update the attacker and the detector to enable their co-evolution. Experiments on 4 datasets with 4 shot sizes and 3 random seeds show that REACT improves average detection F1 by 4.95 points over 8 state-of-the-art (SOTA) detectors and reduces the average attack success rate (ASR) under 4 strong attacks by 3.66 percentage points.
81.0CRApr 28
MGTEVAL: An Interactive Platform for Systemtic Evaluation of Machine-Generated Text DetectorsYuanfan Li, Qi Zhou, Chengzhengxu Li et al.
We present MGTEVAL, an extensible platform for systematic evaluation of Machine-Generated Text (MGT) detectors. Despite rapid progress in MGT detection, existing evaluations are often fragmented across datasets, preprocessing, attacks, and metrics, making results hard to compare and reproduce. MGTEVAL organizes the workflow into four components: Dataset Building, Dataset Attack, Detector Training, and Performance Evaluation. It supports constructing custom benchmarks by generating MGT with configurable LLMs, applying 12 text attacks to test sets, training detectors via a unified interface, and reporting effectiveness, robustness, and efficiency. The platform provides both command-line and Web-based interfaces for user-friendly experimentation without code rewriting.
4.4LGApr 23
Relocation of compact sets in $\mathbb{R}^n$ by diffeomorphisms and linear separability of datasets in $\mathbb{R}^n$Xiao-Song Yang, Xuan Zhou, Qi Zhou
Relocation of compact sets in an $n$-dimensional manifold by self-diffeomorphism is of its own interest as well as significant potential applications to data classification in data science. This paper presents a theory for relocating a finite number of compact sets in $\mathbb{R}^n$ to be relocated to arbitrary target domains in $\mathbb{R}^n$ by diffeomorphisms of $\mathbb{R}^n$. Furthermore, we prove that for any such collection, there exists a differentiable embedding into $\mathbb{R}^{n+1}$ such that their images become linearly separable. As applications of the established theory, we show that a finite number of compact datasets in $\mathbb{R}^n$ can be made linearly separable by width-$n$ deep neural networks (DNNs) with Leaky-ReLU, ELU, or SELU activation functions, under a mild condition. In addition, we show that any finite number of mutually disjoint compact datasets in $\mathbb{R}^n$ can be made linearly separable in $\mathbb{R}^{n+1}$ by a width-$(n+1)$ DNN.
LGNov 10, 2025
Minimum Width of Deep Narrow Networks for Universal ApproximationXiao-Song Yang, Qi Zhou, Xuan Zhou
Determining the minimum width of fully connected neural networks has become a fundamental problem in recent theoretical studies of deep neural networks. In this paper, we study the lower bounds and upper bounds of the minimum width required for fully connected neural networks in order to have universal approximation capability, which is important in network design and training. We show that $w_{min}\leq\max(2d_x+1, d_y)$ for networks with ELU, SELU, and the upper bound of this inequality is attained when $d_y=2d_x$, where $d_x$, $d_y$ denote the input and output dimensions, respectively. Besides, we show that $d_x+1\leq w_{min}\leq d_x+d_y$ for networks with LeakyReLU, ELU, CELU, SELU, Softplus, by proving that ReLU can be approximated by these activation functions. In addition, in the case that the activation function is injective or can be uniformly approximated by a sequence of injective functions (e.g., ReLU), we present a new proof of the inequality $w_{min}\ge d_y+\mathbf{1}_{d_x<d_y\leq2d_x}$ by constructing a more intuitive example via a new geometric approach based on Poincar$\acute{\text{e}}$-Miranda Theorem.
CLOct 17, 2024
Unconstrained Model Merging for Enhanced LLM ReasoningYiming Zhang, Baoyi He, Shengyu Zhang et al.
Recent advancements in building domain-specific large language models (LLMs) have shown remarkable success, especially in tasks requiring reasoning abilities like logical inference over complex relationships and multi-step problem solving. However, creating a powerful all-in-one LLM remains challenging due to the need for proprietary data and vast computational resources. As a resource-friendly alternative, we explore the potential of merging multiple expert models into a single LLM. Existing studies on model merging mainly focus on generalist LLMs instead of domain experts, or the LLMs under the same architecture and size. In this work, we propose an unconstrained model merging framework that accommodates both homogeneous and heterogeneous model architectures with a focus on reasoning tasks. A fine-grained layer-wise weight merging strategy is designed for homogeneous models merging, while heterogeneous model merging is built upon the probabilistic distribution knowledge derived from instruction-response fine-tuning data. Across 7 benchmarks and 9 reasoning-optimized LLMs, we reveal key findings that combinatorial reasoning emerges from merging which surpasses simple additive effects. We propose that unconstrained model merging could serve as a foundation for decentralized LLMs, marking a notable progression from the existing centralized LLM framework. This evolution could enhance wider participation and stimulate additional advancement in the field of artificial intelligence, effectively addressing the constraints posed by centralized models.
LGMar 2, 2024
MPIPN: A Multi Physics-Informed PointNet for solving parametric acoustic-structure systemsChu Wang, Jinhong Wu, Yanzhi Wang et al.
Machine learning is employed for solving physical systems governed by general nonlinear partial differential equations (PDEs). However, complex multi-physics systems such as acoustic-structure coupling are often described by a series of PDEs that incorporate variable physical quantities, which are referred to as parametric systems. There are lack of strategies for solving parametric systems governed by PDEs that involve explicit and implicit quantities. In this paper, a deep learning-based Multi Physics-Informed PointNet (MPIPN) is proposed for solving parametric acoustic-structure systems. First, the MPIPN induces an enhanced point-cloud architecture that encompasses explicit physical quantities and geometric features of computational domains. Then, the MPIPN extracts local and global features of the reconstructed point-cloud as parts of solving criteria of parametric systems, respectively. Besides, implicit physical quantities are embedded by encoding techniques as another part of solving criteria. Finally, all solving criteria that characterize parametric systems are amalgamated to form distinctive sequences as the input of the MPIPN, whose outputs are solutions of systems. The proposed framework is trained by adaptive physics-informed loss functions for corresponding computational domains. The framework is generalized to deal with new parametric conditions of systems. The effectiveness of the MPIPN is validated by applying it to solve steady parametric acoustic-structure coupling systems governed by the Helmholtz equations. An ablation experiment has been implemented to demonstrate the efficacy of physics-informed impact with a minority of supervised data. The proposed method yields reasonable precision across all computational domains under constant parametric conditions and changeable combinations of parametric conditions for acoustic-structure systems.
CLJan 6, 2025
InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM FusionZhaoyi Yan, Yiming Zhang, Baoyi He et al.
We introduce InfiFusion, an efficient training pipeline designed to integrate multiple domain-specialized Large Language Models (LLMs) into a single pivot model, effectively harnessing the strengths of each source model. Traditional fusion methods either merge model parameters directly or rely on knowledge distillation with rigid assumptions, limiting their flexibility and efficiency. InfiFusion overcomes these limitations by enhancing Universal Logit Distillation (ULD) with Top-K selection and Logits Standardization. We propose two fusion strategies: Pairwise Fusion (InfiFusion$_p$), where each source model knowledge is distilled individually into the pivot model followed by merging and Unified Fusion (InfiFusion$_u$), where knowledge from all source models is distilled simultaneously into the pivot model. InfiFusion outperforms the state-of-the-art models, such as Qwen-2.5-14B-Instruct and Phi-4, across 11 widely applied benchmarks covering reasoning, coding, mathematics, and instruction-following tasks. Notably, InfiFusion achieves this superior performance while significantly reduces computational costs, completing full training with only 160 H800 GPU hours compared to the millions typically required for traditional LLM training.
LGMay 20, 2025
InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language ModelsYanggan Gu, Yuanyi Wang, Zhaoyi Yan et al.
Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) --a critical phase for enhancing LLM performance--largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improve its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.
CLFeb 17, 2025
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in ReasoningCongkai Xie, Shuo Cai, Wenjun Wang et al.
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have made significant advancements in reasoning capabilities. However, they still face challenges such as high computational demands and privacy concerns. This paper focuses on developing efficient Small Language Models (SLMs) and Multimodal Small Language Models (MSLMs) that retain competitive reasoning abilities. We introduce a novel training pipeline that enhances reasoning capabilities and facilitates deployment on edge devices, achieving state-of-the-art performance while minimizing development costs. \InfR~ aims to advance AI systems by improving reasoning, reducing adoption barriers, and addressing privacy concerns through smaller model sizes. Resources are available at https://github. com/Reallm-Labs/InfiR.
CVFeb 6, 2024
FoolSDEdit: Deceptively Steering Your Edits Towards Targeted Attribute-aware DistributionQi Zhou, Dongxia Wang, Tianlin Li et al.
Guided image synthesis methods, like SDEdit based on the diffusion model, excel at creating realistic images from user inputs such as stroke paintings. However, existing efforts mainly focus on image quality, often overlooking a key point: the diffusion model represents a data distribution, not individual images. This introduces a low but critical chance of generating images that contradict user intentions, raising ethical concerns. For example, a user inputting a stroke painting with female characteristics might, with some probability, get male faces from SDEdit. To expose this potential vulnerability, we aim to build an adversarial attack forcing SDEdit to generate a specific data distribution aligned with a specified attribute (e.g., female), without changing the input's attribute characteristics. We propose the Targeted Attribute Generative Attack (TAGA), using an attribute-aware objective function and optimizing the adversarial noise added to the input stroke painting. Empirical studies reveal that traditional adversarial noise struggles with TAGA, while natural perturbations like exposure and motion blur easily alter generated images' attributes. To execute effective attacks, we introduce FoolSDEdit: We design a joint adversarial exposure and blur attack, adding exposure and motion blur to the stroke painting and optimizing them together. We optimize the execution strategy of various perturbations, framing it as a network architecture search problem. We create the SuperPert, a graph representing diverse execution strategies for different perturbations. After training, we obtain the optimized execution strategy for effective TAGA against SDEdit. Comprehensive experiments on two datasets show our method compelling SDEdit to generate a targeted attribute-aware data distribution, significantly outperforming baselines.
CYJan 3, 2024
Harnessing Transparent Learning Analytics for Individualized Support through Auto-detection of Engagement in Face-to-Face Collaborative LearningQi Zhou, Wannapon Suraworachet, Mutlu Cukurova
Using learning analytics to investigate and support collaborative learning has been explored for many years. Recently, automated approaches with various artificial intelligence approaches have provided promising results for modelling and predicting student engagement and performance in collaborative learning tasks. However, due to the lack of transparency and interpretability caused by the use of "black box" approaches in learning analytics design and implementation, guidance for teaching and learning practice may become a challenge. On the one hand, the black box created by machine learning algorithms and models prevents users from obtaining educationally meaningful learning and teaching suggestions. On the other hand, focusing on group and cohort level analysis only can make it difficult to provide specific support for individual students working in collaborative groups. This paper proposes a transparent approach to automatically detect student's individual engagement in the process of collaboration. The results show that the proposed approach can reflect student's individual engagement and can be used as an indicator to distinguish students with different collaborative learning challenges (cognitive, behavioural and emotional) and learning outcomes. The potential of the proposed collaboration analytics approach for scaffolding collaborative learning practice in face-to-face contexts is discussed and future research suggestions are provided.
CVDec 17, 2024
Defending LVLMs Against Vision Attacks through Partial-Perception SupervisionQi Zhou, Tianlin Li, Qing Guo et al.
Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise the LVLM's responses to the original images. We propose a black-box, training-free method called DPS (Defense through Partial-Perception Supervision). In this approach, the model is prompted using the responses generated by a model that perceives only a partial image. With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input. Our findings show that the weak model can supervise the strong model: when faced with an attacked input, the strong model becomes less confident and adjusts its response based on the weak model's partial understanding, effectively defending against the attack. With clean input, it confidently maintains its original response. Empirical experiments show our method outperforms the baseline, cutting the average attack success rate by 76.3% across six datasets on three popular models.
AISep 29, 2025
Model Merging Scaling Laws in Large Language ModelsYuanyi Wang, Yanggan Gu, Yiming Zhang et al.
We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.
AIMay 17, 2025
Fair-PP: A Synthetic Dataset for Aligning LLM with Personalized Preferences of Social EquityQi Zhou, Jie Zhang, Dongxia Wang et al.
Human preference plays a crucial role in the refinement of large language models (LLMs). However, collecting human preference feedback is costly and most existing datasets neglect the correlation between personalization and preferences. To address this issue, we introduce Fair-PP, a synthetic dataset of personalized preferences targeting social equity, derived from real-world social survey data, which includes 28 social groups, 98 equity topics, and 5 personal preference dimensions. Leveraging GPT-4o-mini, we engage in role-playing based on seven representative persona portrayals guided by existing social survey data, yielding a total of 238,623 preference records. Through Fair-PP, we also contribute (i) An automated framework for generating preference data, along with a more fine-grained dataset of personalized preferences; (ii) analysis of the positioning of the existing mainstream LLMs across five major global regions within the personalized preference space; and (iii) a sample reweighting method for personalized preference alignment, enabling alignment with a target persona while maximizing the divergence from other personas. Empirical experiments show our method outperforms the baselines.
CVMar 8
RobustSCI: Beyond Reconstruction to Restoration for Snapshot Compressive Imaging under Real-World DegradationsHao Wang, Yuanfan Li, Qi Zhou et al.
Deep learning algorithms for video Snapshot Compressive Imaging (SCI) have achieved great success, yet they predominantly focus on reconstructing from clean measurements. This overlooks a critical real-world challenge: the captured signal itself is often severely degraded by motion blur and low light. Consequently, existing models falter in practical applications. To break this limitation, we pioneer the first study on robust video SCI restoration, shifting the goal from "reconstruction" to "restoration"--recovering the underlying pristine scene from a degraded measurement. To facilitate this new task, we first construct a large-scale benchmark by simulating realistic, continuous degradations on the DAVIS 2017 dataset. Second, we propose RobustSCI, a network that enhances a strong encoder-decoder backbone with a novel RobustCFormer block. This block introduces two parallel branches--a multi-scale deblur branch and a frequency enhancement branch--to explicitly disentangle and remove degradations during the recovery process. Furthermore, we introduce RobustSCI-C (RobustSCI-Cascade), which integrates a pre-trained Lightweight Post-processing Deblurring Network to significantly boost restoration performance with minimal overhead. Extensive experiments demonstrate that our methods outperform all SOTA models on the new degraded testbeds, with additional validation on real-world degraded SCI data confirming their practical effectiveness, elevating SCI from merely reconstructing what is captured to restoring what truly happened.
CLFeb 2
Graph-Augmented Reasoning with Large Language Models for Tobacco Pest and Disease ManagementSiyu Li, Chenwei Song, Qi Zhou et al.
This paper proposes a graph-augmented reasoning framework for tobacco pest and disease management that integrates structured domain knowledge into large language models. Building on GraphRAG, we construct a domain-specific knowledge graph and retrieve query-relevant subgraphs to provide relational evidence during answer generation. The framework adopts ChatGLM as the Transformer backbone with LoRA-based parameter-efficient fine-tuning, and employs a graph neural network to learn node representations that capture symptom-disease-treatment dependencies. By explicitly modeling diseases, symptoms, pesticides, and control measures as linked entities, the system supports evidence-aware retrieval beyond surface-level text similarity. Retrieved graph evidence is incorporated into the LLM input to guide generation toward domain-consistent recommendations and to mitigate hallucinated or inappropriate treatments. Experimental results show consistent improvements over text-only baselines, with the largest gains observed on multi-hop and comparative reasoning questions that require chaining multiple relations.
CYNov 24, 2025
Towards Synergistic Teacher-AI Interactions with Generative Artificial IntelligenceMutlu Cukurova, Wannapon Suraworachet, Qi Zhou et al.
Generative artificial intelligence (GenAI) is increasingly used in education, posing significant challenges for teachers adapting to these changes. GenAI offers unprecedented opportunities for accessibility, scalability and productivity in educational tasks. However, the automation of teaching tasks through GenAI raises concerns about reduced teacher agency, potential cognitive atrophy, and the broader deprofessionalisation of teaching. Drawing findings from prior literature on AI in Education, and refining through a recent systematic literature review, this chapter presents a conceptualisation of five levels of teacher-AI teaming: transactional, situational, operational, praxical and synergistic teaming. The framework aims to capture the nuanced dynamics of teacher-AI interactions, particularly with GenAI, that may lead to the replacement, complementarity, or augmentation of teachers' competences and professional practice. GenAI technological affordances required in supporting teaming, along with empirical studies, are discussed. Drawing on empirical observations, we outline a future vision that moves beyond individual teacher agency toward collaborative decision-making between teachers and AI, in which both agents engage in negotiation, constructive challenge, and co-reasoning that enhance each other's capabilities and enable outcomes neither could realise independently. Further discussion of socio-technical factors beyond teacher-AI teaming is also included to streamline the synergy of teachers and AI in education ethically and practically.
CRSep 24, 2025
A Set of Generalized Components to Achieve Effective Poison-only Clean-label Backdoor Attacks with Collaborative Sample Selection and TriggersZhixiao Wu, Yao Lu, Jie Wen et al.
Poison-only Clean-label Backdoor Attacks aim to covertly inject attacker-desired behavior into DNNs by merely poisoning the dataset without changing the labels. To effectively implant a backdoor, multiple \textbf{triggers} are proposed for various attack requirements of Attack Success Rate (ASR) and stealthiness. Additionally, sample selection enhances clean-label backdoor attacks' ASR by meticulously selecting ``hard'' samples instead of random samples to poison. Current methods 1) usually handle the sample selection and triggers in isolation, leading to severely limited improvements on both ASR and stealthiness. Consequently, attacks exhibit unsatisfactory performance on evaluation metrics when converted to PCBAs via a mere stacking of methods. Therefore, we seek to explore the bidirectional collaborative relations between the sample selection and triggers to address the above dilemma. 2) Since the strong specificity within triggers, the simple combination of sample selection and triggers fails to substantially enhance both evaluation metrics, with generalization preserved among various attacks. Therefore, we seek to propose a set of components to significantly improve both stealthiness and ASR based on the commonalities of attacks. Specifically, Component A ascertains two critical selection factors, and then makes them an appropriate combination based on the trigger scale to select more reasonable ``hard'' samples for improving ASR. Component B is proposed to select samples with similarities to relevant trigger implanted samples to promote stealthiness. Component C reassigns trigger poisoning intensity on RGB colors through distinct sensitivity of the human visual system to RGB for higher ASR, with stealthiness ensured by sample selection, including Component B. Furthermore, all components can be strategically integrated into diverse PCBAs.
LGAug 20, 2025
Large Foundation Model for Ads RecommendationShangyu Zhang, Shijie Quan, Zhongren Wang et al.
Online advertising relies on accurate recommendation models, with recent advances using pre-trained large-scale foundation models (LFMs) to capture users' general interests across multiple scenarios and tasks. However, existing methods have critical limitations: they extract and transfer only user representations (URs), ignoring valuable item representations (IRs) and user-item cross representations (CRs); and they simply use a UR as a feature in downstream applications, which fails to bridge upstream-downstream gaps and overlooks more transfer granularities. In this paper, we propose LFM4Ads, an All-Representation Multi-Granularity transfer framework for ads recommendation. It first comprehensively transfers URs, IRs, and CRs, i.e., all available representations in the pre-trained foundation model. To effectively utilize the CRs, it identifies the optimal extraction layer and aggregates them into transferable coarse-grained forms. Furthermore, we enhance the transferability via multi-granularity mechanisms: non-linear adapters for feature-level transfer, an Isomorphic Interaction Module for module-level transfer, and Standalone Retrieval for model-level transfer. LFM4Ads has been successfully deployed in Tencent's industrial-scale advertising platform, processing tens of billions of daily samples while maintaining terabyte-scale model parameters with billions of sparse embedding keys across approximately two thousand features. Since its production deployment in Q4 2024, LFM4Ads has achieved 10+ successful production launches across various advertising scenarios, including primary ones like Weixin Moments and Channels. These launches achieve an overall GMV lift of 2.45% across the entire platform, translating to estimated annual revenue increases in the hundreds of millions of dollars.
CVMar 6, 2025
A Novel Solution for Drone Photogrammetry with Low-overlap Aerial Images using Monocular Depth EstimationJiageng Zhong, Qi Zhou, Ming Li et al.
Low-overlap aerial imagery poses significant challenges to traditional photogrammetric methods, which rely heavily on high image overlap to produce accurate and complete mapping products. In this study, we propose a novel workflow based on monocular depth estimation to address the limitations of conventional techniques. Our method leverages tie points obtained from aerial triangulation to establish a relationship between monocular depth and metric depth, thus transforming the original depth map into a metric depth map, enabling the generation of dense depth information and the comprehensive reconstruction of the scene. For the experiments, a high-overlap drone dataset containing 296 images is processed using Metashape to generate depth maps and DSMs as ground truth. Subsequently, we create a low-overlap dataset by selecting 20 images for experimental evaluation. Results demonstrate that while the recovered depth maps and resulting DSMs achieve meter-level accuracy, they provide significantly better completeness compared to traditional methods, particularly in regions covered by single images. This study showcases the potential of monocular depth estimation in low-overlap aerial photogrammetry.
DCNov 15, 2024
Domain Adaptation-based Edge Computing for Cross-Conditions Fault DiagnosisYanzhi Wang, Jinhong Wu, Chu Wang et al.
Fault diagnosis of mechanical equipment provides robust support for industrial production. It is worth noting that, the operation of mechanical equipment is accompanied by changes in factors such as speed and load, leading to significant differences in data distribution, which pose challenges for fault diagnosis. Additionally, in terms of application deployment, commonly used cloud-based fault diagnosis methods often encounter issues such as time delays and data security concerns, while common fault diagnosis methods cannot be directly applied to edge computing devices. Therefore, conducting fault diagnosis under cross-operating conditions based on edge computing holds significant research value. This paper proposes a domain-adaptation-based lightweight fault diagnosis framework tailored for edge computing scenarios. Incorporating the local maximum mean discrepancy into knowledge transfer aligns the feature distributions of different domains in a high-dimensional feature space, to discover a common feature space across domains. The acquired fault diagnosis expertise from the cloud-based deep neural network model is transferred to the lightweight edge-based model (edge model) using adaptation knowledge transfer methods. It aims to achieve accurate fault diagnosis under cross-working conditions while ensuring real-time diagnosis capabilities. We utilized the NVIDIA Jetson Xavier NX kit as the edge computing platform and conducted validation experiments on two devices. In terms of diagnostic performance, the proposed method significantly improved diagnostic accuracy, with average increases of 34.44% and 17.33% compared to existing methods, respectively.
LGMar 5, 2024
Dynamic Gaussian Graph Operator: Learning parametric partial differential equations in arbitrary discrete mechanics problemsChu Wang, Jinhong Wu, Yanzhi Wang et al.
Deep learning methods have access to be employed for solving physical systems governed by parametric partial differential equations (PDEs) due to massive scientific data. It has been refined to operator learning that focuses on learning non-linear mapping between infinite-dimensional function spaces, offering interface from observations to solutions. However, state-of-the-art neural operators are limited to constant and uniform discretization, thereby leading to deficiency in generalization on arbitrary discretization schemes for computational domain. In this work, we propose a novel operator learning algorithm, referred to as Dynamic Gaussian Graph Operator (DGGO) that expands neural operators to learning parametric PDEs in arbitrary discrete mechanics problems. The Dynamic Gaussian Graph (DGG) kernel learns to map the observation vectors defined in general Euclidean space to metric vectors defined in high-dimensional uniform metric space. The DGG integral kernel is parameterized by Gaussian kernel weighted Riemann sum approximating and using dynamic message passing graph to depict the interrelation within the integral term. Fourier Neural Operator is selected to localize the metric vectors on spatial and frequency domains. Metric vectors are regarded as located on latent uniform domain, wherein spatial and spectral transformation offer highly regular constraints on solution space. The efficiency and robustness of DGGO are validated by applying it to solve numerical arbitrary discrete mechanics problems in comparison with mainstream neural operators. Ablation experiments are implemented to demonstrate the effectiveness of spatial transformation in the DGG kernel. The proposed method is utilized to forecast stress field of hyper-elastic material with geometrically variable void as engineering application.
LGDec 20, 2021
Learning Robust Policy against Disturbance in Transition Dynamics via State-Conservative Policy OptimizationYufei Kuang, Miao Lu, Jie Wang et al.
Deep reinforcement learning algorithms can perform poorly in real-world tasks due to the discrepancy between source and target environments. This discrepancy is commonly viewed as the disturbance in transition dynamics. Many existing algorithms learn robust policies by modeling the disturbance and applying it to source environments during training, which usually requires prior knowledge about the disturbance and control of simulators. However, these algorithms can fail in scenarios where the disturbance from target environments is unknown or is intractable to model in simulators. To tackle this problem, we propose a novel model-free actor-critic algorithm -- namely, state-conservative policy optimization (SCPO) -- to learn robust policies without modeling the disturbance in advance. Specifically, SCPO reduces the disturbance in transition dynamics to that in state space and then approximates it by a simple gradient-based regularizer. The appealing features of SCPO include that it is simple to implement and does not require additional knowledge about the disturbance or specially designed simulators. Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics.
LGDec 16, 2021
Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-CriticZhihai Wang, Jie Wang, Qi Zhou et al.
Model-based reinforcement learning algorithms, which aim to learn a model of the environment to make decisions, are more sample efficient than their model-free counterparts. The sample efficiency of model-based approaches relies on whether the model can well approximate the environment. However, learning an accurate model is challenging, especially in complex and noisy environments. To tackle this problem, we propose the conservative model-based actor-critic (CMBAC), a novel approach that achieves high sample efficiency without the strong reliance on accurate learned models. Specifically, CMBAC learns multiple estimates of the Q-value function from a set of inaccurate models and uses the average of the bottom-k estimates -- a conservative estimate -- to optimize the policy. An appealing feature of CMBAC is that the conservative estimates effectively encourage the agent to avoid unreliable "promising actions" -- whose values are high in only a small fraction of the models. Experiments demonstrate that CMBAC significantly outperforms state-of-the-art approaches in terms of sample efficiency on several challenging tasks, and the proposed method is more robust than previous methods in noisy environments.
DBMay 20, 2021
Testing DBMS Performance with MutationsXinyu Liu, Qi Zhou, Joy Arulraj et al.
Because database systems are the critical component of modern data-intensive applications, it is important to ensure that they operate correctly. To this end, developers extensively test these systems to eliminate bugs that negatively affect functionality. In addition to functional bugs, however, there is another important class of bugs: performance bugs. These bugs negatively affect the response time of a database system and can therefore affect the overall performance of the system. Despite their impact on end-user experience, performance bugs have received considerably less attention than functional bugs. In this paper, we present AMOEBA, a system for automatically detecting performance bugs in database systems. The core idea behind AMOEBA is to construct query pairs that are semantically equivalent to each other and then compare their response time on the same database system. If the queries exhibit a significant difference in their runtime performance, then the root cause is likely a performance bug in the system. We propose a novel set of structure and predicate mutation rules for constructing query pairs that are likely to uncover performance bugs. We introduce feedback mechanisms for improving the efficacy and computational efficiency of the tool. We evaluate AMOEBA on two widely-used DBMSs, namely PostgreSQL and CockroachDB. AMOEBA has discovered 20 previously-unknown performance bugs, among which developers have already confirmed 14 and fixed 4.
LGDec 9, 2020
EvaLDA: Efficient Evasion Attacks Towards Latent Dirichlet AllocationQi Zhou, Haipeng Chen, Yitao Zheng et al.
As one of the most powerful topic models, Latent Dirichlet Allocation (LDA) has been used in a vast range of tasks, including document understanding, information retrieval and peer-reviewer assignment. Despite its tremendous popularity, the security of LDA has rarely been studied. This poses severe risks to security-critical tasks such as sentiment analysis and peer-reviewer assignment that are based on LDA. In this paper, we are interested in knowing whether LDA models are vulnerable to adversarial perturbations of benign document examples during inference time. We formalize the evasion attack to LDA models as an optimization problem and prove it to be NP-hard. We then propose a novel and efficient algorithm, EvaLDA to solve it. We show the effectiveness of EvaLDA via extensive empirical evaluations. For instance, in the NIPS dataset, EvaLDA can averagely promote the rank of a target topic from 10 to around 7 by only replacing 1% of the words with similar words in a victim document. Our work provides significant insights into the power and limitations of evasion attacks to LDA models.
CRAug 11, 2020
Blockchain-Enabled Internet-of-Things Platform for End-to-End Industrial Hemp Supply ChainKeqi Wang, Wei Xie, Wencen Wu et al.
After being legalized as an agricultural commodity by the 2018 U.S. Farm Bill, the Industrial Hemp production is moved from limited pilot programs to a regulated agriculture production system, and the market keeps increasing since then. However, Industrial Hemp Supply Chain (IHSC) faces several critical challenges, including high complexity and variability, data tampering, and lack of immutable information tracking system. In this paper, we develop a blockchain enabled internet-of-things (IoT) platform for IHSC to support process tracking, scalability, interoperability, and risk management. Basically, we create a two-layer blockchain with proof-of-authority based smart contract, which can leverage local authorities with state/federal regulators to ensure and accelerate quality control verification and regulatory compliance. Then, we develop a user-friendly mobile app so that each participant can use smart phone to real-time collect and upload their data to the cloud, and further share the process verification and tracking information through the blockchain network. Our study indicates the proposed platform can support interoperability, improve the efficiency of quality control verification, and ensure the safety of regulated IHSC.
CRJun 19, 2020
Simulation-Based Digital Twin Development for Blockchain Enabled End-to-End Industrial Hemp Supply Chain Risk ManagementKeqi Wang, Wei Xie, Wencen Wu et al.
With the passage of the 2018 U.S. Farm Bill, Industrial Hemp production is moved from limited pilot programs to a regulated agriculture production system. However, Industrial Hemp Supply Chain (IHSC) faces critical challenges, including: high complexity and variability, very limited production knowledge, lack of data and information tracking. In this paper, we propose blockchain-enabled IHSC and develop a preliminary simulation-based digital twin for this distributed cyber-physical system (CPS) to support the process learning and risk management. Basically, we develop a two-layer blockchain with proof of authority smart contract, which can track the data and key information, improve the supply chain transparency, and leverage local authorities and state regulators to ensure the quality control verification. Then, we introduce a stochastic simulation-based digital twin for IHSC risk management, which can characterize the process spatial-temporal causal interdependencies and dynamic evolution to guide risk control and decision making. Our empirical study demonstrates the promising performance of proposed platform.
NIJan 21, 2020
Intelligent Bandwidth Allocation for Latency Management in NG-EPON using Reinforcement Learning MethodsQi Zhou, Jingjie Zhu, Junwen Zhang et al.
A novel intelligent bandwidth allocation scheme in NG-EPON using reinforcement learning is proposed and demonstrated for latency management. We verify the capability of the proposed scheme under both fixed and dynamic traffic loads scenarios to achieve <1ms average latency. The RL agent demonstrates an efficient intelligent mechanism to manage the latency, which provides a promising IBA solution for the next-generation access network.
IVJan 5, 2020
Deep Transfer Convolutional Neural Network and Extreme Learning Machine for Lung Nodule Diagnosis on CT imagesXufeng Huang, Qiang Lei, Tingli Xie et al.
Some content of the article needs to be kept secret
LGNov 28, 2019
Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy OptimizationQi Zhou, Houqiang Li, Jie Wang
Model-based reinforcement learning algorithms tend to achieve higher sample efficiency than model-free methods. However, due to the inevitable errors of learned models, model-based methods struggle to achieve the same asymptotic performance as model-free methods. In this paper, We propose a Policy Optimization method with Model-Based Uncertainty (POMBU)---a novel model-based approach---that can effectively improve the asymptotic performance using the uncertainty in Q-values. We derive an upper bound of the uncertainty, based on which we can approximate the uncertainty accurately and efficiently for model-based methods. We further propose an uncertainty-aware policy optimization algorithm that optimizes the policy conservatively to encourage performance improvement with high probability. This can significantly alleviate the overfitting of policy to inaccurate models. Experiments show POMBU can outperform existing state-of-the-art policy optimization algorithms in terms of sample efficiency and asymptotic performance. Moreover, the experiments demonstrate the excellent robustness of POMBU compared to previous model-based approaches.