Yuyang Zhang

CV
h-index22
31papers
805citations
Novelty57%
AI Score59

31 Papers

CVJan 14Code
STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao, Chunrui Han et al.

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

CVMar 14, 2022
MTLDesc: Looking Wider to Describe Better

Changwei Wang, Rongtao Xu, Yuyang Zhang et al.

Limited by the locality of convolutional neural networks, most existing local features description methods only learn local descriptors with local information and lack awareness of global and surrounding spatial context. In this work, we focus on making local descriptors "look wider to describe better" by learning local Descriptors with More Than just Local information (MTLDesc). Specifically, we resort to context augmentation and spatial attention mechanisms to make our MTLDesc obtain non-local awareness. First, Adaptive Global Context Augmented Module and Diverse Local Context Augmented Module are proposed to construct robust local descriptors with context information from global to local. Second, Consistent Attention Weighted Triplet Loss is designed to integrate spatial attention awareness into both optimization and matching stages of local descriptors learning. Third, Local Features Detection with Feature Pyramid is given to obtain more stable and accurate keypoints localization. With the above innovations, the performance of our MTLDesc significantly surpasses the prior state-of-the-art local descriptors on HPatches, Aachen Day-Night localization and InLoc indoor localization benchmarks.

RONov 13, 2023
Predicting Continuous Locomotion Modes via Multidimensional Feature Learning from sEMG

Peiwen Fu, Wenjuan Zhong, Yuyang Zhang et al.

Walking-assistive devices require adaptive control methods to ensure smooth transitions between various modes of locomotion. For this purpose, detecting human locomotion modes (e.g., level walking or stair ascent) in advance is crucial for improving the intelligence and transparency of such robotic systems. This study proposes Deep-STF, a unified end-to-end deep learning model designed for integrated feature extraction in spatial, temporal, and frequency dimensions from surface electromyography (sEMG) signals. Our model enables accurate and robust continuous prediction of nine locomotion modes and 15 transitions at varying prediction time intervals, ranging from 100 to 500 ms. In addition, we introduced the concept of 'stable prediction time' as a distinct metric to quantify prediction efficiency. This term refers to the duration during which consistent and accurate predictions of mode transitions are made, measured from the time of the fifth correct prediction to the occurrence of the critical event leading to the task transition. This distinction between stable prediction time and prediction time is vital as it underscores our focus on the precision and reliability of mode transition predictions. Experimental results showcased Deep-STP's cutting-edge prediction performance across diverse locomotion modes and transitions, relying solely on sEMG data. When forecasting 100 ms ahead, Deep-STF surpassed CNN and other machine learning techniques, achieving an outstanding average prediction accuracy of 96.48%. Even with an extended 500 ms prediction horizon, accuracy only marginally decreased to 93.00%. The averaged stable prediction times for detecting next upcoming transitions spanned from 28.15 to 372.21 ms across the 100-500 ms time advances.

ROJul 25, 2023
Gait Cycle-Inspired Learning Strategy for Continuous Prediction of Knee Joint Trajectory from sEMG

Xueming Fu, Hao Zheng, Luyan Liu et al.

Predicting lower limb motion intent is vital for controlling exoskeleton robots and prosthetic limbs. Surface electromyography (sEMG) attracts increasing attention in recent years as it enables ahead-of-time prediction of motion intentions before actual movement. However, the estimation performance of human joint trajectory remains a challenging problem due to the inter- and intra-subject variations. The former is related to physiological differences (such as height and weight) and preferred walking patterns of individuals, while the latter is mainly caused by irregular and gait-irrelevant muscle activity. This paper proposes a model integrating two gait cycle-inspired learning strategies to mitigate the challenge for predicting human knee joint trajectory. The first strategy is to decouple knee joint angles into motion patterns and amplitudes former exhibit low variability while latter show high variability among individuals. By learning through separate network entities, the model manages to capture both the common and personalized gait features. In the second, muscle principal activation masks are extracted from gait cycles in a prolonged walk. These masks are used to filter out components unrelated to walking from raw sEMG and provide auxiliary guidance to capture more gait-related features. Experimental results indicate that our model could predict knee angles with the average root mean square error (RMSE) of 3.03(0.49) degrees and 50ms ahead of time. To our knowledge this is the best performance in relevant literatures that has been reported, with reduced RMSE by at least 9.5%.

CLAug 6, 2024
LLM-based MOFs Synthesis Condition Extraction using Few-Shot Demonstrations

Lei Shi, Zhimeng Liu, Yi Yang et al.

The extraction of Metal-Organic Frameworks (MOFs) synthesis route from literature has been crucial for the logical MOFs design with desirable functionality. The recent advent of large language models (LLMs) provides disruptively new solution to this long-standing problem. While the latest researches mostly stick to primitive zero-shot LLMs lacking specialized material knowledge, we introduce in this work the few-shot LLM in-context learning paradigm. First, a human-AI interactive data curation approach is proposed to secure high-quality demonstrations. Second, an information retrieval algorithm is applied to pick and quantify few-shot demonstrations for each extraction. Over three datasets randomly sampled from nearly 90,000 well-defined MOFs, we conduct triple evaluations to validate our method. The synthesis extraction, structure inference, and material design performance of the proposed few-shot LLMs all significantly outplay zero-shot LLM and baseline methods. The lab-synthesized material guided by LLM surpasses 91.1% high-quality MOFs of the same class reported in the literature, on the key physical property of specific surface area.

LGSep 9, 2024
Enhancing Preference-based Linear Bandits via Human Response Time

Shen Li, Yuyang Zhang, Zhaolin Ren et al.

Interactive preference learning systems infer human preferences by presenting queries as pairs of options and collecting binary choices. Although binary choices are simple and widely used, they provide limited information about preference strength. To address this, we leverage human response times, which are inversely related to preference strength, as an additional signal. We propose a computationally efficient method that combines choices and response times to estimate human utility functions, grounded in the EZ diffusion model from psychology. Theoretical and empirical analyses show that for queries with strong preferences, response times complement choices by providing extra information about preference strength, leading to significantly improved utility estimation. We incorporate this estimator into preference-based linear bandits for fixed-budget best-arm identification. Simulations on three real-world datasets demonstrate that using response times significantly accelerates preference learning compared to choice-only approaches. Additional materials, such as code, slides, and talk video, are available at https://shenlirobot.github.io/pages/NeurIPS24.html

81.0LGMay 19
A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

Yuyang Zhang, Yifu Zhang, Xuehai Zhou et al.

While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $Ω(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck-$k$ language, we establish a strict circuit depth lower bound for $\text{TC}^0$ Transformers. Scaling physical layer depth is necessary to avert representation collapse -- a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.

24.5CVMay 18
LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators

Xinhe Zhang, Yuyang Zhang, Pengfei Jin et al.

High-resolution 3D medical image generation remains challenging because fully volumetric models are computationally expensive, while efficient 2D slice generators often fail to preserve anatomical consistency across the third dimension. We propose LiFT, a framework for Lifted inter-slice Feature Trajectories that factorizes 3D volume synthesis into per-slice image generation and inter-slice trajectory learning. Rather than modeling the volumetric distribution end-to-end, LiFT treats a volume as an ordered trajectory in feature space, capturing how anatomical structures appear, transform, and disappear across depth. A tri-planar drifting loss aligns the trajectory of generated slices with the trajectories of real volumes, enabling distributional learning over inter-slice progressions in unconditional generation; in paired translation, a bidirectional $z$-context mixer trained against the registered target supplies through-plane coherence while preserving per-slice fidelity. We evaluate LiFT on BraTS 2023 (unconditional and missing-modality MR) and SynthRAD2023 (MR-to-CT). Across these settings, LiFT preserves per-slice quality, approaches the reported cWDM missing-MR reconstruction quality at $\sim$$135\times$ lower inference cost (without formal equivalence testing), and improves through-plane coherence on MR-to-CT relative to a no-mapper ablation, demonstrating that lightweight inter-slice trajectory learning is a viable route to high-resolution 3D medical synthesis.

AINov 3, 2025
Human-AI Co-Embodied Intelligence for Scientific Experimentation and Manufacturing

Xinyi Lin, Yuyang Zhang, Yuanhang Gan et al.

Scientific experiment and manufacture rely on complex, multi-step procedures that demand continuous human expertise for precise execution and decision-making. Despite advances in machine learning and automation, conventional models remain confined to virtual domains, while real-world experiment and manufacture still rely on human supervision and expertise. This gap between machine intelligence and physical execution limits reproducibility, scalability, and accessibility across scientific and manufacture workflows. Here, we introduce human-AI co-embodied intelligence, a new form of physical AI that unites human users, agentic AI, and wearable hardware into an integrated system for real-world experiment and intelligent manufacture. In this paradigm, humans provide precise execution and control, while agentic AI contributes memory, contextual reasoning, adaptive planning, and real-time feedback. The wearable interface continuously captures the experimental and manufacture processes, facilitates seamless communication between humans and AI for corrective guidance and interpretable collaboration. As a demonstration, we present Agentic-Physical Experimentation (APEX) system, coupling agentic reasoning with physical execution through mixed-reality. APEX observes and interprets human actions, aligns them with standard operating procedures, provides 3D visual guidance, and analyzes every step. Implemented in a cleanroom for flexible electronics fabrication, APEX system achieves context-aware reasoning with accuracy exceeding general multimodal large language models, corrects errors in real time, and transfers expertise to beginners. These results establish a new class of agentic-physical-human intelligence that extends agentic reasoning beyond computation into the physical domain, transforming scientific research and manufacturing into autonomous, traceable, interpretable, and scalable processes.

LGNov 30, 2025
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

Qi Wang, Mian Wu, Yuyang Zhang et al.

Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-World tasks demonstrate the effectiveness of our approach.

84.4CLApr 5Code
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

Xinyu Geng, Yanjing Xiao, Yuyang Zhang et al.

Deep research agents integrate fragmented evidence through multi-step tool use. BrowseComp offers a text-only testbed for such agents, but existing multimodal benchmarks rarely require both weak visual cues composition and BrowseComp-style multi-hop verification. Geolocation is a natural testbed because answers depend on combining multiple ambiguous visual cues and validating them with open-web evidence. Thus, we introduce GeoBrowse, a geolocation benchmark that combines visual reasoning with knowledge-intensive multi-hop queries. Level 1 tests extracting and composing fragmented visual cues, and Level 2 increases query difficulty by injecting long-tail knowledge and obfuscating key entities. To support evaluation, we provide an agentic workflow GATE with five think-with-image tools and four knowledge-intensive tools, and release expert-annotated stepwise traces grounded in verifiable evidence for trajectory-level analysis. Experiments show that GATE outperforms direct inference and open-source agents, indicating that no-tool, search-only or image-only setups are insufficient. Gains come from coherent, level-specific tool-use plans rather than more tool calls, as they more reliably reach annotated key evidence steps and make fewer errors when integrating into the final decision. The GeoBrowse bernchmark and codes are provided in https://github.com/ornamentt/GeoBrowse

77.3SYMay 15
Provably Efficient Sensor Allocation for Unknown High-dimensional Systems with Limited Sensing

Yuyang Zhang, Derya Cansever, Na Li

This paper focuses on learning efficient sensor allocations that ensure observability of unknown high-dimensional linear systems using only a small number of sensors. Existing methods either require an impractically large number of sensors or assume access to an observable allocation in advance. We propose a two-stage framework that overcomes these limitations: first, a novel system identification algorithm integrates information from multiple trajectories, each observing different subsets of state coordinates; then, a classic sensor allocation method is adapted to operate on the learned system parameters. Our non-asymptotic guarantees show that the proposed approach learns a sensor allocation with a near-optimal number of sensors when sensors can be allocated on any state coordinate. We further extend the results to settings with inaccessible state coordinates that are unavailable for sensor allocation.

21.5HCApr 6
A Spatio-Temporal Graph Convolutional Network for Gesture Recognition from High-Density Electromyography

Wenjuan Zhong, Yuyang Zhang, Peiwen Fu et al.

Accurate hand gesture prediction is crucial for effective upper-limb prosthetic limbs control. As the high flexibility and multiple degrees of freedom exhibited by human hands, there has been a growing interest in integrating deep networks with high-density surface electromyography (HD-sEMG) grids to enhance gesture recognition capabilities. However, many existing methods fall short in fully exploit the specific spatial topology and temporal dependencies present in HD-sEMG data. Additionally, these studies are often limited number of gestures and lack generality. Hence, this study introduces a novel gesture recognition method, named STGCN-GR, which leverages spatio-temporal graph convolution networks for HD-sEMG-based human-machine interfaces. Firstly, we construct muscle networks based on functional connectivity between channels, creating a graph representation of HD-sEMG recordings. Subsequently, a temporal convolution module is applied to capture the temporal dependences in the HD-sEMG series and a spatial graph convolution module is employed to effectively learn the intrinsic spatial topology information among distinct HD-sEMG channels. We evaluate our proposed model on a public HD-sEMG dataset comprising a substantial number of gestures (i.e., 65). Our results demonstrate the remarkable capability of the STGCN-GR method, achieving an impressive accuracy of 91.07% in predicting gestures, which surpasses state-of-the-art deep learning methods applied to the same dataset.

34.7CVMar 21
Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark

Yifei Deng, Chenglong Li, Yuyang Zhang et al.

Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text--image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network, which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text--aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporate the ground-view agent as a bridge in text--aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method.

CLOct 7, 2023
Zero-shot Cross-lingual Transfer without Parallel Corpus

Yuyang Zhang, Xiaofeng Han, Baojun Wang

Recently, although pre-trained language models have achieved great success on multilingual NLP (Natural Language Processing) tasks, the lack of training data on many tasks in low-resource languages still limits their performance. One effective way of solving that problem is to transfer knowledge from rich-resource languages to low-resource languages. However, many previous works on cross-lingual transfer rely heavily on the parallel corpus or translation models, which are often difficult to obtain. We propose a novel approach to conduct zero-shot cross-lingual transfer with a pre-trained model. It consists of a Bilingual Task Fitting module that applies task-related bilingual information alignment; a self-training module generates pseudo soft and hard labels for unlabeled data and utilizes them to conduct self-training. We got the new SOTA on different tasks without any dependencies on the parallel corpus or translation models.

CVJan 20
Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

Haoran Xu, Yanlin Liu, Zizhao Tong et al.

Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.

71.9MAMay 8
Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

Yuyang Zhang, Haldun Balim, Na Li

Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), addresses this through energy-based policy updates. In practice, however, such energy-based policies are intractable to maintain and are commonly projected onto the Gaussian policy class. In this work, we show that the limited expressiveness of Gaussian policies severely hinders exploration in DecSPG, and this limitation worsens as the number of agents grows. To address this issue, we propose decentralized diffusion policy learning (DDPL), which parameterizes each agent's policy with a denoising diffusion probabilistic model, an expressive generative model that captures multi-modal action distributions for enhanced exploration. DDPL enables efficient online training of diffusion policies via importance sampling score matching (ISSM), a novel training method with theoretical guarantee. We evaluate DDPL on representative continuous-action MARL benchmarks, including multi-agent particle environment, multi-agent MuJoCo, IsaacLab, and JAX-reimplemented StarCraft multi-agent challenge, and observe consistently improved performance.

LGDec 29, 2025
Max-Entropy Reinforcement Learning with Flow Matching and A Case Study on LQR

Yuyang Zhang, Yang Hu, Bo Dai et al.

Soft actor-critic (SAC) is a popular algorithm for max-entropy reinforcement learning. In practice, the energy-based policies in SAC are often approximated using simple policy classes for efficiency, sacrificing the expressiveness and robustness. In this paper, we propose a variant of the SAC algorithm that parameterizes the policy with flow-based models, leveraging their rich expressiveness. In the algorithm, we evaluate the flow-based policy utilizing the instantaneous change-of-variable technique and update the policy with an online variant of flow matching developed in this paper. This online variant, termed importance sampling flow matching (ISFM), enables policy update with only samples from a user-specified sampling distribution rather than the unknown target distribution. We develop a theoretical analysis of ISFM, characterizing how different choices of sampling distributions affect the learning efficiency. Finally, we conduct a case study of our algorithm on the max-entropy linear quadratic regulator problems, demonstrating that the proposed algorithm learns the optimal action distribution.

20.8ROMar 16
MoRoCo: An Online Topology-Adaptive Framework for Multi-Operator Multi-Robot Coordination under Restricted Communication

Zhuoli Tian, Yanze Bao, Yuyang Zhang et al.

Fleets of autonomous robots are increasingly deployed with multiple human operators in communication-restricted environments for exploration and intervention tasks such as subterranean inspection, reconnaissance, and search-and-rescue. In these settings, communication is often limited to short-range ad-hoc links, making it difficult to coordinate exploration while supporting online human-fleet interactions. Existing work on multi-robot exploration largely focuses on information gathering itself, but pays limited attention to the fact that operators and robots issue time-critical requests during execution. These requests may require different communication structures, ranging from intermittent status delivery to sustained video streaming and teleoperation. To address this challenge, this paper presents MoRoCo, an online topology-adaptive framework for multi-operator multi-robot coordination under restricted communication. MoRoCo is built on a latency-bounded intermittent communication backbone that guarantees a prescribed delay for information collected by any robot to reach an operator, together with a detach-and-rejoin mechanism that enables online team resizing and topology reconfiguration. On top of this backbone, the framework instantiates request-consistent communication subgraphs to realize different modes of operator-robot interaction by jointly assigning robot roles, positions, and communication topology. It further supports the online decomposition and composition of these subgraphs using only local communication, allowing multiple requests to be serviced during exploration. The framework extends to heterogeneous fleets, multiple teams, and robot failures. Extensive human-in-the-loop simulations and hardware experiments demonstrate effective and reliable coordination under restricted communication.

CVMar 25, 2024
Transfer Learning of Real Image Features with Soft Contrastive Loss for Fake Image Detection

Ziyou Liang, Weifeng Liu, Run Wang et al.

In the last few years, the artifact patterns in fake images synthesized by different generative models have been inconsistent, leading to the failure of previous research that relied on spotting subtle differences between real and fake. In our preliminary experiments, we find that the artifacts in fake images always change with the development of the generative model, while natural images exhibit stable statistical properties. In this paper, we employ natural traces shared only by real images as an additional target for a classifier. Specifically, we introduce a self-supervised feature mapping process for natural trace extraction and develop a transfer learning based on soft contrastive loss to bring them closer to real images and further away from fake ones. This motivates the detector to make decisions based on the proximity of images to the natural traces. To conduct a comprehensive experiment, we built a high-quality and diverse dataset that includes generative models comprising GANs and diffusion models, to evaluate the effectiveness in generalizing unknown forgery techniques and robustness in surviving different transformations. Experimental results show that our proposed method gives 96.2% mAP significantly outperforms the baselines. Extensive experiments conducted on popular commercial platforms reveal that our proposed method achieves an accuracy exceeding 78.4%, underscoring its practicality for real-world application deployment.

LGJun 25, 2025
Directed Link Prediction using GNN with Local and Global Feature Fusion

Yuyang Zhang, Xu Shen, Yu Xie et al.

Link prediction is a classical problem in graph analysis with many practical applications. For directed graphs, recently developed deep learning approaches typically analyze node similarities through contrastive learning and aggregate neighborhood information through graph convolutions. In this work, we propose a novel graph neural network (GNN) framework to fuse feature embedding with community information. We theoretically demonstrate that such hybrid features can improve the performance of directed link prediction. To utilize such features efficiently, we also propose an approach to transform input graphs into directed line graphs so that nodes in the transformed graph can aggregate more information during graph convolutions. Experiments on benchmark datasets show that our approach outperforms the state-of-the-art in most cases when 30%, 40%, 50%, and 60% of the connected links are used as training data, respectively.

CLOct 20, 2025
Empowering Real-World: A Survey on the Technology, Practice, and Evaluation of LLM-driven Industry Agents

Yihong Tang, Kehai Chen, Liang Yue et al.

With the rise of large language models (LLMs), LLM agents capable of autonomous reasoning, planning, and executing complex tasks have become a frontier in artificial intelligence. However, how to translate the research on general agents into productivity that drives industry transformations remains a significant challenge. To address this, this paper systematically reviews the technologies, applications, and evaluation methods of industry agents based on LLMs. Using an industry agent capability maturity framework, it outlines the evolution of agents in industry applications, from "process execution systems" to "adaptive social systems." First, we examine the three key technological pillars that support the advancement of agent capabilities: Memory, Planning, and Tool Use. We discuss how these technologies evolve from supporting simple tasks in their early forms to enabling complex autonomous systems and collective intelligence in more advanced forms. Then, we provide an overview of the application of industry agents in real-world domains such as digital engineering, scientific discovery, embodied intelligence, collaborative business execution, and complex system simulation. Additionally, this paper reviews the evaluation benchmarks and methods for both fundamental and specialized capabilities, identifying the challenges existing evaluation systems face regarding authenticity, safety, and industry specificity. Finally, we focus on the practical challenges faced by industry agents, exploring their capability boundaries, developmental potential, and governance issues in various scenarios, while providing insights into future directions. By combining technological evolution with industry practices, this review aims to clarify the current state and offer a clear roadmap and theoretical foundation for understanding and building the next generation of industry agents.

AIOct 12, 2025
ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding

Xinbang Dai, Huikang Hu, Yongrui Chen et al.

While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs' comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.

CROct 1, 2025
EditTrack: Detecting and Attributing AI-assisted Image Editing

Zhengyuan Jiang, Yuyang Zhang, Moyang Guo et al.

In this work, we formulate and study the problem of image-editing detection and attribution: given a base image and a suspicious image, detection seeks to determine whether the suspicious image was derived from the base image using an AI editing model, while attribution further identifies the specific editing model responsible. Existing methods for detecting and attributing AI-generated images are insufficient for this problem, as they focus on determining whether an image was AI-generated/edited rather than whether it was edited from a particular base image. To bridge this gap, we propose EditTrack, the first framework for this image-editing detection and attribution problem. Building on four key observations about the editing process, EditTrack introduces a novel re-editing strategy and leverages carefully designed similarity metrics to determine whether a suspicious image originates from a base image and, if so, by which model. We evaluate EditTrack on five state-of-the-art editing models across six datasets, demonstrating that it consistently achieves accurate detection and attribution, significantly outperforming five baselines.

LGApr 18, 2025
A Model-Based Approach to Imitation Learning through Multi-Step Predictions

Haldun Balim, Yang Hu, Yuyang Zhang et al.

Imitation learning is a widely used approach for training agents to replicate expert behavior in complex decision-making tasks. However, existing methods often struggle with compounding errors and limited generalization, due to the inherent challenge of error correction and the distribution shift between training and deployment. In this paper, we present a novel model-based imitation learning framework inspired by model predictive control, which addresses these limitations by integrating predictive modeling through multi-step state predictions. Our method outperforms traditional behavior cloning numerical benchmarks, demonstrating superior robustness to distribution shift and measurement noise both in available data and during execution. Furthermore, we provide theoretical guarantees on the sample complexity and error bounds of our method, offering insights into its convergence properties.

CVApr 5, 2025
Interpretable Single-View 3D Gaussian Splatting using Unsupervised Hierarchical Disentangled Representation Learning

Yuyang Zhang, Baao Xie, Hu Zhu et al.

Gaussian Splatting (GS) has recently marked a significant advancement in 3D reconstruction, delivering both rapid rendering and high-quality results. However, existing 3DGS methods pose challenges in understanding underlying 3D semantics, which hinders model controllability and interpretability. To address it, we propose an interpretable single-view 3DGS framework, termed 3DisGS, to discover both coarse- and fine-grained 3D semantics via hierarchical disentangled representation learning (DRL). Specifically, the model employs a dual-branch architecture, consisting of a point cloud initialization branch and a triplane-Gaussian generation branch, to achieve coarse-grained disentanglement by separating 3D geometry and visual appearance features. Subsequently, fine-grained semantic representations within each modality are further discovered through DRL-based encoder-adapters. To our knowledge, this is the first work to achieve unsupervised interpretable 3DGS. Evaluations indicate that our model achieves 3D disentanglement while preserving high-quality and rapid reconstruction.

SYMay 9, 2024
Learning Low-dimensional Latent Dynamics from High-dimensional Observations: Non-asymptotics and Lower Bounds

Yuyang Zhang, Shahriar Talebi, Na Li

In this paper, we focus on learning a linear time-invariant (LTI) model with low-dimensional latent variables but high-dimensional observations. We provide an algorithm that recovers the high-dimensional features, i.e. column space of the observer, embeds the data into low dimensions and learns the low-dimensional model parameters. Our algorithm enjoys a sample complexity guarantee of order $\tilde{\mathcal{O}}(n/ε^2)$, where $n$ is the observation dimension. We further establish a fundamental lower bound indicating this complexity bound is optimal up to logarithmic factors and dimension-independent constants. We show that this inevitable linear factor of $n$ is due to the learning error of the observer's column space in the presence of high-dimensional noises. Extending our results, we consider a meta-learning problem inspired by various real-world applications, where the observer column space can be collectively learned from datasets of multiple LTI systems. An end-to-end algorithm is then proposed, facilitating learning LTI systems from a meta-dataset which breaks the sample complexity lower bound in certain scenarios.

LGOct 10, 2021
A Deep Learning Inference Scheme Based on Pipelined Matrix Multiplication Acceleration Design and Non-uniform Quantization

Yuyang Zhang, Dik Hin Leung, Min Guo et al.

Matrix multiplication is the bedrock in Deep Learning inference application. When it comes to hardware acceleration on edge computing devices, matrix multiplication often takes up a great majority of the time. To achieve better performance in edge computing, we introduce a low-power Multi-layer Perceptron (MLP) accelerator based on a pipelined matrix multiplication scheme and a nonuniform quantization methodology. The implementation is running on Field-programmable Gate Array (FPGA) devices and tested its performance on handwritten digit classification and Q-learning tasks. Results show that our method can achieve better performance with fewer power consumption.

CLSep 18, 2021
DyLex: Incorporating Dynamic Lexicons into BERT for Sequence Labeling

Baojun Wang, Zhao Zhang, Kun Xu et al.

Incorporating lexical knowledge into deep learning models has been proved to be very effective for sequence labeling tasks. However, previous works commonly have difficulty dealing with large-scale dynamic lexicons which often cause excessive matching noise and problems of frequent updates. In this paper, we propose DyLex, a plug-in lexicon incorporation approach for BERT based sequence labeling tasks. Instead of leveraging embeddings of words in the lexicon as in conventional methods, we adopt word-agnostic tag embeddings to avoid re-training the representation while updating the lexicon. Moreover, we employ an effective supervised lexical knowledge denoising method to smooth out matching noise. Finally, we introduce a col-wise attention based knowledge fusion mechanism to guarantee the pluggability of the proposed framework. Experiments on ten datasets of three tasks show that the proposed framework achieves new SOTA, even with very large scale lexicons.

ARApr 11, 2021
iELAS: An ELAS-Based Energy-Efficient Accelerator for Real-Time Stereo Matching on FPGA Platform

Tian Gao, Zishen Wan, Yuyang Zhang et al.

Stereo matching is a critical task for robot navigation and autonomous vehicles, providing the depth estimation of surroundings. Among all stereo matching algorithms, Efficient Large-scale Stereo (ELAS) offers one of the best tradeoffs between efficiency and accuracy. However, due to the inherent iterative process and unpredictable memory access pattern, ELAS can only run at 1.5-3 fps on high-end CPUs and difficult to achieve real-time performance on low-power platforms. In this paper, we propose an energy-efficient architecture for real-time ELAS-based stereo matching on FPGA platform. Moreover, the original computational-intensive and irregular triangulation module is reformed in a regular manner with points interpolation, which is much more hardware-friendly. Optimizations, including memory management, parallelism, and pipelining, are further utilized to reduce memory footprint and improve throughput. Compared with Intel i7 CPU and the state-of-the-art CPU+FPGA implementation, our FPGA realization achieves up to 38.4x and 3.32x frame rate improvement, and up to 27.1x and 1.13x energy efficiency improvement, respectively.

ARApr 1, 2021
An Energy-Efficient Quad-Camera Visual System for Autonomous Machines on FPGA Platform

Zishen Wan, Yuyang Zhang, Arijit Raychowdhury et al.

In our past few years' of commercial deployment experiences, we identify localization as a critical task in autonomous machine applications, and a great acceleration target. In this paper, based on the observation that the visual frontend is a major performance and energy consumption bottleneck, we present our design and implementation of an energy-efficient hardware architecture for ORB (Oriented-Fast and Rotated- BRIEF) based localization system on FPGAs. To support our multi-sensor autonomous machine localization system, we present hardware synchronization, frame-multiplexing, and parallelization techniques, which are integrated in our design. Compared to Nvidia TX1 and Intel i7, our FPGA-based implementation achieves 5.6x and 3.4x speedup, as well as 3.0x and 34.6x power reduction, respectively.