Hao Ye

CV
h-index46
39papers
3,949citations
Novelty49%
AI Score59

39 Papers

CVSep 27, 2023Code
Aperture Diffraction for Compact Snapshot Spectral Imaging

Tao Lv, Hao Ye, Quan Yuan et al.

We demonstrate a compact, cost-effective snapshot spectral imaging system named Aperture Diffraction Imaging Spectrometer (ADIS), which consists only of an imaging lens with an ultra-thin orthogonal aperture mask and a mosaic filter sensor, requiring no additional physical footprint compared to common RGB cameras. Then we introduce a new optical design that each point in the object space is multiplexed to discrete encoding locations on the mosaic filter sensor by diffraction-based spatial-spectral projection engineering generated from the orthogonal mask. The orthogonal projection is uniformly accepted to obtain a weakly calibration-dependent data form to enhance modulation robustness. Meanwhile, the Cascade Shift-Shuffle Spectral Transformer (CSST) with strong perception of the diffraction degeneration is designed to solve a sparsity-constrained inverse problem, realizing the volume reconstruction from 2D measurements with Large amount of aliasing. Our system is evaluated by elaborating the imaging optical theory and reconstruction algorithm with demonstrating the experimental imaging under a single exposure. Ultimately, we achieve the sub-super-pixel spatial resolution and high spectral resolution imaging. The code will be available at: https://github.com/Krito-ex/CSST.

LGOct 2, 2022
Metric Distribution to Vector: Constructing Data Representation via Broad-Scale Discrepancies

Xue Liu, Dan Sun, Xiaobo Cao et al.

Graph embedding provides a feasible methodology to conduct pattern classification for graph-structured data by mapping each data into the vectorial space. Various pioneering works are essentially coding method that concentrates on a vectorial representation about the inner properties of a graph in terms of the topological constitution, node attributions, link relations, etc. However, the classification for each targeted data is a qualitative issue based on understanding the overall discrepancies within the dataset scale. From the statistical point of view, these discrepancies manifest a metric distribution over the dataset scale if the distance metric is adopted to measure the pairwise similarity or dissimilarity. Therefore, we present a novel embedding strategy named $\mathbf{MetricDistribution2vec}$ to extract such distribution characteristics into the vectorial representation for each data. We demonstrate the application and effectiveness of our representation method in the supervised prediction tasks on extensive real-world structural graph datasets. The results have gained some unexpected increases compared with a surge of baselines on all the datasets, even if we take the lightweight models as classifiers. Moreover, the proposed methods also conducted experiments in Few-Shot classification scenarios, and the results still show attractive discrimination in rare training samples based inference.

98.0CVApr 20
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

Kangan Qian, ChuChu Xie, Yang Zhong et al.

Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.

ITAug 29, 2024
Semantic Communication for Cooperative Perception using HARQ

Yucheng Sheng, Le Liang, Hao Ye et al.

Cooperative perception, offering a wider field of view than standalone perception, is becoming increasingly crucial in autonomous driving. This perception is enabled through vehicle-to-vehicle (V2V) communication, allowing connected automated vehicles (CAVs) to exchange sensor data, such as light detection and ranging (LiDAR) point clouds, thereby enhancing the collective understanding of the environment. In this paper, we leverage an importance map to distill critical semantic information, introducing a cooperative perception semantic communication framework that employs intermediate fusion. To counter the challenges posed by time-varying multipath fading, our approach incorporates the use of orthogonal frequency-division multiplexing (OFDM) along with channel estimation and equalization strategies. Furthermore, recognizing the necessity for reliable transmission, especially in the low SNR scenarios, we introduce a novel semantic error detection method that is integrated with our semantic communication framework in the spirit of hybrid automatic repeated request (HARQ). Simulation results show that our model surpasses the traditional separate source-channel coding methods in perception performance, both with and without HARQ. Additionally, in terms of throughput, our proposed HARQ schemes demonstrate superior efficiency to the conventional coding approaches.

ROMay 21, 2025Code
AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

Kangan Qian, Sicong Jiang, Yang Zhong et al. · tsinghua

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's core innovations include: \textbf{(i) Structured Data Generation}, which establishes an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model's tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate that AgentThink significantly boosts overall reasoning scores by \textbf{53.91%} and enhances answer accuracy by \textbf{33.54%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models. Code is available at https://github.com/curryqka/AgentThink.

99.3ITMar 15
Reducing Pilots in Channel Estimation with Predictive Foundation Models

Xingyu Zhou, Le Liang, Hao Ye et al.

Accurate channel state information (CSI) acquisition is essential for modern wireless systems, which becomes increasingly difficult under large antenna arrays, strict pilot overhead constraints, and diverse deployment environments. Existing artificial intelligence-based solutions often lack robustness and fail to generalize across scenarios. To address this limitation, this paper introduces a predictive-foundation-model-based channel estimation framework that enables accurate, low-overhead, and generalizable CSI acquisition. The proposed framework employs a predictive foundation model trained on large-scale cross-domain CSI data to extract universal channel representations and provide predictive priors with strong cross-scenario transferability. A pilot processing network based on a vision transformer architecture is further designed to capture spatial, temporal, and frequency correlations from pilot observations. An efficient fusion mechanism integrates predictive priors with real-time measurements, enabling reliable CSI reconstruction even under sparse or noisy conditions. Extensive evaluations across diverse configurations demonstrate that the proposed estimator significantly outperforms both classical and data-driven baselines in accuracy, robustness, and generalization capability.

CVJun 30, 2025Code
A Survey on Vision-Language-Action Models for Autonomous Driving

Sicong Jiang, Zilin Huang, Kangan Qian et al.

The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructions, reason about complex traffic scenes, and make their own decisions. However, the literature remains fragmented and is rapidly expanding. This survey offers the first comprehensive overview of VLA for Autonomous Driving (VLA4AD). We (i) formalize the architectural building blocks shared across recent work, (ii) trace the evolution from early explainer to reasoning-centric VLA models, and (iii) compare over 20 representative models according to VLA's progress in the autonomous driving domain. We also consolidate existing datasets and benchmarks, highlighting protocols that jointly measure driving safety, accuracy, and explanation quality. Finally, we detail open challenges - robustness, real-time efficiency, and formal verification - and outline future directions of VLA4AD. This survey provides a concise yet complete reference for advancing interpretable socially aligned autonomous vehicles. Github repo is available at \href{https://github.com/JohnsonJiang1996/Awesome-VLA4AD}{SicongJiang/Awesome-VLA4AD}.

61.9LGApr 8
A Graph Foundation Model for Wireless Resource Allocation

Yucheng Sheng, Jiacheng Wang, Le Liang et al.

The aggressive densification of modern wireless networks necessitates judicious resource allocation to mitigate severe mutual interference. However, classical iterative algorithms remain computationally prohibitive for real-time applications requiring rapid responsiveness. While recent deep learning-based methods show promise, they typically function as task-specific solvers lacking the flexibility to adapt to different objectives and scenarios without expensive retraining. To address these limitations, we propose a graph foundation model for resource allocation (GFM-RA) based on a pre-training and fine-tuning paradigm to extract unified representations, thereby enabling rapid adaptation to different objectives and scenarios. Specifically, we introduce an interference-aware Transformer architecture with a bias projector that injects interference topologies into global attention mechanisms. Furthermore, we develop a hybrid self-supervised pre-training strategy that synergizes masked edge prediction with negative-free Teacher-Student contrastive learning, enabling the model to capture transferable structural representations from massive unlabeled datasets. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art performance and scales effectively with increased model capacity. Crucially, leveraging its unified representations, the foundation model exhibits exceptional sample efficiency, enabling robust few-shot adaptation to diverse and unsupervised downstream objectives in out-of-distribution (OOD) scenarios. These results demonstrate the promise of pre-trained foundation models for adaptable wireless resource allocation and provide a strong foundation for future research on generalizable learning-based wireless optimization.

AIJun 1, 2025Code
SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning

Jisheng Dang, Yizhou Zhang, Hao Ye et al.

Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning, while mitigating several limitations inherent to direct preference optimization (DPO). First, we propose a pipeline for constructing preference pairs that leverages the intrinsic properties of VLMs along with partial assistance from large language models, achieving an optimal balance between cost and data quality. Second, we propose Synergistic Preference Optimization (SynPO), a novel optimization method offering significant advantages over DPO and its variants. SynPO prevents negative preferences from dominating the optimization, explicitly preserves the model's language capability to avoid deviation of the optimization objective, and improves training efficiency by eliminating the need for the reference model. We extensively evaluate SynPO not only on video captioning benchmarks (e.g., VDC, VDD, VATEX) but also across well-established NLP tasks, including general language understanding and preference evaluation, using diverse pretrained models. Results demonstrate that SynPO consistently outperforms DPO variants while achieving 20\% improvement in training efficiency. Code is available at https://github.com/longmalongma/SynPO

CVJan 7, 2025Code
Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression

Mengshi Qi, Hao Ye, Jiaxuan Peng et al.

Action Quality Assessment (AQA), which aims at automatic and fair evaluation of athletic performance, has gained increasing attention in recent years. However, athletes are often in rapid movement and the corresponding visual appearance variances are subtle, making it challenging to capture fine-grained pose differences and leading to poor estimation performance. Furthermore, most common AQA tasks, such as diving in sports, are usually divided into multiple sub-actions, each of which contains different durations. However, existing methods focus on segmenting the video into fixed frames, which disrupts the temporal continuity of sub-actions resulting in unavoidable prediction errors. To address these challenges, we propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression. Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture fine-grained spatio-temporal visual and skeletal features. Then, a procedure segmentation network is introduced to separate different sub-actions and obtain segmented features. Afterwards, the segmented visual and skeletal features are both fed into a multi-modal fusion module as physics structural priors, to guide the model in learning refined activity similarities and variances. Finally, a multi-stage contrastive learning regression approach is employed to learn discriminative representations and output prediction results. In addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the current low-quality human pose labels. In experiments, the results on FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority of our proposed approach. Our source code and dataset are available at https://github.com/Lumos0507/HP-MCoRe.

AIJul 29, 2025Code
SafeDriveRAG: Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation

Hao Ye, Mengshi Qi, Zhaohong Liu et al.

In this work, we study how vision-language models (VLMs) can be utilized to enhance the safety for the autonomous driving system, including perception, situational understanding, and path planning. However, existing research has largely overlooked the evaluation of these models in traffic safety-critical driving scenarios. To bridge this gap, we create the benchmark (SafeDrive228K) and propose a new baseline based on VLM with knowledge graph-based retrieval-augmented generation (SafeDriveRAG) for visual question answering (VQA). Specifically, we introduce SafeDrive228K, the first large-scale multimodal question-answering benchmark comprising 228K examples across 18 sub-tasks. This benchmark encompasses a diverse range of traffic safety queries, from traffic accidents and corner cases to common safety knowledge, enabling a thorough assessment of the comprehension and reasoning abilities of the models. Furthermore, we propose a plug-and-play multimodal knowledge graph-based retrieval-augmented generation approach that employs a novel multi-scale subgraph retrieval algorithm for efficient information retrieval. By incorporating traffic safety guidelines collected from the Internet, this framework further enhances the model's capacity to handle safety-critical situations. Finally, we conduct comprehensive evaluations on five mainstream VLMs to assess their reliability in safety-sensitive driving tasks. Experimental results demonstrate that integrating RAG significantly improves performance, achieving a +4.73% gain in Traffic Accidents tasks, +8.79% in Corner Cases tasks and +14.57% in Traffic Safety Commonsense across five mainstream VLMs, underscoring the potential of our proposed benchmark and methodology for advancing research in traffic safety. Our source code and data are available at https://github.com/Lumos0507/SafeDriveRAG.

LGAug 15, 2017Code
Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks

Jun Xiao, Hao Ye, Xiangnan He et al.

Factorization Machines (FMs) are a supervised learning approach that enhances the linear regression model by incorporating the second-order feature interactions. Despite effectiveness, FM can be hindered by its modelling of all feature interactions with the same weight, as not all feature interactions are equally useful and predictive. For example, the interactions with useless features may even introduce noises and adversely degrade the performance. In this work, we improve FM by discriminating the importance of different feature interactions. We propose a novel model named Attentional Factorization Machine (AFM), which learns the importance of each feature interaction from data via a neural attention network. Extensive experiments on two real-world datasets demonstrate the effectiveness of AFM. Empirically, it is shown on regression task AFM betters FM with a $8.6\%$ relative improvement, and consistently outperforms the state-of-the-art deep learning methods Wide&Deep and DeepCross with a much simpler structure and fewer model parameters. Our implementation of AFM is publicly available at: https://github.com/hexiangnan/attentional_factorization_machine

96.7LGMay 7
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

Hao Ye, Jisheng Dang, Junfeng Fang et al.

Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don't maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in RLVR-trained model behaves like heavy-tailed distribution. (3) the left singular vectors associated with rank-1 components demonstrate a stronger alignment tendency during training, which echoes the discovery that RLVR is optimizing sampling efficiency in essence. Taken together, our findings and analysis further reveal how RLVR shapes model parameters and offer potential insights for improving existing RL paradigms or other training paradigms to implement continual learning.

CVMay 19, 2025
DD-Ranking: Rethinking the Evaluation of Dataset Distillation

Zekai Li, Xinhao Zhong, Samir Khaki et al.

In recent years, dataset distillation has provided a reliable solution for data compression, where models trained on the resulting smaller synthetic datasets achieve performance comparable to those trained on the original datasets. To further improve the performance of synthetic datasets, various training pipelines and optimization objectives have been proposed, greatly advancing the field of dataset distillation. Recent decoupled dataset distillation methods introduce soft labels and stronger data augmentation during the post-evaluation phase and scale dataset distillation up to larger datasets (e.g., ImageNet-1K). However, this raises a question: Is accuracy still a reliable metric to fairly evaluate dataset distillation methods? Our empirical findings suggest that the performance improvements of these methods often stem from additional techniques rather than the inherent quality of the images themselves, with even randomly sampled images achieving superior results. Such misaligned evaluation settings severely hinder the development of DD. Therefore, we propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods. By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.

CVAug 8, 2024
Integrated Dynamic Phenological Feature for Remote Sensing Image Land Cover Change Detection

Yi Liu, Chenhao Sun, Hao Ye et al.

Remote sensing image change detection (CD) is essential for analyzing land surface changes over time, with a significant challenge being the differentiation of actual changes from complex scenes while filtering out pseudo-changes. A primary contributor to this challenge is the intra-class dynamic changes due to phenological characteristics in natural areas. To overcome this, we introduce the InPhea model, which integrates phenological features into a remote sensing image CD framework. The model features a detector with a differential attention module for improved feature representation of change information, coupled with high-resolution feature extraction and spatial pyramid blocks to enhance performance. Additionally, a constrainer with four constraint modules and a multi-stage contrastive learning approach is employed to aid in the model's understanding of phenological characteristics. Experiments on the HRSCD, SECD, and PSCD-Wuhan datasets reveal that InPhea outperforms other models, confirming its effectiveness in addressing phenological pseudo-changes and its overall model superiority.

AIJul 29, 2025
Large Language Models for Wireless Communications: From Adaptation to Autonomy

Le Liang, Hao Ye, Yucheng Sheng et al.

The emergence of large language models (LLMs) has revolutionized artificial intelligence, offering unprecedented capabilities in reasoning, generalization, and zero-shot learning. These strengths open new frontiers in wireless communications, where increasing complexity and dynamics demand intelligent and adaptive solutions. This article explores the role of LLMs in transforming wireless systems across three key directions: adapting pretrained LLMs for core communication tasks, developing wireless-specific foundation models to balance versatility and efficiency, and enabling agentic LLMs with autonomous reasoning and coordination capabilities. We highlight recent advances, practical case studies, and the unique benefits of LLM-based approaches over traditional methods. Finally, we outline open challenges and research opportunities, including multimodal fusion, collaboration with lightweight models, and self-improving capabilities, charting a path toward intelligent, adaptive, and autonomous wireless networks of the future.

AIJul 8, 2025
A Wireless Foundation Model for Multi-Task Prediction

Yucheng Sheng, Jiacheng Wang, Xingyu Zhou et al.

With the growing complexity and dynamics of the mobile communication networks, accurately predicting key system parameters, such as channel state information (CSI), user location, and network traffic, has become essential for a wide range of physical (PHY)-layer and medium access control (MAC)-layer tasks. Although traditional deep learning (DL)-based methods have been widely applied to such prediction tasks, they often struggle to generalize across different scenarios and tasks. In response, we propose a unified foundation model for multi-task prediction in wireless networks that supports diverse prediction intervals. The proposed model enforces univariate decomposition to unify heterogeneous tasks, encodes granularity for interval awareness, and uses a causal Transformer backbone for accurate predictions. Additionally, we introduce a patch masking strategy during training to support arbitrary input lengths. After trained on large-scale datasets, the proposed foundation model demonstrates strong generalization to unseen scenarios and achieves zero-shot performance on new tasks that surpass traditional full-shot baselines.

SYMay 19, 2025
Power Allocation for Delay Optimization in Device-to-Device Networks: A Graph Reinforcement Learning Approach

Hao Fang, Kai Huang, Hao Ye et al.

The pursuit of rate maximization in wireless communication frequently encounters substantial challenges associated with user fairness. This paper addresses these challenges by exploring a novel power allocation approach for delay optimization, utilizing graph neural networks (GNNs)-based reinforcement learning (RL) in device-to-device (D2D) communication. The proposed approach incorporates not only channel state information but also factors such as packet delay, the number of backlogged packets, and the number of transmitted packets into the components of the state information. We adopt a centralized RL method, where a central controller collects and processes the state information. The central controller functions as an agent trained using the proximal policy optimization (PPO) algorithm. To better utilize topology information in the communication network and enhance the generalization of the proposed method, we embed GNN layers into both the actor and critic networks of the PPO algorithm. This integration allows for efficient parameter updates of GNNs and enables the state information to be parameterized as a low-dimensional embedding, which is leveraged by the agent to optimize power allocation strategies. Simulation results demonstrate that the proposed method effectively reduces average delay while ensuring user fairness, outperforms baseline methods, and exhibits scalability and generalization capability.

LGFeb 12, 2025
Deep Reinforcement Learning-Based User Scheduling for Collaborative Perception

Yandi Liu, Guowei Liu, Le Liang et al.

Stand-alone perception systems in autonomous driving suffer from limited sensing ranges and occlusions at extended distances, potentially resulting in catastrophic outcomes. To address this issue, collaborative perception is envisioned to improve perceptual accuracy by using vehicle-to-everything (V2X) communication to enable collaboration among connected and autonomous vehicles and roadside units. However, due to limited communication resources, it is impractical for all units to transmit sensing data such as point clouds or high-definition video. As a result, it is essential to optimize the scheduling of communication links to ensure efficient spectrum utilization for the exchange of perceptual data. In this work, we propose a deep reinforcement learning-based V2X user scheduling algorithm for collaborative perception. Given the challenges in acquiring perceptual labels, we reformulate the conventional label-dependent objective into a label-free goal, based on characteristics of 3D object detection. Incorporating both channel state information (CSI) and semantic information, we develop a double deep Q-Network (DDQN)-based user scheduling framework for collaborative perception, named SchedCP. Simulation results verify the effectiveness and robustness of SchedCP compared with traditional V2X scheduling methods. Finally, we present a case study to illustrate how our proposed algorithm adaptively modifies the scheduling decisions by taking both instantaneous CSI and perceptual semantics into account.

LGAug 5, 2025
EvaDrive: Evolutionary Adversarial Policy Optimization for End-to-End Autonomous Driving

Siwen Jiao, Kangan Qian, Hao Ye et al.

Autonomous driving faces significant challenges in achieving human-like iterative decision-making, which continuously generates, evaluates, and refines trajectory proposals. Current generation-evaluation frameworks isolate trajectory generation from quality assessment, preventing iterative refinement essential for planning, while reinforcement learning methods collapse multi-dimensional preferences into scalar rewards, obscuring critical trade-offs and yielding scalarization bias.To overcome these issues, we present EvaDrive, a novel multi-objective reinforcement learning framework that establishes genuine closed-loop co-evolution between trajectory generation and evaluation via adversarial optimization. EvaDrive frames trajectory planning as a multi-round adversarial game. In this game, a hierarchical generator continuously proposes candidate paths by combining autoregressive intent modeling for temporal causality with diffusion-based refinement for spatial flexibility. These proposals are then rigorously assessed by a trainable multi-objective critic that explicitly preserves diverse preference structures without collapsing them into a single scalarization bias.This adversarial interplay, guided by a Pareto frontier selection mechanism, enables iterative multi-round refinement, effectively escaping local optima while preserving trajectory diversity.Extensive experiments on NAVSIM and Bench2Drive benchmarks demonstrate SOTA performance, achieving 94.9 PDMS on NAVSIM v1 (surpassing DiffusionDrive by 6.8, DriveSuprim by 5.0, and TrajHF by 0.9) and 64.96 Driving Score on Bench2Drive. EvaDrive generates diverse driving styles via dynamic weighting without external preference data, introducing a closed-loop adversarial framework for human-like iterative decision-making, offering a novel scalarization-free trajectory optimization approach.

CVAug 26, 2025
SoccerNet 2025 Challenges Results

Silvio Giancola, Anthony Cioppa, Marc Gutiérrez-Pérez et al.

The SoccerNet 2025 Challenges mark the fifth annual edition of the SoccerNet open benchmarking effort, dedicated to advancing computer vision research in football video understanding. This year's challenges span four vision-based tasks: (1) Team Ball Action Spotting, focused on detecting ball-related actions in football broadcasts and assigning actions to teams; (2) Monocular Depth Estimation, targeting the recovery of scene geometry from single-camera broadcast clips through relative depth estimation for each pixel; (3) Multi-View Foul Recognition, requiring the analysis of multiple synchronized camera views to classify fouls and their severity; and (4) Game State Reconstruction, aimed at localizing and identifying all players from a broadcast video to reconstruct the game state on a 2D top-view of the field. Across all tasks, participants were provided with large-scale annotated datasets, unified evaluation protocols, and strong baselines as starting points. This report presents the results of each challenge, highlights the top-performing solutions, and provides insights into the progress made by the community. The SoccerNet Challenges continue to serve as a driving force for reproducible, open research at the intersection of computer vision, artificial intelligence, and sports. Detailed information about the tasks, challenges, and leaderboards can be found at https://www.soccer-net.org, with baselines and development kits available at https://github.com/SoccerNet.

LGMay 6, 2025
Small-Scale-Fading-Aware Resource Allocation in Wireless Federated Learning

Jiacheng Wang, Le Liang, Hao Ye et al.

Judicious resource allocation can effectively enhance federated learning (FL) training performance in wireless networks by addressing both system and statistical heterogeneity. However, existing strategies typically rely on block fading assumptions, which overlooks rapid channel fluctuations within each round of FL gradient uploading, leading to a degradation in FL training performance. Therefore, this paper proposes a small-scale-fading-aware resource allocation strategy using a multi-agent reinforcement learning (MARL) framework. Specifically, we establish a one-step convergence bound of the FL algorithm and formulate the resource allocation problem as a decentralized partially observable Markov decision process (Dec-POMDP), which is subsequently solved using the QMIX algorithm. In our framework, each client serves as an agent that dynamically determines spectrum and power allocations within each coherence time slot, based on local observations and a reward derived from the convergence analysis. The MARL setting reduces the dimensionality of the action space and facilitates decentralized decision-making, enhancing the scalability and practicality of the solution. Experimental results demonstrate that our QMIX-based resource allocation strategy significantly outperforms baseline methods across various degrees of statistical heterogeneity. Additionally, ablation studies validate the critical importance of incorporating small-scale fading dynamics, highlighting its role in optimizing FL performance.

CVNov 25, 2019
Cascaded Detail-Preserving Networks for Super-Resolution of Document Images

Zhichao Fu, Yu Kong, Yingbin Zheng et al.

The accuracy of OCR is usually affected by the quality of the input document image and different kinds of marred document images hamper the OCR results. Among these scenarios, the low-resolution image is a common and challenging case. In this paper, we propose the cascaded networks for document image super-resolution. Our model is composed by the Detail-Preserving Networks with small magnification. The loss function with perceptual terms is designed to simultaneously preserve the original patterns and enhance the edge of the characters. These networks are trained with the same architecture and different parameters and then assembled into a pipeline model with a larger magnification. The low-resolution images can upscale gradually by passing through each Detail-Preserving Network until the final high-resolution images. Through extensive experiments on two scanning document image datasets, we demonstrate that the proposed approach outperforms recent state-of-the-art image super-resolution methods, and combining it with standard OCR system lead to signification improvements on the recognition results.

CVNov 4, 2019
Scene Text Recognition with Temporal Convolutional Encoder

Xiangcheng Du, Tianlong Ma, Yingbin Zheng et al.

Texts from scene images typically consist of several characters and exhibit a characteristic sequence structure. Existing methods capture the structure with the sequence-to-sequence models by an encoder to have the visual representations and then a decoder to translate the features into the label sequence. In this paper, we study text recognition framework by considering the long-term temporal dependencies in the encoder stage. We demonstrate that the proposed Temporal Convolutional Encoder with increased sequential extents improves the accuracy of text recognition. We also study the impact of different attention modules in convolutional blocks for learning accurate text representations. We conduct comparisons on seven datasets and the experiments demonstrate the effectiveness of our proposed approach.

SPAug 12, 2019
Learn to Compress CSI and Allocate Resources in Vehicular Networks

Liang Wang, Hao Ye, Le Liang et al.

Resource allocation has a direct and profound impact on the performance of vehicle-to-everything (V2X) networks. In this paper, we develop a hybrid architecture consisting of centralized decision making and distributed resource sharing (the C-Decision scheme) to maximize the long-term sum rate of all vehicles. To reduce the network signaling overhead, each vehicle uses a deep neural network to compress its observed information that is thereafter fed back to the centralized decision making unit. The centralized decision unit employs a deep Q-network to allocate resources and then sends the decision results to all vehicles. We further adopt a quantization layer for each vehicle that learns to quantize the continuous feedback. In addition, we devise a mechanism to balance the transmission of vehicle-to-vehicle (V2V) links and vehicle-to-infrastructure (V2I) links. To further facilitate distributed spectrum sharing, we also propose a distributed decision making and spectrum sharing architecture (the D-Decision scheme) for each V2V link. Through extensive simulation results, we demonstrate that the proposed C-Decision and D-Decision schemes can both achieve near-optimal performance and are robust to feedback interval variations, input noise, and feedback noise.

NIJul 30, 2019
Learn to Allocate Resources in Vehicular Networks

Liang Wang, Hao Ye, Le Liang et al.

Resource allocation has a direct and profound impact on the performance of vehicle-to-everything (V2X) networks. Considering the dynamic nature of vehicular environments, it is appealing to devise a decentralized strategy to perform effective resource sharing. In this paper, we exploit deep learning to promote coordination among multiple vehicles and propose a hybrid architecture consisting of centralized decision making and distributed resource sharing to maximize the long-term sum rate of all vehicles. To reduce the network signaling overhead, each vehicle uses a deep neural network to compress its own observed information that is thereafter fed back to the centralized decision-making unit, which employs a deep Q-network to allocate resources and then sends the decision results to all vehicles. We further adopt a quantization layer for each vehicle that learns to quantize the continuous feedback. Extensive simulation results demonstrate that the proposed hybrid architecture can achieve near-optimal performance. Meanwhile, there exists an optimal number of continuous feedback and binary feedback, respectively. Besides, this architecture is robust to different feedback intervals, input noise, and feedback noise.

ITJul 7, 2019
Deep Learning based Wireless Resource Allocation with Application to Vehicular Networks

Le Liang, Hao Ye, Guanding Yu et al.

It has been a long-held belief that judicious resource allocation is critical to mitigating interference, improving network efficiency, and ultimately optimizing wireless communication performance. The traditional wisdom is to explicitly formulate resource allocation as an optimization problem and then exploit mathematical programming to solve the problem to a certain level of optimality. Nonetheless, as wireless networks become increasingly diverse and complex, e.g., in the high-mobility vehicular networks, the current design methodologies face significant challenges and thus call for rethinking of the traditional design philosophy. Meanwhile, deep learning, with many success stories in various disciplines, represents a promising alternative due to its remarkable power to leverage data for problem solving. In this paper, we discuss the key motivations and roadblocks of using deep learning for wireless resource allocation with application to vehicular networks. We review major recent studies that mobilize the deep learning philosophy in wireless resource allocation and achieve impressive results. We first discuss deep learning assisted optimization for resource allocation. We then highlight the deep reinforcement learning approach to address resource allocation problems that are difficult to handle in the traditional optimization framework. We also identify some research directions that deserve further investigation.

CVJul 4, 2019
Edge-Aware Deep Image Deblurring

Zhichao Fu, Tianlong Ma, Yingbin Zheng et al.

Image deblurring is a fundamental and challenging low-level vision problem. Previous vision research indicates that edge structure in natural scenes is one of the most important factors to estimate the abilities of human visual perception. In this paper, we resort to human visual demands of sharp edges and propose a two-phase edge-aware deep network to improve deep image deblurring. An edge detection convolutional subnet is designed in the first phase and a residual fully convolutional deblur subnet is then used for generating deblur results. The introduction of the edge-aware network enables our model with the specific capacity of enhancing images with sharp edges. We successfully apply our framework on standard benchmarks and promising results are achieved by our proposed deblur model.

CVJul 4, 2019
Fast Video Crowd Counting with a Temporal Aware Network

Xingjiao Wu, Baohan Xu, Yingbin Zheng et al.

Crowd counting aims to count the number of instantaneous people in a crowded space, and many promising solutions have been proposed for single image crowd counting. With the ubiquitous video capture devices in public safety field, how to effectively apply the crowd counting technique to video content has become an urgent problem. In this paper, we introduce a novel framework based on temporal aware modeling of the relationship between video frames. The proposed network contains a few dilated residual blocks, and each of them consists of the layers that compute the temporal convolutions of features from the adjacent frames to improve the prediction. To alleviate the expensive computation and satisfy the demand of fast video crowd counting, we also introduce a lightweight network to balance the computational cost with representation ability. We conduct experiments on the crowd counting benchmarks and demonstrate its superiority in terms of effectiveness and efficiency over previous video-based approaches.

CVMar 23, 2019
Detecting Curve Text with Local Segmentation Network and Curve Connection

Zhao Zhou, Hao Ye, Luhui Chen et al.

Curve text or arbitrary shape text is very common in real-world scenarios. In this paper, we propose a novel framework with the local segmentation network (LSN) followed by the curve connection to detect text in horizontal, oriented and curved forms. The LSN is composed of two elements, i.e., proposal generation to get the horizontal rectangle proposals with high overlap with text and text segmentation to find the arbitrary shape text region within proposals. The curve connection is then designed to connect the local mask to the detection results. We conduct experiments using the proposed framework on two real-world curve text detection datasets and demonstrate the effectiveness over previous approaches.

CVDec 6, 2018
Adaptive Scenario Discovery for Crowd Counting

Xingjiao Wu, Yingbin Zheng, Hao Ye et al.

Crowd counting, i.e., estimation number of the pedestrian in crowd images, is emerging as an important research problem with the public security applications. A key component for the crowd counting systems is the construction of counting models which are robust to various scenarios under facts such as camera perspective and physical barriers. In this paper, we present an adaptive scenario discovery framework for crowd counting. The system is structured with two parallel pathways that are trained with different sizes of the receptive field to represent different scales and crowd densities. After ensuring that these components are present in the proper geometric configuration, a third branch is designed to adaptively recalibrate the pathway-wise responses by discovering and modeling the dynamic scenarios implicitly. Our system is able to represent highly variable crowd images and achieves state-of-the-art results in two challenging benchmarks.

CVJun 26, 2018
Crowd Counting with Density Adaption Networks

Li Wang, Weiyuan Shao, Yao Lu et al.

Crowd counting is one of the core tasks in various surveillance applications. A practical system involves estimating accurate head counts in dynamic scenarios under different lightning, camera perspective and occlusion states. Previous approaches estimate head counts despite that they can vary dramatically in different density settings; the crowd is often unevenly distributed and the results are therefore unsatisfactory. In this paper, we propose a lightweight deep learning framework that can automatically estimate the crowd density level and adaptively choose between different counter networks that are explicitly trained for different density domains. Experiments on two recent crowd counting datasets, UCF_CC_50 and ShanghaiTech, show that the proposed mechanism achieves promising improvements over state-of-the-art methods. Moreover, runtime speed is 20 FPS on a single GPU.

CVApr 13, 2018
Precise Temporal Action Localization by Evolving Temporal Proposals

Haonan Qiu, Yingbin Zheng, Hao Ye et al.

Locating actions in long untrimmed videos has been a challenging problem in video content analysis. The performances of existing action localization approaches remain unsatisfactory in precisely determining the beginning and the end of an action. Imitating the human perception procedure with observations and refinements, we propose a novel three-phase action localization framework. Our framework is embedded with an Actionness Network to generate initial proposals through frame-wise similarity grouping, and then a Refinement Network to conduct boundary adjustment on these proposals. Finally, the refined proposals are sent to a Localization Network for further fine-grained location regression. The whole process can be deemed as multi-stage refinement using a novel non-local pyramid feature under various temporal granularities. We evaluate our framework on THUMOS14 benchmark and obtain a significant improvement over the state-of-the-arts approaches. Specifically, the performance gain is remarkable under precise localization with high IoU thresholds. Our proposed framework achieves mAP@IoU=0.5 of 34.2%.

ITApr 1, 2018
Toward Intelligent Vehicular Networks: A Machine Learning Framework

Le Liang, Hao Ye, Geoffrey Ye Li

As wireless networks evolve towards high mobility and providing better support for connected vehicles, a number of new challenges arise due to the resulting high dynamics in vehicular environments and thus motive rethinking of traditional wireless design methodologies. Future intelligent vehicles, which are at the heart of high mobility networks, are increasingly equipped with multiple advanced onboard sensors and keep generating large volumes of data. Machine learning, as an effective approach to artificial intelligence, can provide a rich set of tools to exploit such data for the benefit of the networks. In this article, we first identify the distinctive characteristics of high mobility vehicular networks and motivate the use of machine learning to address the resulting challenges. After a brief introduction of the major concepts of machine learning, we discuss its applications to learn the dynamics of vehicular networks and make informed decisions to optimize network performance. In particular, we discuss in greater detail the application of reinforcement learning in managing network resources as an alternative to the prevalent optimization approach. Finally, some open issues worth further investigation are highlighted.

CVMar 3, 2017
Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Jianqi Ma, Weiyuan Shao, Hao Ye et al.

This paper introduces a novel rotation-based framework for arbitrary-oriented text detection in natural scene images. We present the Rotation Region Proposal Networks (RRPN), which are designed to generate inclined proposals with text orientation angle information. The angle information is then adapted for bounding box regression to make the proposals more accurately fit into the text region in terms of the orientation. The Rotation Region-of-Interest (RRoI) pooling layer is proposed to project arbitrary-oriented proposals to a feature map for a text region classifier. The whole framework is built upon a region-proposal-based architecture, which ensures the computational efficiency of the arbitrary-oriented text detection compared with previous text detection systems. We conduct experiments using the rotation-based framework on three real-world scene text detection datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.

CVFeb 1, 2017
Evolving Boxes for Fast Vehicle Detection

Li Wang, Yao Lu, Hong Wang et al.

We perform fast vehicle detection from traffic surveillance cameras. A novel deep learning framework, namely Evolving Boxes, is developed that proposes and refines the object boxes under different feature representations. Specifically, our framework is embedded with a light-weight proposal network to generate initial anchor boxes as well as to early discard unlikely regions; a fine-turning network produces detailed features for these candidate boxes. We show intriguingly that by applying different feature fusion techniques, the initial boxes can be refined for both localization and recognition. We evaluate our network on the recent DETRAC benchmark and obtain a significant improvement over the state-of-the-art Faster RCNN by 9.5% mAP. Further, our network achieves 9-13 FPS detection speed on a moderate commercial GPU.

CVSep 21, 2015
Fusing Multi-Stream Deep Networks for Video Classification

Zuxuan Wu, Yu-Gang Jiang, Xi Wang et al.

This paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural Networks to model spatial, short-term motion and audio clues respectively. Long Short Term Memory networks are then adopted to explore long-term temporal dynamics. With the outputs of the individual streams, we propose a simple and effective fusion method to generate the final predictions, where the optimal fusion weights are learned adaptively for each class, and the learning process is regularized by automatically estimated class relationships. Our contributions are two-fold. First, the proposed multi-stream framework is able to exploit multimodal features that are more comprehensive than those previously attempted. Second, we demonstrate that the adaptive fusion method using the class relationship as a regularizer outperforms traditional alternatives that estimate the weights in a "free" fashion. Our framework produces significantly better results than the state of the arts on two popular benchmarks, 92.2\% on UCF-101 (without using audio) and 84.9\% on Columbia Consumer Videos.

CVApr 8, 2015
Evaluating Two-Stream CNN for Video Classification

Hao Ye, Zuxuan Wu, Rui-Wei Zhao et al.

Videos contain very rich semantic information. Traditional hand-crafted features are known to be inadequate in analyzing complex video semantics. Inspired by the huge success of the deep learning methods in analyzing image, audio and text data, significant efforts are recently being devoted to the design of deep nets for video analytics. Among the many practical needs, classifying videos (or video clips) based on their major semantic categories (e.g., "skiing") is useful in many applications. In this paper, we conduct an in-depth study to investigate important implementation options that may affect the performance of deep nets on video classification. Our evaluations are conducted on top of a recent two-stream convolutional neural network (CNN) pipeline, which uses both static frames and motion optical flows, and has demonstrated competitive performance against the state-of-the-art methods. In order to gain insights and to arrive at a practical guideline, many important options are studied, including network architectures, model fusion, learning parameters and the final prediction methods. Based on the evaluations, very competitive results are attained on two popular video classification benchmarks. We hope that the discussions and conclusions from this work can help researchers in related fields to quickly set up a good basis for further investigations along this very promising direction.

CVApr 7, 2015
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

Zuxuan Wu, Xi Wang, Yu-Gang Jiang et al.

Classifying videos according to content semantics is an important problem with a wide range of applications. In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos. Specifically, the spatial and the short-term motion features are extracted separately by two Convolutional Neural Networks (CNN). These two types of CNN-based features are then combined in a regularized feature fusion network for classification, which is able to learn and utilize feature relationships for improved performance. In addition, Long Short Term Memory (LSTM) networks are applied on top of the two features to further model longer-term temporal clues. The main contribution of this work is the hybrid learning framework that can model several important aspects of the video data. We also show that (1) combining the spatial and the short-term motion features in the regularized fusion network is better than direct classification and fusion using the CNN with a softmax layer, and (2) the sequence-based LSTM is highly complementary to the traditional classification strategy without considering the temporal frame orders. Extensive experiments are conducted on two popular and challenging benchmarks, the UCF-101 Human Actions and the Columbia Consumer Videos (CCV). On both benchmarks, our framework achieves to-date the best reported performance: $91.3\%$ on the UCF-101 and $83.5\%$ on the CCV.