Hang Song

CV
h-index33
17papers
435citations
Novelty51%
AI Score59

17 Papers

CVApr 12
NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models: Datasets, Methods and Results

Xin Li, Jiachao Gong, Xijun Wang et al.

This paper presents an overview of the NTIRE 2026 Challenge on Short-form UGC Video Restoration in the Wild with Generative Models. This challenge utilizes a new short-form UGC (S-UGC) video restoration benchmark, termed KwaiVIR, which is contributed by USTC and Kuaishou Technology. It contains both synthetically distorted videos and real-world short-form UGC videos in the wild. For this edition, the released data include 200 synthetic training videos, 48 wild training videos, 11 validation videos, and 20 testing videos. The primary goal of this challenge is to establish a strong and practical benchmark for restoring short-form UGC videos under complex real-world degradations, especially in the emerging paradigm of generative-model-based S-UGC video restoration. This challenge has two tracks: (i) the primary track is a subjective track, where the evaluation is based on a user study; (ii) the second track is an objective track. These two tracks enable a comprehensive assessment of restoration quality. In total, 95 teams have registered for this competition. And 12 teams submitted valid final solutions and fact sheets for the testing phase. The submitted methods achieved strong performance on the KwaiVIR benchmark, demonstrating encouraging progress in short-form UGC video restoration in the wild.

CLAug 5, 2024Code
UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Zhaowei Li, Wei Wang, YiQing Cai et al.

Significant advancements has recently been achieved in the field of multi-modal large language models (MLLMs), demonstrating their remarkable capabilities in understanding and reasoning across diverse tasks. However, these models are often trained for specific tasks and rely on task-specific input-output formats, limiting their applicability to a broader range of tasks. This raises a fundamental question: Can we develop a unified approach to represent and handle different multi-modal tasks to maximize the generalizability of MLLMs? In this paper, we propose UnifiedMLLM, a comprehensive model designed to represent various tasks using a unified representation. Our model exhibits strong capabilities in comprehending the implicit intent of user instructions and preforming reasoning. In addition to generating textual responses, our model also outputs task tokens and grounding tokens, serving as indicators of task types and task granularity. These outputs are subsequently routed through the task router and directed to specific expert models for task completion. To train our model, we construct a task-specific dataset and an 100k multi-task dataset encompassing complex scenarios. Employing a three-stage training strategy, we equip our model with robust reasoning and task processing capabilities while preserving its generalization capacity and knowledge reservoir. Extensive experiments showcase the impressive performance of our unified representation approach across various tasks, surpassing existing methodologies. Furthermore, our approach exhibits exceptional scalability and generality. Our code, model, and dataset will be available at \url{https://github.com/lzw-lzw/UnifiedMLLM}.

NIDec 29, 2022
Pensieve 5G: Implementation of RL-based ABR Algorithm for UHD 4K/8K Content Delivery on Commercial 5G SA/NR-DC Network

Kasidis Arunruangsirilert, Bo Wei, Hang Song et al.

While the rollout of the fifth-generation mobile network (5G) is underway across the globe with the intention to deliver 4K/8K UHD videos, Augmented Reality (AR), and Virtual Reality (VR) content to the mass amounts of users, the coverage and throughput are still one of the most significant issues, especially in the rural areas, where only 5G in the low-frequency band are being deployed. This called for a high-performance adaptive bitrate (ABR) algorithm that can maximize the user quality of experience given 5G network characteristics and data rate of UHD contents. Recently, many of the newly proposed ABR techniques were machine-learning based. Among that, Pensieve is one of the state-of-the-art techniques, which utilized reinforcement-learning to generate an ABR algorithm based on observation of past decision performance. By incorporating the context of the 5G network and UHD content, Pensieve has been optimized into Pensieve 5G. New QoE metrics that more accurately represent the QoE of UHD video streaming on the different types of devices were proposed and used to evaluate Pensieve 5G against other ABR techniques including the original Pensieve. The results from the simulation based on the real 5G Standalone (SA) network throughput shows that Pensieve 5G outperforms both conventional algorithms and Pensieve with the average QoE improvement of 8.8% and 14.2%, respectively. Additionally, Pensieve 5G also performed well on the commercial 5G NR-NR Dual Connectivity (NR-DC) Network, despite the training being done solely using the data from the 5G Standalone (SA) network.

CVJan 11, 2024Code
GroundingGPT:Language Enhanced Multi-modal Grounding Model

Zhaowei Li, Qi Xu, Dong Zhang et al.

Multi-modal large language models have demonstrated impressive performance across various tasks in different modalities. However, existing multi-modal models primarily emphasize capturing global information within each modality while neglecting the importance of perceiving local information across modalities. Consequently, these models lack the ability to effectively understand the fine-grained details of input data, limiting their performance in tasks that require a more nuanced understanding. To address this limitation, there is a compelling need to develop models that enable fine-grained understanding across multiple modalities, thereby enhancing their applicability to a wide range of tasks. In this paper, we propose GroundingGPT, a language enhanced multi-modal grounding model. Beyond capturing global information like other multi-modal models, our proposed model excels at tasks demanding a detailed understanding of local information within the input. It demonstrates precise identification and localization of specific regions in images or moments in videos. To achieve this objective, we design a diversified dataset construction pipeline, resulting in a multi-modal, multi-granularity dataset for model training. The code, dataset, and demo of our model can be found at https: //github.com/lzw-lzw/GroundingGPT.

CLMay 30, 2025Code
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models

Shilin Xu, Yanwei Li, Rui Yang et al.

Recent works on large language models (LLMs) have successfully demonstrated the emergence of reasoning capabilities via reinforcement learning (RL). Although recent efforts leverage group relative policy optimization (GRPO) for MLLMs post-training, they constantly explore one specific aspect, such as grounding tasks, math problems, or chart analysis. There are no works that can leverage multi-source MLLM tasks for stable reinforcement learning. In this work, we present a unified perspective to solve this problem. We present Mixed-R1, a unified yet straightforward framework that contains a mixed reward function design (Mixed-Reward) and a mixed post-training dataset (Mixed-45K). We first design a data engine to select high-quality examples to build the Mixed-45K post-training dataset. Then, we present a Mixed-Reward design, which contains various reward functions for various MLLM tasks. In particular, it has four different reward functions: matching reward for binary answer or multiple-choice problems, chart reward for chart-aware datasets, IoU reward for grounding problems, and open-ended reward for long-form text responses such as caption datasets. To handle the various long-form text content, we propose a new open-ended reward named Bidirectional Max-Average Similarity (BMAS) by leveraging tokenizer embedding matching between the generated response and the ground truth. Extensive experiments show the effectiveness of our proposed method on various MLLMs, including Qwen2.5-VL and Intern-VL on various sizes. Our dataset and model are available at https://github.com/xushilin1/mixed-r1.

CVDec 12, 2025
EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Wei Chow, Linfeng Li, Lingdong Kong et al.

Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

SYApr 9
A Passive Software-Defined Radio-based mmWave Sensing System for Blind Integrated Communication and Sensing

Shiqi Liu, Hang Song, Bo Wei et al.

Integrated Sensing and Communication (ISAC) is considered as a key component of future 6G technologies, especially in the millimeter-wave (mmWave) bands. Recently, the performances of ISAC were experimentally evaluated and demonstrated in various scenarios by developing ISAC systems. These systems generally consist of coherent transmitting (Tx) and receiving (Rx) modules. However, actively transmitting radio waves for experiments is not easy due to regulatory restrictions of radio. Meanwhile, the Tx/Rx should be synchronized and Rx need the information of Tx. In this paper, a fully passive mmWave sensing system is developed with software-defined radio for blind ISAC. It only consists of a passive Rx module which does not depend on the Tx. Since the proposed system is not synchronized with Tx and has no knowledge of the transmitted signals, a differential structure with two oppositely-oriented receivers is introduced to realize the sensing function. This structure can mitigate the influences of unknown source signals and other distortions. With the proposed sensing system, the ambient mmWave communication signals are leveraged for sensing without interrupting the existing systems. It can be deployed for field applications such as signal detection and dynamic human activity recognition since it does not emit signals. The efficacy of the developed system is first verified with a metallic plate with known motion pattern. The measured Doppler spectrogram shows good agreement with the simulation results, demonstrating the correctness of the sensing results. Further, the system is evaluated in complex scenarios, including handwaving, single- and multi-person motion detection. The sensing results successfully reflect the corresponding motions, demonstrating that the proposed sensing system can be utilized for blind ISAC in various applications.

CVMay 11
Masked Generative Transformer Is What You Need for Image Editing

Wei Chow, Linfeng Li, Xian Sun et al.

Diffusion models dominate image editing, yet their global denoising mechanism entangles edited regions with surrounding context, causing modifications to propagate into areas that should remain intact. We propose a fundamentally different approach by leveraging Masked Generative Transformers (MGTs), whose localized token-prediction paradigm naturally confines changes to intended regions. We present EditMGT, an MGT-based editing framework that is the first of its kind. Our approach employs multi-layer attention consolidation to aggregate cross-attention maps into precise edit localization signals, and region-hold sampling to explicitly prevent token flipping in non-target areas. To support training, we construct CrispEdit-2M, a 2M-sample high-resolution (>1024) editing dataset spanning seven categories. With only 960M parameters, EditMGT achieves state-of-the-art image similarity on multiple benchmarks while delivering 6x faster editing, demonstrating that MGTs offer a compelling alternative to diffusion-based editing.

CVApr 5
NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results

Shuhong Liu, Chenyu Bao, Ziteng Cui et al.

This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.

CVNov 14, 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

Wei Wang, Zhaowei Li, Qi Xu et al.

Multi-modal large language models (MLLMs) have achieved remarkable success in fine-grained visual understanding across a range of tasks. However, they often encounter significant challenges due to inadequate alignment for fine-grained knowledge, which restricts their ability to accurately capture local details and attain a comprehensive global perception. While recent advancements have focused on aligning object expressions with grounding information, they typically lack explicit integration of object images, which contain affluent information beyond mere texts or coordinates. To bridge this gap, we introduce a novel fine-grained visual knowledge alignment method that effectively aligns and integrates multi-scale knowledge of objects, including texts, coordinates, and images. This innovative method is underpinned by our multi-scale fine-grained enhancement data synthesis pipeline, which provides over 300K essential training data to enhance alignment and improve overall performance. Furthermore, we present TinyGroundingGPT, a series of compact models optimized for high-level alignments. With a scale of approximately 3B parameters, TinyGroundingGPT achieves outstanding results in grounding tasks while delivering performance comparable to larger MLLMs in complex visual scenarios.

DCSep 9, 2025
Astra: A Multi-Agent System for GPU Kernel Performance Optimization

Anjiang Wei, Tianran Sun, Yogesh Seenichamy et al. · stanford

GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining high performance typically requires extensive manual tuning. Compiler-based systems reduce some of this burden, but still demand substantial manual design and engineering effort. Recently, researchers have explored using LLMs for GPU kernel generation, though prior work has largely focused on translating high-level PyTorch modules into CUDA code. In this work, we introduce Astra, the first LLM-based multi-agent system for GPU kernel optimization. Unlike previous approaches, Astra starts from existing CUDA implementations extracted from SGLang, a widely deployed framework for serving LLMs, rather than treating PyTorch modules as the specification. Within Astra, specialized LLM agents collaborate through iterative code generation, testing, profiling, and planning to produce kernels that are both correct and high-performance. On kernels from SGLang, Astra achieves an average speedup of 1.32x using zero-shot prompting with OpenAI o4-mini. A detailed case study further demonstrates that LLMs can autonomously apply loop transformations, optimize memory access patterns, exploit CUDA intrinsics, and leverage fast math operations to yield substantial performance gains. Our work highlights multi-agent LLM systems as a promising new paradigm for GPU kernel optimization.

CLMay 14, 2024
QCRD: Quality-guided Contrastive Rationale Distillation for Large Language Models

Wei Wang, Zhaowei Li, Qi Xu et al.

The deployment of large language models (LLMs) faces considerable challenges concerning resource constraints and inference efficiency. Recent research has increasingly focused on smaller, task-specific models enhanced by distilling knowledge from LLMs. However, prior studies have often overlooked the diversity and quality of knowledge, especially the untapped potential of negative knowledge. Constructing effective negative knowledge remains severely understudied. In this paper, we introduce a novel framework called quality-guided contrastive rationale distillation aimed at enhancing reasoning capabilities through contrastive knowledge learning. For positive knowledge, we enrich its diversity through temperature sampling and employ self-consistency for further denoising and refinement. For negative knowledge, we propose an innovative self-adversarial approach that generates low-quality rationales by sampling previous iterations of smaller language models, embracing the idea that one can learn from one's own weaknesses. A contrastive loss is developed to distill both positive and negative knowledge into smaller language models, where an online-updating discriminator is integrated to assess qualities of rationales and assign them appropriate weights, optimizing the training process. Through extensive experiments across multiple reasoning tasks, we demonstrate that our method consistently outperforms existing distillation techniques, yielding higher-quality rationales.

CVJun 3, 2025
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Wei Chow, Yuan Gao, Linfeng Li et al.

Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.

NISep 25, 2021
Adaptive video transmission using QUBO method and Digital Annealer based on Ising machine

Bo Wei, Hang Song, Jiro Katto

With the dramatically increasing video streaming in the total network traffic, it is critical to develop effective algorithms to promote the content delivery service of high quality. Adaptive bitrate (ABR) control is the most essential technique which determines the proper bitrate to be chosen based on network conditions, thus realize high-quality video streaming. In this paper, a novel ABR strategy is proposed based on Ising machine by using the quadratic unconstrained binary optimization (QUBO) method and Digital Annealer (DA) for the first time. The proposed method is evaluated by simulation with the real-world measured throughput, and compared with other state-of-the-art methods. Experiment results show that the proposed QUBO-based method can outperform the existing methods, which demonstrating the superior of the proposed QUBO-based method.

ROSep 17, 2019
Synchronous Maneuver Searching and Trajectory Planning for Autonomous Vehicles in Dynamic Traffic Environments

Lilin Qian, Xin Xu, Yujun Zeng et al.

In the real-time decision-making and local planning process of autonomous vehicles in dynamic environments, the autonomous driving system may fail to find a reasonable policy or even gets trapped in some situation due to the complexity of global tasks and the incompatibility between upper-level maneuver decisions with the low-level lower level trajectory planning. To solve this problem, this paper presents a synchronous maneuver searching and trajectory planning (SMSTP) algorithm based on the topological concept of homotopy. Firstly, a set of alternative maneuvers with boundary limits are enumerated on a multi-lane road. Instead of sampling numerous paths in the whole spatio-temporal space, we, for the first time, propose using Trajectory Profiles (TPs) to quickly construct the topological maneuvers represented by different routes, and put forward a corridor generation algorithm based on graph-search. The bounded corridor further constrains the maneuver's space in the spatial space. A step-wise heuristic optimization algorithm is then proposed to synchronously generate a feasible trajectory for each maneuver. To achieve real-time performance, we initialize the states to be optimized with the boundary constraints of maneuvers, and we set some heuristic states as terminal targets in the quadratic cost function. The solution of a feasible trajectory is always guaranteed only if a specific maneuver is given. The simulation and realistic driving-test experiments verified that the proposed SMSTP algorithm has a short computation time which is less than 37ms, and the experimental results showed the validity and effectiveness of the SMSTP algorithm.

CVJul 30, 2017
Discover and Learn New Objects from Documentaries

Kai Chen, Hang Song, Chen Change Loy et al.

Despite the remarkable progress in recent years, detecting objects in a new context remains a challenging task. Detectors learned from a public dataset can only work with a fixed list of categories, while training from scratch usually requires a large amount of training data with detailed annotations. This work aims to explore a novel approach -- learning object detectors from documentary films in a weakly supervised manner. This is inspired by the observation that documentaries often provide dedicated exposition of certain object categories, where visual presentations are aligned with subtitles. We believe that object detectors can be learned from such a rich source of information. Towards this goal, we develop a joint probabilistic framework, where individual pieces of information, including video frames and subtitles, are brought together via both visual and linguistic links. On top of this formulation, we further derive a weakly supervised learning algorithm, where object model learning and training set mining are unified in an optimization procedure. Experimental results on a real world dataset demonstrate that this is an effective approach to learning new object detectors.

CVAug 2, 2016
CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016

Yuanjun Xiong, Limin Wang, Zhe Wang et al.

This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016. We follow the basic pipeline of temporal segment networks and further raise the performance via a number of other techniques. Specifically, we use the latest deep model architecture, e.g., ResNet and Inception V3, and introduce new aggregation schemes (top-k and attention-weighted pooling). Additionally, we incorporate the audio as a complementary channel, extracting relevant information via a CNN applied to the spectrograms. With these techniques, we derive an ensemble of deep models, which, together, attains a high classification accuracy (mAP $93.23\%$) on the testing set and secured the first place in the challenge.