CLApr 10, 2025
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement LearningByteDance Seed, Jiaze Chen, Tiantian Fan et al. · bytedance
We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.
97.1ROMay 29
TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic EnvironmentsZhiyu Huang, Yun Zhang, Johnson Liu et al.
Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency. Project website: https://ucla-mobility.github.io/TIC-VLA/
IVApr 8, 2022
Deep Learning-Based Intra Mode Derivation for Versatile Video CodingLinwei Zhu, Yun Zhang, Na Li et al.
In intra coding, Rate Distortion Optimization (RDO) is performed to achieve the optimal intra mode from a pre-defined candidate list. The optimal intra mode is also required to be encoded and transmitted to the decoder side besides the residual signal, where lots of coding bits are consumed. To further improve the performance of intra coding in Versatile Video Coding (VVC), an intelligent intra mode derivation method is proposed in this paper, termed as Deep Learning based Intra Mode Derivation (DLIMD). In specific, the process of intra mode derivation is formulated as a multi-class classification task, which aims to skip the module of intra mode signaling for coding bits reduction. The architecture of DLIMD is developed to adapt to different quantization parameter settings and variable coding blocks including non-square ones, which are handled by one single trained model. Different from the existing deep learning based classification problems, the hand-crafted features are also fed into the intra mode derivation network besides the learned features from feature learning network. To compete with traditional method, one additional binary flag is utilized in the video codec to indicate the selected scheme with RDO. Extensive experimental results reveal that the proposed method can achieve 2.28%, 1.74%, and 2.18% bit rate reduction on average for Y, U, and V components on the platform of VVC test model, which outperforms the state-of-the-art works.
CVSep 25, 2023
Data Upcycling Knowledge Distillation for Image Super-ResolutionYun Zhang, Wei Li, Simiao Li et al.
Knowledge distillation (KD) compresses deep neural networks by transferring task-related knowledge from cumbersome pre-trained teacher models to compact student models. However, current KD methods for super-resolution (SR) networks overlook the nature of SR task that the outputs of the teacher model are noisy approximations to the ground-truth distribution of high-quality images (GT), which shades the teacher model's knowledge to result in limited KD effects. To utilize the teacher model beyond the GT upper-bound, we present the Data Upcycling Knowledge Distillation (DUKD), to transfer the teacher model's knowledge to the student model through the upcycled in-domain data derived from training data. Besides, we impose label consistency regularization to KD for SR by the paired invertible augmentations to improve the student model's performance and robustness. Comprehensive experiments demonstrate that the DUKD method significantly outperforms previous arts on several SR tasks.
85.6CVMay 29
nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous DrivingZhiyu Huang, Johnson Liu, Rui Song et al.
Reasoning is essential for autonomous driving (AD) in long-tail scenarios, where vehicles must apply commonsense knowledge, understand spatial relations, infer agent interactions, and make safe decisions. However, existing AD datasets and benchmarks mainly target perception, prediction, or planning, and provide limited supervision for reasoning over realistic long-tail driving scenes. We introduce nuReasoning, a large-scale real-world dataset and benchmark for reasoning-centric AD. Following the lineage of nuScenes and nuPlan, nuReasoning advances real-world AD datasets and benchmarks toward reasoning in long-tail driving scenarios. The dataset contains 20,000 clips, each 20 seconds long, collected across multiple cities, with synchronized multi-camera images, LiDAR data, HD maps, object annotations, and human-verified reasoning annotations spanning Spatial Reasoning, Decision Reasoning, and Counterfactual Reasoning. Unlike prior datasets that focus primarily on visual question answering, nuReasoning supports both reasoning evaluation and planning evaluation, enabling a direct study of how reasoning supervision affects driving performance. Experiments show that fine-tuning VLMs on nuReasoning substantially improves driving-specific question answering, while incorporating reasoning supervision into VLA training improves planning performance even when textual reasoning outputs are disabled at inference time. These results establish nuReasoning as a foundation for evaluating and improving robust, interpretable, reasoning-driven AD systems in realistic long-tail settings.
CLJul 14, 2024
Multi-Granularity Semantic Revision for Large Language Model DistillationXiaoyu Liu, Yun Zhang, Wei Li et al.
Knowledge distillation plays a key role in compressing the Large Language Models (LLMs), which boosts a small-size student model under large teacher models' guidance. However, existing LLM distillation methods overly rely on student-generated outputs, which may introduce generation errors and misguide the distillation process. Moreover, the distillation loss functions introduced in previous art struggle to align the most informative part due to the complex distribution of LLMs' outputs. To address these problems, we propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation (SCRG) strategy. SCRG first calculates the semantic cognitive difference between the teacher and student to detect the error token, then corrects it with the teacher-generated one, and re-generates the sequence to reduce generation errors and enhance generation diversity. At the token level, we design a distribution adaptive clipping Kullback-Leibler (DAC-KL) loss as the distillation objective function. DAC-KL loss exploits a learnable sub-network to adaptively extract semantically dense areas from the teacher's output, avoiding the interference of redundant information in the distillation process. Finally, at the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent, further enhancing the transfer of semantic information. Extensive experiments across different model families with parameters ranging from 0.1B to 13B demonstrate the superiority of our method compared to existing methods.
CVApr 16, 2024Code
The Ninth NTIRE 2024 Efficient Super-Resolution Challenge ReportBin Ren, Yawei Li, Nancy Mehta et al.
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/.
IVSep 18, 2024
LFIC-DRASC: Deep Light Field Image Compression Using Disentangled Representation and Asymmetrical Strip ConvolutionShiyu Feng, Yun Zhang, Linwei Zhu et al.
Light-Field (LF) image is emerging 4D data of light rays that is capable of realistically presenting spatial and angular information of 3D scene. However, the large data volume of LF images becomes the most challenging issue in real-time processing, transmission, and storage. In this paper, we propose an end-to-end deep LF Image Compression method Using Disentangled Representation and Asymmetrical Strip Convolution (LFIC-DRASC) to improve coding efficiency. Firstly, we formulate the LF image compression problem as learning a disentangled LF representation network and an image encoding-decoding network. Secondly, we propose two novel feature extractors that leverage the structural prior of LF data by integrating features across different dimensions. Meanwhile, disentangled LF representation network is proposed to enhance the LF feature disentangling and decoupling. Thirdly, we propose the LFIC-DRASC for LF image compression, where two Asymmetrical Strip Convolution (ASC) operators, i.e. horizontal and vertical, are proposed to capture long-range correlation in LF feature space. These two ASC operators can be combined with the square convolution to further decouple LF features, which enhances the model ability in representing intricate spatial relationships. Experimental results demonstrate that the proposed LFIC-DRASC achieves an average of 20.5\% bit rate reductions comparing with the state-of-the-art methods.
IVMar 7, 2022
Remote blood pressure measurement via spatiotemporal mapping of a short-time facial videoJialiang Zhuang, Bin Li, Yun Zhang et al.
Blood pressure (BP) monitoring is vital in daily healthcare, especially for cardiovascular diseases. However, BP values are mainly acquired through the contact sensing method, which is inconvenient and unfriendly to continuous BP measurement. Hence, we propose an efficient end-to-end network to estimate the BP values from a facial video to achieve remote BP measurement in daily life. In this study, we first derived a Spatial-temporal map of a short-time (~15s) facial video. According to the Spatial-temporal map, we then regressed the BP ranges by a designed blood pressure classifier and simultaneously calculated the specific value by a blood pressure calculator in each BP range. In addition, we also developed an innovative oversampling training strategy to handle the unbalanced data distribution problem. Finally, we trained the proposed network on a private dataset ASPD and tested it on the popular dataset MMSE-HR. As a result, the proposed network achieved a state-of-the-art MAE of 12.35 mmHg and 9.5 mmHg on systolic and diastolic BP measurements, which is better than the recent works. It concludes that the proposed method has excellent potential for camera-based BP monitoring in real-world scenarios.
CVMar 11, 2024Code
Eliminating Warping Shakes for Unsupervised Online Video StitchingLang Nie, Chunyu Lin, Kang Liao et al.
In this paper, we retarget video stitching to an emerging issue, named warping shake, when extending image stitching to video stitching. It unveils the temporal instability of warped content in non-overlapping regions, despite image stitching having endeavored to preserve the natural structures. Therefore, in most cases, even if the input videos to be stitched are stable, the stitched video will inevitably cause undesired warping shakes and affect the visual experience. To eliminate the shakes, we propose StabStitch to simultaneously realize video stitching and video stabilization in a unified unsupervised learning framework. Starting from the camera paths in video stabilization, we first derive the expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Then a warp smoothing model is presented to optimize them with a comprehensive consideration regarding content alignment, trajectory smoothness, spatial consistency, and online collaboration. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Compared with existing stitching solutions, StabStitch exhibits significant superiority in scene robustness and inference speed in addition to stitching and stabilization performance, contributing to a robust and real-time online video stitching system. The code and dataset are available at https://github.com/nie-lang/StabStitch.
CVJun 25, 2022
FastBVP-Net: a lightweight pulse extraction network for measuring heart rhythm via facial videosJialiang Zhuang, Yuheng Chen, Yun Zhang et al.
Remote photoplethysmography (rPPG) is an attractive camera-based health monitoring method that can measure the heart rhythm from facial videos. Many well-established deep-learning models have been reported to measure heart rate (HR) and heart rate variability (HRV). However, most of these models usually require a 30-second facial video and enormous computational resources to obtain accurate and robust results, which significantly limits their applications in real-world scenarios. Hence, we propose a lightweight pulse extraction network, FastBVP-Net, to quickly measure heart rhythm via facial videos. The proposed FastBVP-Net uses a multi-frequency mode signal fusion (MMSF) mechanism to characterize the different modes of the raw signals in a decompose module and reconstruct the blood volume pulse (BVP) signal under a complex noise environment in a compose module. Meanwhile, an oversampling training scheme is used to solve the over-fitting problem caused by the limitations of the datasets. Then, the HR and HRV can be estimated based on the extracted BVP signals. Comprehensive experiments are conducted on the benchmark datasets to validate the proposed FastBVP-Net. For intra-dataset and cross-dataset testing, the proposed approach achieves better performance for HR and HRV estimation from 30-second facial videos with fewer computational burdens than the current well-established methods. Moreover, the proposed approach also achieves competitive results from 15-second facial videos. Therefore, the proposed FastBVP-Net has the potential to be applied in many real-world scenarios with shorter videos.
CVNov 12, 2024Code
Joint multi-dimensional dynamic attention and transformer for general image restorationHuan Zhang, Xu Zhang, Nian Cai et al.
Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder-decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at https://github.com/House-yuyu/MDDA-former.
MMAug 18, 2023
LSCD: A Large-Scale Screen Content Dataset for Video CompressionYuhao Cheng, Siru Zhang, Yiqiang Yan et al.
Multimedia compression allows us to watch videos, see pictures and hear sounds within a limited bandwidth, which helps the flourish of the internet. During the past decades, multimedia compression has achieved great success using hand-craft features and systems. With the development of artificial intelligence and video compression, there emerges a lot of research work related to using the neural network on the video compression task to get rid of the complicated system. Not only producing the advanced algorithms, but researchers also spread the compression to different content, such as User Generated Content(UGC). With the rapid development of mobile devices, screen content videos become an important part of multimedia data. In contrast, we find community lacks a large-scale dataset for screen content video compression, which impedes the fast development of the corresponding learning-based algorithms. In order to fulfill this blank and accelerate the research of this special type of videos, we propose the Large-scale Screen Content Dataset(LSCD), which contains 714 source sequences. Meanwhile, we provide the analysis of the proposed dataset to show some features of screen content videos, which will help researchers have a better understanding of how to explore new algorithms. Besides collecting and post-processing the data to organize the dataset, we also provide a benchmark containing the performance of both traditional codec and learning-based methods.
CVMay 11, 2025
Seed1.5-VL Technical ReportDong Guo, Faming Wu, Feida Zhu et al. · pku
We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)
CVOct 28, 2025Code
MIC-BEV: Multi-Infrastructure Camera Bird's-Eye-View Transformer with Relation-Aware Fusion for 3D Object DetectionYun Zhang, Zhaoliang Zheng, Johnson Liu et al.
Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.
CVAug 27, 2025Code
Image Quality Assessment for Machines: Paradigm, Large-scale Database, and ModelsXiaoqi Wang, Yun Zhang, Weisi Lin
Machine vision systems (MVS) are intrinsically vulnerable to performance degradation under adverse visual conditions. To address this, we propose a machine-centric image quality assessment (MIQA) framework that quantifies the impact of image degradations on MVS performance. We establish an MIQA paradigm encompassing the end-to-end assessment workflow. To support this, we construct a machine-centric image quality database (MIQD-2.5M), comprising 2.5 million samples that capture distinctive degradation responses in both consistency and accuracy metrics, spanning 75 vision models, 250 degradation types, and three representative vision tasks. We further propose a region-aware MIQA (RA-MIQA) model to evaluate MVS visual quality through fine-grained spatial degradation analysis. Extensive experiments benchmark the proposed RA-MIQA against seven human visual system (HVS)-based IQA metrics and five retrained classical backbones. Results demonstrate RA-MIQA's superior performance in multiple dimensions, e.g., achieving SRCC gains of 13.56% on consistency and 13.37% on accuracy for image classification, while also revealing task-specific degradation sensitivities. Critically, HVS-based metrics prove inadequate for MVS quality prediction, while even specialized MIQA models struggle with background degradations, accuracy-oriented estimation, and subtle distortions. This study can advance MVS reliability and establish foundations for machine-centric image processing and optimization. The model and code are available at: https://github.com/XiaoqiWang/MIQA.
75.2CVApr 29
EnerGS: Energy-Based Gaussian Splatting with Partial Geometric PriorsRui Song, Tianhui Cai, Markus Gross et al.
3D Gaussian Splatting (3DGS) has been widely adopted for scene reconstruction, where training inherently constitutes a highly coupled and non-convex optimization problem. Recent works commonly incorporate geometric priors, such as LiDAR measurements, either for initialization or as training constraints, with the goal of improving photometric reconstruction quality. However, in large-scale outdoor scenarios, such geometric supervision is often spatially incomplete and uneven, which limits its effectiveness as a reliable prior and can even be detrimental to the final reconstruction. To address this challenge, we model partially observable geometry as a continuous energy field induced by geometric evidence and propose EnerGS. Rather than enforcing geometry as a hard constraint, EnerGS provides a soft geometric guidance for the optimization of Gaussian primitives, allowing geometric information to steer the optimization process without directly restricting the solution space. Extensive experiments on large-scale outdoor scenes demonstrate that, under both sparse multi-view and monocular settings, EnerGS consistently improves photometric quality and geometric stability, while effectively mitigating overfitting during 3DGS training.
CVJun 16, 2025
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-TuningZewei Zhou, Tianhui Cai, Seth Z. Zhao et al.
Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model for end-to-end autonomous driving. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios.
LGDec 13, 2023
CBQ: Cross-Block Quantization for Large Language ModelsXin Ding, Xiaoyu Liu, Zhijun Tu et al.
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs. However, existing PTQ methods only focus on handling the outliers within one layer or one block, which ignores the dependency of blocks and leads to severe performance degradation in low-bit settings. In this paper, we propose CBQ, a cross-block reconstruction-based PTQ method for LLMs. CBQ employs a cross-block dependency using a homologous reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation. Furthermore, CBQ incorporates a coarse-to-fine preprocessing (CFP) strategy for suppressing weight and activation outliers, coupled with an adaptive LoRA-Rounding technique for precise weight quantization. These innovations enable CBQ to not only handle extreme outliers effectively but also improve overall quantization accuracy. Extensive experiments show that CBQ achieves superior low-bit quantization (W4A4, W4A8, W2A16) and outperforms existing state-of-the-art methods across various LLMs and datasets. Notably, CBQ quantizes the 4-bit LLAMA1-65B model within only 4.3 hours on a single GPU, achieving a commendable tradeoff between performance and quantization efficiency.
CVApr 20, 2025
NTIRE 2025 Challenge on Real-World Face Restoration: Methods and ResultsZheng Chen, Jingkai Wang, Kai Liu et al.
This paper provides a review of the NTIRE 2025 challenge on real-world face restoration, highlighting the proposed solutions and the resulting outcomes. The challenge focuses on generating natural, realistic outputs while maintaining identity consistency. Its goal is to advance state-of-the-art solutions for perceptual quality and realism, without imposing constraints on computational resources or training data. The track of the challenge evaluates performance using a weighted image quality assessment (IQA) score and employs the AdaFace model as an identity checker. The competition attracted 141 registrants, with 13 teams submitting valid models, and ultimately, 10 teams achieved a valid score in the final ranking. This collaborative effort advances the performance of real-world face restoration while offering an in-depth overview of the latest trends in the field.
LGNov 15, 2025
Understanding InfoNCE: Transition Probability Matrix Induced Feature ClusteringGe Cheng, Shuo Wang, Yun Zhang
Contrastive learning has emerged as a cornerstone of unsupervised representation learning across vision, language, and graph domains, with InfoNCE as its dominant objective. Despite its empirical success, the theoretical underpinnings of InfoNCE remain limited. In this work, we introduce an explicit feature space to model augmented views of samples and a transition probability matrix to capture data augmentation dynamics. We demonstrate that InfoNCE optimizes the probability of two views sharing the same source toward a constant target defined by this matrix, naturally inducing feature clustering in the representation space. Leveraging this insight, we propose Scaled Convergence InfoNCE (SC-InfoNCE), a novel loss function that introduces a tunable convergence target to flexibly control feature similarity alignment. By scaling the target matrix, SC-InfoNCE enables flexible control over feature similarity alignment, allowing the training objective to better match the statistical properties of downstream data. Experiments on benchmark datasets, including image, graph, and text tasks, show that SC-InfoNCE consistently achieves strong and reliable performance across diverse domains.
CVDec 2, 2024
V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and PredictionZewei Zhou, Hao Xiang, Zhaoliang Zheng et al.
Vehicle-to-everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single-vehicle systems. Prior work primarily focuses on single-frame cooperative perception, which fuses agents' information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus on the spatio-temporal fusion in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies - early, late, and intermediate (what to transmit), providing comprehensive benchmarks with 11 fusion models (how to fuse). Furthermore, we propose V2XPnP, a novel intermediate fusion framework within one-step communication for end-to-end perception and prediction. Our framework employs a unified Transformer-based architecture to effectively model complex spatio-temporal relationships across multiple agents, frames, and high-definition maps. Moreover, we introduce the V2XPnP Sequential Dataset that supports all V2X collaboration modes and addresses the limitations of existing real-world datasets, which are restricted to single-frame or single-mode cooperation. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods in both perception and prediction tasks.
CVAug 7, 2024
No-Reference Image Quality Assessment with Global-Local Progressive Integration and Semantic-Aligned Quality TransferXiaoqi Wang, Yun Zhang
Accurate measurement of image quality without reference signals remains a fundamental challenge in low-level visual perception applications. In this paper, we propose a global-local progressive integration model that addresses this challenge through three key contributions: 1) We develop a dual-measurement framework that combines vision Transformer (ViT)-based global feature extractor and convolutional neural networks (CNNs)-based local feature extractor to comprehensively capture and quantify image distortion characteristics at different granularities. 2) We propose a progressive feature integration scheme that utilizes multi-scale kernel configurations to align global and local features, and progressively aggregates them via an interactive stack of channel-wise self-attention and spatial interaction modules for multi-grained quality-aware representations. 3) We introduce a semantic-aligned quality transfer method that extends the training data by automatically labeling the quality scores of diverse image content with subjective opinion scores. Experimental results demonstrate that our model yields 5.04% and 5.40% improvements in Spearman's rank-order correlation coefficient (SROCC) for cross-authentic and cross-synthetic dataset generalization tests, respectively. Furthermore, the proposed semantic-aligned quality transfer further yields 2.26% and 13.23% performance gains in evaluations on single-synthetic and cross-synthetic datasets.
CLJun 2, 2025
Reasoning-Table: Exploring Reinforcement Learning for Table ReasoningFangyu Lei, Jinxiang Meng, Yiming Huang et al.
Table reasoning, encompassing tasks such as table question answering, fact verification, and text-to-SQL, requires precise understanding of structured tabular data, coupled with numerical computation and code manipulation for effective inference. Supervised fine-tuning (SFT) approaches have achieved notable success but often struggle with generalization and robustness due to biases inherent in imitative learning. We introduce Reasoning-Table, the first application of reinforcement learning (RL) to table reasoning, achieving state-of-the-art performance. Through rigorous data preprocessing, reward design, and tailored training strategies, our method leverages simple rule-based outcome rewards to outperform SFT across multiple benchmarks. Unified training across diverse tasks enables Reasoning-Table to emerge as a robust table reasoning large language model, surpassing larger proprietary models like Claude-3.7-Sonnet by 4.0% on table reasoning benchmarks. The approach also achieves excellent performance on text-to-SQL tasks, reaching 68.3% performance on the BIRD dev dataset with a 7B model. Further experiments demonstrate that Reasoning-Table enhances the model's generalization capabilities and robustness.
CVMay 8, 2025
StabStitch++: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional WarpsLang Nie, Chunyu Lin, Kang Liao et al.
We retarget video stitching to an emerging issue, named warping shake, which unveils the temporal content shakes induced by sequentially unsmooth warps when extending image stitching to video stitching. Even if the input videos are stable, the stitched video can inevitably cause undesired warping shakes and affect the visual experience. To address this issue, we propose StabStitch++, a novel video stitching framework to realize spatial stitching and temporal stabilization with unsupervised learning simultaneously. First, different from existing learning-based image stitching solutions that typically warp one image to align with another, we suppose a virtual midplane between original image planes and project them onto it. Concretely, we design a differentiable bidirectional decomposition module to disentangle the homography transformation and incorporate it into our spatial warp, evenly spreading alignment burdens and projective distortions across two views. Then, inspired by camera paths in video stabilization, we derive the mathematical expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Finally, a warp smoothing model is presented to produce stable stitched videos with a hybrid loss to simultaneously encourage content alignment, trajectory smoothness, and online collaboration. Compared with StabStitch that sacrifices alignment for stabilization, StabStitch++ makes no compromise and optimizes both of them simultaneously, especially in the online mode. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Experiments exhibit that StabStitch++ surpasses current solutions in stitching performance, robustness, and efficiency, offering compelling advancements in this field by building a real-time online video stitching system.
CVMar 13, 2025
V2X-ReaLO: An Open Online Framework and Dataset for Cooperative Perception in RealityHao Xiang, Zhaoliang Zheng, Xin Xia et al.
Cooperative perception enabled by Vehicle-to-Everything (V2X) communication holds significant promise for enhancing the perception capabilities of autonomous vehicles, allowing them to overcome occlusions and extend their field of view. However, existing research predominantly relies on simulated environments or static datasets, leaving the feasibility and effectiveness of V2X cooperative perception especially for intermediate fusion in real-world scenarios largely unexplored. In this work, we introduce V2X-ReaLO, an open online cooperative perception framework deployed on real vehicles and smart infrastructure that integrates early, late, and intermediate fusion methods within a unified pipeline and provides the first practical demonstration of online intermediate fusion's feasibility and performance under genuine real-world conditions. Additionally, we present an open benchmark dataset specifically designed to assess the performance of online cooperative perception systems. This new dataset extends V2X-Real dataset to dynamic, synchronized ROS bags and provides 25,028 test frames with 6,850 annotated key frames in challenging urban scenarios. By enabling real-time assessments of perception accuracy and communication lantency under dynamic conditions, V2X-ReaLO sets a new benchmark for advancing and optimizing cooperative perception systems in real-world applications. The codes and datasets will be released to further advance the field.
LGSep 19, 2025
Robust LLM Training Infrastructure at ByteDanceBorui Wan, Gaohong Liu, Zuquan Song et al.
The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
CVDec 9, 2024
AgentAlign: Misalignment-Adapted Multi-Agent Perception for Resilient Inter-Agent Sensor CorrelationsZonglin Meng, Yun Zhang, Zhaoliang Zheng et al.
Cooperative perception has attracted wide attention given its capability to leverage shared information across connected automated vehicles (CAVs) and smart infrastructures to address sensing occlusion and range limitation issues. However, existing research overlooks the fragile multi-sensor correlations in multi-agent settings, as the heterogeneous agent sensor measurements are highly susceptible to environmental factors, leading to weakened inter-agent sensor interactions. The varying operational conditions and other real-world factors inevitably introduce multifactorial noise and consequentially lead to multi-sensor misalignment, making the deployment of multi-agent multi-modality perception particularly challenging in the real world. In this paper, we propose AgentAlign, a real-world heterogeneous agent cross-modality feature alignment framework, to effectively address these multi-modality misalignment issues. Our method introduces a cross-modality feature alignment space (CFAS) and heterogeneous agent feature alignment (HAFA) mechanism to harmonize multi-modality features across various agents dynamically. Additionally, we present a novel V2XSet-noise dataset that simulates realistic sensor imperfections under diverse environmental conditions, facilitating a systematic evaluation of our approach's robustness. Extensive experiments on the V2X-Real and V2XSet-Noise benchmarks demonstrate that our framework achieves state-of-the-art performance, underscoring its potential for real-world applications in cooperative autonomous driving. The controllable V2XSet-Noise dataset and generation pipeline will be released in the future.
CVApr 3, 2024
Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-ResolutionSimiao Li, Yun Zhang, Wei Li et al.
Knowledge distillation (KD) is a promising yet challenging model compression technique that transfers rich learning representations from a well-performing but cumbersome teacher model to a compact student model. Previous methods for image super-resolution (SR) mostly compare the feature maps directly or after standardizing the dimensions with basic algebraic operations (e.g. average, dot-product). However, the intrinsic semantic differences among feature maps are overlooked, which are caused by the disparate expressive capacity between the networks. This work presents MiPKD, a multi-granularity mixture of prior KD framework, to facilitate efficient SR model through the feature mixture in a unified latent space and stochastic network block mixture. Extensive experiments demonstrate the effectiveness of the proposed MiPKD method.
93.8IVApr 10
Multi-task Just Recognizable Difference for Video Coding for Machines: Database, Model, and Coding ApplicationJunqi Liu, Yun Zhang, Xiaoxia Huang et al.
Just Recognizable Difference (JRD) boosts coding efficiency for machine vision through visibility threshold modeling, but is currently limited to a single-task scenario. To address this issue, we propose a Multi-Task JRD (MT-JRD) dataset and an Attribute-assisted MT-JRD (AMT-JRD) model for Video Coding for Machines (VCM), enhancing both prediction accuracy and coding efficiency. First, we construct a dataset comprising 27,264 JRD annotations from machines, supporting three representative tasks including object detection, instance segmentation, and keypoint detection. Secondly, we propose the AMT-JRD prediction model, which integrates Generalized Feature Extraction Module (GFEM) and Specialized Feature Extraction Module (SFEM) to facilitate joint learning across multiple tasks. Thirdly, we innovatively incorporate object attribute information into object-wise JRD prediction through the Attribute Feature Fusion Module (AFFM), which introduces prior knowledge about object size and location. This design effectively compensates for the limitations of relying solely on image features and enhances the model's capacity to represent the perceptual mechanisms of machine vision. Finally, we apply the AMT-JRD model to VCM, where the accurately predicted JRDs are applied to reduce the coding bit rate while preserving accuracy across multiple machine vision tasks. Extensive experimental results demonstrate that AMT-JRD achieves precise and robust multi-task prediction with a mean absolute error of 3.781 and error variance of 5.332 across three tasks, outperforming the state-of-the-art single-task prediction model by 6.7% and 6.3%, respectively. Coding experiments further reveal that compared to the baseline VVC and JPEG, the AMT-JRD-based VCM improves an average of 3.861% and 7.886% Bjontegaard Delta-mean Average Precision (BD-mAP), respectively.
83.3IVApr 7
CI-ICM: Channel Importance-driven Learned Image Coding for MachinesYun Zhang, Junle Liu, Huan Zhang et al.
Traditional human vision-centric image compression methods are suboptimal for machine vision centric compression due to different visual properties and feature characteristics. To address this problem, we propose a Channel Importance-driven learned Image Coding for Machines (CI-ICM), aiming to maximize the performance of machine vision tasks at a given bitrate constraint. First, we propose a Channel Importance Generation (CIG) module to quantify channel importance in machine vision and develop a channel order loss to rank channels in descending order. Second, to properly allocate bitrate among feature channels, we propose a Feature Channel Grouping and Scaling (FCGS) module that non-uniformly groups the feature channels based on their importance and adjusts the dynamic range of each group. Based on FCGS, we further propose a Channel Importance-based Context (CI-CTX) module to allocate bits among feature groups and to preserve higher fidelity in critical channels. Third, to adapt to multiple machine tasks, we propose a Task-Specific Channel Adaptation (TSCA) module to adaptively enhance features for multiple downstream machine tasks. Experimental results on the COCO2017 dataset show that the proposed CI-ICM achieves BD-mAP@50:95 gains of 16.25$\%$ in object detection and 13.72$\%$ in instance segmentation over the established baseline codec. Ablation studies validate the effectiveness of each contribution, and computation complexity analysis reveals the practicability of the CI-ICM. This work establishes feature channel optimization for machine vision-centric compression, bridging the gap between image coding and machine perception.
CLOct 25, 2025
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language FoundationLing Team, Ang Li, Ben Liu et al.
We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.
CVMar 25, 2025
Multiscale Feature Importance-based Bit Allocation for End-to-End Feature Coding for MachinesJunle Liu, Yun Zhang, Zixi Guo
Feature Coding for Machines (FCM) aims to compress intermediate features effectively for remote intelligent analytics, which is crucial for future intelligent visual applications. In this paper, we propose a Multiscale Feature Importance-based Bit Allocation (MFIBA) for end-to-end FCM. First, we find that the importance of features for machine vision tasks varies with the scales, object size, and image instances. Based on this finding, we propose a Multiscale Feature Importance Prediction (MFIP) module to predict the importance weight for each scale of features. Secondly, we propose a task loss-rate model to establish the relationship between the task accuracy losses of using compressed features and the bitrate of encoding these features. Finally, we develop a MFIBA for end-to-end FCM, which is able to assign coding bits of multiscale features more reasonably based on their importance. Experimental results demonstrate that when combined with a retained Efficient Learned Image Compression (ELIC), the proposed MFIBA achieves an average of 38.202% bitrate savings in object detection compared to the anchor ELIC. Moreover, the proposed MFIBA achieves an average of 17.212% and 36.492% feature bitrate savings for instance segmentation and keypoint detection, respectively. When the proposed MFIBA is applied to the LIC-TCM, it achieves an average of 18.103%, 19.866% and 19.597% bit rate savings on three machine vision tasks, respectively, which validates the proposed MFIBA has good generalizability and adaptability to different machine vision tasks and FCM base codecs.
LGOct 11, 2025
PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior ModelingGuilin Li, Yun Zhang, Xiuyuan Chen et al.
Large language models (LLMs) have shown that generative pretraining can distill vast world knowledge into compact token representations. While LLMs encapsulate extensive world knowledge, they remain limited in modeling the behavioral knowledge contained within user interaction histories. User behavior forms a distinct modality, where each action, defined by multi-dimensional attributes such as time, context, and transaction type, constitutes a behavioral token. Modeling these high-cardinality sequences is challenging, and discriminative models often falter under limited supervision. To bridge this gap, we extend generative pretraining to user behavior, learning transferable representations from unlabeled behavioral data analogous to how LLMs learn from text. We present PANTHER, a hybrid generative-discriminative framework that unifies user behavior pretraining and downstream adaptation, enabling large-scale sequential user representation learning and real-time inference. PANTHER introduces: (1) Structured Tokenization to compress multi-dimensional transaction attributes into an interpretable vocabulary; (2) Sequence Pattern Recognition Module (SPRM) for modeling periodic transaction motifs; (3) a Unified User-Profile Embedding that fuses static demographics with dynamic transaction histories; and (4) Real-time scalability enabled by offline caching of pretrained embeddings for millisecond-level inference. Fully deployed and operational online at WeChat Pay, PANTHER delivers a 25.6 percent boost in next-transaction prediction HitRate@1 and a 38.6 percent relative improvement in fraud detection recall over baselines. Cross-domain evaluations on public benchmarks show strong generalization, achieving up to 21 percent HitRate@1 gains over transformer baselines, establishing PANTHER as a scalable, high-performance framework for industrial sequential user behavior modeling.
CVAug 13, 2025
Hierarchical Graph Attention Network for No-Reference Omnidirectional Image Quality AssessmentHao Yang, Xu Zhang, Jiaqi Ma et al.
Current Omnidirectional Image Quality Assessment (OIQA) methods struggle to evaluate locally non-uniform distortions due to inadequate modeling of spatial variations in quality and ineffective feature representation capturing both local details and global context. To address this, we propose a graph neural network-based OIQA framework that explicitly models structural relationships between viewports to enhance perception of spatial distortion non-uniformity. Our approach employs Fibonacci sphere sampling to generate viewports with well-structured topology, representing each as a graph node. Multi-stage feature extraction networks then derive high-dimensional node representation. To holistically capture spatial dependencies, we integrate a Graph Attention Network (GAT) modeling fine-grained local distortion variations among adjacent viewports, and a graph transformer capturing long-range quality interactions across distant regions. Extensive experiments on two large-scale OIQA databases with complex spatial distortions demonstrate that our method significantly outperforms existing approaches, confirming its effectiveness and strong generalization capability.
CVJul 29, 2025
RelMap: Enhancing Online Map Construction with Class-Aware Spatial Relation and Semantic PriorsTianhui Cai, Yun Zhang, Zewei Zhou et al.
Online high-definition (HD) map construction is crucial for scaling autonomous driving systems. While Transformer-based methods have become prevalent in online HD map construction, most existing approaches overlook the inherent spatial dependencies and semantic relationships among map elements, which constrains their accuracy and generalization capabilities. To address this, we propose RelMap, an end-to-end framework that explicitly models both spatial relations and semantic priors to enhance online HD map construction. Specifically, we introduce a Class-aware Spatial Relation Prior, which explicitly encodes relative positional dependencies between map elements using a learnable class-aware relation encoder. Additionally, we design a Mixture-of-Experts-based Semantic Prior, which routes features to class-specific experts based on predicted class probabilities, refining instance feature decoding. RelMap is compatible with both single-frame and temporal perception backbones, achieving state-of-the-art performance on both the nuScenes and Argoverse 2 datasets.
CVJun 28, 2025
Deep Learning based Joint Geometry and Attribute Up-sampling for Large-Scale Colored Point CloudsYun Zhang, Feifan Chen, Na Li et al.
Colored point cloud, which includes geometry and attribute components, is a mainstream representation enabling realistic and immersive 3D applications. To generate large-scale and denser colored point clouds, we propose a deep learning-based Joint Geometry and Attribute Up-sampling (JGAU) method that learns to model both geometry and attribute patterns while leveraging spatial attribute correlations. First, we establish and release a large-scale dataset for colored point cloud up-sampling called SYSU-PCUD, containing 121 large-scale colored point clouds with diverse geometry and attribute complexities across six categories and four sampling rates. Second, to improve the quality of up-sampled point clouds, we propose a deep learning-based JGAU framework that jointly up-samples geometry and attributes. It consists of a geometry up-sampling network and an attribute up-sampling network, where the latter leverages the up-sampled auxiliary geometry to model neighborhood correlations of the attributes. Third, we propose two coarse attribute up-sampling methods, Geometric Distance Weighted Attribute Interpolation (GDWAI) and Deep Learning-based Attribute Interpolation (DLAI), to generate coarse up-sampled attributes for each point. Then, an attribute enhancement module is introduced to refine these up-sampled attributes and produce high-quality point clouds by further exploiting intrinsic attribute and geometry patterns. Extensive experiments show that the Peak Signal-to-Noise Ratio (PSNR) achieved by the proposed JGAU method is 33.90 decibels, 32.10 decibels, 31.10 decibels, and 30.39 decibels for up-sampling rates of 4 times, 8 times, 12 times, and 16 times, respectively. Compared to state-of-the-art methods, JGAU achieves average PSNR gains of 2.32 decibels, 2.47 decibels, 2.28 decibels, and 2.11 decibels at these four up-sampling rates, demonstrating significant improvement.
LGJun 23, 2025
Online high-precision prediction method for injection molding product weight by integrating time series/non-time series mixed features and feature attention mechanismMaoyuan Li, Sihong Li, Guancheng Shen et al.
To address the challenges of untimely detection and online monitoring lag in injection molding quality anomalies, this study proposes a mixed feature attention-artificial neural network (MFA-ANN) model for high-precision online prediction of product weight. By integrating mechanism-based with data-driven analysis, the proposed architecture decouples time series data (e.g., melt flow dynamics, thermal profiles) from non-time series data (e.g., mold features, pressure settings), enabling hierarchical feature extraction. A self-attention mechanism is strategically embedded during cross-domain feature fusion to dynamically calibrate inter-modality feature weights, thereby emphasizing critical determinants of weight variability. The results demonstrate that the MFA-ANN model achieves a RMSE of 0.0281 with 0.5 g weight fluctuation tolerance, outperforming conventional benchmarks: a 25.1% accuracy improvement over non-time series ANN models, 23.0% over LSTM networks, 25.7% over SVR, and 15.6% over RF models, respectively. Ablation studies quantitatively validate the synergistic enhancement derived from the integration of mixed feature modeling (contributing 22.4%) and the attention mechanism (contributing 11.2%), significantly enhancing the model's adaptability to varying working conditions and its resistance to noise. Moreover, critical sensitivity analyses further reveal that data resolution significantly impacts prediction reliability, low-fidelity sensor inputs degrade performance by 23.8% RMSE compared to high-precision measurements. Overall, this study provides an efficient and reliable solution for the intelligent quality control of injection molding processes.
MMFeb 25, 2025
Deep-JGAC: End-to-End Deep Joint Geometry and Attribute Compression for Dense Colored Point CloudsYun Zhang, Zixi Guo, Linwei Zhu et al.
Colored point cloud becomes a fundamental representation in the realm of 3D vision. Effective Point Cloud Compression (PCC) is urgently needed due to huge amount of data. In this paper, we propose an end-to-end Deep Joint Geometry and Attribute point cloud Compression (Deep-JGAC) framework for dense colored point clouds, which exploits the correlation between the geometry and attribute for high compression efficiency. Firstly, we propose a flexible Deep-JGAC framework, where the geometry and attribute sub-encoders are compatible to either learning or non-learning based geometry and attribute encoders. Secondly, we propose an attribute-assisted deep geometry encoder that enhances the geometry latent representation with the help of attribute, where the geometry decoding remains unchanged. Moreover, Attribute Information Fusion Module (AIFM) is proposed to fuse attribute information in geometry coding. Thirdly, to solve the mismatch between the point cloud geometry and attribute caused by the geometry compression distortion, we present an optimized re-colorization module to attach the attribute to the geometrically distorted point cloud for attribute coding. It enhances the colorization and lowers the computational complexity. Extensive experimental results demonstrate that in terms of the geometry quality metric D1-PSNR, the proposed Deep-JGAC achieves an average of 82.96%, 36.46%, 41.72%, and 31.16% bit-rate reductions as compared to the state-of-the-art G-PCC, V-PCC, GRASP, and PCGCv2, respectively. In terms of perceptual joint quality metric MS-GraphSIM, the proposed Deep-JGAC achieves an average of 48.72%, 14.67%, and 57.14% bit-rate reductions compared to the G-PCC, V-PCC, and IT-DL-PCC, respectively. The encoding/decoding time costs are also reduced by 94.29%/24.70%, and 96.75%/91.02% on average as compared with the V-PCC and IT-DL-PCC.
IVNov 14, 2024
DT-JRD: Deep Transformer based Just Recognizable Difference Prediction Model for Video Coding for MachinesJunqi Liu, Yun Zhang, Xiaoqi Wang et al.
Just Recognizable Difference (JRD) represents the minimum visual difference that is detectable by machine vision, which can be exploited to promote machine vision oriented visual signal processing. In this paper, we propose a Deep Transformer based JRD (DT-JRD) prediction model for Video Coding for Machines (VCM), where the accurately predicted JRD can be used reduce the coding bit rate while maintaining the accuracy of machine tasks. Firstly, we model the JRD prediction as a multi-class classification and propose a DT-JRD prediction model that integrates an improved embedding, a content and distortion feature extraction, a multi-class classification and a novel learning strategy. Secondly, inspired by the perception property that machine vision exhibits a similar response to distortions near JRD, we propose an asymptotic JRD loss by using Gaussian Distribution-based Soft Labels (GDSL), which significantly extends the number of training labels and relaxes classification boundaries. Finally, we propose a DT-JRD based VCM to reduce the coding bits while maintaining the accuracy of object detection. Extensive experimental results demonstrate that the mean absolute error of the predicted JRD by the DT-JRD is 5.574, outperforming the state-of-the-art JRD prediction model by 13.1%. Coding experiments shows that comparing with the VVC, the DT-JRD based VCM achieves an average of 29.58% bit rate reduction while maintaining the object detection accuracy.
LGMar 14, 2024
Soften to Defend: Towards Adversarial Robustness via Self-Guided Label RefinementDaiwei Yu, Zhuorong Li, Lina Wei et al.
Adversarial training (AT) is currently one of the most effective ways to obtain the robustness of deep neural networks against adversarial attacks. However, most AT methods suffer from robust overfitting, i.e., a significant generalization gap in adversarial robustness between the training and testing curves. In this paper, we first identify a connection between robust overfitting and the excessive memorization of noisy labels in AT from a view of gradient norm. As such label noise is mainly caused by a distribution mismatch and improper label assignments, we are motivated to propose a label refinement approach for AT. Specifically, our Self-Guided Label Refinement first self-refines a more accurate and informative label distribution from over-confident hard labels, and then it calibrates the training by dynamically incorporating knowledge from self-distilled models into the current model and thus requiring no external teachers. Empirical results demonstrate that our method can simultaneously boost the standard accuracy and robust performance across multiple benchmark datasets, attack types, and architectures. In addition, we also provide a set of analyses from the perspectives of information theory to dive into our method and suggest the importance of soft labels for robust generalization.
LGJan 2, 2024
GEN: A Practical Alternative to Graph Transformers for Long-Range Graph ModelingShuo Wang, Ge Cheng, Yun Zhang
Message Passing Neural Networks (MPNNs) model local relations effectively but struggle to propagate information over long distances. Graph Transformers (GTs) mitigate this via global self-attention, yet their quadratic cost in the number of nodes limits scalability. We propose Graph Elimination Networks (GENs), an MPNN variant that approximates GT-like long-range modeling while maintaining high efficiency. GENs combine edge-wise and hop-wise self-attention in parallel; their multiplicative composition yields an attention kernel separable across edge and hop factors within a bounded K-hop receptive field. To enable hop-wise attention, we introduce the Graph Elimination Algorithm (GEA), which prevents double counting across hops, ensuring that each round injects the k-hop incremental contribution exactly once. Taking differences between successive rounds recovers the k-hop increment and yields disentangled multi-hop features as inputs for hop-wise attention. This preserves clearer structural distinctions across hop distances and enables more faithful modeling of pairwise dependencies between distant nodes within the K-hop neighborhood. On the Long-Range Graph Benchmark (LRGB), GENs outperform strong MPNN baselines by 7.7 and 6.0 percentage points (pp) on PascalVOC-SP and COCO-SP, and achieve performance on par with or better than state-of-the-art Graph Transformers. On OGBN-Products, GENs support full-batch training/inference, while sparse-attention baselines like Exphormer struggle with memory limits under comparable budgets, highlighting GENs as a practical alternative for large, sparse graphs.
CVFeb 20, 2022
Distortion-Aware Loop Filtering of Intra 360^o Video Coding with Equirectangular ProjectionPingping Zhang, Xu Wang, Linwei Zhu et al.
In this paper, we propose a distortion-aware loop filtering model to improve the performance of intra coding for 360$^o$ videos projected via equirectangular projection (ERP) format. To enable the awareness of distortion, our proposed module analyzes content characteristics based on a coding unit (CU) partition mask and processes them through partial convolution to activate the specified area. The feature recalibration module, which leverages cascaded residual channel-wise attention blocks (RCABs) to adjust the inter-channel and intra-channel features automatically, is capable of adapting with different quality levels. The perceptual geometry optimization combining with weighted mean squared error (WMSE) and the perceptual loss guarantees both the local field of view (FoV) and global image reconstruction with high quality. Extensive experimental results show that our proposed scheme achieves significant bitrate savings compared with the anchor (HM + 360Lib), leading to 8.9%, 9.0%, 7.1% and 7.4% on average bit rate reductions in terms of PSNR, WPSNR, and PSNR of two viewports for luminance component of 360^o videos, respectively.
MMDec 23, 2021
A Survey on Perceptually Optimized Video CodingYun Zhang, Linwei Zhu, Gangyi Jiang et al.
To provide users with more realistic visual experiences, videos are developing in the trends of Ultra High Definition (UHD), High Frame Rate (HFR), High Dynamic Range (HDR), Wide Color Gammut (WCG) and high clarity. However, the data amount of videos increases exponentially, which requires high efficiency video compression for storage and network transmission. Perceptually optimized video coding aims to maximize compression efficiency by exploiting visual redundancies. In this paper, we present a broad and systematic survey on perceptually optimized video coding. Firstly, we present problem formulation and framework of the perceptually optimized video coding, which includes visual perception modelling, visual quality assessment and perceptual video coding optimization. Secondly, recent advances on visual factors, computational perceptual models and quality assessment models are presented. Thirdly, we review perceptual video coding optimizations from four key aspects, including perceptually optimized bit allocation, rate-distortion optimization, transform and quantization, filtering and enhancement. In each part, problem formulation, working flow, recent advances, advantages and challenges are presented. Fourthly, perceptual coding performances of the latest coding standards and tools are experimentally analyzed. Finally, challenging issues and future opportunities are identified.
CVDec 4, 2021
Sphere Face Model:A 3D Morphable Model with Hypersphere Manifold Latent SpaceDiqiong Jiang, Yiwei Jin, Fanglue Zhang et al.
3D Morphable Models (3DMMs) are generative models for face shape and appearance. However, the shape parameters of traditional 3DMMs satisfy the multivariate Gaussian distribution while the identity embeddings satisfy the hypersphere distribution, and this conflict makes it challenging for face reconstruction models to preserve the faithfulness and the shape consistency simultaneously. To address this issue, we propose the Sphere Face Model(SFM), a novel 3DMM for monocular face reconstruction, which can preserve both shape fidelity and identity consistency. The core of our SFM is the basis matrix which can be used to reconstruct 3D face shapes, and the basic matrix is learned by adopting a two-stage training approach where 3D and 2D training data are used in the first and second stages, respectively. To resolve the distribution mismatch, we design a novel loss to make the shape parameters have a hyperspherical latent space. Extensive experiments show that SFM has high representation ability and shape parameter space's clustering performance. Moreover, it produces fidelity face shapes, and the shapes are consistent in challenging conditions in monocular face reconstruction.
ROAug 26, 2021
Design of a Flying Humanoid Robot Based on Thrust Vector ControlYuhang Li, Yuhao Zhou, Junbin Huang et al.
Achieving short-distance flight helps improve the efficiency of humanoid robots moving in complex environments (e.g., crossing large obstacles or reaching high places) for rapid emergency missions. This study proposes a design of a flying humanoid robot named Jet-HR2. The robot has 10 joints driven by brushless motors and harmonic drives for locomotion. To overcome the challenge of the stable-attitude takeoff in small thrust-to-weight conditions, the robot was designed based on the concept of thrust vectoring. The propulsion system consists of four ducted fans, that is, two fixed on the waist of the robot and the other two mounted on the feet, for thrust vector control. The thrust vector is controlled by adjusting the attitude of the foot during the flight. A simplified model and control strategies are proposed to solve the problem of attitude instability caused by mass errors and joint position errors during takeoff. The experimental results show that the robot's spin and dive behaviors during takeoff were effectively suppressed by controlling the thrust vector of the ducted fan on the foot. The robot successfully achieved takeoff at a thrust-to-weight ratio of 1.17 (17 kg / 20 kg) and maintained a stable attitude, reaching a takeoff height of over 1000 mm.
CVJul 31, 2021
A Dynamic 3D Spontaneous Micro-expression Database: Establishment and EvaluationFengping Wang, Jie Li, Siqi Zhang et al.
Micro-expressions are spontaneous, unconscious facial movements that show people's true inner emotions and have great potential in related fields of psychological testing. Since the face is a 3D deformation object, the occurrence of an expression can arouse spatial deformation of the face, but limited by the available databases are 2D videos, lacking the description of 3D spatial information of micro-expressions. Therefore, we proposed a new micro-expression database containing 2D video sequences and 3D point clouds sequences. The database includes 373 micro-expressions sequences, and these samples were classified using the objective method based on facial action coding system, as well as the non-objective method that combines video contents and participants' self-reports. We extracted 2D and 3D features using the local binary patterns on three orthogonal planes (LBP-TOP) and curvature algorithms, respectively, and evaluated the classification accuracies of these two features and their fusion results with leave-one-subject-out (LOSO) and 10-fold cross-validation. Further, we performed various neural network algorithms for database classification, the results show that classification accuracies are improved by fusing 3D features than using only 2D features. The database offers original and cropped micro-expression samples, which will facilitate the exploration and research on 3D Spatio-temporal features of micro-expressions.
CVMar 10, 2020
DymSLAM:4D Dynamic Scene Reconstruction Based on Geometrical Motion SegmentationChenjie Wang, Bin Luo, Yun Zhang et al.
Most SLAM algorithms are based on the assumption that the scene is static. However, in practice, most scenes are dynamic which usually contains moving objects, these methods are not suitable. In this paper, we introduce DymSLAM, a dynamic stereo visual SLAM system being capable of reconstructing a 4D (3D + time) dynamic scene with rigid moving objects. The only input of DymSLAM is stereo video, and its output includes a dense map of the static environment, 3D model of the moving objects and the trajectories of the camera and the moving objects. We at first detect and match the interesting points between successive frames by using traditional SLAM methods. Then the interesting points belonging to different motion models (including ego-motion and motion models of rigid moving objects) are segmented by a multi-model fitting approach. Based on the interesting points belonging to the ego-motion, we are able to estimate the trajectory of the camera and reconstruct the static background. The interesting points belonging to the motion models of rigid moving objects are then used to estimate their relative motion models to the camera and reconstruct the 3D models of the objects. We then transform the relative motion to the trajectories of the moving objects in the global reference frame. Finally, we then fuse the 3D models of the moving objects into the 3D map of the environment by considering their motion trajectories to obtain a 4D (3D+time) sequence. DymSLAM obtains information about the dynamic objects instead of ignoring them and is suitable for unknown rigid objects. Hence, the proposed system allows the robot to be employed for high-level tasks, such as obstacle avoidance for dynamic objects. We conducted experiments in a real-world environment where both the camera and the objects were moving in a wide range.
MMJan 7, 2020
SUR-FeatNet: Predicting the Satisfied User Ratio Curvefor Image Compression with Deep Feature LearningHanhe Lin, Vlad Hosu, Chunling Fan et al.
The satisfied user ratio (SUR) curve for a lossy image compression scheme, e.g., JPEG, characterizes the complementary cumulative distribution function of the just noticeable difference (JND), the smallest distortion level that can be perceived by a subject when a reference image is compared to a distorted one. A sequence of JNDs can be defined with a suitable successive choice of reference images. We propose the first deep learning approach to predict SUR curves. We show how to apply maximum likelihood estimation and the Anderson-Darling test to select a suitable parametric model for the distribution function. We then use deep feature learning to predict samples of the SUR curve and apply the method of least squares to fit the parametric model to the predicted samples. Our deep learning approach relies on a siamese convolutional neural network, transfer learning, and deep feature learning, using pairs consisting of a reference image and a compressed image for training. Experiments on the MCL-JCI dataset showed state-of-the-art performance. For example, the mean Bhattacharyya distances between the predicted and ground truth first, second, and third JND distributions were 0.0810, 0.0702, and 0.0522, respectively, and the corresponding average absolute differences of the peak signal-to-noise ratio at a median of the first JND distribution were 0.58, 0.69, and 0.58 dB. Further experiments on the JND-Pano dataset showed that the method transfers well to high resolution panoramic images viewed on head-mounted displays.
CVOct 15, 2019
Stereo-based Multi-motion Visual Odometry for Mobile RobotsQing Zhao, Bin Luo, Yun Zhang
With the development of computer vision, visual odometry is adopted by more and more mobile robots. However, we found that not only its own pose, but the poses of other moving objects are also crucial for the decision of the robot. In addition, the visual odometry will be greatly disturbed when a significant moving object appears. In this letter, a stereo-based multi-motion visual odometry method is proposed to acquire the poses of the robot and other moving objects. In order to obtain the poses simultaneously, a continuous motion segmentation module and a coordinate conversion module are applied to the traditional visual odometry pipeline. As a result, poses of all moving objects can be acquired and transformed into the ground coordinate system. The experimental results show that the proposed multi-motion visual odometry can effectively eliminate the influence of moving objects on the visual odometry, as well as achieve 10 cm in position and 3° in orientation RMSE (Root Mean Square Error) of each moving object.