Yixuan Xu

CV
h-index16
14papers
119citations
Novelty50%
AI Score51

14 Papers

CVMar 4, 2022
A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation

Hamidreza Fazlali, Yixuan Xu, Yuan Ren et al.

3D object detection using LiDAR data is an indispensable component for autonomous driving systems. Yet, only a few LiDAR-based 3D object detection methods leverage segmentation information to further guide the detection process. In this paper, we propose a novel multi-task framework that jointly performs 3D object detection and panoptic segmentation. In our method, the 3D object detection backbone in Bird's-Eye-View (BEV) plane is augmented by the injection of Range-View (RV) feature maps from the 3D panoptic segmentation backbone. This enables the detection backbone to leverage multi-view information to address the shortcomings of each projection view. Furthermore, foreground semantic information is incorporated to ease the detection task by highlighting the locations of each object class in the feature maps. Finally, a new center density heatmap generated based on the instance-level information further guides the detection backbone by suggesting possible box center locations for objects. Our method works with any BEV-based 3D object detection method, and as shown by extensive experiments on the nuScenes dataset, it provides significant performance gains. Notably, the proposed method based on a single-stage CenterPoint 3D object detection network achieved state-of-the-art performance on nuScenes 3D Detection Benchmark with 67.3 NDS.

CVFeb 2, 2023
AOP-Net: All-in-One Perception Network for Joint LiDAR-based 3D Object Detection and Panoptic Segmentation

Yixuan Xu, Hamidreza Fazlali, Yuan Ren et al.

LiDAR-based 3D object detection and panoptic segmentation are two crucial tasks in the perception systems of autonomous vehicles and robots. In this paper, we propose All-in-One Perception Network (AOP-Net), a LiDAR-based multi-task framework that combines 3D object detection and panoptic segmentation. In this method, a dual-task 3D backbone is developed to extract both panoptic- and detection-level features from the input LiDAR point cloud. Also, a new 2D backbone that intertwines Multi-Layer Perceptron (MLP) and convolution layers is designed to further improve the detection task performance. Finally, a novel module is proposed to guide the detection head by recovering useful features discarded during down-sampling operations in the 3D backbone. This module leverages estimated instance segmentation masks to recover detailed information from each candidate object. The AOP-Net achieves state-of-the-art performance for published works on the nuScenes benchmark for both 3D object detection and panoptic segmentation tasks. Also, experiments show that our method easily adapts to and significantly improves the performance of any BEV-based 3D object detection method.

CVJul 12, 2024
Uplifting Range-View-based 3D Semantic Segmentation in Real-Time with Multi-Sensor Fusion

Shiqi Tan, Hamidreza Fazlali, Yixuan Xu et al.

Range-View(RV)-based 3D point cloud segmentation is widely adopted due to its compact data form. However, RV-based methods fall short in providing robust segmentation for the occluded points and suffer from distortion of projected RGB images due to the sparse nature of 3D point clouds. To alleviate these problems, we propose a new LiDAR and Camera Range-view-based 3D point cloud semantic segmentation method (LaCRange). Specifically, a distortion-compensating knowledge distillation (DCKD) strategy is designed to remedy the adverse effect of RV projection of RGB images. Moreover, a context-based feature fusion module is introduced for robust and preservative sensor fusion. Finally, in order to address the limited resolution of RV and its insufficiency of 3D topology, a new point refinement scheme is devised for proper aggregation of features in 2D and augmentation of point features in 3D. We evaluated the proposed method on large-scale autonomous driving datasets \ie SemanticKITTI and nuScenes. In addition to being real-time, the proposed method achieves state-of-the-art results on nuScenes benchmark

86.8CVMay 20
Rethinking Cross-Layer Information Routing in Diffusion Transformers

Chao Xu, Maohua Li, Qirui Li et al.

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

40.4CVMay 20
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

Kesong Li, Yixuan Xu, Kuo-kun Tseng et al.

Direct Preference Optimization (DPO) is successful for alignment in LLMs but still faces challenges in text-to-image generation. Existing studies are confined to denoising diffusion models while overlooking flow-matching, and suffer from an objective mismatch when applying discrete NLP-based DPO to regression-based generative tasks.\ In this paper, we derive a generalized DPO objective that covers both diffusion and flow-matching via a unified reverse-time SDE framework, and point out from a gradient perspective that the standard DPO objective is suboptimal for text-to-image generation. Consequently, we propose Linear-DPO, which replaces the aggressive sigmoid-based utility function with a sustained linear utility and incorporates an EMA-updated reference model. Qualitative and quantitative experiments on diffusion models (SD1.5, SDXL) and flow-matching model (SD3-Medium) demonstrate the superiority of our approach over existing baselines.

CVApr 30, 2025Code
Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space

Hong Zhang, Zhongjie Duan, Xingjun Wang et al.

Unified multimodal generative models aim to integrate image understanding and generation abilities, offering significant advantages in harnessing multimodal corpora, particularly interleaved text-image data. However, existing unified models exhibit limitations in image synthesis quality, autoregressive error accumulation, and image editing capability. In this work, we propose Nexus-Gen, a novel architecture that unifies image understanding, generation, and editing tasks in a shared image embedding space. This shared space serves as a bridge for the autoregressive and diffusion models, which seamlessly integrates their complementary strengths in cross-modal modeling. To mitigate the severe error accumulation during autoregressive embedding prediction, we propose a novel prefilled autoregression strategy that aligns training-inference dynamics by prefilling input sequences with learnable embeddings. After multi-stage and multi-task training on our constructed large-scale dataset with 26.3 million samples, Nexus-Gen achieves state-of-the-art performance on the evaluation benchmarks spanning image understanding, generation and editing tasks. All models, datasets, and source codes are released in https://github.com/modelscope/Nexus-Gen to facilitate further advancements across the field.

86.3CVMay 11
Qwen-Image-2.0 Technical Report

Bing Zhao, Chenfei Wu, Deqing Li et al.

We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long text rendering, multilingual typography, high-resolution photorealism, robust instruction following, and efficient deployment, especially in text-rich and compositionally complex scenarios. Qwen-Image-2.0 addresses these challenges by coupling Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer for joint condition-target modeling, supported by large-scale data curation and a customized multi-stage training pipeline. This enables strong multimodal understanding while preserving flexible generation and editing capabilities. The model supports instructions of up to 1K tokens for generating text-rich content such as slides, posters, infographics, and comics, while significantly improving multilingual text fidelity and typography. It also enhances photorealistic generation with richer details, more realistic textures, and coherent lighting, and follows complex prompts more reliably across diverse styles. Extensive human evaluations show that Qwen-Image-2.0 substantially outperforms previous Qwen-Image models in both generation and editing, marking a step toward more general, reliable, and practical image generation foundation models.

CLSep 17, 2025
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang et al. · eth-zurich

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

CLMay 19, 2025
Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks

Yixuan Xu, Antoni-Joan Solergibert i Llaquet, Antoine Bosselut et al.

Large language models are known to memorize parts of their training data, posing risk of copyright violations. To systematically examine this risk, we pretrain language models (1B/3B/8B) from scratch on 83B tokens, mixing web-scale data with public domain books used to simulate copyrighted content at controlled frequencies at lengths at least ten times longer than prior work. We thereby identified the offset effect, a phenomenon characterized by two key findings: (1) verbatim memorization is most strongly triggered by short prefixes drawn from the beginning of the context window, with memorization decreasing counterintuitively as prefix length increases; and (2) a sharp decline in verbatim recall when prefix begins offset from the initial tokens of the context window. We attribute this to positional fragility: models rely disproportionately on the earliest tokens in their context window as retrieval anchors, making them sensitive to even slight shifts. We further observe that when the model fails to retrieve memorized content, it often produces degenerated text. Leveraging these findings, we show that shifting sensitive data deeper into the context window suppresses both extractable memorization and degeneration. Our results suggest that positional offset is a critical and previously overlooked axis for evaluating memorization risks, since prior work implicitly assumed uniformity by probing only from the beginning of training sequences.

CVJun 17, 2024
Video Frame Interpolation for Polarization via Swin-Transformer

Feng Huang, Xin Zhang, Yixuan Xu et al.

Video Frame Interpolation (VFI) has been extensively explored and demonstrated, yet its application to polarization remains largely unexplored. Due to the selective transmission of light by polarized filters, longer exposure times are typically required to ensure sufficient light intensity, which consequently lower the temporal sample rates. Furthermore, because polarization reflected by objects varies with shooting perspective, focusing solely on estimating pixel displacement is insufficient to accurately reconstruct the intermediate polarization. To tackle these challenges, this study proposes a multi-stage and multi-scale network called Swin-VFI based on the Swin-Transformer and introduces a tailored loss function to facilitate the network's understanding of polarization changes. To ensure the practicality of our proposed method, this study evaluates its interpolated frames in Shape from Polarization (SfP) and Human Shape Reconstruction tasks, comparing them with other state-of-the-art methods such as CAIN, FLAVR, and VFIT. Experimental results demonstrate our approach's superior reconstruction accuracy across all tasks.

LGDec 12, 2023
Analyze the Robustness of Classifiers under Label Noise

Cheng Zeng, Yixuan Xu, Jiaqi Tian

This study explores the robustness of label noise classifiers, aiming to enhance model resilience against noisy data in complex real-world scenarios. Label noise in supervised learning, characterized by erroneous or imprecise labels, significantly impairs model performance. This research focuses on the increasingly pertinent issue of label noise's impact on practical applications. Addressing the prevalent challenge of inaccurate training data labels, we integrate adversarial machine learning (AML) and importance reweighting techniques. Our approach involves employing convolutional neural networks (CNN) as the foundational model, with an emphasis on parameter adjustment for individual training samples. This strategy is designed to heighten the model's focus on samples critically influencing performance.

LGDec 3, 2023
Analyze the robustness of three NMF algorithms (Robust NMF with L1 norm, L2-1 norm NMF, L2 NMF)

Cheng Zeng, Jiaqi Tian, Yixuan Xu

Non-negative matrix factorization (NMF) and its variants have been widely employed in clustering and classification tasks (Long, & Jian , 2021). However, noises can seriously affect the results of our experiments. Our research is dedicated to investigating the noise robustness of non-negative matrix factorization (NMF) in the face of different types of noise. Specifically, we adopt three different NMF algorithms, namely L1 NMF, L2 NMF, and L21 NMF, and use the ORL and YaleB data sets to simulate a series of experiments with salt-and-pepper noise and Block-occlusion noise separately. In the experiment, we use a variety of evaluation indicators, including root mean square error (RMSE), accuracy (ACC), and normalized mutual information (NMI), to evaluate the performance of different NMF algorithms in noisy environments. Through these indicators, we quantify the resistance of NMF algorithms to noise and gain insights into their feasibility in practical applications.

CVNov 2, 2021
CPSeg: Cluster-free Panoptic Segmentation of 3D LiDAR Point Clouds

Enxu Li, Ryan Razani, Yixuan Xu et al.

A fast and accurate panoptic segmentation system for LiDAR point clouds is crucial for autonomous driving vehicles to understand the surrounding objects and scenes. Existing approaches usually rely on proposals or clustering to segment foreground instances. As a result, they struggle to achieve real-time performance. In this paper, we propose a novel real-time end-to-end panoptic segmentation network for LiDAR point clouds, called CPSeg. In particular, CPSeg comprises a shared encoder, a dual-decoder, and a cluster-free instance segmentation head, which is able to dynamically pillarize foreground points according to the learned embedding. Then, it acquires instance labels by finding connected pillars with a pairwise embedding comparison. Thus, the conventional proposal-based or clustering-based instance segmentation is transformed into a binary segmentation problem on the pairwise embedding comparison matrix. To help the network regress instance embedding, a fast and deterministic depth completion algorithm is proposed to calculate the surface normal of each point cloud in real-time. The proposed method is benchmarked on two large-scale autonomous driving datasets: SemanticKITTI and nuScenes. Notably, extensive experimental results show that CPSeg achieves state-of-the-art results among real-time approaches on both datasets.

CVAug 31, 2021
SMAC-Seg: LiDAR Panoptic Segmentation via Sparse Multi-directional Attention Clustering

Enxu Li, Ryan Razani, Yixuan Xu et al.

Panoptic segmentation aims to address semantic and instance segmentation simultaneously in a unified framework. However, an efficient solution of panoptic segmentation in applications like autonomous driving is still an open research problem. In this work, we propose a novel LiDAR-based panoptic system, called SMAC-Seg. We present a learnable sparse multi-directional attention clustering to segment multi-scale foreground instances. SMAC-Seg is a real-time clustering-based approach, which removes the complex proposal network to segment instances. Most existing clustering-based methods use the difference of the predicted and ground truth center offset as the only loss to supervise the instance centroid regression. However, this loss function only considers the centroid of the current object, but its relative position with respect to the neighbouring objects is not considered when learning to cluster. Thus, we propose to use a novel centroid-aware repel loss as an additional term to effectively supervise the network to differentiate each object cluster with its neighbours. Our experimental results show that SMAC-Seg achieves state-of-the-art performance among all real-time deployable networks on both large-scale public SemanticKITTI and nuScenes panoptic segmentation datasets.