Shizhen Zhao

CV
h-index9
17papers
211citations
Novelty59%
AI Score59

17 Papers

CVOct 11, 2022
Prototypical VoteNet for Few-Shot 3D Point Cloud Object Detection

Shizhen Zhao, Xiaojuan Qi

Most existing 3D point cloud object detection approaches heavily rely on large amounts of labeled training data. However, the labeling process is costly and time-consuming. This paper considers few-shot 3D point cloud object detection, where only a few annotated samples of novel classes are needed with abundant samples of base classes. To this end, we propose Prototypical VoteNet to recognize and localize novel instances, which incorporates two new modules: Prototypical Vote Module (PVM) and Prototypical Head Module (PHM). Specifically, as the 3D basic geometric structures can be shared among categories, PVM is designed to leverage class-agnostic geometric prototypes, which are learned from base classes, to refine local features of novel categories.Then PHM is proposed to utilize class prototypes to enhance the global feature of each object, facilitating subsequent object localization and classification, which is trained by the episodic training strategy. To evaluate the model in this new setting, we contribute two new benchmark datasets, FS-ScanNet and FS-SUNRGBD. We conduct extensive experiments to demonstrate the effectiveness of Prototypical VoteNet, and our proposed method shows significant and consistent improvements compared to baselines on two benchmark datasets.

CVSep 8, 2024
Can OOD Object Detectors Learn from Foundation Models?

Jiahui Liu, Xin Wen, Shizhen Zhao et al.

Out-of-distribution (OOD) object detection is a challenging task due to the absence of open-set OOD data. Inspired by recent advancements in text-to-image generative models, such as Stable Diffusion, we study the potential of generative models trained on large-scale open-set data to synthesize OOD samples, thereby enhancing OOD object detection. We introduce SyncOOD, a simple data curation method that capitalizes on the capabilities of large foundation models to automatically extract meaningful OOD data from text-to-image generative models. This offers the model access to open-world knowledge encapsulated within off-the-shelf foundation models. The synthetic OOD samples are then employed to augment the training of a lightweight, plug-and-play OOD detector, thus effectively optimizing the in-distribution (ID)/OOD decision boundaries. Extensive experiments across multiple benchmarks demonstrate that SyncOOD significantly outperforms existing methods, establishing new state-of-the-art performance with minimal synthetic data usage.

89.2DCMay 21
LiveR: Fine-Grained Elasticity via Live Reconfiguration for Model Training

Haoyuan Liu, Kairui Zhou, Shuyao Qi et al.

To reduce user costs and maximize cluster utilization, large model training increasingly leverages volatile but inexpensive GPU capacity, such as spot instances and reclaimable resources in shared clusters. Yet, capitalizing on these economic benefits requires jobs to adapt within the short warning windows that many such environments provide. Existing elastic training systems still treat reconfiguration as stop-and-restart: they externalize distributed state through checkpoints, rebuild the distributed runtime on a new topology, and restart training, turning each resize event into a storage-heavy recovery procedure that incurs substantial downtime from checkpoint I/O, process restart, CUDA initialization, and communicator setup. We present LiveR, a live reconfiguration runtime for elastic LLM training that replaces storage-backed restart with a live, bounded-memory handoff between mixed-parallel training worlds. While the current world continues training, LiveR asynchronously prepares the target world, bootstraps newly added workers in isolation to keep heavyweight initialization off the critical path, and streams model state directly over high-bandwidth interconnects while reshaping it online across tensor, pipeline, and data parallel dimensions. Once the target world is ready, LiveR performs a lightweight commit that switches training to the new configuration without stop-and-restart on the live path. We implement LiveR atop Megatron-LM and PyTorch and evaluate it end-to-end on a multi-node GPU cluster. Across diverse reconfiguration scenarios, LiveR reduces downtime from minutes to seconds, accelerates reconfiguration by 14$\times$-23$\times$ over checkpoint/restart baselines, incurs minimal steady-state overhead, and sustains up to 99% training goodput under volatile-resource conditions, making volatile low-cost GPU capacity far more practical for LLM training.

CVJun 2, 2025Code
Data Pruning by Information Maximization

Haoru Tan, Sitong Wu, Wei Huang et al.

In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables InfoMax to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of InfoMax in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models. Code is available at https://github.com/hrtan/InfoMax.

83.8NIMar 31
Leaf-centric Logical Topology Design for OCS-based GPU Clusters

Xinchi Han, Weihao Jiang, Yingming Mao et al.

Recent years have witnessed the growing deployment of optical circuit switches (OCS) in commercial GPU clusters (e.g., Google A3 GPU cluster) optimized for machine learning (ML) workloads. Such clusters adopt a three-tier leaf-spine-OCS topology, servers attach to leaf-layer electronic packet switches (EPSes); these leaf switches aggregate into spine-layer EPSes to form a Pod; and multiple Pods are interconnected via core-layer OCSes. Unlike EPSes, OCSes only support circuit-based paths between directly connected spine switches, potentially inducing a phenomenon termed routing polarization, which refers to the scenario where the bandwidth requirements between specific pairs of Pods are unevenly fulfilled through links among different spine switches. The resulting imbalance induces traffic contention and bottlenecks on specific leaf-to-spine links, ultimately reducing ML training throughput. To mitigate this issue, we introduce a leaf-centric paradigm to ensure traffic originating from the same leaf switch is evenly distributed across multiple spine switches with balanced loads. Through rigorous theoretical analysis, we establish a sufficient condition for avoiding routing polarization and propose a corresponding logical topology design algorithm with polynomial-time complexity. Large-scale simulations validate up to 19.27% throughput improvement and a 99.16% reduction in logical topology computation overhead compared to Mixed Integer Programming (MIP)-based methods.

34.4NIMay 13
NeuroRisk: Physics-Informed Neural Optimization for Risk-Aware Traffic Engineering

Yingming Mao, Ximeng Liu, Jingyi Cheng et al.

In production Wide-Area Networks (WANs), correlated failures dominate availability losses, forcing operators to reserve large safety margins that leave substantial capacity underutilized. Achieving high utilization under strict availability targets therefore requires risk-aware Traffic Engineering (TE) over dozens to hundreds of probabilistic failure scenarios-yet solving this problem at operational timescales remains elusive. We demonstrate that existing risk-aware formulations can be unified under an embedded Sort-and-Select structure, exposing a fundamental trade-off between expressiveness and tractability: classical optimizers either restrict scenario selection for efficiency or incur prohibitive decomposition costs. While deep learning appears promising, prior Deep TE methods mainly target maximum link utilization and rely on scaling-based feasibility, which fundamentally breaks under explicit capacity constraints and scenario-dependent risk. We present NeuroRisk, a physics-informed deep unrolled optimizer that exploits the structure of Sort-and-Select. NeuroRisk enforces feasibility via gated edge-local reservations and represents scenario sets through permutation-invariant, gradient-aligned cues. Evaluations on production-style WANs show that NeuroRisk achieves small optimality gaps relative to the solver with orders of magnitude speedup $(10^2- 10^5 \times)$ on risk objectives, while outperforming neural baselines on nominal throughput.

CVDec 10, 2025
ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation

Shengchao Zhou, Jiehong Lin, Jiahui Liu et al.

Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.

LGJan 29
SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning

Jianchang Su, Yifan Zhang, Shengkai Lin et al.

Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration. We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller, improving its policy online from reward-labeled interaction histories without gradient updates. SAIR combines Pareto-dominance reward shaping with a provable separation margin, surprisal-guided experience retrieval for context efficiency, and fine-grained GPU rate control via user-space CUDA interception. We provide regret analysis decomposing error into retrieval coverage and LLM selection components. On four ML serving pipelines under three workload patterns, SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97% (under GPU rate-control assumptions), with 86% bottleneck detection accuracy and no offline training.

84.5NIApr 30
Libra: Accelerating Socket I/O via Programmable Selective Data Copying

Kairui Zhou, Shengkai Lin, Wei Zhang et al.

Layer-7 (L7) proxies are critical to modern cloud-native systems, yet their performance is increasingly bottlenecked by copying entire payloads across the kernel-user boundary. Existing approaches reduce this overhead but typically sacrifice compatibility with unmodified POSIX applications, introduce new APIs, or require specialized environments. We show that, under conventional OS abstractions, fully eliminating kernel-user copies while preserving standard socket semantics for unmodified proxies is fundamentally impossible. This leads to a practical insight: in common L7 workloads, proxies inspect only small metadata (e.g., HTTP headers) for routing, while forwarding the bulk payload unchanged. Based on this insight, we present Libra, an OS-level selective-copy framework that copies only metadata to the user space and retains the bulk payload in the kernel for forwarding, reducing data movement without breaking compatibility. Libra uses eBPF to identify protocol-specific metadata boundaries and coordinate selective copy and payload reuse across receive and transmit paths, all without modifying the socket API. Implemented in Linux and evaluated with unmodified Nginx and HAProxy, Libra improves plaintext throughput by up to 4.2x and reduces P99 tail latency by over 90%. With hardware-offloaded kTLS, it boosts encrypted throughput by 2.0x and cuts tail latency by 65%.

73.0DCApr 21
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

Shuyao Qi, Haoyuan Liu, Shizhen Zhao

Fine-grained, per-micro-batch load balancing is essential for efficient Mixture-of-Experts (MoE) training, yet every prior dynamic scheduling scheme pays for it with extra communication that is hard to hide. Especially on modern bulk-transfer backends such as DeepEP. We make a simple but consequential observation: on the NVIDIA Hopper architecture the NVLink Copy Engine can move data between intra-node GPUs without consuming any SM cycles, effectively providing a nearly free communication channel that runs in parallel with compute kernels. FEPLB turns this idle hardware into a new parallel dimension for MoE load rebalancing. Its Two-Phase Dispatch first routes tokens across nodes via the standard EP backend, then redistributes dynamic-expert tokens and weights within the NVLink domain through the Copy Engine at nearly zero cost, while a lightweight CPU scheduler runs concurrently with static expert computation. Because FEPLB uses only Copy Engine and CPU that are orthogonal to those consumed by EP and PP, it coexists with existing parallel strategies without reconfiguration. On GLM-5's MoE layers (128 experts, no auxiliary loss, up to 16 H100 GPUs), FEPLB reduces the token straggler by 51-70% and the GEMM straggler by 50-68% with no measurable EP communication overhead. Its advantage grows with the EP degree: at EP=8, it achieves 2x lower token straggler than FasterMoE.

CVOct 21, 2024
Learning from Neighbors: Category Extrapolation for Long-Tail Learning

Shizhen Zhao, Xin Wen, Jiahui Liu et al.

Balancing training on long-tail data distributions remains a long-standing challenge in deep learning. While methods such as re-weighting and re-sampling help alleviate the imbalance issue, limited sample diversity continues to hinder models from learning robust and generalizable feature representations, particularly for tail classes. In contrast to existing methods, we offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance. In this paper, we investigate this phenomenon through both quantitative and qualitative studies, showing that increased granularity enhances the generalization of learned features in tail categories. Motivated by these findings, we propose a method to increase dataset granularity through category extrapolation. Specifically, we introduce open-set auxiliary classes that are visually similar to existing ones, aiming to enhance representation learning for both head and tail classes. This forms the core contribution and insight of our approach. To automate the curation of auxiliary data, we leverage large language models (LLMs) as knowledge bases to search for auxiliary categories and retrieve relevant images through web crawling. To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss that encourages the model to focus on class discrimination within the target dataset. During inference, the classifier weights for auxiliary categories are masked out, leaving only the target class weights for use. Extensive experiments and ablation studies on three standard long-tail benchmarks demonstrate the effectiveness of our approach, notably outperforming strong baseline methods that use the same amount of data. The code will be made publicly available.

CVOct 12, 2025
Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection

Shizhen Zhao, Jiahui Liu, Xin Wen et al.

Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic-$β$ Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate the effectiveness of our approach, significantly outperforming baseline methods.

CVAug 8, 2025
Aligning Effective Tokens with Video Anomaly in Large Language Models

Yingxian Chen, Jiahui Liu, Ruidi Fan et al.

Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.

NIJun 30, 2025
Geminet: Learning the Duality-based Iterative Process for Lightweight Traffic Engineering in Changing Topologies

Ximeng Liu, Shizhen Zhao, Xinbing Wang

Recently, researchers have explored ML-based Traffic Engineering (TE), leveraging neural networks to solve TE problems traditionally addressed by optimization. However, existing ML-based TE schemes remain impractical: they either fail to handle topology changes or suffer from poor scalability due to excessive computational and memory overhead. To overcome these limitations, we propose Geminet, a lightweight and scalable ML-based TE framework that can handle changing topologies. Geminet is built upon two key insights: (i) a methodology that decouples neural networks from topology by learning an iterative gradient-descent-based adjustment process, as the update rule of gradient descent is topology-agnostic, relying only on a few gradient-related quantities; (ii) shifting optimization from path-level routing weights to edge-level dual variables, reducing memory consumption by leveraging the fact that edges are far fewer than paths. Evaluations on WAN and data center datasets show that Geminet significantly improves scalability. Its neural network size is only 0.04% to 7% of existing schemes, while handling topology variations as effectively as HARP, a state-of-the-art ML-based TE approach, without performance degradation. When trained on large-scale topologies, Geminet consumes under 10 GiB of memory, more than eight times less than the 80-plus GiB required by HARP, while achieving 5.45 times faster convergence speed, demonstrating its potential for large-scale deployment.

CVSep 11, 2020
Devil's in the Details: Aligning Visual Clues for Conditional Embedding in Person Re-Identification

Fufu Yu, Xinyang Jiang, Yifei Gong et al.

Although Person Re-Identification has made impressive progress, difficult cases like occlusion, change of view-pointand similar clothing still bring great challenges. Besides overall visual features, matching and comparing detailed information is also essential for tackling these challenges. This paper proposes two key recognition patterns to better utilize the detail information of pedestrian images, that most of the existing methods are unable to satisfy. Firstly, Visual Clue Alignment requires the model to select and align decisive regions pairs from two images for pair-wise comparison, while existing methods only align regions with predefined rules like high feature similarity or same semantic labels. Secondly, the Conditional Feature Embedding requires the overall feature of a query image to be dynamically adjusted based on the gallery image it matches, while most of the existing methods ignore the reference images. By introducing novel techniques including correspondence attention module and discrepancy-based GCN, we propose an end-to-end ReID method that integrates both patterns into a unified framework, called CACE-Net((C)lue(A)lignment and (C)onditional (E)mbedding). The experiments show that CACE-Net achieves state-of-the-art performance on three public datasets.

CVAug 16, 2020
Do Not Disturb Me: Person Re-identification Under the Interference of Other Pedestrians

Shizhen Zhao, Changxin Gao, Jun Zhang et al.

In the conventional person Re-ID setting, it is widely assumed that cropped person images are for each individual. However, in a crowded scene, off-shelf-detectors may generate bounding boxes involving multiple people, where the large proportion of background pedestrians or human occlusion exists. The representation extracted from such cropped images, which contain both the target and the interference pedestrians, might include distractive information. This will lead to wrong retrieval results. To address this problem, this paper presents a novel deep network termed Pedestrian-Interference Suppression Network (PISNet). PISNet leverages a Query-Guided Attention Block (QGAB) to enhance the feature of the target in the gallery, under the guidance of the query. Furthermore, the involving Guidance Reversed Attention Module and the Multi-Person Separation Loss promote QGAB to suppress the interference of other pedestrians. Our method is evaluated on two new pedestrian-interference datasets and the results show that the proposed method performs favorably against existing Re-ID methods.

CVJan 19, 2020
GTNet: Generative Transfer Network for Zero-Shot Object Detection

Shizhen Zhao, Changxin Gao, Yuanjie Shao et al.

We propose a Generative Transfer Network (GTNet) for zero shot object detection (ZSD). GTNet consists of an Object Detection Module and a Knowledge Transfer Module. The Object Detection Module can learn large-scale seen domain knowledge. The Knowledge Transfer Module leverages a feature synthesizer to generate unseen class features, which are applied to train a new classification layer for the Object Detection Module. In order to synthesize features for each unseen class with both the intra-class variance and the IoU variance, we design an IoU-Aware Generative Adversarial Network (IoUGAN) as the feature synthesizer, which can be easily integrated into GTNet. Specifically, IoUGAN consists of three unit models: Class Feature Generating Unit (CFU), Foreground Feature Generating Unit (FFU), and Background Feature Generating Unit (BFU). CFU generates unseen features with the intra-class variance conditioned on the class semantic embeddings. FFU and BFU add the IoU variance to the results of CFU, yielding class-specific foreground and background features, respectively. We evaluate our method on three public datasets and the results demonstrate that our method performs favorably against the state-of-the-art ZSD approaches.