Zhiheng Xu

h-index18

6papers

166citations

Novelty62%

AI Score54

Ranked #27,077 of 201,326 authors (top 13%)#11,231 in CV (top 19%)

6 Papers

93.6CRMay 24Code

RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry

Bo Lv, Zhiheng Xu, KeDong Xiu et al.

Mixture-of-Experts (MoE) architectures have become an increasingly important paradigm for scaling Large Language Models (LLMs). As MoE models are increasingly deployed in real-world services, safety auditing becomes necessary to verify whether these models produce or facilitate harmful behaviors during operation. However, existing content-based auditing methods typically require access to user prompts, model inputs, or generated outputs, potentially exposing sensitive user information and creating a fundamental tension between LLM safety and user privacy. On the other hand, we observe that, in MoE models, sparse expert routing maps different inputs to activate different expert-execution patterns, producing measurable footprints in low-level GPU execution telemetry. Inspired by this observation, we propose RouteScan, a non-intrusive auditing framework for detecting harmful behaviors through GPU-level expert routing telemetry. Specifically, RouteScan utilizes the number of active GPU threads allocated to expert modules during the prefilling phase as a discriminative micro-architectural fingerprint, and builds a lightweight detection pipeline that isolates cross-domain invariant risk indicators for the precise identification of malicious prompts. Comprehensive evaluations on open-source MoE LLMs with distinct routing designs demonstrate that RouteScan achieves strong generalization, with an AUROC exceeding 0.93 on unseen harmful domains and 0.96 under novel jailbreak wrappers. Moreover, empirical inversion tests show that the collected expert routing telemetry provides limited information for prompt reconstruction, suggesting a practical privacy advantage over content-based auditing methods.

CVFeb 25

SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Guibin Chen, Dixuan Lin, Jiangping Yang et al.

SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.

CVApr 17, 2025Code

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang et al.

Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.

53.1MMApr 22

Feedback-Driven Rate Control for Learned Video Compression

Zhiheng Xu, Xuerui Ma, Chunhua Peng et al.

End-to-end learned video compression has achieved strong rate-distortion performance, but rate control remains underexplored, especially in target-bitrate-driven and budget-constrained scenarios. Existing methods mainly rely on explicit R-D-lambda modeling or feed-forward prediction, which may lack stable online adjustment when video content varies dynamically. We propose a feedback-driven rate control framework for learned video compression. First, we build a single-model multi-rate coding interface on top of a DCVC-style framework, enabling continuous bitrate control through the rate-distortion parameter lambda. Then, a log-domain PI/PID closed-loop controller updates lambda online according to the error between the target bitrate and the entropy-estimated bitrate, achieving stable target bitrate tracking. To further improve frame-level bit allocation under budget constraints, we introduce a dual-branch GRU-based adjustment controller that refines the base control signal using budget-state features and causal coding statistics. Experiments on UVG and HEVC show that the proposed PI/PID controller achieves average bitrate errors of 2.88% and 2.95% on DCVC and DCVC-TCM, respectively. With the proposed adjustment controller, the method further achieves average BD-rate reductions of 5.69% and 4.49%, while reducing the average bitrate errors to 2.13% and 2.24%. These results show that the proposed method provides a practical solution for learned video compression with both controllable bitrate and improved rate-distortion performance.

SYApr 13, 2020

Automatic Generation of Hierarchical Contracts for Resilience in Cyber-Physical Systems

Zhiheng Xu, Daniel Jun Xian Ng, Arvind Easwaran

With the growing scale of Cyber-Physical Systems (CPSs), it is challenging to maintain their stability under all operating conditions. How to reduce the downtime and locate the failures becomes a core issue in system design. In this paper, we employ a hierarchical contract-based resilience framework to guarantee the stability of CPS. In this framework, we use Assume Guarantee (A-G) contracts to monitor the non-functional properties of individual components (e.g., power and latency), and hierarchically compose such contracts to deduce information about faults at the system level. The hierarchical contracts enable rapid fault detection in large-scale CPS. However, due to the vast number of components in CPS, manually designing numerous contracts and the hierarchy becomes challenging. To address this issue, we propose a technique to automatically decompose a root contract into multiple lower-level contracts depending on I/O dependencies between components. We then formulate a multi-objective optimization problem to search the optimal parameters of each lower-level contract. This enables automatic contract refinement taking into consideration the communication overhead between components. Finally, we use a case study from the manufacturing domain to experimentally demonstrate the benefits of the proposed framework.

MMMar 6, 2017

Learning from Experience: A Dynamic Closed-Loop QoE Optimization for Video Adaptation and Delivery

Imen Triki, Quanyan Zhu, Rachid Elazouzi et al.

The quality of experience (QoE) is known to be subjective and context-dependent. Identifying and calculating the factors that affect QoE is indeed a difficult task. Recently, a lot of effort has been devoted to estimate the users QoE in order to improve video delivery. In the literature, most of the QoE-driven optimization schemes that realize trade-offs among different quality metrics have been addressed under the assumption of homogenous populations. Nevertheless, people perceptions on a given video quality may not be the same, which makes the QoE optimization harder. This paper aims at taking a step further in order to address this limitation and meet users profiles. To do so, we propose a closed-loop control framework based on the users(subjective) feedbacks to learn the QoE function and optimize it at the same time. Our simulation results show that our system converges to a steady state, where the resulting QoE function noticeably improves the users feedbacks.