Xinggong Zhang

NI
h-index20
15papers
249citations
Novelty49%
AI Score50

15 Papers

NIFeb 13Code
Artic: AI-oriented Real-time Communication for MLLM Video Assistant

Jiangkai Wu, Zhiyuan Ren, Junquan Zhong et al.

AI Video Assistant emerges as a new paradigm for Real-time Communication (RTC), where one peer is a Multimodal Large Language Model (MLLM) deployed in the cloud. This makes interaction between humans and AI more intuitive, akin to chatting with a real person. However, a fundamental mismatch exists between current RTC frameworks and AI Video Assistants, stemming from the drastic shift in Quality of Experience (QoE) and more challenging networks. Measurements on our production prototype also confirm that current RTC fails, causing latency spikes and accuracy drops. To address these challenges, we propose Artic, an AI-oriented RTC framework for MLLM Video Assistants, exploring the shift from "humans watching video" to "AI understanding video." Specifically, Artic proposes: (1) Response Capability-aware Adaptive Bitrate, which utilizes MLLM accuracy saturation to proactively cap bitrate, reserving bandwidth headroom to absorb future fluctuations for latency reduction; (2) Zero-overhead Context-aware Streaming, which allocates limited bitrate to regions most important for the response, maintaining accuracy even under ultra-low bitrates; and (3) Degraded Video Understanding Benchmark, the first benchmark evaluating how RTC-induced video degradation affects MLLM accuracy. Prototype experiments using real-world uplink traces show that compared with existing methods, Artic significantly improves accuracy by 15.12% and reduces latency by 135.31 ms. We will release the benchmark and codes at https://github.com/pku-netvideo/DeViBench.

NIMay 13, 2024Code
DoLLM: How Large Language Models Understanding Network Flow Data to Detect Carpet Bombing DDoS

Qingyang Li, Yihang Zhang, Zhidong Jia et al.

It is an interesting question Can and How Large Language Models (LLMs) understand non-language network data, and help us detect unknown malicious flows. This paper takes Carpet Bombing as a case study and shows how to exploit LLMs' powerful capability in the networking area. Carpet Bombing is a new DDoS attack that has dramatically increased in recent years, significantly threatening network infrastructures. It targets multiple victim IPs within subnets, causing congestion on access links and disrupting network services for a vast number of users. Characterized by low-rates, multi-vectors, these attacks challenge traditional DDoS defenses. We propose DoLLM, a DDoS detection model utilizes open-source LLMs as backbone. By reorganizing non-contextual network flows into Flow-Sequences and projecting them into LLMs semantic space as token embeddings, DoLLM leverages LLMs' contextual understanding to extract flow representations in overall network context. The representations are used to improve the DDoS detection performance. We evaluate DoLLM with public datasets CIC-DDoS2019 and real NetFlow trace from Top-3 countrywide ISP. The tests have proven that DoLLM possesses strong detection capabilities. Its F1 score increased by up to 33.3% in zero-shot scenarios and by at least 20.6% in real ISP traces.

CRFeb 10Code
MalMoE: Mixture-of-Experts Enhanced Encrypted Malicious Traffic Detection Under Graph Drift

Yunpeng Tan, Qingyang Li, Mingxin Yang et al.

Encryption has been commonly used in network traffic to secure transmission, but it also brings challenges for malicious traffic detection, due to the invisibility of the packet payload. Graph-based methods are emerging as promising solutions by leveraging multi-host interactions to promote detection accuracy. But most of them face a critical problem: Graph Drift, where the flow statistics or topological information of a graph change over time. To overcome these drawbacks, we propose a graph-assisted encrypted traffic detection system, MalMoE, which applies Mixture of Experts (MoE) to select the best expert model for drift-aware classification. Particularly, we design 1-hop-GNN-like expert models that handle different graph drifts by analyzing graphs with different features. Then, the redesigned gate model conducts expert selection according to the actual drift. MalMoE is trained with a stable two-stage training strategy with data augmentation, which effectively guides the gate on how to perform routing. Experiments on open-source, synthetic, and real-world datasets show that MalMoE can perform precise and real-time detection.

CVApr 14, 2025Code
MCBlock: Boosting Neural Radiance Field Training Speed by MCTS-based Dynamic-Resolution Ray Sampling

Yunpeng Tan, Junlin Hao, Jiangkai Wu et al.

Neural Radiance Field (NeRF) is widely known for high-fidelity novel view synthesis. However, even the state-of-the-art NeRF model, Gaussian Splatting, requires minutes for training, far from the real-time performance required by multimedia scenarios like telemedicine. One of the obstacles is its inefficient sampling, which is only partially addressed by existing works. Existing point-sampling algorithms uniformly sample simple-texture regions (easy to fit) and complex-texture regions (hard to fit), while existing ray-sampling algorithms sample these regions all in the finest granularity (i.e. the pixel level), both wasting GPU training resources. Actually, regions with different texture intensities require different sampling granularities. To this end, we propose a novel dynamic-resolution ray-sampling algorithm, MCBlock, which employs Monte Carlo Tree Search (MCTS) to partition each training image into pixel blocks with different sizes for active block-wise training. Specifically, the trees are initialized according to the texture of training images to boost the initialization speed, and an expansion/pruning module dynamically optimizes the block partition. MCBlock is implemented in Nerfstudio, an open-source toolset, and achieves a training acceleration of up to 2.33x, surpassing other ray-sampling algorithms. We believe MCBlock can apply to any cone-tracing NeRF model and contribute to the multimedia community.

CVAug 19, 2024
$R^2$-Mesh: Reinforcement Learning Powered Mesh Reconstruction via Geometry and Appearance Refinement

Haoyang Wang, Liming Liu, Xinggong Zhang

Mesh reconstruction from Neural Radiance Fields (NeRF) is widely used in 3D reconstruction and has been applied across numerous domains. However, existing methods typically rely solely on the given training set images, which restricts supervision to limited observations and makes it difficult to fully constrain geometry and appearance. Moreover, the contribution of each viewpoint for training is not uniform and changes dynamically during the optimization process, which can result in suboptimal guidance for both geometric refinement and rendering quality. To address these limitations, we propose $R^2$-Mesh, a reinforcement learning framework that combines NeRF-rendered pseudo-supervision with online viewpoint selection. Our key insight is to exploit NeRF's rendering ability to synthesize additional high-quality images, enriching training with diverse viewpoint information. To ensure that supervision focuses on the most beneficial perspectives, we introduce a UCB-based strategy with a geometry-aware reward, which dynamically balances exploration and exploitation to identify informative viewpoints throughout training. Within this framework, we jointly optimize SDF geometry and view-dependent appearance under differentiable rendering, while periodically refining meshes to capture fine geometric details. Experiments demonstrate that our method achieves competitive results in both geometric accuracy and rendering quality.

NIFeb 3
Morphe: High-Fidelity Generative Video Streaming with Vision Foundation Model

Tianyi Gong, Zijian Cao, Zixing Zhang et al.

Video streaming is a fundamental Internet service, while the quality still cannot be guaranteed especially in poor network conditions such as bandwidth-constrained and remote areas. Existing works mainly work towards two directions: traditional pixel-codec streaming nearly approaches its limit and is hard to step further in compression; the emerging neural-enhanced or generative streaming usually fall short in latency and visual fidelity, hindering their practical deployment. Inspired by the recent success of vision foundation model (VFM), we strive to harness the powerful video understanding and processing capacities of VFM to achieve generalization, high fidelity and loss resilience for real-time video streaming with even higher compression rate. We present the first revolutionized paradigm that enables VFM-based end-to-end generative video streaming towards this goal. Specifically, Morphe employs joint training of visual tokenizers and variable-resolution spatiotemporal optimization under simulated network constraints. Additionally, a robust streaming system is constructed that leverages intelligent packet dropping to resist real-world network perturbations. Extensive evaluation demonstrates that Morphe achieves comparable visual quality while saving 62.5\% bandwidth compared to H.265, and accomplishes real-time, loss-resilient video delivery in challenging network environments, representing a milestone in VFM-enabled multimedia streaming solutions.

NIApr 19, 2024
Large Language Models for Networking: Workflow, Advances and Challenges

Chang Liu, Xiaohui Xie, Xinggong Zhang et al.

The networking field is characterized by its high complexity and rapid iteration, requiring extensive expertise to accomplish network tasks, ranging from network design, configuration, diagnosis and security. The inherent complexity of these tasks, coupled with the ever-changing landscape of networking technologies and protocols, poses significant hurdles for traditional machine learning-based methods. These methods often struggle to generalize and automate complex tasks in networking, as they require extensive labeled data, domain-specific feature engineering, and frequent retraining to adapt to new scenarios. However, the recent emergence of large language models (LLMs) has sparked a new wave of possibilities in addressing these challenges. LLMs have demonstrated remarkable capabilities in natural language understanding, generation, and reasoning. These models, trained on extensive data, can benefit the networking domain. Some efforts have already explored the application of LLMs in the networking domain and revealed promising results. By reviewing recent advances, we present an abstract workflow to describe the fundamental process involved in applying LLM for Networking. We introduce the highlights of existing works by category and explain in detail how they operate at different stages of the workflow. Furthermore, we delve into the challenges encountered, discuss potential solutions, and outline future research prospects. We hope that this survey will provide insight for researchers and practitioners, promoting the development of this interdisciplinary research field.

NIJul 14, 2025
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

Jiangkai Wu, Zhiyuan Ren, Liming Liu et al.

AI Video Chat emerges as a new paradigm for Real-time Communication (RTC), where one peer is not a human, but a Multimodal Large Language Model (MLLM). This makes interaction between humans and AI more intuitive, as if chatting face-to-face with a real person. However, this poses significant challenges to latency, because the MLLM inference takes up most of the response time, leaving very little time for video streaming. Due to network uncertainty and instability, transmission latency becomes a critical bottleneck preventing AI from being like a real person. To address this, we propose Artic, an AI-oriented Real-time Communication framework, exploring the network requirement shift from "humans watching video" to "AI understanding video". To reduce bitrate dramatically while maintaining MLLM accuracy, we propose Context-Aware Video Streaming that recognizes the importance of each video region for chat and allocates bitrate almost exclusively to chat-important regions. To avoid packet retransmission, we propose Loss-Resilient Adaptive Frame Rate that leverages previous frames to substitute for lost/delayed frames while avoiding bitrate waste. To evaluate the impact of video streaming quality on MLLM accuracy, we build the first benchmark, named Degraded Video Understanding Benchmark (DeViBench). Finally, we discuss some open questions and ongoing solutions for AI Video Chat.

CVApr 16, 2025
R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors

Haoyang Wang, Liming Liu, Peiheng Wang et al.

Mesh reconstruction from multi-view images is a fundamental problem in computer vision, but its performance degrades significantly under sparse-view conditions, especially in unseen regions where no ground-truth observations are available. While recent advances in diffusion models have demonstrated strong capabilities in synthesizing novel views from limited inputs, their outputs often suffer from visual artifacts and lack 3D consistency, posing challenges for reliable mesh optimization. In this paper, we propose a novel framework that leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner. To address the instability of diffusion outputs, we propose a Consensus Diffusion Module that filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion to produce robust pseudo-supervision. Building on this, we design an online reinforcement learning strategy based on the Upper Confidence Bound (UCB) to adaptively select the most informative viewpoints for enhancement, guided by diffusion loss. Finally, the fused images are used to jointly supervise a NeRF-based model alongside sparse-view ground truth, ensuring consistency across both geometry and appearance. Extensive experiments demonstrate that our method achieves significant improvements in both geometric quality and rendering quality.

CVApr 14, 2025
GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting

Junlin Hao, Peiheng Wang, Haoyang Wang et al.

Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.

NIMar 24, 2025
Large Language Models powered Malicious Traffic Detection: Architecture, Opportunities and Case Study

Xinggong Zhang, Haotian Meng, Qingyang Li et al.

Malicious traffic detection is a pivotal technology for network security to identify abnormal network traffic and detect network attacks. Large Language Models (LLMs) are trained on a vast corpus of text, have amassed remarkable capabilities of context-understanding and commonsense knowledge. This has opened up a new door for network attacks detection. Researchers have already initiated discussions regarding the application of LLMs on specific cyber-security tasks. Unfortunately, there remains a lack of comprehensive analysis on harnessing LLMs for traffic detection, as well as the opportunities and challenges. In this paper, we focus on unleashing the full potential of Large Language Models (LLMs) in malicious traffic detection. We present a holistic view of the architecture of LLM-powered malicious traffic detection, including the procedures of Pre-training, Fine-tuning, and Detection. Especially, by exploring the knowledge and capabilities of LLM, we identify three distinct roles LLM can act in traffic classification: Classifier, Encoder, and Predictor. For each of them, the modeling paradigm, opportunities and challenges are elaborated. Finally, we present our design on LLM-powered DDoS detection as a case study. The proposed framework attains accurate detection on carpet bombing DDoS by exploiting LLMs' capabilities in contextual mining. The evaluation shows its efficacy, exhibiting a nearly 35% improvement compared to existing systems.

NIMar 20, 2025
PromptMobile: Efficient Promptus for Low Bandwidth Mobile Video Streaming

Liming Liu, Jiangkai Wu, Haoyang Wang et al.

Traditional video compression algorithms exhibit significant quality degradation at extremely low bitrates. Promptus emerges as a new paradigm for video streaming, substantially cutting down the bandwidth essential for video streaming. However, Promptus is computationally intensive and can not run in real-time on mobile devices. This paper presents PromptMobile, an efficient acceleration framework tailored for on-device Promptus. Specifically, we propose (1) a two-stage efficient generation framework to reduce computational cost by 8.1x, (2) a fine-grained inter-frame caching to reduce redundant computations by 16.6%, (3) system-level optimizations to further enhance efficiency. The evaluations demonstrate that compared with the original Promptus, PromptMobile achieves a 13.6x increase in image generation speed. Compared with other streaming methods, PromptMobile achives an average LPIPS improvement of 0.016 (compared with H.265), reducing 60% of severely distorted frames (compared to VQGAN).

MMNov 11, 2019
Pano: Optimizing 360° Video Streaming with a Better Understanding of Quality Perception

Yu Guan, Chengyuan Zheng, Zongming Guo et al.

Streaming 360° videos requires more bandwidth than non-360° videos. This is because current solutions assume that users perceive the quality of 360° videos in the same way they perceive the quality of non-360° videos. This means the bandwidth demand must be proportional to the size of the user's field of view. However, we found several qualitydetermining factors unique to 360°videos, which can help reduce the bandwidth demand. They include the moving speed of a user's viewpoint (center of the user's field of view), the recent change of video luminance, and the difference in depth-of-fields of visual objects around the viewpoint. This paper presents Pano, a 360° video streaming system that leverages the 360° video-specific factors. We make three contributions. (1) We build a new quality model for 360° videos that captures the impact of the 360° video-specific factors. (2) Pano proposes a variable-sized tiling scheme in order to strike a balance between the perceived quality and video encoding efficiency. (3) Pano proposes a new qualityadaptation logic that maximizes 360° video user-perceived quality and is readily deployable. Our evaluation (based on user study and trace analysis) shows that compared with state-of-the-art techniques, Pano can save 41-46% bandwidth without any drop in the perceived quality, or it can raise the perceived quality (user rating) by 25%-142% without using more bandwidth.

MMMay 15, 2019
Statistical Learning Based Congestion Control for Real-time Video Communication

Tongyu Dai, Xinggong Zhang, Yihang Zhang et al.

With the increasing demands on interactive video applications, how to adapt video bit rate to avoid network congestion has become critical, since congestion results in self-inflicted delay and packet loss which deteriorate the quality of real-time video service. The existing congestion control is hard to simultaneously achieve low latency, high throughput, good adaptability and fair bandwidth allocation, mainly because of the hardwired control strategy and egocentric convergence objective. To address these issues, we propose an end-to-end statistical learning based congestion control, named Iris. By exploring the underlying principles of self-inflicted delay, we reveal that congestion delay is determined by sending rate, receiving rate and network status, which inspires us to control video bit rate using a statistical-learning congestion control model. The key idea of Iris is to force all flows to converge to the same queue load, and adjust the bit rate by the model. All flows keep a small and fixed number of packets queuing in the network, thus the fair bandwidth allocation and low latency are both achieved. Besides, the adjustment step size of sending rate is updated by online learning, to better adapt to dynamically changing networks. We carried out extensive experiments to evaluate the performance of Iris, with the implementations of transport layer (UDP) and application layer (QUIC) respectively. The testing environment includes emulated network, real-world Internet and commercial LTE networks. Compared against TCP flavors and state-of-the-art protocols, Iris is able to achieve high bandwidth utilization, low latency and good fairness concurrently. Especially over QUIC, Iris is able to increase the video bitrate up to 25%, and PSNR up to 1dB.

MMApr 27, 2017
TFDASH: A Fairness, Stability, and Efficiency Aware Rate Control Approach for Multiple Clients over DASH

Chao Zhou, Chia-Wen Lin, Xinggong Zhang et al.

Dynamic adaptive streaming over HTTP (DASH) has recently been widely deployed in the Internet and adopted in the industry. It, however, does not impose any adaptation logic for selecting the quality of video fragments requested by clients and suffers from lackluster performance with respect to a number of desirable properties: efficiency, stability, and fairness when multiple players compete for a bottleneck link. In this paper, we propose a throughput-friendly DASH (TFDASH) rate control scheme for video streaming with multiple clients over DASH to well balance the trade-offs among efficiency, stability, and fairness. The core idea behind guaranteeing fairness and high efficiency (bandwidth utilization) is to avoid OFF periods during the downloading process for all clients, i.e., the bandwidth is in perfect-subscription or over-subscription with bandwidth utilization approach to 100\%. We also propose a dual-threshold buffer model to solve the instability problem caused by the above idea. As a result, by integrating these novel components, we also propose a probability-driven rate adaption logic taking into account several key factors that most influence visual quality, including buffer occupancy, video playback quality, video bit-rate switching frequency and amplitude, to guarantee high-quality video streaming. Our experiments evidently demonstrate the superior performance of the proposed method.