Jianing Zhang

CV
h-index10
15papers
13citations
Novelty60%
AI Score56

15 Papers

CVFeb 26
TrajTok: Learning Trajectory Tokens enables better Video Understanding

Chenhao Zheng, Jieyu Zhang, Jianing Zhang et al.

Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

CVMay 8Code
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

Zhiheng Li, Zongyang Ma, Jiaxian Chen et al.

The past year has seen over 20 open-source document parsing models, yet thefield still benchmarks almost exclusively on OmniDocBench, a 1,355-pagemanually annotated dataset whose top scores have saturated above 90%. Athree-stage audit pipeline we run on OmniDocBench screens its 21,353evaluator-scored blocks and confirms 2,580 errors (12.08%); combined with overa year of public availability, both annotation quality and contamination riskcall its rankings into question. To address these issues, we presentPureDocBench, a programmatically generated, source-traceable benchmark thatrenders document images from HTML/CSS and produces verifiable annotations fromthe same source, covering 10 domains, 66 subcategories, and 1,475 pages, eachin three versions: clean, digitally degraded, and real-degraded (4,425 imagestotal). Evaluating 40 models spanning pipeline specialists, end-to-endspecialists, and general-purpose VLMs, we find: (i) document parsing is farfrom solved: the best model scores only ~74 out of 100, with a 44.6-point gapbetween the strongest and weakest models; (ii) specialist parsers with <=4Bparameters rival or surpass general VLMs that are 5-100x larger, yet formularecognition remains a shared bottleneck where no model exceeds 67% whenaveraging the formula metric across all three tracks; (iii) general VLMs loseonly 0.99/8.52 Overall points under digital/real degradation versus 4.90/14.21for pipeline specialists, producing ranking reversals that make clean-onlyevaluation misleading for deployment. All data, code, and artifacts arepublicly released.

CVMay 4Code
RAFNet: Region-Aware Fusion Network for Pansharpening

Jianing Zhang, Zijian Zhou, Kai Sun

Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and high-resolution panchromatic (PAN) images. Although deep learning has advanced this field, mainstream frequency-based methods relying on standard scaled dot-product attention suffer from quadratic computational complexity and fail to exploit the inherent regional sparsity of remote sensing imagery. Furthermore, existing spatial enhancement strategies typically employ static convolution kernels, which struggle to adapt to the complex frequency and regional variations of PAN and MS images. To address these bottlenecks, we propose a Region-Aware Fusion (RAFNet) Network that synergistically models spatial and frequency information. Specifically, we design a Spatial Adaptive Refinement (SAR) module that leverages the discrete wavelet transform (DWT) for directional frequency separation and K-means clustering for regional partitioning, which enables the dynamic construction of region-specific adaptive convolution kernels, achieving spatially and frequency-adaptive feature enhancement. Moreover, we introduce a Clustered Frequency Aggregation (CFA) module based on a sparse attention mechanism guided by the semantic clusters, which executes a region-aware sparse attention strategy that drastically reduces computational redundancy while ensuring high-quality frequency feature extraction. In addition we integrated these modules into a progressive, multi-level spatial-frequency network architecture to facilitate robust interaction and accurate image reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that the proposed RAFNet significantly outperforms state-of-the-art pansharpening methods in both reduced- and full-resolution assessments. The code is available at https://github.com/PatrickNod/RAFNet.

CVApr 8Code
Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Zhiheng Li, Zongyang Ma, Yuntong Pan et al.

Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.

NIMar 19
AutORAN: LLM-driven Natural Language Programming for Agile xApp Development

Xin Li, Shiming Yu, Leming Shen et al.

Traditional RAN systems are closed and monolithic, stifling innovation. The openness and programmability enabled by Open Radio Access Network (O-RAN) are envisioned to revolutionize cellular networks with control-plane applications--xApps. The development of xApps (typically by third-party developers), however, remains time-consuming and cumbersome, often requiring months of manual coding and integration, which hinders the roll-out of new functionalities in practice. To lower the barrier of xApp development for both developers and network operators, we present AutORAN, the first LLM-driven natural language programming framework for agile xApps that automates the entire xApp development pipeline. In a nutshell, AutORAN turns high-level user intents into swiftly deployable xApps within minutes, eliminating the need for manual coding or testing. To this end, AutORAN builds a fully automated xApp generation pipeline, which integrates multiple functional modules (from user requirement elicitation, AI/ML function design and validation, to xApp synthesis and deployment). We design, implement, and comprehensively evaluate AutORAN on representative xApp tasks. Results show AutORAN-generated xApps can achieve similar or even better performance than the best known hand-crafted baselines. AutORAN drastically accelerates the xApp development cycle (from user intent elicitation to roll-out), streamlining O-RAN innovation.

CVMar 3
ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Jiayi Zhu, Jianing Zhang, Yiying Yang et al.

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

CVNov 24, 2025Code
3M-TI: High-Quality Mobile Thermal Imaging via Calibration-free Multi-Camera Cross-Modal Diffusion

Minchong Chen, Xiaoyun Yuan, Junzhe Wan et al.

The miniaturization of thermal sensors for mobile platforms inherently limits their spatial resolution and textural fidelity, leading to blurry and less informative images. Existing thermal super-resolution (SR) methods can be grouped into single-image and RGB-guided approaches: the former struggles to recover fine structures from limited information, while the latter relies on accurate and laborious cross-camera calibration, which hinders practical deployment and robustness. Here, we propose 3M-TI, a calibration-free Multi-camera cross-Modality diffusion framework for Mobile Thermal Imaging. At its core, 3M-TI integrates a cross-modal self-attention module (CSM) into the diffusion UNet, replacing the original self-attention layers to adaptively align thermal and RGB features throughout the denoising process, without requiring explicit camera calibration. This design enables the diffusion network to leverage its generative prior to enhance spatial resolution, structural fidelity, and texture detail in the super-resolved thermal images. Extensive evaluations on real-world mobile thermal cameras and public benchmarks validate our superior performance, achieving state-of-the-art results in both visual quality and quantitative metrics. More importantly, the thermal images enhanced by 3M-TI lead to substantial gains in critical downstream tasks like object detection and segmentation, underscoring its practical value for robust mobile thermal perception systems. More materials: https://github.com/work-submit/3MTI.

DCOct 24, 2025Code
Towards Straggler-Resilient Split Federated Learning: An Unbalanced Update Approach

Dandan Liang, Jianing Zhang, Evan Chen et al.

Split Federated Learning (SFL) enables scalable training on edge devices by combining the parallelism of Federated Learning (FL) with the computational offloading of Split Learning (SL). Despite its great success, SFL suffers significantly from the well-known straggler issue in distributed learning systems. This problem is exacerbated by the dependency between Split Server and clients: the Split Server side model update relies on receiving activations from clients. Such synchronization requirement introduces significant time latency, making straggler a critical bottleneck to the scalability and efficiency of the system. To mitigate this problem, we propose MU-SplitFed, a straggler-resilient SFL algorithm in zeroth-order optimization that decouples training progress from straggler delays via a simple yet effective unbalanced update mechanism. By enabling the server to perform $τ$ local updates per client round, MU-SplitFed achieves a convergence rate of $O(\sqrt{d/(τT)})$ for non-convex objectives, demonstrating a linear speedup of $τ$ in communication rounds. Experiments demonstrate that MU-SplitFed consistently outperforms baseline methods with the presence of stragglers and effectively mitigates their impact through adaptive tuning of $τ$. The code for this project is available at https://github.com/Johnny-Zip/MU-SplitFed.

CVApr 8
Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation

Jianing Zhang, Runan Li, Honglin Pang et al.

Deciphering ancient Chinese Oracle Bone Script (OBS) is a challenging task that offers insights into the beliefs, systems, and culture of the ancient era. Existing approaches treat decipherment as a closed-set image recognition problem, which fails to bridge the ``interpretation gap'': while individual characters are often unique and rare, they are composed of a limited set of recurring, pictographic components that carry transferable semantic meanings. To leverage this structural logic, we propose an agent-driven Vision-Language Model (VLM) framework that integrates a VLM for precise visual grounding with an LLM-based agent to automate a reasoning chain of component identification, graph-based knowledge retrieval, and relationship inference for linguistically accurate interpretation. To support this, we also introduce OB-Radix, an expert-annotated dataset providing structural and semantic data absent from prior corpora, comprising 1,022 character images (934 unique characters) and 1,853 fine-grained component images across 478 distinct components with verified explanations. By evaluating our system across three benchmarks of different tasks, we demonstrate that our framework yields more detailed and precise decipherments compared to baseline methods.

CVMay 2
CGFformer: Cluster-Guidance Frequency Transformer for Pansharpening

Zijian Zhou, Jianing Zhang, Kai Sun et al.

Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images. However, the current mainstream frequency-based pansharpening methods employ fixed frequency filters, which cannot precisely adapt to complex and spatially diversified frequency distributions in PAN and MS images. Furthermore, existing denoising strategies insufficiently exploit frequency components for denoising and struggle to suppress various noise types accurately. To address these challenges, we propose CGFformer, a cluster-guidance frequency Transformer that focuses on varying frequency distribution and interactions between frequency and spatial components. Specifically, we design an adaptive separation module that integrates local features and non-local information through K-means clustering, enabling more precise separation of high- and low-frequency components. Subsequently, we introduce a dual-stream refinement module combined with Transformer-based cross-attention to remove various noise, allowing the network to jointly suppress frequency-relevant and irrelevant disturbances. In addition, we develop a frequency-spatial fusion module designed to enhance detail and facilitate spatial-frequency interaction, ensuring more effective reconstruction of spatial structures in the fused results. Extensive experiments on multiple benchmark datasets demonstrate that the proposed CGFformer achieves notable improvements over existing pansharpening approaches.

CVJun 28, 2025
Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

Jianing Zhang, Jiayi Zhu, Feiyu Ji et al.

Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, a lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside \textit{pseudo} data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: https://dmdiff.github.io/.

LGFeb 27, 2025
DPZV: Elevating the Tradeoff between Privacy and Utility in Zeroth-Order Vertical Federated Learning

Jianing Zhang, Evan Chen, Chaoyue Liu et al.

Vertical Federated Learning (VFL) enables collaborative training with feature-partitioned data, yet remains vulnerable to privacy leakage through gradient transmissions. Standard differential privacy (DP) techniques such as DP-SGD are difficult to apply in this setting due to VFL's distributed nature and the high variance incurred by vector-valued noise. On the other hand, zeroth-order (ZO) optimization techniques can avoid explicit gradient exposure but lack formal privacy guarantees. In this work, we propose DPZV, the first ZO optimization framework for VFL that achieves tunable DP with performance guarantees. DPZV overcomes these limitations by injecting low-variance scalar noise at the server, enabling controllable privacy with reduced memory overhead. We conduct a comprehensive theoretical analysis showing that DPZV matches the convergence rate of first-order optimization methods while satisfying formal ($ε, δ$)-DP guarantees. Experiments on image and language benchmarks demonstrate that DPZV outperforms several baselines in terms of accuracy under a wide range of privacy constraints ($ε\le 10$), thereby elevating the privacy-utility tradeoff in VFL.

LGFeb 4, 2025
Gradient Correction in Federated Learning with Adaptive Optimization

Evan Chen, Shiqiang Wang, Jianing Zhang et al.

In federated learning (FL), model training performance is strongly impacted by data heterogeneity across clients. Client-drift compensation methods have recently emerged as a solution to this issue, introducing correction terms into local model updates. To date, these methods have only been considered under stochastic gradient descent (SGD)-based model training, while modern FL frameworks also employ adaptive optimizers (e.g., Adam) for improved convergence. However, due to the complex interplay between first and second moments found in most adaptive optimization methods, naively injecting correction terms can lead to performance degradation in heterogeneous settings. In this work, we propose {\tt FAdamGC}, the first algorithm to integrate drift compensation into adaptive federated optimization. The key idea of {\tt FAdamGC} is injecting a pre-estimation correction term that aligns with the moment structure of adaptive methods. We provide a rigorous convergence analysis of our algorithm under non-convex settings, showing that {\tt FAdamGC} results in better rate and milder assumptions than naively porting SGD-based correction algorithms into adaptive optimizers. Our experimental results demonstrate that {\tt FAdamGC} consistently outperform existing methods in total communication and computation cost across varying levels of data heterogeneity, showing the efficacy of correcting gradient information in federated adaptive optimization.

CVDec 8, 2024
GBR: Generative Bundle Refinement for High-fidelity Gaussian Splatting with Enhanced Mesh Reconstruction

Jianing Zhang, Yuchao Zheng, Ziwei Li et al.

Gaussian splatting has gained attention for its efficient representation and rendering of 3D scenes using continuous Gaussian primitives. However, it struggles with sparse-view inputs due to limited geometric and photometric information, causing ambiguities in depth, shape, and texture. we propose GBR: Generative Bundle Refinement, a method for high-fidelity Gaussian splatting and meshing using only 4-6 input views. GBR integrates a neural bundle adjustment module to enhance geometry accuracy and a generative depth refinement module to improve geometry fidelity. More specifically, the neural bundle adjustment module integrates a foundation network to produce initial 3D point maps and point matches from unposed images, followed by bundle adjustment optimization to improve multiview consistency and point cloud accuracy. The generative depth refinement module employs a diffusion-based strategy to enhance geometric details and fidelity while preserving the scale. Finally, for Gaussian splatting optimization, we propose a multimodal loss function incorporating depth and normal consistency, geometric regularization, and pseudo-view supervision, providing robust guidance under sparse-view conditions. Experiments on widely used datasets show that GBR significantly outperforms existing methods under sparse-view inputs. Additionally, GBR demonstrates the ability to reconstruct and render large-scale real-world scenes, such as the Pavilion of Prince Teng and the Great Wall, with remarkable details using only 6 views.

CVFeb 16, 2021
LEAD: LiDAR Extender for Autonomous Driving

Jianing Zhang, Wei Li, Honggang Gou et al.

3D perception using sensors under vehicle industrial standard is the rigid demand in autonomous driving. MEMS LiDAR emerges with irresistible trend due to its lower cost, more robust, and meeting the mass-production standards. However, it suffers small field of view (FoV), slowing down the step of its population. In this paper, we propose LEAD, i.e., LiDAR Extender for Autonomous Driving, to extend the MEMS LiDAR by coupled image w.r.t both FoV and range. We propose a multi-stage propagation strategy based on depth distributions and uncertainty map, which shows effective propagation ability. Moreover, our depth outpainting/propagation network follows a teacher-student training fashion, which transfers depth estimation ability to depth completion network without any scale error passed. To validate the LiDAR extension quality, we utilize a high-precise laser scanner to generate a ground-truth dataset. Quantitative and qualitative evaluations show that our scheme outperforms SOTAs with a large margin. We believe the proposed LEAD along with the dataset would benefit the community w.r.t depth researches.