Jingjing Zhao

CV
h-index16
8papers
75citations
Novelty42%
AI Score50

8 Papers

CVJan 29, 2023
LiDAR-CS Dataset: LiDAR Point Cloud Dataset with Cross-Sensors for 3D Object Detection

Jin Fang, Dingfu Zhou, Jingjing Zhao et al.

Over the past few years, there has been remarkable progress in research on 3D point clouds and their use in autonomous driving scenarios has become widespread. However, deep learning methods heavily rely on annotated data and often face domain generalization issues. Unlike 2D images whose domains usually pertain to the texture information present in them, the features derived from a 3D point cloud are affected by the distribution of the points. The lack of a 3D domain adaptation benchmark leads to the common practice of training a model on one benchmark (e.g. Waymo) and then assessing it on another dataset (e.g. KITTI). This setting results in two distinct domain gaps: scenarios and sensors, making it difficult to analyze and evaluate the method accurately. To tackle this problem, this paper presents LiDAR Dataset with Cross Sensors (LiDAR-CS Dataset), which contains large-scale annotated LiDAR point cloud under six groups of different sensors but with the same corresponding scenarios, captured from hybrid realistic LiDAR simulator. To our knowledge, LiDAR-CS Dataset is the first dataset that addresses the sensor-related gaps in the domain of 3D object detection in real traffic. Furthermore, we evaluate and analyze the performance using various baseline detectors and demonstrated its potential applications. Project page: https://opendriving.github.io/lidar-cs.

CVNov 1, 2025
ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation

Panwang Pan, Jingjing Zhao, Yuchen Lin et al.

Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency. To faithfully preserve the subject consistency and textual information in synthesized videos, ID-Composer designs a hierarchical identity-preserving attention mechanism, which effectively aggregates features within and across subjects and modalities. To effectively allow for the semantic following of user intention, we introduce semantic understanding via pretrained vision-language model (VLM), leveraging VLM's superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. Considering that standard diffusion loss often fails in aligning the critical concepts like subject ID, we employ an online reinforcement learning phase to drive the overall training objective of ID-Composer into RLVR. Extensive experiments demonstrate that our model surpasses existing methods in identity preservation, temporal consistency, and video quality.

CVNov 1, 2025
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Panwang Pan, Chenguo Lin, Jingjing Zhao et al.

We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splatacross video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.

CVNov 1, 2025
HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation

Panwang Pan, Tingting Shen, Chenxin Li et al.

Recent advances in generative models have achieved high-fidelity in 3D human reconstruction, yet their utility for specific tasks (e.g., human 3D segmentation) remains constrained. We propose HumanCrafter, a unified framework that enables the joint modeling of appearance and human-part semantics from a single image in a feed-forward manner. Specifically, we integrate human geometric priors in the reconstruction stage and self-supervised semantic priors in the segmentation stage. To address labeled 3D human datasets scarcity, we further develop an interactive annotation procedure for generating high-quality data-label pairs. Our pixel-aligned aggregation enables cross-task synergy, while the multi-task objective simultaneously optimizes texture modeling fidelity and semantic consistency. Extensive experiments demonstrate that HumanCrafter surpasses existing state-of-the-art methods in both 3D human-part segmentation and 3D human reconstruction from a single image.

72.4ITMay 9
Fluid Antennas Assisted RIS-NOMA Communication Networks

Xinwei Yue, He Geng, Jingjing Zhao et al.

This paper introduces a fluid antenna system (FAS) into reconfigurable intelligent surface (RIS) assisted non-orthogonal multiple access (NOMA) communication networks, where the non-orthogonal users are equipped with planar fluid antennas. Specifically, we formulate a sum rate maximization problem for FAS-RIS-NOMA networks, which jointly optimizes the fluid ports, the RIS deployment, and the phase shift matrix. To solve the resulting non-convex optimization problem involving highly coupled variables, an iterative algorithm based on alternating optimization is employed to decompose the original problem into three subproblems. Exhaustive search is employed for optimizing the fluid ports, particle swarm optimization is used for the RIS deployment, and semidefinite relaxation with successive convex approximation is adopted for optimizing the phase shift matrix. Finally, the simulation results show that: 1) compared with traditional antenna systems and orthogonal multiple access, the FAS-RIS-NOMA networks achieve higher system throughput under high signal-to-noise ratio conditions; and 2) by increasing the number of RIS elements and enlarging the FAS size, the sum rate of FAS-RIS-NOMA networks can be significantly enhanced.

21.9ITApr 28
Performance Analysis of Pinching Antenna Systems Enabled NOMA Communications

Xinwei Yue, Xinglun Tao, Jingjing Zhao et al.

Pinching antenna systems (PASS) have the advantages in the perspective of flexible antenna reconfiguration, line-of-sight (LoS) creation, and scalability features. To highlight the ascendancy of PASS, we survey the integration of PASS into non-orthogonal multiple access (NOMA) networks. The locations of nodes are randomly distributed within a circular coverage region. The influencing factors of line-of-sight (LoS) and non-line-of-sight (NLoS) propagation links from PASS to non-orthogonal nodes are taken into considered. To characterize performance of PASS-NOMA, we deduce the blockage probability and ergodic data rates expressions of two nodes over LoS/NLoS fading channels. In light of these theoretical results, the infinite diversity gain are also analyzed with near node n under non-ideal successive interference cancellation (NISIC) and far node f over LoS links. The slopes of ergodic data rate for node n with NISIC and node f were equal to zeros. In addition, the PASS-NOMA system throughput are evaluated in different transmission modes. It is shown from the numerical results that: 1) The blockage outage behaviors of PASS-NOMA networks with LoS/NLoS conditions outperform that of PASS aided traditional orthogonal multiple access (OMA); 2)The employment of PASS enables the larger ergodic data rates relative to PASS-OMA networks; and 3) As the quantity of pinching antennas rises, the performance of PASS-NOMA networks are enhanced over LoS/NLoS propagation links.

CLMar 13, 2024
PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency

Zhishuai Li, Xiang Wang, Jingjing Zhao et al.

Recent advancements in Text-to-SQL (Text2SQL) emphasize stimulating the large language models (LLM) on in-context learning, achieving significant results. Nevertheless, they face challenges when dealing with verbose database information and complex user intentions. This paper presents a two-stage framework to enhance the performance of current LLM-based natural language to SQL systems. We first introduce a novel prompt representation, called reference-enhanced representation, which includes schema information and randomly sampled cell values from tables to instruct LLMs in generating SQL queries. Then, in the first stage, question-SQL pairs are retrieved as few-shot demonstrations, prompting the LLM to generate a preliminary SQL (PreSQL). After that, the mentioned entities in PreSQL are parsed to conduct schema linking, which can significantly compact the useful information. In the second stage, with the linked schema, we simplify the prompt's schema information and instruct the LLM to produce the final SQL. Finally, as the post-refinement module, we propose using cross-consistency across different LLMs rather than self-consistency within a particular LLM. Our methods achieve new SOTA results on the Spider benchmark, with an execution accuracy of 87.6%.

CVApr 9, 2025
EDIT: Enhancing Vision Transformers by Mitigating Attention Sink through an Encoder-Decoder Architecture

Wenfeng Feng, Hongxiang Wang, Jianlong Wang et al.

In this paper, we propose EDIT (Encoder-Decoder Image Transformer), a novel architecture designed to mitigate the attention sink phenomenon observed in Vision Transformer models. Attention sink occurs when an excessive amount of attention is allocated to the [CLS] token, distorting the model's ability to effectively process image patches. To address this, we introduce a layer-aligned encoder-decoder architecture, where the encoder utilizes self-attention to process image patches, while the decoder uses cross-attention to focus on the [CLS] token. Unlike traditional encoder-decoder framework, where the decoder depends solely on high-level encoder representations, EDIT allows the decoder to extract information starting from low-level features, progressively refining the representation layer by layer. EDIT is naturally interpretable demonstrated through sequential attention maps, illustrating the refined, layer-by-layer focus on key image features. Experiments on ImageNet-1k and ImageNet-21k, along with transfer learning tasks, show that EDIT achieves consistent performance improvements over DeiT3 models. These results highlight the effectiveness of EDIT's design in addressing attention sink and improving visual feature extraction.