Chen Bai

CV
h-index11
10papers
59citations
Novelty60%
AI Score58

10 Papers

CVJul 6, 2022Code
3DG-STFM: 3D Geometric Guided Student-Teacher Feature Matching

Runyu Mao, Chen Bai, Yatong An et al.

We tackle the essential task of finding dense visual correspondences between a pair of images. This is a challenging problem due to various factors such as poor texture, repetitive patterns, illumination variation, and motion blur in practical scenarios. In contrast to methods that use dense correspondence ground-truths as direct supervision for local feature matching training, we train 3DG-STFM: a multi-modal matching model (Teacher) to enforce the depth consistency under 3D dense correspondence supervision and transfer the knowledge to 2D unimodal matching model (Student). Both teacher and student models consist of two transformer-based matching modules that obtain dense correspondences in a coarse-to-fine manner. The teacher model guides the student model to learn RGB-induced depth information for the matching purpose on both coarse and fine branches. We also evaluate 3DG-STFM on a model compression task. To the best of our knowledge, 3DG-STFM is the first student-teacher learning method for the local feature matching task. The experiments show that our method outperforms state-of-the-art methods on indoor and outdoor camera pose estimations, and homography estimation problems. Code is available at: https://github.com/Ryan-prime/3DG-STFM.

ARMay 18Code
CPPL: A Circuit Prompt Programming Language

Shuo Yin, Yihe Wang, Lancheng Zou et al.

Large language models (LLMs) have shown promise in register-transfer level (RTL) design automation, but direct RTL generation remains difficult to validate, optimize, and integrate with compiler-based hardware design flows. Hardware compiler infrastructures such as CIRCT provide typed intermediate representations, legality checks, and optimization passes, yet current LLMs struggle to emit raw compiler IR because of MLIR syntax, SSA discipline, dialect-specific operations, and strict width constraints. This paper presents CPPL, a compiler-mediated design framework that turns LLM-assisted hardware generation into a statically checkable frontend problem rather than an unconstrained RTL text-generation task. CPPL combines a Python frontend DSL for declaring module interfaces and hierarchy with CPPL IR, a JSON-based circuit IR designed to expose compiler-visible structure while remaining accessible to LLMs. The compiler infers operation widths from declared module ports, validates generated IR, checks hierarchy and port bindings, and deterministically lowers the result to CIRCT for synthesizable Verilog generation. On the RTLLM benchmark, CPPL improves functional correctness over direct Verilog and direct CIRCT IR generation, while CIRCT optimization reduces post-synthesis AIG node counts. These results show that a compiler-mediated interface can make LLM-assisted hardware design more reliable, analyzable, and amenable to backend optimization. CPPL is available at https://github.com/SawyDust1228/CPPL.

CRApr 14
Post-Quantum Security of Block Cipher Constructions

Gorjan Alagic, Chen Bai, Christian Majenz et al.

Block ciphers are versatile cryptographic ingredients that are used in a wide range of applications ranging from secure Internet communications to disk encryption. While post-quantum security of public-key cryptography has received significant attention, the case of symmetric-key cryptography (and block ciphers in particular) remains a largely unexplored topic. In this work, we set the foundations for a theory of post-quantum security for block ciphers and associated constructions. Leveraging our new techniques, we provide the first post-quantum security proofs for the key-length extension scheme FX, the tweakable block ciphers LRW and XEX, and most block cipher encryption and authentication modes. Our techniques can be used for security proofs in both the plain model and the quantum ideal cipher model. Our work takes significant initial steps in establishing a rigorous understanding of the post-quantum security of practical symmetric-key cryptography.

ARMay 3Code
PipeRTL: Timing-Aware Pipeline Optimization at IR-Level for RTL Generation

Shuo Yin, Fangzhou Liu, Lancheng Zou et al.

Modern hardware compilers increasingly rely on rich intermediate representations (IRs) to preserve optimization-relevant semantics before generating RTL code. However, one important optimization is still largely deferred to backend tools: pipeline optimization. In common RTL flows, registers are inserted by frontend heuristics or hardware designers and later adjusted by backend retiming after the design has been lowered to a much lower-level netlist representation. At that point, much of the operator-level structure originally exposed by the compiler IR has already been weakened or lost, limiting opportunities for global, compiler-level pipeline optimization. This paper presents PipeRTL, an IR-level pipeline optimization framework for hardware compilers, instantiated in CIRCT. PipeRTL makes the legality of register relocation explicit in the IR, uses a learned timing predictor to approximate downstream delay behavior, and formulates timing-aware register relocation as a global min-cost flow problem under timing constraints. Evaluation on open-source designs under a commercial backend synthesis flow shows that PipeRTL improves downstream implementation quality on average, reducing critical-path delay, power, and area across the evaluated benchmarks, while also providing a stronger starting point for backend retiming. These results indicate that exposing pipeline optimization as an explicit compiler pass can deliver backend-meaningful gains by improving the sequential structure presented to later stages and the resulting downstream implementation quality.

CVFeb 26
DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

Zhechao Wang, Yiming Zeng, Lufan Ma et al.

Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.

CVJan 29
Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

Linhan Wang, Zichong Yang, Chen Bai et al.

End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.

CVMay 17, 2023Code
Dynamic Structural Brain Network Construction by Hierarchical Prototype Embedding GCN using T1-MRI

Yilin Leng, Wenju Cui, Chen Bai et al.

Constructing structural brain networks using T1-weighted magnetic resonance imaging (T1-MRI) presents a significant challenge due to the lack of direct regional connectivity information. Current methods with T1-MRI rely on predefined regions or isolated pretrained location modules to obtain atrophic regions, which neglects individual specificity. Besides, existing methods capture global structural context only on the whole-image-level, which weaken correlation between regions and the hierarchical distribution nature of brain connectivity.We hereby propose a novel dynamic structural brain network construction method based on T1-MRI, which can dynamically localize critical regions and constrain the hierarchical distribution among them for constructing dynamic structural brain network. Specifically, we first cluster spatially-correlated channel and generate several critical brain regions as prototypes. Further, we introduce a contrastive loss function to constrain the prototypes distribution, which embed the hierarchical brain semantic structure into the latent space. Self-attention and GCN are then used to dynamically construct hierarchical correlations of critical regions for brain network and explore the correlation, respectively. Our method is evaluated on ADNI-1 and ADNI-2 databases for mild cognitive impairment (MCI) conversion prediction, and acheive the state-of-the-art (SOTA) performance. Our source code is available at http://github.com/*******.

CVJan 30, 2024
Anything in Any Scene: Photorealistic Video Object Insertion

Chen Bai, Zeman Shao, Guoxiang Zhang et al.

Realistic video simulation has shown significant potential across diverse applications, from virtual reality to film production. This is particularly true for scenarios where capturing videos in real-world settings is either impractical or expensive. Existing approaches in video simulation often fail to accurately model the lighting environment, represent the object geometry, or achieve high levels of photorealism. In this paper, we propose Anything in Any Scene, a novel and generic framework for realistic video simulation that seamlessly inserts any object into an existing dynamic video with a strong emphasis on physical realism. Our proposed general framework encompasses three key processes: 1) integrating a realistic object into a given scene video with proper placement to ensure geometric realism; 2) estimating the sky and environmental lighting distribution and simulating realistic shadows to enhance the light realism; 3) employing a style transfer network that refines the final video output to maximize photorealism. We experimentally demonstrate that Anything in Any Scene framework produces simulated videos of great geometric realism, lighting realism, and photorealism. By significantly mitigating the challenges associated with video data generation, our framework offers an efficient and cost-effective solution for acquiring high-quality videos. Furthermore, its applications extend well beyond video data augmentation, showing promising potential in virtual reality, video editing, and various other video-centric applications. Please check our project website https://anythinginanyscene.github.io for access to our project code and more high-resolution video results.

LGApr 15, 2024
GNNavigator: Towards Adaptive Training of Graph Neural Networks via Automatic Guideline Exploration

Tong Qiao, Jianlei Yang, Yingjie Qi et al.

Graph Neural Networks (GNNs) succeed significantly in many applications recently. However, balancing GNNs training runtime cost, memory consumption, and attainable accuracy for various applications is non-trivial. Previous training methodologies suffer from inferior adaptability and lack a unified training optimization solution. To address the problem, this work proposes GNNavigator, an adaptive GNN training configuration optimization framework. GNNavigator meets diverse GNN application requirements due to our unified software-hardware co-abstraction, proposed GNNs training performance model, and practical design space exploration solution. Experimental results show that GNNavigator can achieve up to 3.1x speedup and 44.9% peak memory reduction with comparable accuracy to state-of-the-art approaches.

ROJul 7, 2025
NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving

Qucheng Peng, Chen Bai, Guoxiang Zhang et al.

Autonomous driving systems have made significant advances in Q&A, perception, prediction, and planning based on local visual information, yet they struggle to incorporate broader navigational context that human drivers routinely utilize. We address this critical gap between local sensor data and global navigation information by proposing NavigScene, an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems. Moreover, we develop three complementary paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses by establishing preferences for navigation-relevant summarized information; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion. Extensive experiments demonstrate that our approaches significantly improve performance across perception, prediction, planning, and question-answering tasks by enabling reasoning capabilities beyond visual range and improving generalization to diverse driving scenarios. This work represents a significant step toward more comprehensive autonomous driving systems capable of navigating complex, unfamiliar environments with greater reliability and safety.