Zixiang Ren

24.0CVMay 29

NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

Jiahui Li, Jiawei Sun, Zixiang Ren et al.

Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.

16.7ITApr 5

Environment-Aware Near-Field Channel Estimation Leveraging CKM and ISAC

Yuan Guo, Yilong Chen, Zixiang Ren et al.

This paper proposes an environment-aware near-field channel estimation framework for integrated sensing and communication (ISAC) systems equipped with extremely large-scale antenna arrays (ELAAs). The proposed framework jointly exploits channel knowledge maps (CKMs) and ISAC to obtain a priori information on static and dynamic environmental features for facilitating channel estimation. In particular, we propose a novel CKM representation, termed the virtual object map (VOM), which stores the locations of virtual environment objects (EOs) to characterize the dominant multipath components (MPCs) induced by static physical EOs. In addition, we design a sensing-assisted channel training protocol, in which the ISAC-enabled base station (BS) transmits downlink pilots while simultaneously collecting monostatic echoes for sensing dynamic targets in the environment, and the user equipment (UE) feeds back a quantized version of its received pilot observation. Based on the VOM prior and the sensed dynamic information, the BS jointly estimates the coefficients of the static and dynamic MPCs to recover the near-field channel. Numerical results demonstrate that the proposed joint VOM- and sensing-aided channel estimation scheme significantly outperforms conventional schemes without VOM-based priors and/or dynamic sensing in terms of both channel estimation accuracy and achievable rate.

Zixiang Ren

2 Papers