CVJun 5

TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance

Duc Tri Tran, Trung Thanh Nguyen, Vijay John, Phi Le Nguyen, Yasutomo Kawanishi

arXiv:2606.0716117.1Has Code

Originality Incremental advance

AI Analysis

For urban surveillance and intelligent transportation systems, TraRA provides a plug-and-play method to improve video text spotting robustness under challenging conditions.

TraRA addresses unreliable frame-level text recognition in video streams due to motion blur, occlusion, and scale variation by aggregating information over entire text trajectories using temporal clustering and vision-language fusion, achieving consistent improvements over state-of-the-art methods on four benchmarks.

Video Text Spotting (VTS) is essential for urban surveillance and intelligent transportation systems, enabling automated reading of street signs, vehicle markings, and scene text in video streams. However, reliable recognition remains challenging due to dynamic video factors common in surveillance scenarios, including motion blur, occlusion, and scale variation, which degrade frame-level recognition. Existing VTS methods typically perform recognition independently on each frame, leading to inconsistent and inaccurate results across sequences. To address these limitations, we propose TraRA (Trajectory-level Recognition Aggregation for VTS), a plug-and-play method that performs trajectory-level text recognition by leveraging temporal and multimodal consistency. TraRA integrates two key modules: (1) the Temporal Clustering and (2) the Vision-Language Aggregation. The former refines noisy trajectories by grouping temporally and visually coherent text instances, while the latter employs a Low-Rank Adaptation-enhanced Vision-Language model to fuse visual cues with linguistic context across frames. By aggregating information over entire text trajectories, TraRA achieves robust text recognition even under challenging surveillance conditions. Extensive experiments on four public benchmarks, including road and urban scene datasets (RoadText, BOVText, ArTVideo, and ICDAR15), demonstrate that TraRA consistently improves tracking and recognition performance over state-of-the-art VTS methods. The source code is available at https://github.com/trid2912/TraRA.

View on arXiv PDF Code

Similar