CVApr 30, 2022Code
SVTR: Scene Text Recognition with a Single Visual ModelYongkun Du, Zhineng Chen, Caiyan Jia et al.
Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference. The code is publicly available at https://github.com/PaddlePaddle/PaddleOCR.
CVJun 7, 2022Code
PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR SystemChenxia Li, Weiwei Liu, Ruoyu Guo et al.
Optical character recognition (OCR) technology has been widely used in various scenes, as shown in Figure 1. Designing a practical OCR system is still a meaningful but challenging task. In previous work, considering the efficiency and accuracy, we proposed a practical ultra lightweight OCR system (PP-OCR), and an optimized version PP-OCRv2. In order to further improve the performance of PP-OCRv2, a more robust OCR system PP-OCRv3 is proposed in this paper. PP-OCRv3 upgrades the text detection model and text recognition model in 9 aspects based on PP-OCRv2. For text detector, we introduce a PAN module with large receptive field named LK-PAN, a FPN module with residual attention mechanism named RSE-FPN, and DML distillation strategy. For text recognizer, the base model is replaced from CRNN to SVTR, and we introduce lightweight text recognition network SVTR LCNet, guided training of CTC by attention, data augmentation strategy TextConAug, better pre-trained model by self-supervised TextRotNet, UDML, and UIM to accelerate the model and improve the effect. Experiments on real data show that the hmean of PP-OCRv3 is 5% higher than PP-OCRv2 under comparable inference speed. All the above mentioned models are open-sourced and the code is available in the GitHub repository PaddleOCR which is powered by PaddlePaddle.
CVJul 11, 2023Code
Towards Anytime Optical Flow Estimation with Event CamerasYaozu Ye, Hao Shi, Kailun Yang et al.
Event cameras respond to changes in log-brightness at the millisecond level, making them ideal for optical flow estimation. However, existing datasets from event cameras provide only low frame rate ground truth for optical flow, limiting the research potential of event-driven optical flow. To address this challenge, we introduce a low-latency event representation, Unified Voxel Grid, and propose EVA-Flow, an EVent-based Anytime Flow estimation network to produce high-frame-rate event optical flow with only low-frame-rate optical flow ground truth for supervision. Furthermore, we propose the Rectified Flow Warp Loss (RFWL) for the unsupervised assessment of intermediate optical flow. A comprehensive variety of experiments on MVSEC, DESC, and our EVA-FlowSet demonstrates that EVA-Flow achieves competitive performance, super-low-latency (5ms), time-dense motion estimation (200Hz), and strong generalization. Our code will be available at https://github.com/Yaozhuwa/EVA-Flow.
CVJul 23, 2023Code
Context Perception Parallel Decoder for Scene Text RecognitionYongkun Du, Zhineng Chen, Caiyan Jia et al.
Scene text recognition (STR) methods have struggled to attain high accuracy and fast inference speed. Autoregressive (AR)-based models implement the recognition in a character-by-character manner, showing superiority in accuracy but with slow inference speed. Alternatively, parallel decoding (PD)-based models infer all characters in a single decoding pass, offering faster inference speed but generally worse accuracy. We first present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception. Consequently, we propose Context Perception Parallel Decoder (CPPD) to predict the character sequence in a PD pass. CPPD devises a character counting module to infer the occurrence count of each character, and a character ordering module to deduce the content-free reading order and placeholders. Meanwhile, the character prediction task associates the placeholders with characters. They together build a comprehensive recognition context. We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts. Moreover, the plugged models achieve significant accuracy improvements. Code is at \href{https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_en/algorithm_rec_cppd_en.md}{this https URL}.
CVNov 21, 2022Code
Beyond the Field-of-View: Enhancing Scene Visibility and Perception with Clip-Recurrent TransformerHao Shi, Qi Jiang, Kailun Yang et al.
Vision sensors are widely applied in vehicles, robots, and roadside infrastructure. However, due to limitations in hardware cost and system size, camera Field-of-View (FoV) is often restricted and may not provide sufficient coverage. Nevertheless, from a spatiotemporal perspective, it is possible to obtain information beyond the camera's physical FoV from past video streams. In this paper, we propose the concept of online video inpainting for autonomous vehicles to expand the field of view, thereby enhancing scene visibility, perception, and system safety. To achieve this, we introduce the FlowLens architecture, which explicitly employs optical flow and implicitly incorporates a novel clip-recurrent transformer for feature propagation. FlowLens offers two key features: 1) FlowLens includes a newly designed Clip-Recurrent Hub with 3D-Decoupled Cross Attention (DDCA) to progressively process global information accumulated over time. 2) It integrates a multi-branch Mix Fusion Feed Forward Network (MixF3N) to enhance the precise spatial flow of local features. To facilitate training and evaluation, we derive the KITTI360 dataset with various FoV mask, which covers both outer- and inner FoV expansion scenarios. We also conduct both quantitative assessments and qualitative comparisons of beyond-FoV semantics and beyond-FoV object detection across different models. We illustrate that employing FlowLens to reconstruct unseen scenes even enhances perception within the field of view by providing reliable semantic context. Extensive experiments and user studies involving offline and online video inpainting, as well as beyond-FoV perception tasks, demonstrate that FlowLens achieves state-of-the-art performance. The source code and dataset are made publicly available at https://github.com/MasterHow/FlowLens.
CVNov 8, 2023Code
Exploring Event-based Human Pose Estimation with 3D Event RepresentationsXiaoting Yin, Hao Shi, Jiaan Chen et al.
Human pose estimation is a fundamental and appealing task in computer vision. Although traditional cameras are commonly applied, their reliability decreases in scenarios under high dynamic range or heavy motion blur, where event cameras offer a robust solution. Predominant event-based methods accumulate events into frames, ignoring the asynchronous and high temporal resolution that is crucial for distinguishing distinct actions. To address this issue and to unlock the 3D potential of event information, we introduce two 3D event representations: the Rasterized Event Point Cloud (RasEPC) and the Decoupled Event Voxel (DEV). The RasEPC aggregates events within concise temporal slices at identical positions, preserving their 3D attributes along with statistical information, thereby significantly reducing memory and computational demands. Meanwhile, the DEV representation discretizes events into voxels and projects them across three orthogonal planes, utilizing decoupled event attention to retrieve 3D cues from the 2D planes. Furthermore, we develop and release EV-3DPW, a synthetic event-based dataset crafted to facilitate training and quantitative analysis in outdoor scenes. Our methods are tested on the DHP19 public dataset, MMHPSD dataset, and our EV-3DPW dataset, with further qualitative validation via a derived driving scene dataset EV-JAAD and an outdoor collection vehicle. Our code and dataset have been made publicly available at https://github.com/MasterHow/EventPointPose.
CVMar 13, 2024Code
Offboard Occupancy Refinement with Hybrid Propagation for Autonomous DrivingHao Shi, Song Wang, Jiaming Zhang et al.
Vision-based occupancy prediction, also known as 3D Semantic Scene Completion (SSC), presents a significant challenge in computer vision. Previous methods, confined to onboard processing, struggle with simultaneous geometric and semantic estimation, continuity across varying viewpoints, and single-view occlusion. Our paper introduces OccFiner, a novel offboard framework designed to enhance the accuracy of vision-based occupancy predictions. OccFiner operates in two hybrid phases: 1) a multi-to-multi local propagation network that implicitly aligns and processes multiple local frames for correcting onboard model errors and consistently enhancing occupancy accuracy across all distances. 2) the region-centric global propagation, focuses on refining labels using explicit multi-view geometry and integrating sensor bias, particularly for increasing the accuracy of distant occupied voxels. Extensive experiments demonstrate that OccFiner improves both geometric and semantic accuracy across various types of coarse occupancy, setting a new state-of-the-art performance on the SemanticKITTI dataset. Notably, OccFiner significantly boosts the performance of vision-based SSC models, achieving accuracy levels competitive with established LiDAR-based onboard SSC methods. Furthermore, OccFiner is the first to achieve automatic annotation of SSC in a purely vision-based approach. Quantitative experiments prove that OccFiner successfully facilitates occupancy data loop-closure in autonomous driving. Additionally, we quantitatively and qualitatively validate the superiority of the offboard approach on city-level SSC static maps. The source code will be made publicly available at https://github.com/MasterHow/OccFiner.
CVOct 22, 2024Code
E-3DGS: Gaussian Splatting with Exposure and Motion EventsXiaoting Yin, Hao Shi, Yuhan Bao et al.
Achieving 3D reconstruction from images captured under optimal conditions has been extensively studied in the vision and imaging fields. However, in real-world scenarios, challenges such as motion blur and insufficient illumination often limit the performance of standard frame-based cameras in delivering high-quality images. To address these limitations, we incorporate a transmittance adjustment device at the hardware level, enabling event cameras to capture both motion and exposure events for diverse 3D reconstruction scenarios. Motion events (triggered by camera or object movement) are collected in fast-motion scenarios when the device is inactive, while exposure events (generated through controlled camera exposure) are captured during slower motion to reconstruct grayscale images for high-quality training and optimization of event-based 3D Gaussian Splatting (3DGS). Our framework supports three modes: High-Quality Reconstruction using exposure events, Fast Reconstruction relying on motion events, and Balanced Hybrid optimizing with initial exposure events followed by high-speed motion events. On the EventNeRF dataset, we demonstrate that exposure events significantly improve fine detail reconstruction compared to motion events and outperform frame-based cameras under challenging conditions such as low illumination and overexposure. Furthermore, we introduce EME-3D, a real-world 3D dataset with exposure events, motion events, camera calibration parameters, and sparse point clouds. Our method achieves faster and higher-quality reconstruction than event-based NeRF and is more cost-effective than methods combining event and RGB data. E-3DGS sets a new benchmark for event-based 3D reconstruction with robust performance in challenging conditions and lower hardware demands. The source code and dataset will be available at https://github.com/MasterHow/E-3DGS.
CVJan 30, 2024Code
Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational TransformersJianbin Jiao, Xina Cheng, Weijie Chen et al.
3D human pose estimation captures the human joint points in three-dimensional space while keeping the depth information and physical structure. That is essential for applications that require precise pose information, such as human-computer interaction, scene understanding, and rehabilitation training. Due to the challenges in data collection, mainstream datasets of 3D human pose estimation are primarily composed of multi-view video data collected in laboratory environments, which contains rich spatial-temporal correlation information besides the image frame content. Given the remarkable self-attention mechanism of transformers, capable of capturing the spatial-temporal correlation from multi-view video datasets, we propose a multi-stage framework for 3D sequence-to-sequence (seq2seq) human pose detection. Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships and 3D spatial positional relationship features between the multi-perspective images. Secondly, the self-attention mechanism is adopted to eliminate the interference from non-human body parts and reduce computing resources. Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset. Experimental results demonstrate that our approach achieves state-of-the-art performance on this dataset. The source code will be available at https://github.com/WUJINHUAN/3D-human-pose.
CVFeb 4, 2025Code
Event-aided Semantic Scene CompletionShangwei Guo, Hao Shi, Song Wang et al.
Autonomous driving systems rely on robust 3D scene understanding. Recent advances in Semantic Scene Completion (SSC) for autonomous driving underscore the limitations of RGB-based approaches, which struggle under motion blur, poor lighting, and adverse weather. Event cameras, offering high dynamic range and low latency, address these challenges by providing asynchronous data that complements RGB inputs. We present DSEC-SSC, the first real-world benchmark specifically designed for event-aided SSC, which includes a novel 4D labeling pipeline for generating dense, visibility-aware labels that adapt dynamically to object motion. Our proposed RGB-Event fusion framework, EvSSC, introduces an Event-aided Lifting Module (ELM) that effectively bridges 2D RGB-Event features to 3D space, enhancing view transformation and the robustness of 3D volume construction across SSC models. Extensive experiments on DSEC-SSC and simulated SemanticKITTI-E demonstrate that EvSSC is adaptable to both transformer-based and LSS-based SSC architectures. Notably, evaluations on SemanticKITTI-C demonstrate that EvSSC achieves consistently improved prediction accuracy across five degradation modes and both In-domain and Out-of-domain settings, achieving up to a 52.5% relative improvement in mIoU when the image sensor partially fails. Additionally, we quantitatively and qualitatively validate the superiority of EvSSC under motion blur and extreme weather conditions, where autonomous driving is challenged. The established datasets and our codebase will be made publicly at https://github.com/Pandapan01/EvSSC.
CVMar 16, 2025Code
EgoEvGesture: Gesture Recognition Based on Egocentric Event CameraLuming Wang, Hao Shi, Xiaoting Yin et al.
Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that includes events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. To address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-Temporal Shift Module (BTSM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further establish the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy tested on unseen subjects with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high inter-personal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 97.0% on the DVS128 Gesture, demonstrating the effectiveness and generalization capability of our method on public datasets. The dataset and models are made available at https://github.com/3190105222/EgoEv_Gesture.
CVFeb 27, 2022Code
PanoFlow: Learning 360° Optical Flow for Surrounding Temporal UnderstandingHao Shi, Yifan Zhou, Kailun Yang et al.
Optical flow estimation is a basic task in self-driving and robotics systems, which enables to temporally interpret traffic scenes. Autonomous vehicles clearly benefit from the ultra-wide Field of View (FoV) offered by 360° panoramic sensors. However, due to the unique imaging process of panoramic cameras, models designed for pinhole images do not directly generalize satisfactorily to 360° panoramic images. In this paper, we put forward a novel network framework--PanoFlow, to learn optical flow for panoramic images. To overcome the distortions introduced by equirectangular projection in panoramic transformation, we design a Flow Distortion Augmentation (FDA) method, which contains radial flow distortion (FDA-R) or equirectangular flow distortion (FDA-E). We further look into the definition and properties of cyclic optical flow for panoramic videos, and hereby propose a Cyclic Flow Estimation (CFE) method by leveraging the cyclicity of spherical images to infer 360° optical flow and converting large displacement to relatively small displacement. PanoFlow is applicable to any existing flow estimation method and benefits from the progress of narrow-FoV flow estimation. In addition, we create and release a synthetic panoramic dataset FlowScape based on CARLA to facilitate training and quantitative analysis. PanoFlow achieves state-of-the-art performance on the public OmniFlowNet and the established FlowScape benchmarks. Our proposed approach reduces the End-Point-Error (EPE) on FlowScape by 27.3%. On OmniFlowNet, PanoFlow achieves a 55.5% error reduction from the best published result. We also qualitatively validate our method via a collection vehicle and a public real-world OmniPhotos dataset, indicating strong potential and robustness for real-world navigation applications. Code and dataset are publicly available at https://github.com/MasterHow/PanoFlow.
CVFeb 2, 2022Code
CSFlow: Learning Optical Flow via Cross Strip Correlation for Autonomous DrivingHao Shi, Yifan Zhou, Kailun Yang et al.
Optical flow estimation is an essential task in self-driving systems, which helps autonomous vehicles perceive temporal continuity information of surrounding scenes. The calculation of all-pair correlation plays an important role in many existing state-of-the-art optical flow estimation methods. However, the reliance on local knowledge often limits the model's accuracy under complex street scenes. In this paper, we propose a new deep network architecture for optical flow estimation in autonomous driving--CSFlow, which consists of two novel modules: Cross Strip Correlation module (CSC) and Correlation Regression Initialization module (CRI). CSC utilizes a striping operation across the target image and the attended image to encode global context into correlation volumes, while maintaining high efficiency. CRI is used to maximally exploit the global context for optical flow initialization. Our method has achieved state-of-the-art accuracy on the public autonomous driving dataset KITTI-2015. Code is publicly available at https://github.com/MasterHow/CSFlow.
CVSep 21, 2020Code
PP-OCR: A Practical Ultra Lightweight OCR SystemYuning Du, Chenxia Li, Ruoyu Guo et al.
The Optical Character Recognition (OCR) systems have been widely used in various of application scenarios, such as office automation (OA) systems, factory automations, online educations, map productions etc. However, OCR is still a challenging task due to the various of text appearances and the demand of computational efficiency. In this paper, we propose a practical ultra lightweight OCR system, i.e., PP-OCR. The overall model size of the PP-OCR is only 3.5M for recognizing 6622 Chinese characters and 2.8M for recognizing 63 alphanumeric symbols, respectively. We introduce a bag of strategies to either enhance the model ability or reduce the model size. The corresponding ablation experiments with the real data are also provided. Meanwhile, several pre-trained models for the Chinese and English recognition are released, including a text detector (97K images are used), a direction classifier (600K images are used) as well as a text recognizer (17.9M images are used). Besides, the proposed PP-OCR are also verified in several other language recognition tasks, including French, Korean, Japanese and German. All of the above mentioned models are open-sourced and the codes are available in the GitHub repository, i.e., https://github.com/PaddlePaddle/PaddleOCR.
CVSep 23, 2025
Event-guided 3D Gaussian Splatting for Dynamic Human and Scene ReconstructionXiaoting Yin, Hao Shi, Kailun Yang et al.
Reconstructing dynamic humans together with static scenes from monocular videos remains difficult, especially under fast motion, where RGB frames suffer from motion blur. Event cameras exhibit distinct advantages, e.g., microsecond temporal resolution, making them a superior sensing choice for dynamic human reconstruction. Accordingly, we present a novel event-guided human-scene reconstruction framework that jointly models human and scene from a single monocular event camera via 3D Gaussian Splatting. Specifically, a unified set of 3D Gaussians carries a learnable semantic attribute; only Gaussians classified as human undergo deformation for animation, while scene Gaussians stay static. To combat blur, we propose an event-guided loss that matches simulated brightness changes between consecutive renderings with the event stream, improving local fidelity in fast-moving regions. Our approach removes the need for external human masks and simplifies managing separate Gaussian sets. On two benchmark datasets, ZJU-MoCap-Blur and MMHPSD-Blur, it delivers state-of-the-art human-scene reconstruction, with notable gains over strong baselines in PSNR/SSIM and reduced LPIPS, especially for high-speed subjects.