CVAug 21, 2022

DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection

arXiv:2208.09878v13 citationsh-index: 23
Originality Incremental advance
AI Analysis

This addresses the problem of detecting text in arbitrary shapes and extreme aspect ratios in images for computer vision applications, representing an incremental improvement over existing segmentation-based methods.

The paper tackled scene text detection by proposing DPTNet, a dual-path transformer architecture that integrates convolutional networks with self-attention to model global and local information, achieving state-of-the-art results on the MSRA-TD500 dataset and competitive performance on other benchmarks.

The prosperity of deep learning contributes to the rapid progress in scene text detection. Among all the methods with convolutional networks, segmentation-based ones have drawn extensive attention due to their superiority in detecting text instances of arbitrary shapes and extreme aspect ratios. However, the bottom-up methods are limited to the performance of their segmentation models. In this paper, we propose DPTNet (Dual-Path Transformer Network), a simple yet effective architecture to model the global and local information for the scene text detection task. We further propose a parallel design that integrates the convolutional network with a powerful self-attention mechanism to provide complementary clues between the attention path and convolutional path. Moreover, a bi-directional interaction module across the two paths is developed to provide complementary clues in the channel and spatial dimensions. We also upgrade the concentration operation by adding an extra multi-head attention layer to it. Our DPTNet achieves state-of-the-art results on the MSRA-TD500 dataset, and provides competitive results on other standard benchmarks in terms of both detection accuracy and speed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes