CVJul 14, 2022Code
Dynamic Low-Resolution Distillation for Cost-Efficient End-to-End Text SpottingYing Chen, Liang Qiao, Zhanzhan Cheng et al.
End-to-end text spotting has attached great attention recently due to its benefits on global optimization and high maintainability for real applications. However, the input scale has always been a tough trade-off since recognizing a small text instance usually requires enlarging the whole image, which brings high computational costs. In this paper, to address this problem, we propose a novel cost-efficient Dynamic Low-resolution Distillation (DLD) text spotting framework, which aims to infer images in different small but recognizable resolutions and achieve a better balance between accuracy and efficiency. Concretely, we adopt a resolution selector to dynamically decide the input resolutions for different images, which is constraint by both inference accuracy and computational cost. Another sequential knowledge distillation strategy is conducted on the text recognition branch, making the low-res input obtains comparable performance to a high-res image. The proposed method can be optimized end-to-end and adopted in any current text spotting framework to improve the practicability. Extensive experiments on several text spotting benchmarks show that the proposed method vastly improves the usability of low-res models. The code is available at https://github.com/hikopensource/DAVAR-Lab-OCR/.
CVOct 17, 2022Code
Distilling Object Detectors With Global KnowledgeSanli Tang, Zhongyu Zhang, Zhanzhan Cheng et al.
Knowledge distillation learns a lightweight student model that mimics a cumbersome teacher. Existing methods regard the knowledge as the feature of each instance or their relations, which is the instance-level knowledge only from the teacher model, i.e., the local knowledge. However, the empirical studies show that the local knowledge is much noisy in object detection tasks, especially on the blurred, occluded, or small instances. Thus, a more intrinsic approach is to measure the representations of instances w.r.t. a group of common basis vectors in the two feature spaces of the teacher and the student detectors, i.e., global knowledge. Then, the distilling algorithm can be applied as space alignment. To this end, a novel prototype generation module (PGM) is proposed to find the common basis vectors, dubbed prototypes, in the two feature spaces. Then, a robust distilling module (RDM) is applied to construct the global knowledge based on the prototypes and filtrate noisy global and local knowledge by measuring the discrepancy of the representations in two feature spaces. Experiments with Faster-RCNN and RetinaNet on PASCAL and COCO datasets show that our method achieves the best performance for distilling object detectors with various backbones, which even surpasses the performance of the teacher model. We also show that the existing methods can be easily combined with global knowledge and obtain further improvement. Code is available: https://github.com/hikvision-research/DAVAR-Lab-ML.
SPApr 12, 2022
GMSS: Graph-Based Multi-Task Self-Supervised Learning for EEG Emotion RecognitionYang Li, Ji Chen, Fu Li et al.
Previous electroencephalogram (EEG) emotion recognition relies on single-task learning, which may lead to overfitting and learned emotion features lacking generalization. In this paper, a graph-based multi-task self-supervised learning model (GMSS) for EEG emotion recognition is proposed. GMSS has the ability to learn more general representations by integrating multiple self-supervised tasks, including spatial and frequency jigsaw puzzle tasks, and contrastive learning tasks. By learning from multiple tasks simultaneously, GMSS can find a representation that captures all of the tasks thereby decreasing the chance of overfitting on the original task, i.e., emotion recognition task. In particular, the spatial jigsaw puzzle task aims to capture the intrinsic spatial relationships of different brain regions. Considering the importance of frequency information in EEG emotional signals, the goal of the frequency jigsaw puzzle task is to explore the crucial frequency bands for EEG emotion recognition. To further regularize the learned features and encourage the network to learn inherent representations, contrastive learning task is adopted in this work by mapping the transformed data into a common feature space. The performance of the proposed GMSS is compared with several popular unsupervised and supervised methods. Experiments on SEED, SEED-IV, and MPED datasets show that the proposed model has remarkable advantages in learning more discriminative and general features for EEG emotional signals.
CVJul 14, 2022Code
DavarOCR: A Toolbox for OCR and Multi-Modal Document UnderstandingLiang Qiao, Hui Jiang, Ying Chen et al.
This paper presents DavarOCR, an open-source toolbox for OCR and document understanding tasks. DavarOCR currently implements 19 advanced algorithms, covering 9 different task forms. DavarOCR provides detailed usage instructions and the trained models for each algorithm. Compared with the previous opensource OCR toolbox, DavarOCR has relatively more complete support for the sub-tasks of the cutting-edge technology of document understanding. In order to promote the development and application of OCR technology in academia and industry, we pay more attention to the use of modules that different sub-domains of technology can share. DavarOCR is publicly released at https://github.com/hikopensource/Davar-Lab-OCR.
CVJul 14, 2022
TRIE++: Towards End-to-End Information Extraction from Visually Rich DocumentsZhanzhan Cheng, Peng Zhang, Can Li et al.
Recently, automatically extracting information from visually rich documents (e.g., tickets and resumes) has become a hot and vital research topic due to its widespread commercial value. Most existing methods divide this task into two subparts: the text reading part for obtaining the plain text from the original document images and the information extraction part for extracting key contents. These methods mainly focus on improving the second, while neglecting that the two parts are highly correlated. This paper proposes a unified end-to-end information extraction framework from visually rich documents, where text reading and information extraction can reinforce each other via a well-designed multi-modal context block. Specifically, the text reading part provides multi-modal features like visual, textual and layout features. The multi-modal context block is developed to fuse the generated multi-modal features and even the prior knowledge from the pre-trained language model for better semantic representation. The information extraction part is responsible for generating key contents with the fused context features. The framework can be trained in an end-to-end trainable manner, achieving global optimization. What is more, we define and group visually rich documents into four categories across two dimensions, the layout and text type. For each document category, we provide or recommend the corresponding benchmarks, experimental settings and strong baselines for remedying the problem that this research area lacks the uniform evaluation standard. Extensive experiments on four kinds of benchmarks (from fixed layout to variable layout, from full-structured text to semi-unstructured text) are reported, demonstrating the proposed method's effectiveness. Data, source code and models are available.
CVMar 16, 2022
PMAL: Open Set Recognition via Robust Prototype MiningJing Lu, Yunxu Xu, Hao Li et al.
Open Set Recognition (OSR) has been an emerging topic. Besides recognizing predefined classes, the system needs to reject the unknowns. Prototype learning is a potential manner to handle the problem, as its ability to improve intra-class compactness of representations is much needed in discrimination between the known and the unknowns. In this work, we propose a novel Prototype Mining And Learning (PMAL) framework. It has a prototype mining mechanism before the phase of optimizing embedding space, explicitly considering two crucial properties, namely high-quality and diversity of the prototype set. Concretely, a set of high-quality candidates are firstly extracted from training samples based on data uncertainty learning, avoiding the interference from unexpected noise. Considering the multifarious appearance of objects even in a single category, a diversity-based strategy for prototype set filtering is proposed. Accordingly, the embedding space can be better optimized to discriminate therein the predefined classes and between known and unknowns. Extensive experiments verify the two good characteristics (i.e., high-quality and diversity) embraced in prototype mining, and show the remarkable performance of the proposed framework compared to state-of-the-arts.
CVJul 14, 2022
E2-AEN: End-to-End Incremental Learning with Adaptively Expandable NetworkGuimei Cao, Zhanzhan Cheng, Yunlu Xu et al.
Expandable networks have demonstrated their advantages in dealing with catastrophic forgetting problem in incremental learning. Considering that different tasks may need different structures, recent methods design dynamic structures adapted to different tasks via sophisticated skills. Their routine is to search expandable structures first and then train on the new tasks, which, however, breaks tasks into multiple training stages, leading to suboptimal or overmuch computational cost. In this paper, we propose an end-to-end trainable adaptively expandable network named E2-AEN, which dynamically generates lightweight structures for new tasks without any accuracy drop in previous tasks. Specifically, the network contains a serial of powerful feature adapters for augmenting the previously learned representations to new tasks, and avoiding task interference. These adapters are controlled via an adaptive gate-based pruning strategy which decides whether the expanded structures can be pruned, making the network structure dynamically changeable according to the complexity of the new tasks. Moreover, we introduce a novel sparsity-activation regularization to encourage the model to learn discriminative features with limited parameters. E2-AEN reduces cost and can be built upon any feed-forward architectures in an end-to-end manner. Extensive experiments on both classification (i.e., CIFAR and VDD) and detection (i.e., COCO, VOC and ICCV2021 SSLAD challenge) benchmarks demonstrate the effectiveness of the proposed method, which achieves the new remarkable results.
CVSep 19, 2023
RECALL+: Adversarial Web-based Replay for Continual Learning in Semantic SegmentationChang Liu, Giulia Rizzoli, Francesco Barbato et al.
Catastrophic forgetting of previous knowledge is a critical issue in continual learning typically handled through various regularization strategies. However, existing methods struggle especially when several incremental steps are performed. In this paper, we extend our previous approach (RECALL) and tackle forgetting by exploiting unsupervised web-crawled data to retrieve examples of old classes from online databases. In contrast to the original methodology, which did not incorporate an assessment of web-based data, the present work proposes two advanced techniques: an adversarial approach and an adaptive threshold strategy. These methods are utilized to meticulously choose samples from web data that exhibit strong statistical congruence with the no longer available training data. Furthermore, we improved the pseudo-labeling scheme to achieve a more accurate labeling of web data that also considers classes being learned in the current step. Experimental results show that this enhanced approach achieves remarkable results, particularly when the incremental scenario spans multiple steps.
CVSep 19, 2023
Retinex-guided Channel-grouping based Patch Swap for Arbitrary Style TransferChang Liu, Yi Niu, Mingming Ma et al.
The basic principle of the patch-matching based style transfer is to substitute the patches of the content image feature maps by the closest patches from the style image feature maps. Since the finite features harvested from one single aesthetic style image are inadequate to represent the rich textures of the content natural image, existing techniques treat the full-channel style feature patches as simple signal tensors and create new style feature patches via signal-level fusion, which ignore the implicit diversities existed in style features and thus fail for generating better stylised results. In this paper, we propose a Retinex theory guided, channel-grouping based patch swap technique to solve the above challenges. Channel-grouping strategy groups the style feature maps into surface and texture channels, which prevents the winner-takes-all problem. Retinex theory based decomposition controls a more stable channel code rate generation. In addition, we provide complementary fusion and multi-scale generation strategy to prevent unexpected black area and over-stylised results respectively. Experimental results demonstrate that the proposed method outperforms the existing techniques in providing more style-consistent textures while keeping the content fidelity.
CVJul 18, 2024
Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic SegmentationChang Liu, Giulia Rizzoli, Pietro Zanuttigh et al.
Current weakly-supervised incremental learning for semantic segmentation (WILSS) approaches only consider replacing pixel-level annotations with image-level labels, while the training images are still from well-designed datasets. In this work, we argue that widely available web images can also be considered for the learning of new classes. To achieve this, firstly we introduce a strategy to select web images which are similar to previously seen examples in the latent space using a Fourier-based domain discriminator. Then, an effective caption-driven reharsal strategy is proposed to preserve previously learnt classes. To our knowledge, this is the first work to rely solely on web images for both the learning of new concepts and the preservation of the already learned ones in WILSS. Experimental results show that the proposed approach can reach state-of-the-art performances without using manually selected and annotated data in the incremental steps.
CRMar 26
Efficient ML-DSA Public Key Management Method with Identity for PKI and Its ApplicationPenghui Liu, Yi Niu, Xiaoxiong Zhong et al.
With the rapid evolution of the Industrial Internet of Things (IIoT), the boundaries and scale of the Internet are continuously expanding. Consequently, the limitations of traditional certificate-based Public Key Infrastructure (PKI) have become increasingly evident, particularly in scenarios requiring large-scale certificate storage, verification, and frequent transmission. These challenges are expected to be further amplified by the widespread adoption of post-quantum cryptography. In this paper, we propose a novel identity-based public key management framework for PKI based on post-quantum cryptography, termed \textit{IPK-pq}. This approach implements an identity key generation protocol leveraging NIST ML-DSA and random matrix theory. Building on the concept of the Composite Public Key (CPK), \textit{IPK-pq} addresses the linear collusion problem inherent in CPK through an enhanced identity mapping mechanism. Furthermore, it simplifies the verification of the declared public key's authenticity, effectively reducing the complexity associated with certificate-based key management. We also provide a formal security proof for \textit{IPK-pq}, covering both individual private key components and the composite private key. To validate our approach, formally, we directly implement and evaluate \textit{IPK-pq} within a typical PKI application scenario: Resource PKI (RPKI). Comparative experimental results demonstrate that an RPKI system based on \textit{IPK-pq} yields significant improvements in efficiency and scalability. These results validate the feasibility and rationality of \textit{IPK-pq}, positioning it as a strong candidate for next-generation RPKI systems capable of securely managing large-scale routing information.
CVMay 30, 2025Code
The Butterfly Effect in Pathology: Exploring Security in Pathology Foundation ModelsJiashuai Liu, Yingjia Shang, Yingkang Zhan et al.
With the widespread adoption of pathology foundation models in both research and clinical decision support systems, exploring their security has become a critical concern. However, despite their growing impact, the vulnerability of these models to adversarial attacks remains largely unexplored. In this work, we present the first systematic investigation into the security of pathology foundation models for whole slide image~(WSI) analysis against adversarial attacks. Specifically, we introduce the principle of \textit{local perturbation with global impact} and propose a label-free attack framework that operates without requiring access to downstream task labels. Under this attack framework, we revise four classical white-box attack methods and redefine the perturbation budget based on the characteristics of WSI. We conduct comprehensive experiments on three representative pathology foundation models across five datasets and six downstream tasks. Despite modifying only 0.1\% of patches per slide with imperceptible noise, our attack leads to downstream accuracy degradation that can reach up to 20\% in the worst cases. Furthermore, we analyze key factors that influence attack success, explore the relationship between patch-level vulnerability and semantic content, and conduct a preliminary investigation into potential defence strategies. These findings lay the groundwork for future research on the adversarial robustness and reliable deployment of pathology foundation models. Our code is publicly available at: https://github.com/Jiashuai-Liu-hmos/Attack-WSI-pathology-foundation-models.
CVMay 13, 2021Code
Reciprocal Feature Learning via Explicit and Implicit Tasks in Scene Text RecognitionHui Jiang, Yunlu Xu, Zhanzhan Cheng et al.
Text recognition is a popular topic for its broad applications. In this work, we excavate the implicit task, character counting within the traditional text recognition, without additional labor annotation cost. The implicit task plays as an auxiliary branch for complementing the sequential recognition. We design a two-branch reciprocal feature learning framework in order to adequately utilize the features from both the tasks. Through exploiting the complementary effect between explicit and implicit tasks, the feature is reliably enhanced. Extensive experiments on 7 benchmarks show the advantages of the proposed methods in both text recognition and the new-built character counting tasks. In addition, it is convenient yet effective to equip with variable networks and tasks. We offer abundant ablation studies, generalizing experiments with deeper understanding on the tasks. Code is available.
CVApr 8, 2024
MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene CuesXiahan Chen, Mingjian Chen, Sanli Tang et al.
3D object detection based on roadside cameras is an additional way for autonomous driving to alleviate the challenges of occlusion and short perception range from vehicle cameras. Previous methods for roadside 3D object detection mainly focus on modeling the depth or height of objects, neglecting the stationary of cameras and the characteristic of inter-frame consistency. In this work, we propose a novel framework, namely MOSE, for MOnocular 3D object detection with Scene cuEs. The scene cues are the frame-invariant scene-specific features, which are crucial for object localization and can be intuitively regarded as the height between the surface of the real road and the virtual ground plane. In the proposed framework, a scene cue bank is designed to aggregate scene cues from multiple frames of the same scene with a carefully designed extrinsic augmentation strategy. Then, a transformer-based decoder lifts the aggregated scene cues as well as the 3D position embeddings for 3D object location, which boosts generalization ability in heterologous scenes. The extensive experiment results on two public benchmarks demonstrate the state-of-the-art performance of the proposed method, which surpasses the existing methods by a large margin.
SPJan 24, 2025
Adaptive Progressive Attention Graph Neural Network for EEG Emotion RecognitionTianzhi Feng, Chennan Wu, Yi Niu et al.
In recent years, numerous neuroscientific studies demonstrate that specific areas of the brain are connected to human emotional responses, with these regions exhibiting variability across individuals and emotional states. To fully leverage these neural patterns, we propose an Adaptive Progressive Attention Graph Neural Network (APAGNN), which dynamically captures the spatial relationships among brain regions during emotional processing. The APAGNN employs three specialized experts that progressively analyze brain topology. The first expert captures global brain patterns, the second focuses on region-specific features, and the third examines emotion-related channels. This hierarchical approach enables increasingly refined analysis of neural activity. Additionally, a weight generator integrates the outputs of all three experts, balancing their contributions to produce the final predictive label. Extensive experiments conducted on SEED, SEED-IV and MPED datasets indicate that our method enhances EEG emotion recognition performance, achieving superior results compared to baseline methods.
CVMar 21, 2025
PH2ST:ST-Prompt Guided Histological Hypergraph Learning for Spatial Gene Expression PredictionYi Niu, Jiashuai Liu, Yingkang Zhan et al.
Spatial Transcriptomics (ST) reveals the spatial distribution of gene expression in tissues, offering critical insights into biological processes and disease mechanisms. However, the high cost, limited coverage, and technical complexity of current ST technologies restrict their widespread use in clinical and research settings, making obtaining high-resolution transcriptomic profiles across large tissue areas challenging. Predicting ST from H\&E-stained histology images has emerged as a promising alternative to address these limitations but remains challenging due to the heterogeneous relationship between histomorphology and gene expression, which is affected by substantial variability across patients and tissue sections. In response, we propose PH2ST, a prompt-guided hypergraph learning framework, which leverages limited ST signals to guide multi-scale histological representation learning for accurate and robust spatial gene expression prediction. Extensive evaluations on two public ST datasets and multiple prompt sampling strategies simulating real-world scenarios demonstrate that PH2ST not only outperforms existing state-of-the-art methods, but also shows strong potential for practical applications such as imputing missing spots, ST super-resolution, and local-to-global prediction, highlighting its value for scalable and cost-effective spatial gene expression mapping in biomedical contexts.
CVJan 13, 2022
Technical Report for ICCV 2021 Challenge SSLAD-Track3B: Transformers Are Better Continual LearnersDuo Li, Guimei Cao, Yunlu Xu et al.
In the SSLAD-Track 3B challenge on continual learning, we propose the method of COntinual Learning with Transformer (COLT). We find that transformers suffer less from catastrophic forgetting compared to convolutional neural network. The major principle of our method is to equip the transformer based feature extractor with old knowledge distillation and head expanding strategies to compete catastrophic forgetting. In this report, we first introduce the overall framework of continual learning for object detection. Then, we analyse the key elements' effect on withstanding catastrophic forgetting in our solution. Our method achieves 70.78 mAP on the SSLAD-Track 3B challenge test set.
CVOct 21, 2021
A Strong Baseline for Semi-Supervised Incremental Few-Shot LearningLinglan Zhao, Dashan Guo, Yunlu Xu et al.
Few-shot learning (FSL) aims to learn models that generalize to novel classes with limited training samples. Recent works advance FSL towards a scenario where unlabeled examples are also available and propose semi-supervised FSL methods. Another line of methods also cares about the performance of base classes in addition to the novel ones and thus establishes the incremental FSL scenario. In this paper, we generalize the above two under a more realistic yet complex setting, named by Semi-Supervised Incremental Few-Shot Learning (S2 I-FSL). To tackle the task, we propose a novel paradigm containing two parts: (1) a well-designed meta-training algorithm for mitigating ambiguity between base and novel classes caused by unreliable pseudo labels and (2) a model adaptation mechanism to learn discriminative features for novel classes while preserving base knowledge using few labeled and all the unlabeled data. Extensive experiments on standard FSL, semi-supervised FSL, incremental FSL, and the firstly built S2 I-FSL benchmarks demonstrate the effectiveness of our proposed method.
CVMay 13, 2021
LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask AlignmentLiang Qiao, Zaisheng Li, Zhanzhan Cheng et al.
Table structure recognition is a challenging task due to the various structures and complicated cell spanning relations. Previous methods handled the problem starting from elements in different granularities (rows/columns, text regions), which somehow fell into the issues like lossy heuristic rules or neglect of empty cell division. Based on table structure characteristics, we find that obtaining the aligned bounding boxes of text region can effectively maintain the entire relevant range of different cells. However, the aligned bounding boxes are hard to be accurately predicted due to the visual ambiguities. In this paper, we aim to obtain more reliable aligned bounding boxes by fully utilizing the visual information from both text regions in proposed local features and cell relations in global features. Specifically, we propose the framework of Local and Global Pyramid Mask Alignment, which adopts the soft pyramid mask learning mechanism in both the local and global feature maps. It allows the predicted boundaries of bounding boxes to break through the limitation of original proposals. A pyramid mask re-scoring module is then integrated to compromise the local and global information and refine the predicted boundaries. Finally, we propose a robust table structure recovery pipeline to obtain the final structure, in which we also effectively solve the problems of empty cells locating and division. Experimental results show that the proposed method achieves competitive and even new state-of-the-art performance on several public benchmarks.
CVMay 13, 2021
VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and RelationsPeng Zhang, Can Li, Liang Qiao et al.
Document layout analysis is crucial for understanding document structures. On this task, vision and semantics of documents, and relations between layout components contribute to the understanding process. Though many works have been proposed to exploit the above information, they show unsatisfactory results. NLP-based methods model layout analysis as a sequence labeling task and show insufficient capabilities in layout modeling. CV-based methods model layout analysis as a detection or segmentation task, but bear limitations of inefficient modality fusion and lack of relation modeling between layout components. To address the above limitations, we propose a unified framework VSR for document layout analysis, combining vision, semantics and relations. VSR supports both NLP-based and CV-based methods. Specifically, we first introduce vision through document image and semantics through text embedding maps. Then, modality-specific visual and semantic features are extracted using a two-stream network, which are adaptively fused to make full use of complementary information. Finally, given component candidates, a relation module based on graph neural network is incorported to model relations between components and output final results. On three popular benchmarks, VSR outperforms previous models by large margins. Code will be released soon.
CVApr 8, 2021
1st Place Solution to ICDAR 2021 RRC-ICTEXT End-to-end Text Spotting and Aesthetic Assessment on Integrated CircuitQiyao Wang, Pengfei Li, Li Zhu et al.
This paper presents our proposed methods to ICDAR 2021 Robust Reading Challenge - Integrated Circuit Text Spotting and Aesthetic Assessment (ICDAR RRC-ICTEXT 2021). For the text spotting task, we detect the characters on integrated circuit and classify them based on yolov5 detection model. We balance the lowercase and non-lowercase by using SynthText, generated data and data sampler. We adopt semi-supervised algorithm and distillation to furtherly improve the model's accuracy. For the aesthetic assessment task, we add a classification branch of 3 classes to differentiate the aesthetic classes of each character. Finally, we make model deployment to accelerate inference speed and reduce memory consumption based on NVIDIA Tensorrt. Our methods achieve 59.1 mAP on task 3.1 with 31 FPS and 306M memory (rank 1), 78.7\% F2 score on task 3.2 with 30 FPS and 306M memory (rank 1).
CVDec 8, 2020
MANGO: A Mask Attention Guided One-Stage Scene Text SpotterLiang Qiao, Ying Chen, Zhanzhan Cheng et al.
Recently end-to-end scene text spotting has become a popular research topic due to its advantages of global optimization and high maintainability in real applications. Most methods attempt to develop various region of interest (RoI) operations to concatenate the detection part and the sequence recognition part into a two-stage text spotting framework. However, in such framework, the recognition part is highly sensitive to the detected results (e.g.), the compactness of text contours). To address this problem, in this paper, we propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO, in which character sequences can be directly recognized without RoI operation. Concretely, a position-aware mask attention module is developed to generate attention weights on each text instance and its characters. It allows different text instances in an image to be allocated on different feature map channels which are further grouped as a batch of instance features. Finally, a lightweight sequence decoder is applied to generate the character sequences. It is worth noting that MANGO inherently adapts to arbitrary-shaped text spotting and can be trained end-to-end with only coarse position information (e.g.), rectangular bounding box) and text annotations. Experimental results show that the proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks, i.e., ICDAR 2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.
CVNov 16, 2020
2D+3D Facial Expression Recognition via Discriminative Dynamic Range Enhancement and Multi-Scale LearningYang Jiao, Yi Niu, Trac D. Tran et al.
In 2D+3D facial expression recognition (FER), existing methods generate multi-view geometry maps to enhance the depth feature representation. However, this may introduce false estimations due to local plane fitting from incomplete point clouds. In this paper, we propose a novel Map Generation technique from the viewpoint of information theory, to boost the slight 3D expression differences from strong personality variations. First, we examine the HDR depth data to extract the discriminative dynamic range $r_{dis}$, and maximize the entropy of $r_{dis}$ to a global optimum. Then, to prevent the large deformation caused by over-enhancement, we introduce a depth distortion constraint and reduce the complexity from $O(KN^2)$ to $O(KNτ)$. Furthermore, the constrained optimization is modeled as a $K$-edges maximum weight path problem in a directed acyclic graph, and we solve it efficiently via dynamic programming. Finally, we also design an efficient Facial Attention structure to automatically locate subtle discriminative facial parts for multi-scale learning, and train it with a proposed loss function $\mathcal{L}_{FA}$ without any facial landmarks. Experimental results on different datasets show that the proposed method is effective and outperforms the state-of-the-art 2D+3D FER methods in both FER accuracy and the output entropy of the generated maps.
CVJul 6, 2020
Learning a Domain Classifier Bank for Unsupervised Adaptive Object DetectionSanli Tang, Zhanzhan Cheng, Shiliang Pu et al.
In real applications, object detectors based on deep networks still face challenges of the large domain gap between the labeled training data and unlabeled testing data. To reduce the gap, recent techniques are proposed by aligning the image/instance-level features between source and unlabeled target domains. However, these methods suffer from the suboptimal problem mainly because of ignoring the category information of object instances. To tackle this issue, we develop a fine-grained domain alignment approach with a well-designed domain classifier bank that achieves the instance-level alignment respecting to their categories. Specifically, we first employ the mean teacher paradigm to generate pseudo labels for unlabeled samples. Then we implement the class-level domain classifiers and group them together, called domain classifier bank, in which each domain classifier is responsible for aligning features of a specific class. We assemble the bare object detector with the proposed fine-grained domain alignment mechanism as the adaptive detector, and optimize it with a developed crossed adaptive weighting mechanism. Extensive experiments on three popular transferring benchmarks demonstrate the effectiveness of our method and achieve the new remarkable state-of-the-arts.
CVJun 22, 2020
Text Recognition in Real Scenarios with a Few Labeled SamplesJinghuang Lin, Zhanzhan Cheng, Fan Bai et al.
Scene text recognition (STR) is still a hot research topic in computer vision field due to its various applications. Existing works mainly focus on learning a general model with a huge number of synthetic text images to recognize unconstrained scene texts, and have achieved substantial progress. However, these methods are not quite applicable in many real-world scenarios where 1) high recognition accuracy is required, while 2) labeled samples are lacked. To tackle this challenging problem, this paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation between the synthetic source domain (with many synthetic labeled samples) and a specific target domain (with only some or a few real labeled samples). This is done by simultaneously learning each character's feature representation with an attention mechanism and establishing the corresponding character-level latent subspace with adversarial learning. Our approach can maximize the character-level confusion between the source domain and the target domain, thus achieves the sequence-level adaptation with even a small number of labeled samples in the target domain. Extensive experiments on various datasets show that our method significantly outperforms the finetuning scheme, and obtains comparable performance to the state-of-the-art STR methods.
CVMay 27, 2020
TRIE: End-to-End Text Reading and Information Extraction for Document UnderstandingPeng Zhang, Yunlu Xu, Zhanzhan Cheng et al.
Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text. However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
CVMay 27, 2020
SPIN: Structure-Preserving Inner Offset Network for Scene Text RecognitionChengwei Zhang, Yunlu Xu, Zhanzhan Cheng et al.
Arbitrary text appearance poses a great challenge in scene text recognition tasks. Existing works mostly handle with the problem in consideration of the shape distortion, including perspective distortions, line curvature or other style variations. Therefore, methods based on spatial transformers are extensively studied. However, chromatic difficulties in complex scenes have not been paid much attention on. In this work, we introduce a new learnable geometric-unrelated module, the Structure-Preserving Inner Offset Network (SPIN), which allows the color manipulation of source data within the network. This differentiable module can be inserted before any recognition architecture to ease the downstream tasks, giving neural networks the ability to actively transform input intensity rather than the existing spatial rectification. It can also serve as a complementary module to known spatial transformations and work in both independent and collaborative ways with them. Extensive experiments show that the use of SPIN results in a significant improvement on multiple text recognition benchmarks compared to the state-of-the-arts.
CVMay 27, 2020
Object-QA: Towards High Reliable Object Quality AssessmentJing Lu, Baorui Zou, Zhanzhan Cheng et al.
In object recognition applications, object images usually appear with different quality levels. Practically, it is very important to indicate object image qualities for better application performance, e.g. filtering out low-quality object image frames to maintain robust video object recognition results and speed up inference. However, no previous works are explicitly proposed for addressing the problem. In this paper, we define the problem of object quality assessment for the first time and propose an effective approach named Object-QA to assess high-reliable quality scores for object images. Concretely, Object-QA first employs a well-designed relative quality assessing module that learns the intra-class-level quality scores by referring to the difference between object images and their estimated templates. Then an absolute quality assessing module is designed to generate the final quality scores by aligning the quality score distributions in inter-class. Besides, Object-QA can be implemented with only object-level annotations, and is also easily deployed to a variety of object recognition tasks. To our best knowledge this is the first work to put forward the definition of this problem and conduct quantitative evaluations. Validations on 5 different datasets show that Object-QA can not only assess high-reliable quality scores according with human cognition, but also improve application performance.
CVFeb 26, 2020
Refined Gate: A Simple and Effective Gating Mechanism for Recurrent UnitsZhanzhan Cheng, Yunlu Xu, Mingjian Cheng et al.
Recurrent neural network (RNN) has been widely studied in sequence learning tasks, while the mainstream models (e.g., LSTM and GRU) rely on the gating mechanism (in control of how information flows between hidden states). However, the vanilla gates in RNN (e.g., the input gate in LSTM) suffer from the problem of gate undertraining, which can be caused by various factors, such as the saturating activation functions, the gate layouts (e.g., the gate number and gating functions), or even the suboptimal memory state etc.. Those may result in failures of learning gating switch roles and thus the weak performance. In this paper, we propose a new gating mechanism within general gated recurrent neural networks to handle this issue. Specifically, the proposed gates directly short connect the extracted input features to the outputs of vanilla gates, denoted as refined gates. The refining mechanism allows enhancing gradient back-propagation as well as extending the gating activation scope, which can guide RNN to reach possibly deeper minima. We verify the proposed gating mechanism on three popular types of gated RNNs including LSTM, GRU and MGU. Extensive experiments on 3 synthetic tasks, 3 language modeling tasks and 5 scene text recognition benchmarks demonstrate the effectiveness of our method.
CVFeb 17, 2020
Text Perceptron: Towards End-to-End Arbitrary-Shaped Text SpottingLiang Qiao, Sanli Tang, Zhanzhan Cheng et al.
Many approaches have recently been proposed to detect irregular scene text and achieved promising results. However, their localization results may not well satisfy the following text recognition part mainly because of two reasons: 1) recognizing arbitrary shaped text is still a challenging task, and 2) prevalent non-trainable pipeline strategies between text detection and text recognition will lead to suboptimal performances. To handle this incompatibility problem, in this paper we propose an end-to-end trainable text spotting approach named Text Perceptron. Concretely, Text Perceptron first employs an efficient segmentation-based text detector that learns the latent text reading order and boundary information. Then a novel Shape Transform Module (abbr. STM) is designed to transform the detected feature regions into regular morphologies without extra parameters. It unites text detection and the following recognition part into a whole framework, and helps the whole network achieve global optimization. Experiments show that our method achieves competitive performance on two standard text benchmarks, i.e., ICDAR 2013 and ICDAR 2015, and also obviously outperforms existing methods on irregular text benchmarks SCUT-CTW1500 and Total-Text.
CVAug 7, 2019
Adversarial Seeded Sequence Growing for Weakly-Supervised Temporal Action LocalizationChengwei Zhang, Yunlu Xu, Zhanzhan Cheng et al.
Temporal action localization is an important yet challenging research topic due to its various applications. Since the frame-level or segment-level annotations of untrimmed videos require amounts of labor expenditure, studies on the weakly-supervised action detection have been springing up. However, most of existing frameworks rely on Class Activation Sequence (CAS) to localize actions by minimizing the video-level classification loss, which exploits the most discriminative parts of actions but ignores the minor regions. In this paper, we propose a novel weakly-supervised framework by adversarial learning of two modules for eliminating such demerits. Specifically, the first module is designed as a well-designed Seeded Sequence Growing (SSG) Network for progressively extending seed regions (namely the highly reliable regions initialized by a CAS-based framework) to their expected boundaries. The second module is a specific classifier for mining trivial or incomplete action regions, which is trained on the shared features after erasing the seeded regions activated by SSG. In this way, a whole network composed of these two modules can be trained in an adversarial manner. The goal of the adversary is to mine features that are difficult for the action classifier. That is, erasion from SSG will force the classifier to discover minor or even new action regions on the input feature sequence, and the classifier will drive the seeds to grow, alternately. At last, we could obtain the action locations and categories from the well-trained SSG and the classifier. Extensive experiments on two public benchmarks THUMOS'14 and ActivityNet1.3 demonstrate the impressive performance of our proposed method compared with the state-of-the-arts.
CVAug 6, 2019
REAPS: Towards Better Recognition of Fine-grained Images by Region Attending and Part SequencingPeng Zhang, Xinyu Zhu, Zhanzhan Cheng et al.
Fine-grained image recognition has been a hot research topic in computer vision due to its various applications. The-state-of-the-art is the part/region-based approaches that first localize discriminative parts/regions, and then learn their fine-grained features. However, these approaches have some inherent drawbacks: 1) the discriminative feature representation of an object is prone to be disturbed by complicated background; 2) it is unreasonable and inflexible to fix the number of salient parts, because the intended parts may be unavailable under certain circumstances due to occlusion or incompleteness, and 3) the spatial correlation among different salient parts has not been thoroughly exploited (if not completely neglected). To overcome these drawbacks, in this paper we propose a new, simple yet robust method by building part sequence model on the attended object region. Concretely, we first try to alleviate the background effect by using a region attention mechanism to generate the attended region from the original image. Then, instead of localizing different salient parts and extracting their features separately, we learn the part representation implicitly by applying a mapping function on the serialized features of the object. Finally, we combine the region attending network and the part sequence learning network into a unified framework that can be trained end-to-end with only image-level labels. Our extensive experiments on three fine-grained benchmarks show that the proposed method achieves the state of the art performance.
CVMar 8, 2019
You Only Recognize Once: Towards Fast Video Text SpottingZhanzhan Cheng, Jing Lu, Yi Niu et al.
Video text spotting is still an important research topic due to its various real-applications. Previous approaches usually fall into the four-staged pipeline: text detection in individual images, framewisely recognizing localized text regions, tracking text streams and generating final results with complicated post-processing skills, which might suffer from the huge computational cost as well as the interferences of low-quality text. In this paper, we propose a fast and robust video text spotting framework by only recognizing the localized text one-time instead of frame-wisely recognition. Specifically, we first obtain text regions in videos with a well-designed spatial-temporal detector. Then we concentrate on developing a novel text recommender for selecting the highest-quality text from text streams and only recognizing the selected ones. Here, the recommender assembles text tracking, quality scoring and recognition into an end-to-end trainable module, which not only avoids the interferences from low-quality text but also dramatically speeds up the video text spotting process. In addition, we collect a larger scale video text dataset (LSVTD) for promoting the video text spotting community, which contains 100 text videos from 22 different real-life scenarios. Extensive experiments on two public benchmarks show that our method greatly speeds up the recognition process averagely by 71 times compared with the frame-wise manner, and also achieves the remarkable state-of-the-art.
CVNov 19, 2018
Segregated Temporal Assembly Recurrent Networks for Weakly Supervised Multiple Action DetectionYunlu Xu, Chengwei Zhang, Zhanzhan Cheng et al.
This paper proposes a segregated temporal assembly recurrent (STAR) network for weakly-supervised multiple action detection. The model learns from untrimmed videos with only supervision of video-level labels and makes prediction of intervals of multiple actions. Specifically, we first assemble video clips according to class labels by an attention mechanism that learns class-variable attention weights and thus helps the noise relieving from background or other actions. Secondly, we build temporal relationship between actions by feeding the assembled features into an enhanced recurrent neural network. Finally, we transform the output of recurrent neural network into the corresponding action distribution. In order to generate more precise temporal proposals, we design a score term called segregated temporal gradient-weighted class activation mapping (ST-GradCAM) fused with attention weights. Experiments on THUMOS'14 and ActivityNet1.3 datasets show that our approach outperforms the state-of-the-art weakly-supervised method, and performs at par with the fully-supervised counterparts.
CVMay 9, 2018
Edit Probability for Scene Text RecognitionFan Bai, Zhanzhan Cheng, Yi Niu et al.
We consider the scene text recognition problem under the attention-based encoder-decoder framework, which is the state of the art. The existing methods usually employ a frame-wise maximal likelihood loss to optimize the models. When we train the model, the misalignment between the ground truth strings and the attention's output sequences of probability distribution, which is caused by missing or superfluous characters, will confuse and mislead the training process, and consequently make the training costly and degrade the recognition accuracy. To handle this problem, we propose a novel method called edit probability (EP) for scene text recognition. EP tries to effectively estimate the probability of generating a string from the output sequence of probability distribution conditioned on the input image, while considering the possible occurrences of missing/superfluous characters. The advantage lies in that the training process can focus on the missing, superfluous and unrecognized characters, and thus the impact of the misalignment problem can be alleviated or even overcome. We conduct extensive experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets. Experimental results show that the EP can substantially boost scene text recognition performance.
CVNov 12, 2017
AON: Towards Arbitrarily-Oriented Text RecognitionZhanzhan Cheng, Yangliu Xu, Fan Bai et al.
Recognizing text from natural images is a hot research topic in computer vision due to its various applications. Despite the enduring research of several decades on optical character recognition (OCR), recognizing texts from natural images is still a challenging task. This is because scene texts are often in irregular (e.g. curved, arbitrarily-oriented or seriously distorted) arrangements, which have not yet been well addressed in the literature. Existing methods on text recognition mainly work with regular (horizontal and frontal) texts and cannot be trivially generalized to handle irregular texts. In this paper, we develop the arbitrary orientation network (AON) to directly capture the deep features of irregular texts, which are combined into an attention-based decoder to generate character sequence. The whole network can be trained end-to-end by using only images and word-level annotations. Extensive experiments on various benchmarks, including the CUTE80, SVT-Perspective, IIIT5k, SVT and ICDAR datasets, show that the proposed AON-based method achieves the-state-of-the-art performance in irregular datasets, and is comparable to major existing methods in regular datasets.