Kun Yao

CV
h-index32
28papers
1,214citations
Novelty52%
AI Score50

28 Papers

CVJul 26, 2022Code
Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

Qiang Chen, Xiaokang Chen, Jian Wang et al.

Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to one prediction, for end-to-end detection without NMS post-processing. It is known that one-to-many assignment, assigning one ground-truth object to multiple predictions, succeeds in detection methods such as Faster R-CNN and FCOS. While the naive one-to-many assignment does not work for DETR, and it remains challenging to apply one-to-many assignment for DETR training. In this paper, we introduce Group DETR, a simple yet efficient DETR training approach that introduces a group-wise way for one-to-many assignment. This approach involves using multiple groups of object queries, conducting one-to-one assignment within each group, and performing decoder self-attention separately. It resembles data augmentation with automatically-learned object query augmentation. It is also equivalent to simultaneously training parameter-sharing networks of the same architecture, introducing more supervision and thus improving DETR training. The inference process is the same as DETR trained normally and only needs one group of queries without any architecture modification. Group DETR is versatile and is applicable to various DETR variants. The experiments show that Group DETR significantly speeds up the training convergence and improves the performance of various DETR-based models. Code will be available at \url{https://github.com/Atten4Vis/GroupDETR}.

CVAug 14, 2023Code
Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

Huan Liu, Qiang Chen, Zichang Tan et al.

In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e.g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR. We present a simple yet effective transformer approach, named Group Pose. We simply regard $K$-keypoint pose estimation as predicting a set of $N\times K$ keypoint positions, each from a keypoint query, as well as representing each pose with an instance query for scoring $N$ pose predictions. Motivated by the intuition that the interaction, among across-instance queries of different types, is not directly helpful, we make a simple modification to decoder self-attention. We replace single self-attention over all the $N\times(K+1)$ queries with two subsequent group self-attentions: (i) $N$ within-instance self-attention, with each over $K$ keypoint queries and one instance query, and (ii) $(K+1)$ same-type across-instance self-attention, each over $N$ queries of the same type. The resulting decoder removes the interaction among across-instance type-different queries, easing the optimization and thus improving the performance. Experimental results on MS COCO and CrowdPose show that our approach without human box supervision is superior to previous methods with complex decoders, and even is slightly better than ED-Pose that uses human box supervision. $\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm Paddle}$ and $\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch}$ code are available.

CVSep 26, 2023
GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction

Pengyuan Lyu, Weihong Ma, Hongyi Wang et al. · microsoft-research

All tables can be represented as grids. Based on this observation, we propose GridFormer, a novel approach for interpreting unconstrained table structures by predicting the vertex and edge of a grid. First, we propose a flexible table representation in the form of an MXN grid. In this representation, the vertexes and edges of the grid store the localization and adjacency information of the table. Then, we introduce a DETR-style table structure recognizer to efficiently predict this multi-objective information of the grid in a single shot. Specifically, given a set of learned row and column queries, the recognizer directly outputs the vertexes and edges information of the corresponding rows and columns. Extensive experiments on five challenging benchmarks which include wired, wireless, multi-merge-cell, oriented, and distorted tables demonstrate the competitive performance of our model over other methods.

CVOct 31, 2023
HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

Junkun Yuan, Xinyu Zhang, Hao Zhou et al. · tencent-ai

Model pre-training is essential in human-centric perception. In this paper, we first introduce masked image modeling (MIM) as a pre-training approach for this task. Upon revisiting the MIM training strategy, we reveal that human structure priors offer significant potential. Motivated by this insight, we further incorporate an intuitive human structure prior - human parts - into pre-training. Specifically, we employ this prior to guide the mask sampling process. Image patches, corresponding to human part regions, have high priority to be masked out. This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks. To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image. We term the entire method as HAP. HAP simply uses a plain ViT as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset. For example, HAP achieves 78.1% mAP on MSMT17 for person re-identification, 86.54% mA on PA-100K for pedestrian attribute recognition, 78.2% AP on MS COCO for 2D pose estimation, and 56.0 PA-MPJPE on 3DPW for 3D pose and shape estimation.

CVNov 17, 2022
CAE v2: Context Autoencoder with CLIP Target

Xinyu Zhang, Jiahui Chen, Junkun Yuan et al. · tencent-ai

Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance. To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio, and reveal two interesting perspectives, relying on our developed simple pipeline, context autodecoder with CLIP target (CAE v2). Firstly, we observe that the supervision on visible patches achieves remarkable performance, even better than that on masked patches, where the latter is the standard format in the existing MIM methods. Secondly, the optimal mask ratio positively correlates to the model size. That is to say, the smaller the model, the lower the mask ratio needs to be. Driven by these two discoveries, our simple and concise approach CAE v2 achieves superior performance on a series of downstream tasks. For example, a vanilla ViT-Large model achieves 81.7% and 86.7% top-1 accuracy on linear probing and fine-tuning on ImageNet-1K, and 55.9% mIoU on semantic segmentation on ADE20K with the pre-training for 300 epochs. We hope our findings can be helpful guidelines for the pre-training in the MIM area, especially for the small-scale models.

CVJun 5, 2023
ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

Wenwen Yu, Chengquan Zhang, Haoyu Cao et al.

Structured text extraction is one of the most valuable and challenging application directions in the field of Document AI. However, the scenarios of past benchmarks are limited, and the corresponding evaluation protocols usually focus on the submodules of the structured text extraction scheme. In order to eliminate these problems, we organized the ICDAR 2023 competition on Structured text extraction from Visually-Rich Document images (SVRD). We set up two tracks for SVRD including Track 1: HUST-CELL and Track 2: Baidu-FEST, where HUST-CELL aims to evaluate the end-to-end performance of Complex Entity Linking and Labeling, and Baidu-FEST focuses on evaluating the performance and generalization of Zero-shot / Few-shot Structured Text extraction from an end-to-end perspective. Compared to the current document benchmarks, our two tracks of competition benchmark enriches the scenarios greatly and contains more than 50 types of visually-rich document images (mainly from the actual enterprise applications). The competition opened on 30th December, 2022 and closed on 24th March, 2023. There are 35 participants and 91 valid submissions received for Track 1, and 15 participants and 26 valid submissions received for Track 2. In this report we will presents the motivation, competition datasets, task definition, evaluation protocol, and submission summaries. According to the performance of the submissions, we believe there is still a large gap on the expected information extraction performance for complex and zero-shot scenarios. It is hoped that this competition will attract many researchers in the field of CV and NLP, and bring some new thoughts to the field of Document AI.

CVMar 1, 2023
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

Yuechen Yu, Yulin Li, Chengquan Zhang et al.

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

CVNov 7, 2022
Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Qiang Chen, Jian Wang, Chuchu Han et al.

We present a strong object detector with encoder-decoder pretraining and finetuning. Our method, called Group DETR v2, is built upon a vision transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, a DETR variant DINO~\cite{zhang2022dino}, and an efficient DETR training method Group DETR~\cite{chen2022group}. The training process consists of self-supervised pretraining and finetuning a ViT-Huge encoder on ImageNet-1K, pretraining the detector on Object365, and finally finetuning it on COCO. Group DETR v2 achieves $\textbf{64.5}$ mAP on COCO test-dev, and establishes a new SoTA on the COCO leaderboard https://paperswithcode.com/sota/object-detection-on-coco

CVJun 1, 2022
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

Pengyuan Lyu, Chengquan Zhang, Shanshan Liu et al.

Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images, which allows us to learn strong visual representations. In contrast to introducing linguistic knowledge with an additional language model, we directly pre-train the sequence decoder. Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language modeling scheme. Significantly, the encoder is frozen during the pre-training phase of the sequence decoder. Experimental results demonstrate that our proposed method achieves superior performance on benchmark datasets, including Chinese and English text images.

CVAug 14, 2023
Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning

Xugong Qin, Pengyuan Lyu, Chengquan Zhang et al.

Due to the flexible representation of arbitrary-shaped scene text and simple pipeline, bottom-up segmentation-based methods begin to be mainstream in real-time scene text detection. Despite great progress, these methods show deficiencies in robustness and still suffer from false positives and instance adhesion. Different from existing methods which integrate multiple-granularity features or multiple outputs, we resort to the perspective of representation learning in which auxiliary tasks are utilized to enable the encoder to jointly learn robust features with the main task of per-pixel classification during optimization. For semantic representation learning, we propose global-dense semantic contrast (GDSC), in which a vector is extracted for global semantic representation, then used to perform element-wise contrast with the dense grid features. To learn instance-aware representation, we propose to combine top-down modeling (TDM) with the bottom-up framework to provide implicit instance-level clues for the encoder. With the proposed GDSC and TDM, the encoder network learns stronger representation without introducing any parameters and computations during inference. Equipped with a very light decoder, the detector can achieve more robust real-time scene text detection. Experimental results on four public datasets show that the proposed method can outperform or be comparable to the state-of-the-art on both accuracy and speed. Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce RTX 2080 Ti GPU.

CVJun 29, 2023
Learning Structure-Guided Diffusion Model for 2D Human Pose Estimation

Zhongwei Qiu, Qiansheng Yang, Jian Wang et al.

One of the mainstream schemes for 2D human pose estimation (HPE) is learning keypoints heatmaps by a neural network. Existing methods typically improve the quality of heatmaps by customized architectures, such as high-resolution representation and vision Transformers. In this paper, we propose \textbf{DiffusionPose}, a new scheme that formulates 2D HPE as a keypoints heatmaps generation problem from noised heatmaps. During training, the keypoints are diffused to random distribution by adding noises and the diffusion model learns to recover ground-truth heatmaps from noised heatmaps with respect to conditions constructed by image feature. During inference, the diffusion model generates heatmaps from initialized heatmaps in a progressive denoising way. Moreover, we further explore improving the performance of DiffusionPose with conditions from human structural information. Extensive experiments show the prowess of our DiffusionPose, with improvements of 1.6, 1.2, and 1.2 mAP on widely-used COCO, CrowdPose, and AI Challenge datasets, respectively.

CVJul 15, 2024Code
OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

Yu Wang, Xiangbo Su, Qiang Chen et al.

Open-vocabulary object detection focusing on detecting novel categories guided by natural language. In this report, we propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. Building upon OVLW-DETR, we provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment. We align detector with the text encoder from VLM by replacing the fixed classification layer weights in detector with the class-name embeddings extracted from the text encoder. Without additional fusing module, OVLW-DETR is flexible and deployment friendly, making it easier to implement and modulate. improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark. Source code and pre-trained models are available at [https://github.com/Atten4Vis/LW-DETR].

CVAug 31, 2022
TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers

Zengyuan Guo, Yuechen Yu, Pengyuan Lv et al.

Table structure recognition is a crucial part of document image analysis domain. Its difficulty lies in the need to parse the physical coordinates and logical indices of each cell at the same time. However, the existing methods are difficult to achieve both these goals, especially when the table splitting lines are blurred or tilted. In this paper, we propose an accurate and end-to-end transformer-based table structure recognition method, referred to as TRUST. Transformers are suitable for table structure recognition because of their global computations, perfect memory, and parallel computation. By introducing novel Transformer-based Query-based Splitting Module and Vertex-based Merging Module, the table structure recognition problem is decoupled into two joint optimization sub-tasks: multi-oriented table row/column splitting and table grid merging. The Query-based Splitting Module learns strong context information from long dependencies via Transformer networks, accurately predicts the multi-oriented table row/column separators, and obtains the basic grids of the table accordingly. The Vertex-based Merging Module is capable of aggregating local contextual information between adjacent basic grids, providing the ability to merge basic girds that belong to the same spanning cell accurately. We conduct experiments on several popular benchmarks including PubTabNet and SynthTable, our method achieves new state-of-the-art results. In particular, TRUST runs at 10 FPS on PubTabNet, surpassing the previous methods by a large margin.

CVJul 30, 2024Code
Add-SD: Rational Generation without Manual Reference

Lingfeng Yang, Xinyu Zhang, Xiang Li et al.

Diffusion models have exhibited remarkable prowess in visual generalization. Building on this success, we introduce an instruction-based object addition pipeline, named Add-SD, which automatically inserts objects into realistic scenes with rational sizes and positions. Different from layout-conditioned methods, Add-SD is solely conditioned on simple text prompts rather than any other human-costly references like bounding boxes. Our work contributes in three aspects: proposing a dataset containing numerous instructed image pairs; fine-tuning a diffusion model for rational generation; and generating synthetic data to boost downstream tasks. The first aspect involves creating a RemovalDataset consisting of original-edited image pairs with textual instructions, where an object has been removed from the original image while maintaining strong pixel consistency in the background. These data pairs are then used for fine-tuning the Stable Diffusion (SD) model. Subsequently, the pretrained Add-SD model allows for the insertion of expected objects into an image with good rationale. Additionally, we generate synthetic instances for downstream task datasets at scale, particularly for tail classes, to alleviate the long-tailed problem. Downstream tasks benefit from the enriched dataset with enhanced diversity and rationale. Experiments on LVIS val demonstrate that Add-SD yields an improvement of 4.3 mAP on rare classes over the baseline. Code and models are available at https://github.com/ylingfeng/Add-SD.

AIMar 2Code
Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory

Boqin Yuan, Yue Su, Kun Yao

Memory-augmented LLM agents store and retrieve information from prior interactions, yet the relative importance of how memories are written versus how they are retrieved remains unclear. We introduce a diagnostic framework that analyzes how performance differences manifest across write strategies, retrieval methods, and memory utilization behavior, and apply it to a 3x3 study crossing three write strategies (raw chunks, Mem0-style fact extraction, MemGPT-style summarization) with three retrieval methods (cosine, BM25, hybrid reranking). On LoCoMo, retrieval method is the dominant factor: average accuracy spans 20 points across retrieval methods (57.1% to 77.2%) but only 3-8 points across write strategies. Raw chunked storage, which requires zero LLM calls, matches or outperforms expensive lossy alternatives, suggesting that current memory pipelines may discard useful context that downstream retrieval mechanisms fail to compensate for. Failure analysis shows that performance breakdowns most often manifest at the retrieval stage rather than at utilization. We argue that, under current retrieval practices, improving retrieval quality yields larger gains than increasing write-time sophistication. Code is publicly available at https://github.com/boqiny/memory-probe.

CVJul 24, 2023
MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary

Beiya Dai, Xing li, Qunyi Xie et al.

Document dewarping from a distorted camera-captured image is of great value for OCR and document understanding. The document boundary plays an important role which is more evident than the inner region in document dewarping. Current learning-based methods mainly focus on complete boundary cases, leading to poor document correction performance of documents with incomplete boundaries. In contrast to these methods, this paper proposes MataDoc, the first method focusing on arbitrary boundary document dewarping with margin and text aware regularizations. Specifically, we design the margin regularization by explicitly considering background consistency to enhance boundary perception. Moreover, we introduce word position consistency to keep text lines straight in rectified document images. To produce a comprehensive evaluation of MataDoc, we propose a novel benchmark ArbDoc, mainly consisting of document images with arbitrary boundaries in four typical scenarios. Extensive experiments confirm the superiority of MataDoc with consideration for the incomplete boundary on ArbDoc and also demonstrate the effectiveness of the proposed method on DocUNet, DIR300, and WarpDoc datasets.

CVSep 20, 2024
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs

Jing Hao, Yuxiang Zhao, Song Chen et al.

Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the simplicity of the prompts used to instruct the model. To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions. This engine is characterized by its cascade annotation process, which involves multiple expert models and employs rich prompts to instruct LLMs in generating dense image captions. We re-annotated the COCO and Visual Genome datasets using our FullAnno system, tripling the number of object annotations and increasing the length of the original image captions by a factor of 15. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks. The re-annotated data are available at: https://arcana-project-page.github.io

AIMay 7
AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

Yang Xu, Kun Yao, Yiming Deng et al.

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the model to focus on rare correct paths while suppressing updates from trivial paths. Our experiments on five mathematical benchmarks demonstrate that AGPO achieves state-of-the-art accuracy while consistently improving pass@$k$ performance at scale. In a large-scale industrial application for search ads relevance optimization, AGPO effectively enhances the quality of the data annotation, leading to substantial performance gains in downstream student models.

CVJun 5, 2024Code
LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

Qiang Chen, Xiangbo Su, Xinyu Zhang et al.

In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (https://github.com/Atten4Vis/LW-DETR).

CVFeb 5, 2024
FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

Xiaohu Huang, Hao Zhou, Kun Yao et al.

In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: https://visual-ai.github.io/froster.

CVJan 3, 2025
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

Guosheng Zhang, Keyao Wang, Haixiao Yue et al.

Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out-of-domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images, enriching the model's supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model's perception of global visual features, we design a Globally Aware Connector (GAC) to align multi-level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.

CVJun 13, 2024
Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

Zhengqi Zhao, Xiaohu Huang, Hao Zhou et al.

The key to action counting is accurately locating each video's repetitive actions. Instead of estimating the probability of each frame belonging to an action directly, we propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner. The model draws inspiration from empirical observations indicating that humans typically engage in coarse skimming of entire sequences to grasp the general action pattern initially, followed by a finer, frame-by-frame focus to determine if it aligns with the target action. Specifically, SkimFocusNet incorporates a skim branch and a focus branch. The skim branch scans the global contextual information throughout the sequence to identify potential target action for guidance. Subsequently, the focus branch utilizes the guidance to diligently identify repetitive actions using a long-short adaptive guidance (LSAG) block. Additionally, we have observed that videos in existing datasets often feature only one type of repetitive action, which inadequately represents real-world scenarios. To more accurately describe real-life situations, we establish the Multi-RepCount dataset, which includes videos containing multiple repetitive motions. On Multi-RepCount, our SkimFoucsNet can perform specified action counting, that is, to enable counting a particular action type by referencing an exemplary video. This capability substantially exhibits the robustness of our method. Extensive experiments demonstrate that SkimFocusNet achieves state-of-the-art performances with significant improvements. We also conduct a thorough ablation study to evaluate the network components. The source code will be published upon acceptance.

CVMay 19, 2023
Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding

Mingliang Zhai, Yulin Li, Xiameng Qin et al.

Transformers achieve promising performance in document understanding because of their high effectiveness and still suffer from quadratic computational complexity dependency on the sequence length. General efficient transformers are challenging to be directly adapted to model document. They are unable to handle the layout representation in documents, e.g. word, line and paragraph, on different granularity levels and seem hard to achieve a good trade-off between efficiency and performance. To tackle the concerns, we propose Fast-StrucTexT, an efficient multi-modal framework based on the StrucTexT algorithm with an hourglass transformer architecture, for visual document understanding. Specifically, we design a modality-guided dynamic token merging block to make the model learn multi-granularity representation and prunes redundant tokens. Additionally, we present a multi-modal interaction module called Symmetry Cross Attention (SCA) to consider multi-modal fusion and efficiently guide the token mergence. The SCA allows one modality input as query to calculate cross attention with another modality in a dual phase. Extensive experiments on FUNSD, SROIE, and CORD datasets demonstrate that our model achieves the state-of-the-art performance and almost 1.9X faster inference time than the state-of-the-art methods.

CVMar 31, 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

Mengjun Cheng, Yipeng Sun, Longchao Wang et al.

Visual appearance is considered to be the most important cue to understand images for cross-modal retrieval, while sometimes the scene text appearing in images can provide valuable information to understand the visual semantics. Most of existing cross-modal retrieval approaches ignore the usage of scene text information and directly adding this information may lead to performance degradation in scene text free scenarios. To address this issue, we propose a full transformer architecture to unify these cross-modal retrieval scenarios in a single $\textbf{Vi}$sion and $\textbf{S}$cene $\textbf{T}$ext $\textbf{A}$ggregation framework (ViSTA). Specifically, ViSTA utilizes transformer blocks to directly encode image patches and fuse scene text embedding to learn an aggregated visual representation for cross-modal retrieval. To tackle the modality missing problem of scene text, we propose a novel fusion token based transformer aggregation approach to exchange the necessary scene text information only through the fusion token and concentrate on the most important features in each modality. To further strengthen the visual modality, we develop dual contrastive learning losses to embed both image-text pairs and fusion-text pairs into a common cross-modal space. Compared to existing methods, ViSTA enables to aggregate relevant scene text semantics with visual appearance, and hence improve results under both scene text free and scene text aware scenarios. Experimental results show that ViSTA outperforms other methods by at least $\bf{8.4}\%$ at Recall@1 for scene text aware retrieval task. Compared with state-of-the-art scene text free retrieval methods, ViSTA can achieve better accuracy on Flicker30K and MSCOCO while running at least three times faster during the inference stage, which validates the effectiveness of the proposed framework.

CVAug 6, 2021
StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Yulin Li, Yuxi Qian, Yuchen Yu et al.

Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. However, little work has been concerned with the solutions that efficiently extract the structured data from different levels. This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks. Specifically, based on the transformer, we introduce a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Moreover, we design a novel pre-training strategy with three self-supervised tasks to learn a richer representation. StrucTexT uses the existing Masked Visual Language Modeling task and the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate the multi-modal information across text, image, and layout. We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts with significantly superior performance on the FUNSD, SROIE, and EPHOIE datasets.

CHEM-PHOct 31, 2018
Compressing physical properties of atomic species for improving predictive chemistry

John E. Herr, Kevin Koh, Kun Yao et al.

The answers to many unsolved problems lie in the intractable chemical space of molecules and materials. Machine learning techniques are rapidly growing in popularity as a way to compress and explore chemical space efficiently. One of the most important aspects of machine learning techniques is representation through the feature vector, which should contain the most important descriptors necessary to make accurate predictions, not least of which is the atomic species in the molecule or material. In this work we introduce a compressed representation of physical properties for atomic species we call the elemental modes. The elemental modes provide an excellent representation by capturing many of the nuances of the periodic table and the similarity of atomic species. We apply the elemental modes to several different tasks for machine learning algorithms and show that they enable us to make improvements to these tasks even beyond simply achieving higher accuracy predictions.

CHEM-PHDec 19, 2017
Metadynamics for Training Neural Network Model Chemistries: a Competitive Assessment

John E. Herr, Kun Yao, Ryker McIntyre et al.

Neural network (NN) model chemistries (MCs) promise to facilitate the accurate exploration of chemical space and simulation of large reactive systems. One important path to improving these models is to add layers of physical detail, especially long-range forces. At short range, however, these models are data driven and data limited. Little is systematically known about how data should be sampled, and `test data' chosen randomly from some sampling techniques can provide poor information about generality. If the sampling method is narrow `test error' can appear encouragingly tiny while the model fails catastrophically elsewhere. In this manuscript we competitively evaluate two common sampling methods: molecular dynamics (MD), normal-mode sampling (NMS) and one uncommon alternative, Metadynamics (MetaMD), for preparing training geometries. We show that MD is an inefficient sampling method in the sense that additional samples do not improve generality. We also show MetaMD is easily implemented in any NNMC software package with cost that scales linearly with the number of atoms in a sample molecule. MetaMD is a black-box way to ensure samples always reach out to new regions of chemical space, while remaining relevant to chemistry near $k_bT$. It is one cheap tool to address the issue of generalization.

CHEM-PHSep 22, 2016
The Many-Body Expansion Combined with Neural Networks

Kun Yao, John E. Herr, John Parkhill

Fragmentation methods such as the many-body expansion (MBE) are a common strategy to model large systems by partitioning energies into a hierarchy of decreasingly significant contributions. The number of fragments required for chemical accuracy is still prohibitively expensive for ab-initio MBE to compete with force field approximations for applications beyond single-point energies. Alongside the MBE, empirical models of ab-initio potential energy surfaces have improved, especially non-linear models based on neural networks (NN) which can reproduce ab-initio potential energy surfaces rapidly and accurately. Although they are fast, NNs suffer from their own curse of dimensionality; they must be trained on a representative sample of chemical space. In this paper we examine the synergy of the MBE and NN's, and explore their complementarity. The MBE offers a systematic way to treat systems of arbitrary size and intelligently sample chemical space. NN's reduce, by a factor in excess of $10^6$ the computational overhead of the MBE and reproduce the accuracy of ab-initio calculations without specialized force fields. We show they are remarkably general, providing comparable accuracy with drastically different chemical embeddings. To assess this we test a new chemical embedding which can be inverted to predict molecules with desired properties.