Xiaoyuan Yu

h-index12

3papers

101citations

Novelty50%

AI Score28

Ranked #151,041 of 194,257 authors (top 78%)#49,278 in CV (top 83%)

3 Papers

15.9CVJul 10, 2021Code

TA2N: Two-Stage Action Alignment Network for Few-shot Action Recognition

Shuyuan Li, Huabin Liu, Rui Qian et al.

Few-shot action recognition aims to recognize novel action classes (query) using just a few samples (support). The majority of current approaches follow the metric learning paradigm, which learns to compare the similarity between videos. Recently, it has been observed that directly measuring this similarity is not ideal since different action instances may show distinctive temporal distribution, resulting in severe misalignment issues across query and support videos. In this paper, we arrest this problem from two distinct aspects -- action duration misalignment and action evolution misalignment. We address them sequentially through a Two-stage Action Alignment Network (TA2N). The first stage locates the action by learning a temporal affine transform, which warps each video feature to its action duration while dismissing the action-irrelevant feature (e.g. background). Next, the second stage coordinates query feature to match the spatial-temporal action evolution of support by performing temporally rearrange and spatially offset prediction. Extensive experiments on benchmark datasets show the potential of the proposed method in achieving state-of-the-art performance for few-shot action recognition.The code of this project can be found at https://github.com/R00Kie-Liu/TA2N

1.4CVDec 2, 2021

Vision Pair Learning: An Efficient Training Framework for Image Classification

Bei Tong, Xiaoyuan Yu

Transformer is a potentially powerful architecture for vision tasks. Although equipped with more parameters and attention mechanism, its performance is not as dominant as CNN currently. CNN is usually computationally cheaper and still the leading competitor in various vision tasks. One research direction is to adopt the successful ideas of CNN and improve transformer, but it often relies on elaborated and heuristic network design. Observing that transformer and CNN are complementary in representation learning and convergence speed, we propose an efficient training framework called Vision Pair Learning (VPL) for image classification task. VPL builds up a network composed of a transformer branch, a CNN branch and pair learning module. With multi-stage training strategy, VPL enables the branches to learn from their partners during the appropriate stage of the training process, and makes them both achieve better performance with less time cost. Without external data, VPL promotes the top-1 accuracy of ViT-Base and ResNet-50 on the ImageNet-1k validation set to 83.47% and 79.61% respectively. Experiments on other datasets of various domains prove the efficacy of VPL and suggest that transformer performs better when paired with the differently structured CNN in VPL. we also analyze the importance of components through ablation study.

5.6CVNov 25, 2021

A Close Look at Few-shot Real Image Super-resolution from the Distortion Relation Perspective

Xin Li, Xin Jin, Jun Fu et al.

Collecting amounts of distorted/clean image pairs in the real world is non-trivial, which seriously limits the practical applications of these supervised learning-based methods on real-world image super-resolution (RealSR). Previous works usually address this problem by leveraging unsupervised learning-based technologies to alleviate the dependency on paired training samples. However, these methods typically suffer from unsatisfactory texture synthesis due to the lack of supervision of clean images. To overcome this problem, we are the first to have a close look at the under-explored direction for RealSR, i.e., few-shot real-world image super-resolution, which aims to tackle the challenging RealSR problem with few-shot distorted/clean image pairs. Under this brand-new scenario, we propose Distortion Relation guided Transfer Learning (DRTL) for the few-shot RealSR by transferring the rich restoration knowledge from auxiliary distortions (i.e., synthetic distortions) to the target RealSR under the guidance of distortion relation. Concretely, DRTL builds a knowledge graph to capture the distortion relation between auxiliary distortions and target distortion (i.e., real distortions in RealSR). Based on the distortion relation, DRTL adopts a gradient reweighting strategy to guide the knowledge transfer process between auxiliary distortions and target distortions. In this way, DRTL could quickly learn the most relevant knowledge from the synthetic distortions for the target distortion. We instantiate DRTL with two commonly-used transfer learning paradigms, including pre-training and meta-learning pipelines, to realize a distortion relation-aware Few-shot RealSR. Extensive experiments on multiple benchmarks and thorough ablation studies demonstrate the effectiveness of our DRTL.