Fanglin Chen

CV
h-index26
17papers
302citations
Novelty55%
AI Score34

17 Papers

CVDec 19, 2022Code
Universal Object Detection with Large Vision Model

Feng Lin, Wenze Hu, Yaowei Wang et al.

Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such systems have the potential to address a wide range of vision tasks simultaneously, without being limited to specific problems or data domains. This universality is crucial for practical, real-world computer vision applications. In this study, our focus is on a specific challenge: the large-scale, multi-domain universal object detection problem, which contributes to the broader goal of achieving a universal vision system. This problem presents several intricate challenges, including cross-dataset category label duplication, label conflicts, and the necessity to handle hierarchical taxonomies. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training utilizing a pre-trained large vision model. Our method has demonstrated remarkable performance, securing a prestigious second-place ranking in the object detection track of the Robust Vision Challenge 2022 (RVC 2022) on a million-scale cross-dataset object detection benchmark. We believe that our comprehensive study will serve as a valuable reference and offer an alternative approach for addressing similar challenges within the computer vision community. The source code for our work is openly available at https://github.com/linfeng93/Large-UniDet.

CVJul 22, 2022
Multi-Faceted Distillation of Base-Novel Commonality for Few-shot Object Detection

Shuang Wu, Wenjie Pei, Dianwen Mei et al.

Most of existing methods for few-shot object detection follow the fine-tuning paradigm, which potentially assumes that the class-agnostic generalizable knowledge can be learned and transferred implicitly from base classes with abundant samples to novel classes with limited samples via such a two-stage training strategy. However, it is not necessarily true since the object detector can hardly distinguish between class-agnostic knowledge and class-specific knowledge automatically without explicit modeling. In this work we propose to learn three types of class-agnostic commonalities between base and novel classes explicitly: recognition-related semantic commonalities, localization-related semantic commonalities and distribution commonalities. We design a unified distillation framework based on a memory bank, which is able to perform distillation of all three types of commonalities jointly and efficiently. Extensive experiments demonstrate that our method can be readily integrated into most of existing fine-tuning based methods and consistently improve the performance by a large margin.

CVJul 16, 2022
Global-Local Stepwise Generative Network for Ultra High-Resolution Image Restoration

Xin Feng, Haobo Ji, Wenjie Pei et al.

While the research on image background restoration from regular size of degraded images has achieved remarkable progress, restoring ultra high-resolution (e.g., 4K) images remains an extremely challenging task due to the explosion of computational complexity and memory usage, as well as the deficiency of annotated data. In this paper we present a novel model for ultra high-resolution image restoration, referred to as the Global-Local Stepwise Generative Network (GLSGN), which employs a stepwise restoring strategy involving four restoring pathways: three local pathways and one global pathway. The local pathways focus on conducting image restoration in a fine-grained manner over local but high-resolution image patches, while the global pathway performs image restoration coarsely on the scale-down but intact image to provide cues for the local pathways in a global view including semantics and noise patterns. To smooth the mutual collaboration between these four pathways, our GLSGN is designed to ensure the inter-pathway consistency in four aspects in terms of low-level content, perceptual attention, restoring intensity and high-level semantics, respectively. As another major contribution of this work, we also introduce the first ultra high-resolution dataset to date for both reflection removal and rain streak removal, comprising 4,670 real-world and synthetic images. Extensive experiments across three typical tasks for image background restoration, including image reflection removal, image rain streak removal and image dehazing, show that our GLSGN consistently outperforms state-of-the-art methods.

MTRL-SCIApr 25, 2022
Crystal Transformer: Self-learning neural language model for Generative and Tinkering Design of Materials

Lai Wei, Qinyang Li, Yuqi Song et al.

Self-supervised neural language models have recently achieved unprecedented success, from natural language processing to learning the languages of biological sequences and organic molecules. These models have demonstrated superior performance in the generation, structure classification, and functional predictions for proteins and molecules with learned representations. However, most of the masking-based pre-trained language models are not designed for generative design, and their black-box nature makes it difficult to interpret their design logic. Here we propose BLMM Crystal Transformer, a neural network based probabilistic generative model for generative and tinkering design of inorganic materials. Our model is built on the blank filling language model for text generation and has demonstrated unique advantages in learning the "materials grammars" together with high-quality generation, interpretability, and data efficiency. It can generate chemically valid materials compositions with as high as 89.7\% charge neutrality and 84.8\% balanced electronegativity, which are more than 4 and 8 times higher compared to a pseudo random sampling baseline. The probabilistic generation process of BLMM allows it to recommend tinkering operations based on learned materials chemistry and makes it useful for materials doping. Combined with the TCSP crysal structure prediction algorithm, We have applied our model to discover a set of new materials as validated using DFT calculations. Our work thus brings the unsupervised transformer language models based generative artificial intelligence to inorganic materials. A user-friendly web app has been developed for computational materials doping and can be accessed freely at \url{www.materialsatlas.org/blmtinker}.

CVJul 25, 2022
Few-Shot Object Detection by Knowledge Distillation Using Bag-of-Visual-Words Representations

Wenjie Pei, Shuang Wu, Dianwen Mei et al.

While fine-tuning based methods for few-shot object detection have achieved remarkable progress, a crucial challenge that has not been addressed well is the potential class-specific overfitting on base classes and sample-specific overfitting on novel classes. In this work we design a novel knowledge distillation framework to guide the learning of the object detector and thereby restrain the overfitting in both the pre-training stage on base classes and fine-tuning stage on novel classes. To be specific, we first present a novel Position-Aware Bag-of-Visual-Words model for learning a representative bag of visual words (BoVW) from a limited size of image set, which is used to encode general images based on the similarities between the learned visual words and an image. Then we perform knowledge distillation based on the fact that an image should have consistent BoVW representations in two different feature spaces. To this end, we pre-learn a feature space independently from the object detection, and encode images using BoVW in this space. The obtained BoVW representation for an image can be considered as distilled knowledge to guide the learning of object detector: the extracted features by the object detector for the same image are expected to derive the consistent BoVW representations with the distilled knowledge. Extensive experiments validate the effectiveness of our method and demonstrate the superiority over other state-of-the-art methods.

CVJul 25, 2022
Learning Generalizable Latent Representations for Novel Degradations in Super Resolution

Fengjun Li, Xin Feng, Fanglin Chen et al.

Typical methods for blind image super-resolution (SR) focus on dealing with unknown degradations by directly estimating them or learning the degradation representations in a latent space. A potential limitation of these methods is that they assume the unknown degradations can be simulated by the integration of various handcrafted degradations (e.g., bicubic downsampling), which is not necessarily true. The real-world degradations can be beyond the simulation scope by the handcrafted degradations, which are referred to as novel degradations. In this work, we propose to learn a latent representation space for degradations, which can be generalized from handcrafted (base) degradations to novel degradations. The obtained representations for a novel degradation in this latent space are then leveraged to generate degraded images consistent with the novel degradation to compose paired training data for SR model. Furthermore, we perform variational inference to match the posterior of degradations in latent representation space with a prior distribution (e.g., Gaussian distribution). Consequently, we are able to sample more high-quality representations for a novel degradation to augment the training data for SR model. We conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness and advantages of our method for blind super-resolution with novel degradations.

CVJul 28, 2024
WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting

Jingjing Wu, Zhengyao Fang, Pengyuan Lyu et al.

Transcription-only Supervised Text Spotting aims to learn text spotters relying only on transcriptions but no text boundaries for supervision, thus eliminating expensive boundary annotation. The crux of this task lies in locating each transcription in scene text images without location annotations. In this work, we formulate this challenging problem as a Weakly Supervised Cross-modality Contrastive Learning problem, and design a simple yet effective model dubbed WeCromCL that is able to detect each transcription in a scene image in a weakly supervised manner. Unlike typical methods for cross-modality contrastive learning that focus on modeling the holistic semantic correlation between an entire image and a text description, our WeCromCL conducts atomistic contrastive learning to model the character-wise appearance consistency between a text transcription and its correlated region in a scene image to detect an anchor point for the transcription in a weakly supervised manner. The detected anchor points by WeCromCL are further used as pseudo location labels to guide the learning of text spotting. Extensive experiments on four challenging benchmarks demonstrate the superior performance of our model over other methods. Code will be released.

CVNov 27, 2022
Semantic-Aware Local-Global Vision Transformer

Jiatong Zhang, Zengwei Yao, Fanglin Chen et al.

Vision Transformers have achieved remarkable progresses, among which Swin Transformer has demonstrated the tremendous potential of Transformer for vision tasks. It surmounts the key challenge of high computational complexity by performing local self-attention within shifted windows. In this work we propose the Semantic-Aware Local-Global Vision Transformer (SALG), to further investigate two potential improvements towards Swin Transformer. First, unlike Swin Transformer that performs uniform partition to produce equal size of regular windows for local self-attention, our SALG performs semantic segmentation in an unsupervised way to explore the underlying semantic priors in the image. As a result, each segmented region can correspond to a semantically meaningful part in the image, potentially leading to more effective features within each of segmented regions. Second, instead of only performing local self-attention within local windows as Swin Transformer does, the proposed SALG performs both 1) local intra-region self-attention for learning fine-grained features within each region and 2) global inter-region feature propagation for modeling global dependencies among all regions. Consequently, our model is able to obtain the global view when learning features for each token, which is the essential advantage of Transformer. Owing to the explicit modeling of the semantic priors and the proposed local-global modeling mechanism, our SALG is particularly advantageous for small-scale models when the modeling capacity is not sufficient for other models to learn semantics implicitly. Extensive experiments across various vision tasks demonstrates the merit of our model over other vision Transformers, especially in the small-scale modeling scenarios.

LGJul 16, 2022
BCRLSP: An Offline Reinforcement Learning Framework for Sequential Targeted Promotion

Fanglin Chen, Xiao Liu, Bo Tang et al.

We utilize an offline reinforcement learning (RL) model for sequential targeted promotion in the presence of budget constraints in a real-world business environment. In our application, the mobile app aims to boost customer retention by sending cash bonuses to customers and control the costs of such cash bonuses during each time period. To achieve the multi-task goal, we propose the Budget Constrained Reinforcement Learning for Sequential Promotion (BCRLSP) framework to determine the value of cash bonuses to be sent to users. We first find out the target policy and the associated Q-values that maximizes the user retention rate using an RL model. A linear programming (LP) model is then added to satisfy the constraints of promotion costs. We solve the LP problem by maximizing the Q-values of actions learned from the RL model given the budget constraints. During deployment, we combine the offline RL model with the LP model to generate a robust policy under the budget constraints. Using both online and offline experiments, we demonstrate the efficacy of our approach by showing that BCRLSP achieves a higher long-term customer retention rate and a lower cost than various baselines. Taking advantage of the near real-time cost control method, the proposed framework can easily adapt to data with a noisy behavioral policy and/or meet flexible budget constraints.

CVApr 16, 2024Code
Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation

Jiapeng Su, Qi Fan, Guangming Lu et al.

Few-shot semantic segmentation (FSS) has achieved great success on segmenting objects of novel classes, supported by only a few annotated samples. However, existing FSS methods often underperform in the presence of domain shifts, especially when encountering new domain styles that are unseen during training. It is suboptimal to directly adapt or generalize the entire model to new domains in the few-shot scenario. Instead, our key idea is to adapt a small adapter for rectifying diverse target domain styles to the source domain. Consequently, the rectified target domain features can fittingly benefit from the well-optimized source domain segmentation model, which is intently trained on sufficient source domain data. Training domain-rectifying adapter requires sufficiently diverse target domains. We thus propose a novel local-global style perturbation method to simulate diverse potential target domains by perturbating the feature channel statistics of the individual images and collective statistics of the entire source domain, respectively. Additionally, we propose a cyclic domain alignment module to facilitate the adapter effectively rectifying domains using a reverse domain rectification supervision. The adapter is trained to rectify the image features from diverse synthesized target domains to align with the source domain. During testing on target domains, we start by rectifying the image features and then conduct few-shot segmentation on the domain-rectified features. Extensive experiments demonstrate the effectiveness of our method, achieving promising results on cross-domain few-shot semantic segmentation tasks. Our code is available at https://github.com/Matt-Su/DR-Adapter.

CVDec 16, 2023Code
SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt

Wenjie Pei, Tongqi Xia, Fanglin Chen et al.

As a prominent parameter-efficient fine-tuning technique in NLP, prompt tuning is being explored its potential in computer vision. Typical methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP, which represents an input image as a flattened sequence of token embeddings and then learns a set of unordered parameterized tokens prefixed to the sequence representation as the visual prompts for task adaptation of large vision models. While such sequential modeling paradigm of visual prompt has shown great promise, there are two potential limitations. First, the learned visual prompts cannot model the underlying spatial relations in the input image, which is crucial for image encoding. Second, since all prompt tokens play the same role of prompting for all image tokens without distinction, it lacks the fine-grained prompting capability, i.e., individual prompting for different image tokens. In this work, we propose the \mymodel model (\emph{SA$^2$VP}), which learns a two-dimensional prompt token map with equal (or scaled) size to the image token map, thereby being able to spatially align with the image map. Each prompt token is designated to prompt knowledge only for the spatially corresponding image tokens. As a result, our model can conduct individual prompting for different image tokens in a fine-grained manner. Moreover, benefiting from the capability of preserving the spatial structure by the learned prompt token map, our \emph{SA$^2$VP} is able to model the spatial relations in the input image, leading to more effective prompting. Extensive experiments on three challenging benchmarks for image classification demonstrate the superiority of our model over other state-of-the-art methods for visual prompt tuning. Code is available at \emph{https://github.com/tommy-xq/SA2VP}.

CVMar 20, 2024
OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning

Xinyu Geng, Jiaming Wang, Jiawei Gong et al.

Redundancy is a persistent challenge in Capsule Networks (CapsNet),leading to high computational costs and parameter counts. Although previous works have introduced pruning after the initial capsule layer, dynamic routing's fully connected nature and non-orthogonal weight matrices reintroduce redundancy in deeper layers. Besides, dynamic routing requires iterating to converge, further increasing computational demands. In this paper, we propose an Orthogonal Capsule Network (OrthCaps) to reduce redundancy, improve routing performance and decrease parameter counts. Firstly, an efficient pruned capsule layer is introduced to discard redundant capsules. Secondly, dynamic routing is replaced with orthogonal sparse attention routing, eliminating the need for iterations and fully connected structures. Lastly, weight matrices during routing are orthogonalized to sustain low capsule similarity, which is the first approach to introduce orthogonality into CapsNet as far as we know. Our experiments on baseline datasets affirm the efficiency and robustness of OrthCaps in classification tasks, in which ablation studies validate the criticality of each component. Remarkably, OrthCaps-Shallow outperforms other Capsule Network benchmarks on four datasets, utilizing only 110k parameters, which is a mere 1.25% of a standard Capsule Network's total. To the best of our knowledge, it achieves the smallest parameter count among existing Capsule Networks. Similarly, OrthCaps-Deep demonstrates competitive performance across four datasets, utilizing only 1.2% of the parameters required by its counterparts.

CVNov 17, 2021
Pedestrian Detection by Exemplar-Guided Contrastive Learning

Zebin Lin, Wenjie Pei, Fanglin Chen et al.

Typical methods for pedestrian detection focus on either tackling mutual occlusions between crowded pedestrians, or dealing with the various scales of pedestrians. Detecting pedestrians with substantial appearance diversities such as different pedestrian silhouettes, different viewpoints or different dressing, remains a crucial challenge. Instead of learning each of these diverse pedestrian appearance features individually as most existing methods do, we propose to perform contrastive learning to guide the feature learning in such a way that the semantic distance between pedestrians with different appearances in the learned feature space is minimized to eliminate the appearance diversities, whilst the distance between pedestrians and background is maximized. To facilitate the efficiency and effectiveness of contrastive learning, we construct an exemplar dictionary with representative pedestrian appearances as prior knowledge to construct effective contrastive training pairs and thus guide contrastive learning. Besides, the constructed exemplar dictionary is further leveraged to evaluate the quality of pedestrian proposals during inference by measuring the semantic distance between the proposal and the exemplar dictionary. Extensive experiments on both daytime and nighttime pedestrian detection validate the effectiveness of the proposed method.

ASOct 10, 2021
Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-order Latent Domain

Zengwei Yao, Wenjie Pei, Fanglin Chen et al.

The crux of single-channel speech separation is how to encode the mixture of signals into such a latent embedding space that the signals from different speakers can be precisely separated. Existing methods for speech separation either transform the speech signals into frequency domain to perform separation or seek to learn a separable embedding space by constructing a latent domain based on convolutional filters. While the latter type of methods learning an embedding space achieves substantial improvement for speech separation, we argue that the embedding space defined by only one latent domain does not suffice to provide a thoroughly separable encoding space for speech separation. In this paper, we propose the Stepwise-Refining Speech Separation Network (SRSSN), which follows a coarse-to-fine separation framework. It first learns a 1-order latent domain to define an encoding space and thereby performs a rough separation in the coarse phase. Then the proposed SRSSN learns a new latent domain along each basis function of the existing latent domain to obtain a high-order latent domain in the refining phase, which enables our model to perform a refining separation to achieve a more precise speech separation. We demonstrate the effectiveness of our SRSSN by conducting extensive experiments, including speech separation in a clean (noise-free) setting on WSJ0-2/3mix datasets as well as in noisy/reverberant settings on WHAM!/WHAMR! datasets. Furthermore, we also perform experiments of speech recognition on separated speech signals by our model to evaluate the performance of speech separation indirectly.

CVOct 1, 2021
Generative Memory-Guided Semantic Reasoning Model for Image Inpainting

Xin Feng, Wenjie Pei, Fengjun Li et al.

Most existing methods for image inpainting focus on learning the intra-image priors from the known regions of the current input image to infer the content of the corrupted regions in the same image. While such methods perform well on images with small corrupted regions, it is challenging for these methods to deal with images with large corrupted area due to two potential limitations: 1) such methods tend to overfit each single training pair of images relying solely on the intra-image prior knowledge learned from the limited known area; 2) the inter-image prior knowledge about the general distribution patterns of visual semantics, which can be transferred across images sharing similar semantics, is not exploited. In this paper, we propose the Generative Memory-Guided Semantic Reasoning Model (GM-SRM), which not only learns the intra-image priors from the known regions, but also distills the inter-image reasoning priors to infer the content of the corrupted regions. In particular, the proposed GM-SRM first pre-learns a generative memory from the whole training data to capture the semantic distribution patterns in a global view. Then the learned memory are leveraged to retrieve the matching inter-image priors for the current corrupted image to perform semantic reasoning during image inpainting. While the intra-image priors are used for guaranteeing the pixel-level content consistency, the inter-image priors are favorable for performing high-level semantic reasoning, which is particularly effective for inferring semantic content for large corrupted area. Extensive experiments on Paris Street View, CelebA-HQ, and Places2 benchmarks demonstrate that our GM-SRM outperforms the state-of-the-art methods for image inpainting in terms of both the visual quality and quantitative metrics.

CVOct 9, 2020
Deep-Masking Generative Network: A Unified Framework for Background Restoration from Superimposed Images

Xin Feng, Wenjie Pei, Zihui Jia et al.

Restoring the clean background from the superimposed images containing a noisy layer is the common crux of a classical category of tasks on image restoration such as image reflection removal, image deraining and image dehazing. These tasks are typically formulated and tackled individually due to the diverse and complicated appearance patterns of noise layers within the image. In this work we present the Deep-Masking Generative Network (DMGN), which is a unified framework for background restoration from the superimposed images and is able to cope with different types of noise. Our proposed DMGN follows a coarse-to-fine generative process: a coarse background image and a noise image are first generated in parallel, then the noise image is further leveraged to refine the background image to achieve a higher-quality background image. In particular, we design the novel Residual Deep-Masking Cell as the core operating unit for our DMGN to enhance the effective information and suppress the negative information during image generation via learning a gating mask to control the information flow. By iteratively employing this Residual Deep-Masking Cell, our proposed DMGN is able to generate both high-quality background image and noisy image progressively. Furthermore, we propose a two-pronged strategy to effectively leverage the generated noise image as contrasting cues to facilitate the refinement of the background image. Extensive experiments across three typical tasks for image background restoration, including image reflection removal, image rain steak removal and image dehazing, show that our DMGN consistently outperforms state-of-the-art methods specifically designed for each single task.

LGMay 21, 2020
Studying Product Competition Using Representation Learning

Fanglin Chen, Xiao Liu, Davide Proserpio et al.

Studying competition and market structure at the product level instead of brand level can provide firms with insights on cannibalization and product line optimization. However, it is computationally challenging to analyze product-level competition for the millions of products available on e-commerce platforms. We introduce Product2Vec, a method based on the representation learning algorithm Word2Vec, to study product-level competition, when the number of products is large. The proposed model takes shopping baskets as inputs and, for every product, generates a low-dimensional embedding that preserves important product information. In order for the product embeddings to be useful for firm strategic decision making, we leverage economic theories and causal inference to propose two modifications to Word2Vec. First of all, we create two measures, complementarity and exchangeability, that allow us to determine whether product pairs are complements or substitutes. Second, we combine these vectors with random utility-based choice models to forecast demand. To accurately estimate price elasticities, i.e., how demand responds to changes in price, we modify Word2Vec by removing the influence of price from the product vectors. We show that, compared with state-of-the-art models, our approach is faster, and can produce more accurate demand forecasts and price elasticities.