CVApr 11, 2023
FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-trainingYunpeng Han, Lisai Zhang, Qingcai Chen et al.
Fashion vision-language pre-training models have shown efficacy for a wide range of downstream tasks. However, general vision-language pre-training models pay less attention to fine-grained domain features, while these features are important in distinguishing the specific domain tasks from general tasks. We propose a method for fine-grained fashion vision-language pre-training based on fashion Symbols and Attributes Prompt (FashionSAP) to model fine-grained multi-modalities fashion attributes and characteristics. Firstly, we propose the fashion symbols, a novel abstract fashion concept layer, to represent different fashion items and to generalize various kinds of fine-grained fashion features, making modelling fine-grained attributes more effective. Secondly, the attributes prompt method is proposed to make the model learn specific attributes of fashion items explicitly. We design proper prompt templates according to the format of fashion data. Comprehensive experiments are conducted on two public fashion benchmarks, i.e., FashionGen and FashionIQ, and FashionSAP gets SOTA performances for four popular fashion tasks. The ablation study also shows the proposed abstract fashion symbols, and the attribute prompt method enables the model to acquire fine-grained semantics in the fashion domain effectively. The obvious performance gains from FashionSAP provide a new baseline for future fashion task research.
CVMar 9, 2023
Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-trainingLisai Zhang, Qingcai Chen, Zhijian Chen et al.
Fine-grained supervision based on object annotations has been widely used for vision and language pre-training (VLP). However, in real-world application scenarios, aligned multi-modal data is usually in the image-caption format, which only provides coarse-grained supervision. It is not only cost-expensive but also compute-expensive to collect object annotations and build object annotation pre-extractor for different scenarios. In this paper, we propose a fine-grained VLP scheme without object annotations from the linguistic perspective. First, we propose a homonym sentence rewriting (HSR) algorithm to provide token-level supervision. The algorithm replaces a verb/noun/adjective/quantifier word of the caption with its homonyms from WordNet. Correspondingly, we propose refined vision-language modeling (RVLM) framework to exploit the token-level supervision. Three refined tasks, i.e., refined image-text contrastive (RITC), refined image-text matching (RITM), and replace language modeling (RLM) are proposed to learn the fine-grained alignment. Extensive experiments on several downstream tasks demonstrate the superior performance of the proposed method.
CLJun 8, 2021Code
Multi-hop Graph Convolutional Network with High-order Chebyshev Approximation for Text ReasoningShuoran Jiang, Qingcai Chen, Xin Liu et al.
Graph convolutional network (GCN) has become popular in various natural language processing (NLP) tasks with its superiority in long-term and non-consecutive word interactions. However, existing single-hop graph reasoning in GCN may miss some important non-consecutive dependencies. In this study, we define the spectral graph convolutional network with the high-order dynamic Chebyshev approximation (HDGCN), which augments the multi-hop graph reasoning by fusing messages aggregated from direct and long-term dependencies into one convolutional layer. To alleviate the over-smoothing in high-order Chebyshev approximation, a multi-vote-based cross-attention (MVCAttn) with linear computation complexity is also proposed. The empirical results on four transductive and inductive NLP tasks and the ablation study verify the efficacy of the proposed model. Our source code is available at https://github.com/MathIsAll/HDGCN-pytorch.
CVSep 10, 2020Code
Prototype Completion with Primitive Knowledge for Few-Shot LearningBaoquan Zhang, Xutao Li, Yunming Ye et al.
Few-shot learning is a challenging task, which aims to learn a classifier for novel classes with few examples. Pre-training based meta-learning methods effectively tackle the problem by pre-training a feature extractor and then fine-tuning it through the nearest centroid based meta-learning. However, results show that the fine-tuning step makes very marginal improvements. In this paper, 1) we figure out the key reason, i.e., in the pre-trained feature space, the base classes already form compact clusters while novel classes spread as groups with large variances, which implies that fine-tuning the feature extractor is less meaningful; 2) instead of fine-tuning the feature extractor, we focus on estimating more representative prototypes during meta-learning. Consequently, we propose a novel prototype completion based meta-learning framework. This framework first introduces primitive knowledge (i.e., class-level part or attribute annotations) and extracts representative attribute features as priors. Then, we design a prototype completion network to learn to complete prototypes with these priors. To avoid the prototype completion error caused by primitive knowledge noises or class differences, we further develop a Gaussian based prototype fusion strategy that combines the mean-based and completed prototypes by exploiting the unlabeled samples. Extensive experiments show that our method: (i) can obtain more accurate prototypes; (ii) outperforms state-of-the-art techniques by 2% - 9% in terms of classification accuracy. Our code is available online.
CVApr 7, 2020Code
Text-Guided Neural Image InpaintingLisai Zhang, Qingcai Chen, Baotian Hu et al.
Image inpainting task requires filling the corrupted image with contents coherent with the context. This research field has achieved promising progress by using neural image inpainting methods. Nevertheless, there is still a critical challenge in guessing the missed content with only the context pixels. The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text. Unique from existing text-guided image generation works, the inpainting models are required to compare the semantic content of the given text and the remaining part of the image, then find out the semantic content that should be filled for missing part. To fulfill such a task, we propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet). Firstly, a dual multimodal attention mechanism is designed to extract the explicit semantic information about the corrupted regions, which is done by comparing the descriptive text and complementary image areas through reciprocal attention. Secondly, an image-text matching loss is applied to maximize the semantic similarity of the generated image and the text. Experiments are conducted on two open datasets. Results show that the proposed TDANet model reaches new state-of-the-art on both quantitative and qualitative measures. Result analysis suggests that the generated images are consistent with the guidance text, enabling the generation of various results by providing different descriptions. Codes are available at https://github.com/idealwhite/TDANet
47.3CVMay 9
S2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum DomainBaoquan Zhang, Zhehao Yu, Lisai Zhang et al.
Parameter Efficient Fine-Tuning (PEFT) is a key technique for adapting a large pretrained model to downstream tasks by fine-tuning only a small number of parameters. Recent methods based on Fourier transforms have further reduced the fine-tuned parameters scale by only fine-tuning a few spectral coefficients. Its basic assumption is that the weight change δW is a spatial-domain matrix with a sparse spectrum. However, in this paper, we observe that the spectrum of weight change is not sparse, but instead distributed like power-uniform. This fact implies that fine-tuning only a few spectral coefficients is insufficient to accurately model the weight change with uniform spectrum. To address this issue, we propose to seek an invertible transformation that can transform a latent spatial-domain matrix with sparse spectrum to the weight change, and then perform PEFT on such sparse spectrum domain with few spectral coefficients, called S2FT. To seek such transformation, we first pre-estimate a coarse weight change as a prior. Then, inspired by that sparse spectrum often correspond to locally smooth spatial structures, we regard this transformation as a row and column rearrangement operation on the pre-estimated weight change that smooth spatial structures while keep the structure information of neurons. Finally, we propose to solve the rearrangement search problem in a simple nearest neighbor search manner, thereby obtaining the invertible transformation. Extensive results show our S2FT achieves superior performance by only using 0.08% training parameters.
CVDec 11, 2024
AsyncDSB: Schedule-Asynchronous Diffusion Schrödinger Bridge for Image InpaintingZihao Han, Baoquan Zhang, Lisai Zhang et al.
Image inpainting is an important image generation task, which aims to restore corrupted image from partial visible area. Recently, diffusion Schrödinger bridge methods effectively tackle this task by modeling the translation between corrupted and target images as a diffusion Schrödinger bridge process along a noising schedule path. Although these methods have shown superior performance, in this paper, we find that 1) existing methods suffer from a schedule-restoration mismatching issue, i.e., the theoretical schedule and practical restoration processes usually exist a large discrepancy, which theoretically results in the schedule not fully leveraged for restoring images; and 2) the key reason causing such issue is that the restoration process of all pixels are actually asynchronous but existing methods set a synchronous noise schedule to them, i.e., all pixels shares the same noise schedule. To this end, we propose a schedule-Asynchronous Diffusion Schrödinger Bridge (AsyncDSB) for image inpainting. Our insight is preferentially scheduling pixels with high frequency (i.e., large gradients) and then low frequency (i.e., small gradients). Based on this insight, given a corrupted image, we first train a network to predict its gradient map in corrupted area. Then, we regard the predicted image gradient as prior and design a simple yet effective pixel-asynchronous noise schedule strategy to enhance the diffusion Schrödinger bridge. Thanks to the asynchronous schedule at pixels, the temporal interdependence of restoration process between pixels can be fully characterized for high-quality image inpainting. Experiments on real-world datasets show that our AsyncDSB achieves superior performance, especially on FID with around 3% - 14% improvement over state-of-the-art baseline methods.
AIAug 26, 2025
AniME: Adaptive Multi-Agent Planning for Long Animation GenerationLisai Zhang, Baohan Xu, Siqian Yang et al.
We present AniME, a director-oriented multi-agent system for automated long-form anime production, covering the full workflow from a story to the final video. The director agent keeps a global memory for the whole workflow, and coordinates several downstream specialized agents. By integrating customized Model Context Protocol (MCP) with downstream model instruction, the specialized agent adaptively selects control conditions for diverse sub-tasks. AniME produces cinematic animation with consistent characters and synchronized audio visual elements, offering a scalable solution for AI-driven anime creation.
CVOct 20, 2021
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal RetrievalLisai Zhang, Hongfa Wu, Qingcai Chen et al.
Cross-model retrieval has emerged as one of the most important upgrades for text-only search engines (SE). Recently, with powerful representation for pairwise text-image inputs via early interaction, the accuracy of vision-language (VL) transformers has outperformed existing methods for text-image retrieval. However, when the same paradigm is used for inference, the efficiency of the VL transformers is still too low to be applied in a real cross-modal SE. Inspired by the mechanism of human learning and using cross-modal knowledge, this paper presents a novel Vision-Language Decomposed Transformer (VLDeformer), which greatly increases the efficiency of VL transformers while maintaining their outstanding accuracy. By the proposed method, the cross-model retrieval is separated into two stages: the VL transformer learning stage, and the VL decomposition stage. The latter stage plays the role of single modal indexing, which is to some extent like the term indexing of a text SE. The model learns cross-modal knowledge from early-interaction pre-training and is then decomposed into an individual encoder. The decomposition requires only small target datasets for supervision and achieves both $1000+$ times acceleration and less than $0.6$\% average recall drop. VLDeformer also outperforms state-of-the-art visual-semantic embedding methods on COCO and Flickr30k.
CLDec 1, 2019
Semi-supervised Visual Feature Integration for Pre-trained Language ModelsLisai Zhang, Qingcai Chen, Dongfang Li et al.
Integrating visual features has been proved useful for natural language understanding tasks. Nevertheless, in most existing multimodal language models, the alignment of visual and textual data is expensive. In this paper, we propose a novel semi-supervised visual integration framework for pre-trained language models. In the framework, the visual features are obtained through a visualization and fusion mechanism. The uniqueness includes: 1) the integration is conducted via a semi-supervised approach, which does not require aligned images for every sentences 2) the visual features are integrated as an external component and can be directly used by pre-trained language models. To verify the efficacy of the proposed framework, we conduct the experiments on both natural language inference and reading comprehension tasks. The results demonstrate that our mechanism brings improvement to two strong baseline models. Considering that our framework only requires an image database, and no not requires further alignments, it provides an efficient and feasible way for multimodal language learning.