CLAug 19, 2022
Coarse-to-Fine: Hierarchical Multi-task Learning for Natural Language UnderstandingZhaoye Fei, Yu Tian, Yongkang Wu et al.
Generalized text representations are the foundation of many natural language understanding tasks. To fully utilize the different corpus, it is inevitable that models need to understand the relevance among them. However, many methods ignore the relevance and adopt a single-channel model (a coarse paradigm) directly for all tasks, which lacks enough rationality and interpretation. In addition, some existing works learn downstream tasks by stitches skill block(a fine paradigm), which might cause irrationalresults due to its redundancy and noise. Inthis work, we first analyze the task correlation through three different perspectives, i.e., data property, manual design, and model-based relevance, based on which the similar tasks are grouped together. Then, we propose a hierarchical framework with a coarse-to-fine paradigm, with the bottom level shared to all the tasks, the mid-level divided to different groups, and the top-level assigned to each of the tasks. This allows our model to learn basic language properties from all tasks, boost performance on relevant tasks, and reduce the negative impact from irrelevant tasks. Our experiments on 13 benchmark datasets across five natural language understanding tasks demonstrate the superiority of our method.
CVOct 20, 2021
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal RetrievalLisai Zhang, Hongfa Wu, Qingcai Chen et al.
Cross-model retrieval has emerged as one of the most important upgrades for text-only search engines (SE). Recently, with powerful representation for pairwise text-image inputs via early interaction, the accuracy of vision-language (VL) transformers has outperformed existing methods for text-image retrieval. However, when the same paradigm is used for inference, the efficiency of the VL transformers is still too low to be applied in a real cross-modal SE. Inspired by the mechanism of human learning and using cross-modal knowledge, this paper presents a novel Vision-Language Decomposed Transformer (VLDeformer), which greatly increases the efficiency of VL transformers while maintaining their outstanding accuracy. By the proposed method, the cross-model retrieval is separated into two stages: the VL transformer learning stage, and the VL decomposition stage. The latter stage plays the role of single modal indexing, which is to some extent like the term indexing of a text SE. The model learns cross-modal knowledge from early-interaction pre-training and is then decomposed into an individual encoder. The decomposition requires only small target datasets for supervision and achieves both $1000+$ times acceleration and less than $0.6$\% average recall drop. VLDeformer also outperforms state-of-the-art visual-semantic embedding methods on COCO and Flickr30k.