CVMar 23, 2023Code
Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete LabelsZixuan Ding, Ao Wang, Hui Chen et al.
Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, \ie, CLIP, to compensate for insufficient annotations. In spite of promising performance, they generally overlook the valuable prior about the label-to-label correspondence. In this paper, we advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior about the label-to-label correspondence via a semantic prior prompter. We then present a novel Semantic Correspondence Prompt Network (SCPNet), which can thoroughly explore the structured semantic prior. A Prior-Enhanced Self-Supervised Learning method is further introduced to enhance the use of the prior. Comprehensive experiments and analyses on several widely used benchmark datasets show that our method significantly outperforms existing methods on all datasets, well demonstrating the effectiveness and the superiority of our method. Our code will be available at https://github.com/jameslahm/SCPNet.
CVSep 2, 2024Code
TempMe: Video Temporal Token Merging for Efficient Text-Video RetrievalLeqi Shen, Tianxiang Hao, Tao He et al.
Most text-video retrieval methods utilize the text-image pre-trained models like CLIP as a backbone. These methods process each sampled frame independently by the image encoder, resulting in high computational overhead and limiting practical deployment. Addressing this, we focus on efficient text-video retrieval by tackling two key challenges: 1. From the perspective of trainable parameters, current parameter-efficient fine-tuning methods incur high inference costs; 2. From the perspective of model complexity, current token compression methods are mainly designed for images to reduce spatial redundancy but overlook temporal redundancy in consecutive frames of a video. To tackle these challenges, we propose Temporal Token Merging (TempMe), a parameter-efficient and training-inference efficient text-video retrieval architecture that minimizes trainable parameters and model complexity. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we reduce spatio-temporal redundancy and enhance temporal modeling across different frames, leading to improved efficiency and performance. Extensive experiments validate the superiority of our TempMe. Compared to previous parameter-efficient text-video retrieval methods, TempMe achieves superior performance with just 0.50M trainable parameters. It significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. The code is available at https://github.com/LunarShen/TempMe.
CVMar 14, 2024Code
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task AdaptationYizhe Xiong, Hui Chen, Tianxiang Hao et al.
Recently, the scale of transformers has grown rapidly, which introduces considerable challenges in terms of training overhead and inference efficiency in the scope of task adaptation. Existing works, namely Parameter-Efficient Fine-Tuning (PEFT) and model compression, have separately investigated the challenges. However, PEFT cannot guarantee the inference efficiency of the original backbone, especially for large-scale models. Model compression requires significant training costs for structure searching and re-training. Consequently, a simple combination of them cannot guarantee accomplishing both training efficiency and inference efficiency with minimal costs. In this paper, we propose a novel Parallel Yielding Re-Activation (PYRA) method for such a challenge of training-inference efficient task adaptation. PYRA first utilizes parallel yielding adaptive weights to comprehensively perceive the data distribution in downstream tasks. A re-activation strategy for token modulation is then applied for tokens to be merged, leading to calibrated token features. Extensive experiments demonstrate that PYRA outperforms all competing methods under both low compression rate and high compression rate, demonstrating its effectiveness and superiority in maintaining both training efficiency and inference efficiency for large-scale foundation models. Our code is available at https://github.com/THU-MIG/PYRA.
CVSep 9, 2018Code
Dual Attention Network for Scene SegmentationJun Fu, Jing Liu, Haijie Tian et al.
In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on the selfattention mechanism. Unlike previous works that capture contexts by multi-scale features fusion, we propose a Dual Attention Networks (DANet) to adaptively integrate local features with their global dependencies. Specifically, we append two types of attention modules on top of traditional dilated FCN, which model the semantic interdependencies in spatial and channel dimensions respectively. The position attention module selectively aggregates the features at each position by a weighted sum of the features at all positions. Similar features would be related to each other regardless of their distances. Meanwhile, the channel attention module selectively emphasizes interdependent channel maps by integrating associated features among all channel maps. We sum the outputs of the two attention modules to further improve feature representation which contributes to more precise segmentation results. We achieve new state-of-the-art segmentation performance on three challenging scene segmentation datasets, i.e., Cityscapes, PASCAL Context and COCO Stuff dataset. In particular, a Mean IoU score of 81.5% on Cityscapes test set is achieved without using coarse data. We make the code and trained model publicly available at https://github.com/junfu1115/DANet
IRFeb 2, 2022
Adaptive Experimentation with Delayed Binary FeedbackZenan Wang, Carlos Carrion, Xiliang Lin et al.
Conducting experiments with objectives that take significant delays to materialize (e.g. conversions, add-to-cart events, etc.) is challenging. Although the classical "split sample testing" is still valid for the delayed feedback, the experiment will take longer to complete, which also means spending more resources on worse-performing strategies due to their fixed allocation schedules. Alternatively, adaptive approaches such as "multi-armed bandits" are able to effectively reduce the cost of experimentation. But these methods generally cannot handle delayed objectives directly out of the box. This paper presents an adaptive experimentation solution tailored for delayed binary feedback objectives by estimating the real underlying objectives before they materialize and dynamically allocating variants based on the estimates. Experiments show that the proposed method is more efficient for delayed feedback compared to various other approaches and is robust in different settings. In addition, we describe an experimentation product powered by this algorithm. This product is currently deployed in the online experimentation platform of JD.com, a large e-commerce company and a publisher of digital ads.
LGMay 28, 2021
Blending Advertising with Organic Content in E-Commerce: A Virtual Bids Optimization ApproachCarlos Carrion, Zenan Wang, Harikesh Nair et al.
In e-commerce platforms, sponsored and non-sponsored content are jointly displayed to users and both may interactively influence their engagement behavior. The former content helps advertisers achieve their marketing goals and provides a stream of ad revenue to the platform. The latter content contributes to users' engagement with the platform, which is key to its long-term health. A burning issue for e-commerce platform design is how to blend advertising with content in a way that respects these interactions and balances these multiple business objectives. This paper describes a system developed for this purpose in the context of blending personalized sponsored content with non-sponsored content on the product detail pages of JD.COM, an e-commerce company. This system has three key features: (1) Optimization of multiple competing business objectives through a new virtual bids approach and the expressiveness of the latent, implicit valuation of the platform for the multiple objectives via these virtual bids. (2) Modeling of users' click behavior as a function of their characteristics, the individual characteristics of each sponsored content and the influence exerted by other sponsored and non-sponsored content displayed alongside through a deep learning approach; (3) Consideration of externalities in the allocation of ads, thereby making it directly compatible with a Vickrey-Clarke-Groves (VCG) auction scheme for the computation of payments in the presence of these externalities. The system is currently deployed and serving all traffic through JD.COM's mobile application. Experiments demonstrating the performance and advantages of the system are presented.
CLMar 2, 2021
Probing Product Description Generation via Posterior DistillationHaolan Zhan, Hainan Zhang, Hongshen Chen et al.
In product description generation (PDG), the user-cared aspect is critical for the recommendation system, which can not only improve user's experiences but also obtain more clicks. High-quality customer reviews can be considered as an ideal source to mine user-cared aspects. However, in reality, a large number of new products (known as long-tailed commodities) cannot gather sufficient amount of customer reviews, which brings a big challenge in the product description generation task. Existing works tend to generate the product description solely based on item information, i.e., product attributes or title words, which leads to tedious contents and cannot attract customers effectively. To tackle this problem, we propose an adaptive posterior network based on Transformer architecture that can utilize user-cared information from customer reviews. Specifically, we first extend the self-attentive Transformer encoder to encode product titles and attributes. Then, we apply an adaptive posterior distillation module to utilize useful review information, which integrates user-cared aspects to the generation process. Finally, we apply a Transformer-based decoding phase with copy mechanism to automatically generate the product description. Besides, we also collect a large-scare Chinese product description dataset to support our work and further research in this field. Experimental results show that our model is superior to traditional generative models in both automatic indicators and human evaluation.
AINov 24, 2020
DADNN: Multi-Scene CTR Prediction via Domain-Aware Deep Neural NetworkJunyou He, Guibao Mei, Feng Xing et al.
Click through rate(CTR) prediction is a core task in advertising systems. The booming e-commerce business in our company, results in a growing number of scenes. Most of them are so-called long-tail scenes, which means that the traffic of a single scene is limited, but the overall traffic is considerable. Typical studies mainly focus on serving a single scene with a well designed model. However, this method brings excessive resource consumption both on offline training and online serving. Besides, simply training a single model with data from multiple scenes ignores the characteristics of their own. To address these challenges, we propose a novel but practical model named Domain-Aware Deep Neural Network(DADNN) by serving multiple scenes with only one model. Specifically, shared bottom block among all scenes is applied to learn a common representation, while domain-specific heads maintain the characteristics of every scene. Besides, knowledge transfer is introduced to enhance the opportunity of knowledge sharing among different scenes. In this paper, we study two instances of DADNN where its shared bottom block is multilayer perceptron(MLP) and Multi-gate Mixture-of-Experts(MMoE) respectively, for which we denote as DADNN-MLP and DADNN-MMoE.Comprehensive offline experiments on a real production dataset from our company show that DADNN outperforms several state-of-the-art methods for multi-scene CTR prediction. Extensive online A/B tests reveal that DADNN-MLP contributes up to 6.7% CTR and 3.0% CPM(Cost Per Mille) promotion compared with a well-engineered DCN model. Furthermore, DADNN-MMoE outperforms DADNN-MLP with a relative improvement of 2.2% and 2.7% on CTR and CPM respectively. More importantly, DADNN utilizes a single model for multiple scenes which saves a lot of offline training and online serving resources.
LGOct 2, 2020
Kalman Filtering Attention for User Behavior Modeling in CTR PredictionHu Liu, Jing Lu, Xiwei Zhao et al.
Click-through rate (CTR) prediction is one of the fundamental tasks for e-commerce search engines. As search becomes more personalized, it is necessary to capture the user interest from rich behavior data. Existing user behavior modeling algorithms develop different attention mechanisms to emphasize query-relevant behaviors and suppress irrelevant ones. Despite being extensively studied, these attentions still suffer from two limitations. First, conventional attentions mostly limit the attention field only to a single user's behaviors, which is not suitable in e-commerce where users often hunt for new demands that are irrelevant to any historical behaviors. Second, these attentions are usually biased towards frequent behaviors, which is unreasonable since high frequency does not necessarily indicate great importance. To tackle the two limitations, we propose a novel attention mechanism, termed Kalman Filtering Attention (KFAtt), that considers the weighted pooling in attention as a maximum a posteriori (MAP) estimation. By incorporating a priori, KFAtt resorts to global statistics when few user behaviors are relevant. Moreover, a frequency capping mechanism is incorporated to correct the bias towards frequent behaviors. Offline experiments on both benchmark and a 10 billion scale real production dataset, together with an Online A/B test, show that KFAtt outperforms all compared state-of-the-arts. KFAtt has been deployed in the ranking system of a leading e commerce website, serving the main traffic of hundreds of millions of active users everyday.
CLSep 16, 2020
Group-wise Contrastive Learning for Neural Dialogue GenerationHengyi Cai, Hongshen Chen, Yonghao Song et al.
Neural dialogue response generation has gained much popularity in recent years. Maximum Likelihood Estimation (MLE) objective is widely adopted in existing dialogue model learning. However, models trained with MLE objective function are plagued by the low-diversity issue when it comes to the open-domain conversational setting. Inspired by the observation that humans not only learn from the positive signals but also benefit from correcting behaviors of undesirable actions, in this work, we introduce contrastive learning into dialogue generation, where the model explicitly perceives the difference between the well-chosen positive and negative utterances. Specifically, we employ a pretrained baseline model as a reference. During contrastive learning, the target dialogue model is trained to give higher conditional probabilities for the positive samples, and lower conditional probabilities for those negative samples, compared to the reference model. To manage the multi-mapping relations prevailed in human conversation, we augment contrastive dialogue learning with group-wise dual sampling. Extensive experimental results show that the proposed group-wise contrastive learning framework is suited for training a wide range of neural dialogue generation models with very favorable performance over the baseline training approaches.
LGJun 18, 2020
Category-Specific CNN for Visual-aware CTR Prediction at JD.comHu Liu, Jing Lu, Hao Yang et al.
As one of the largest B2C e-commerce platforms in China, JD com also powers a leading advertising system, serving millions of advertisers with fingertip connection to hundreds of millions of customers. In our system, as well as most e-commerce scenarios, ads are displayed with images.This makes visual-aware Click Through Rate (CTR) prediction of crucial importance to both business effectiveness and user experience. Existing algorithms usually extract visual features using off-the-shelf Convolutional Neural Networks (CNNs) and late fuse the visual and non-visual features for the finally predicted CTR. Despite being extensively studied, this field still face two key challenges. First, although encouraging progress has been made in offline studies, applying CNNs in real systems remains non-trivial, due to the strict requirements for efficient end-to-end training and low-latency online serving. Second, the off-the-shelf CNNs and late fusion architectures are suboptimal. Specifically, off-the-shelf CNNs were designed for classification thus never take categories as input features. While in e-commerce, categories are precisely labeled and contain abundant visual priors that will help the visual modeling. Unaware of the ad category, these CNNs may extract some unnecessary category-unrelated features, wasting CNN's limited expression ability. To overcome the two challenges, we propose Category-specific CNN (CSCNN) specially for CTR prediction. CSCNN early incorporates the category knowledge with a light-weighted attention-module on each convolutional layer. This enables CSCNN to extract expressive category-specific visual patterns that benefit the CTR prediction. Offline experiments on benchmark and a 10 billion scale real production dataset from JD, together with an Online A/B test show that CSCNN outperforms all compared state-of-the-art algorithms.
CVNov 5, 2019
Adaptive Context Network for Scene ParsingJun Fu, Jing Liu, Yuhang Wang et al.
Recent works attempt to improve scene parsing performance by exploring different levels of contexts, and typically train a well-designed convolutional network to exploit useful contexts across all pixels equally. However, in this paper, we find that the context demands are varying from different pixels or regions in each image. Based on this observation, we propose an Adaptive Context Network (ACNet) to capture the pixel-aware contexts by a competitive fusion of global context and local context according to different per-pixel demands. Specifically, when given a pixel, the global context demand is measured by the similarity between the global feature and its local feature, whose reverse value can be used to measure the local context demand. We model the two demand measurements by the proposed global context module and local context module, respectively, to generate adaptive contextual features. Furthermore, we import multiple such modules to build several adaptive context blocks in different levels of network to obtain a coarse-to-fine result. Finally, comprehensive experimental evaluations demonstrate the effectiveness of the proposed ACNet, and new state-of-the-arts performances are achieved on all four public datasets, i.e. Cityscapes, ADE20K, PASCAL Context, and COCO Stuff.
LGNov 3, 2019
Regularized Adversarial Sampling and Deep Time-aware Attention for Click-Through Rate PredictionYikai Wang, Liang Zhang, Quanyu Dai et al.
Improving the performance of click-through rate (CTR) prediction remains one of the core tasks in online advertising systems. With the rise of deep learning, CTR prediction models with deep networks remarkably enhance model capacities. In deep CTR models, exploiting users' historical data is essential for learning users' behaviors and interests. As existing CTR prediction works neglect the importance of the temporal signals when embed users' historical clicking records, we propose a time-aware attention model which explicitly uses absolute temporal signals for expressing the users' periodic behaviors and relative temporal signals for expressing the temporal relation between items. Besides, we propose a regularized adversarial sampling strategy for negative sampling which eases the classification imbalance of CTR data and can make use of the strong guidance provided by the observed negative CTR samples. The adversarial sampling strategy significantly improves the training efficiency, and can be co-trained with the time-aware attention model seamlessly. Experiments are conducted on real-world CTR datasets from both in-station and out-station advertising places.
LGMar 22, 2019
Deep Hierarchical Reinforcement Learning Based Recommendations via Multi-goals AbstractionDongyang Zhao, Liang Zhang, Bo Zhang et al.
The recommender system is an important form of intelligent application, which assists users to alleviate from information redundancy. Among the metrics used to evaluate a recommender system, the metric of conversion has become more and more important. The majority of existing recommender systems perform poorly on the metric of conversion due to its extremely sparse feedback signal. To tackle this challenge, we propose a deep hierarchical reinforcement learning based recommendation framework, which consists of two components, i.e., high-level agent and low-level agent. The high-level agent catches long-term sparse conversion signals, and automatically sets abstract goals for low-level agent, while the low-level agent follows the abstract goals and interacts with real-time environment. To solve the inherent problem in hierarchical reinforcement learning, we propose a novel deep hierarchical reinforcement learning algorithm via multi-goals abstraction (HRL-MG). Our proposed algorithm contains three characteristics: 1) the high-level agent generates multiple goals to guide the low-level agent in different stages, which reduces the difficulty of approaching high-level goals; 2) different goals share the same state encoder parameters, which increases the update frequency of the high-level agent and thus accelerates the convergence of our proposed algorithm; 3) an appreciate benefit assignment function is designed to allocate rewards in each goal so as to coordinate different goals in a consistent direction. We evaluate our proposed algorithm based on a real-world e-commerce dataset and validate its effectiveness.