An-xiang Zeng

LG
7papers
290citations
Novelty54%
AI Score26

7 Papers

IRDec 16, 2020
Scenario-aware and Mutual-based approach for Multi-scenario Recommendation in E-Commerce

Yuting Chen, Yanshi Wang, Yabo Ni et al.

Recommender systems (RSs) are essential for e-commerce platforms to help meet the enormous needs of users. How to capture user interests and make accurate recommendations for users in heterogeneous e-commerce scenarios is still a continuous research topic. However, most existing studies overlook the intrinsic association of the scenarios: the log data collected from platforms can be naturally divided into different scenarios (e.g., country, city, culture). We observed that the scenarios are heterogeneous because of the huge differences among them. Therefore, a unified model is difficult to effectively capture complex correlations (e.g., differences and similarities) between multiple scenarios thus seriously reducing the accuracy of recommendation results. In this paper, we target the problem of multi-scenario recommendation in e-commerce, and propose a novel recommendation model named Scenario-aware Mutual Learning (SAML) that leverages the differences and similarities between multiple scenarios. We first introduce scenario-aware feature representation, which transforms the embedding and attention modules to map the features into both global and scenario-specific subspace in parallel. Then we introduce an auxiliary network to model the shared knowledge across all scenarios, and use a multi-branch network to model differences among specific scenarios. Finally, we employ a novel mutual unit to adaptively learn the similarity between various scenarios and incorporate it into multi-branch network. We conduct extensive experiments on both public and industrial datasets, empirical results show that SAML consistently and significantly outperforms state-of-the-art methods.

LGMar 25, 2020
AliExpress Learning-To-Rank: Maximizing Online Model Performance without Going Online

Guangda Huzhang, Zhen-Jia Pang, Yongqing Gao et al.

Learning-to-rank (LTR) has become a key technology in E-commerce applications. Most existing LTR approaches follow a supervised learning paradigm from offline labeled data collected from the online system. However, it has been noticed that previous LTR models can have a good validation performance over offline validation data but have a poor online performance, and vice versa, which implies a possible large inconsistency between the offline and online evaluation. We investigate and confirm in this paper that such inconsistency exists and can have a significant impact on AliExpress Search. Reasons for the inconsistency include the ignorance of item context during the learning, and the offline data set is insufficient for learning the context. Therefore, this paper proposes an evaluator-generator framework for LTR with item context. The framework consists of an evaluator that generalizes to evaluate recommendations involving the context, and a generator that maximizes the evaluator score by reinforcement learning, and a discriminator that ensures the generalization of the evaluator. Extensive experiments in simulation environments and AliExpress Search online system show that, firstly, the classic data-based metrics on the offline dataset can show significant inconsistency with online performance, and can even be misleading. Secondly, the proposed evaluator score is significantly more consistent with the online performance than common ranking metrics. Finally, as the consequence, our method achieves a significant improvement (\textgreater$2\%$) in terms of Conversion Rate (CR) over the industrial-level fine-tuned model in online A/B tests.

LGNov 18, 2018
Policy Optimization with Model-based Explorations

Feiyang Pan, Qingpeng Cai, An-Xiang Zeng et al.

Model-free reinforcement learning methods such as the Proximal Policy Optimization algorithm (PPO) have successfully applied in complex decision-making problems such as Atari games. However, these methods suffer from high variances and high sample complexity. On the other hand, model-based reinforcement learning methods that learn the transition dynamics are more sample efficient, but they often suffer from the bias of the transition estimation. How to make use of both model-based and model-free learning is a central problem in reinforcement learning. In this paper, we present a new technique to address the trade-off between exploration and exploitation, which regards the difference between model-free and model-based estimations as a measure of exploration value. We apply this new technique to the PPO algorithm and arrive at a new policy optimization method, named Policy Optimization with Model-based Explorations (POME). POME uses two components to predict the actions' target values: a model-free one estimated by Monte-Carlo sampling and a model-based one which learns a transition model and predicts the value of the next state. POME adds the error of these two target estimations as the additional exploration value for each state-action pair, i.e, encourages the algorithm to explore the states with larger target errors which are hard to estimate. We compare POME with PPO on Atari 2600 games, and it shows that POME outperforms PPO on 33 games out of 49 games.

CVJul 19, 2018
Selective Zero-Shot Classification with Augmented Attributes

Jie Song, Chengchao Shen, Jie Lei et al.

In this paper, we introduce a selective zero-shot classification problem: how can the classifier avoid making dubious predictions? Existing attribute-based zero-shot classification methods are shown to work poorly in the selective classification scenario. We argue the under-complete human defined attribute vocabulary accounts for the poor performance. We propose a selective zero-shot classifier based on both the human defined and the automatically discovered residual attributes. The proposed classifier is constructed by firstly learning the defined and the residual attributes jointly. Then the predictions are conducted within the subspace of the defined attributes. Finally, the prediction confidence is measured by both the defined and the residual attributes. Experiments conducted on several benchmarks demonstrate that our classifier produces a superior performance to other methods under the risk-coverage trade-off metric.

LGJul 2, 2018
Speeding up the Metabolism in E-commerce by Reinforcement Mechanism Design

Hua-Lin He, Chun-Xiang Pan, Qing Da et al.

In a large E-commerce platform, all the participants compete for impressions under the allocation mechanism of the platform. Existing methods mainly focus on the short-term return based on the current observations instead of the long-term return. In this paper, we formally establish the lifecycle model for products, by defining the introduction, growth, maturity and decline stages and their transitions throughout the whole life period. Based on such model, we further propose a reinforcement learning based mechanism design framework for impression allocation, which incorporates the first principal component based permutation and the novel experiences generation method, to maximize short-term as well as long-term return of the platform. With the power of trial-and-error, it is possible to optimize impression allocation strategies globally which is contribute to the healthy development of participants and the platform itself. We evaluate our algorithm on a simulated environment built based on one of the largest E-commerce platforms, and a significant improvement has been achieved in comparison with the baseline solutions.

AIMay 25, 2018
Virtual-Taobao: Virtualizing Real-world Online Retail Environment for Reinforcement Learning

Jing-Cheng Shi, Yang Yu, Qing Da et al.

Applying reinforcement learning in physical-world tasks is extremely challenging. It is commonly infeasible to sample a large number of trials, as required by current reinforcement learning methods, in a physical environment. This paper reports our project on using reinforcement learning for better commodity search in Taobao, one of the largest online retail platforms and meanwhile a physical environment with a high sampling cost. Instead of training reinforcement learning in Taobao directly, we present our approach: first we build Virtual Taobao, a simulator learned from historical customer behavior data through the proposed GAN-SD (GAN for Simulating Distributions) and MAIL (multi-agent adversarial imitation learning), and then we train policies in Virtual Taobao with no physical costs in which ANC (Action Norm Constraint) strategy is proposed to reduce over-fitting. In experiments, Virtual Taobao is trained from hundreds of millions of customers' records, and its properties are compared with the real environment. The results disclose that Virtual Taobao faithfully recovers important properties of the real environment. We also show that the policies trained in Virtual Taobao can have significantly superior online performance to the traditional supervised approaches. We hope our work could shed some light on reinforcement learning applications in complex physical environments.

IRMar 2, 2018
Accelerating E-Commerce Search Engine Ranking by Contextual Factor Selection

Yusen Zhan, Qing Da, Fei Xiao et al.

In industrial large-scale search systems, such as Taobao.com search for commodities, the quality of the ranking result is getting continually improved by introducing more factors from complex procedures, e.g., deep neural networks for extracting image factors. Meanwhile, the increasing of the factors demands more computation resource and raises the system response latency. It has been observed that a search instance usually requires only a small set of effective factors, instead of all factors. Therefore, removing ineffective factors significantly improves the system efficiency. This paper studies the \emph{Contextual Factor Selection} (CFS), which selects only a subset of effective factors for every search instance, for a well balance between the search quality and the response latency. We inject CFS into the search engine ranking score to accelerate the engine, considering both ranking effectiveness and efficiency. The learning of the CFS model involves a combinatorial optimization, which is transformed as a sequential decision-making problem. Solving the problem by reinforcement learning, we propose the RankCFS, which has been assessed in an off-line environment as well as a real-world on-line environment (Taobao.com). The empirical results show that, the proposed CFS approach outperforms several existing supervised/unsupervised methods for feature selection in the off-line environment, and also achieves significant real-world performance improvement, in term of service latency, in daily test as well as Singles' Day Shopping Festival in $2017$.