CLApr 18, 2023
Towards Zero-Shot Personalized Table-to-Text Generation with Contrastive Persona DistillationHaolan Zhan, Xuming Lin, Shaobo Cui et al.
Existing neural methods have shown great potentials towards generating informative text from structured tabular data as well as maintaining high content fidelity. However, few of them shed light on generating personalized expressions, which often requires well-aligned persona-table-text datasets that are difficult to obtain. To overcome these obstacles, we explore personalized table-to-text generation under a zero-shot setting, by assuming no well-aligned persona-table-text triples are required during training. To this end, we firstly collect a set of unpaired persona information and then propose a semi-supervised approach with contrastive persona distillation (S2P-CPD) to generate personalized context. Specifically, tabular data and persona information are firstly represented as latent variables separately. Then, we devise a latent space fusion technique to distill persona information into the table representation. Besides, a contrastive-based discriminator is employed to guarantee the style consistency between the generated context and its corresponding persona. Experimental results on two benchmarks demonstrate S2P-CPD's ability on keeping both content fidelity and personalized expressions.
IROct 6, 2022
Digital Human Interactive Recommendation Decision-Making Based on Reinforcement LearningXiong Junwu, Xiaoyun Feng, YunZhou Shi et al.
Digital human recommendation system has been developed to help customers find their favorite products and is playing an active role in various recommendation contexts. How to timely catch and learn the dynamics of the preferences of the customers, while meeting their exact requirements, becomes crucial in the digital human recommendation domain. We design a novel practical digital human interactive recommendation agent framework based on Reinforcement Learning(RL) to improve the efficiency of the interactive recommendation decision-making by leveraging both the digital human features and the superior flexibility of RL. Our proposed framework learns through real-time interactions between the digital human and customers dynamically through the state-of-art RL algorithms, combined with multimodal embedding and graph embedding, to improve the accuracy of personalization and thus enable the digital human agent to timely catch the attention of the customer. Experiments on real business data demonstrate that our framework can provide better personalized customer engagement and better customer experiences.
SDJun 27, 2023
CASEIN: Cascading Explicit and Implicit Control for Fine-grained Emotion Intensity RegulationYuhao Cui, Xiongwei Wang, Zhongzhou Zhao et al.
Existing fine-grained intensity regulation methods rely on explicit control through predicted emotion probabilities. However, these high-level semantic probabilities are often inaccurate and unsmooth at the phoneme level, leading to bias in learning. Especially when we attempt to mix multiple emotion intensities for specific phonemes, resulting in markedly reduced controllability and naturalness of the synthesis. To address this issue, we propose the CAScaded Explicit and Implicit coNtrol framework (CASEIN), which leverages accurate disentanglement of emotion manifolds from the reference speech to learn the implicit representation at a lower semantic level. This representation bridges the semantical gap between explicit probabilities and the synthesis model, reducing bias in learning. In experiments, our CASEIN surpasses existing methods in both controllability and naturalness. Notably, we are the first to achieve fine-grained control over the mixed intensity of multiple emotions.
CLDec 15, 2023
VK-G2T: Vision and Context Knowledge enhanced Gloss2TextLiqiang Jing, Xuemeng Song, Xinxing Zu et al.
Existing sign language translation methods follow a two-stage pipeline: first converting the sign language video to a gloss sequence (i.e. Sign2Gloss) and then translating the generated gloss sequence into a spoken language sentence (i.e. Gloss2Text). While previous studies have focused on boosting the performance of the Sign2Gloss stage, we emphasize the optimization of the Gloss2Text stage. However, this task is non-trivial due to two distinct features of Gloss2Text: (1) isolated gloss input and (2) low-capacity gloss vocabulary. To address these issues, we propose a vision and context knowledge enhanced Gloss2Text model, named VK-G2T, which leverages the visual content of the sign language video to learn the properties of the target sentence and exploit the context knowledge to facilitate the adaptive translation of gloss words. Extensive experiments conducted on a Chinese benchmark validate the superiority of our model.
CLMay 5, 2023
Stylized Data-to-Text Generation: A Case Study in the E-Commerce DomainLiqiang Jing, Xuemeng Song, Xuming Lin et al.
Existing data-to-text generation efforts mainly focus on generating a coherent text from non-linguistic input data, such as tables and attribute-value pairs, but overlook that different application scenarios may require texts of different styles. Inspired by this, we define a new task, namely stylized data-to-text generation, whose aim is to generate coherent text for the given non-linguistic data according to a specific style. This task is non-trivial, due to three challenges: the logic of the generated text, unstructured style reference, and biased training samples. To address these challenges, we propose a novel stylized data-to-text generation model, named StyleD2T, comprising three components: logic planning-enhanced data embedding, mask-based style embedding, and unbiased stylized text generation. In the first component, we introduce a graph-guided logic planner for attribute organization to ensure the logic of generated text. In the second component, we devise feature-level mask-based style embedding to extract the essential style signal from the given unstructured style reference. In the last one, pseudo triplet augmentation is utilized to achieve unbiased text generation, and a multi-condition based confidence assignment function is designed to ensure the quality of pseudo samples. Extensive experiments on a newly collected dataset from Taobao have been conducted, and the results show the superiority of our model over existing methods.
IRJan 24, 2022
Dual Preference Distribution Learning for Item RecommendationXue Dong, Xuemeng Song, Na Zheng et al.
Recommender systems can automatically recommend users with items that they probably like. The goal of them is to model the user-item interaction by effectively representing the users and items. Existing methods have primarily learned the user's preferences and item's features with vectorized embeddings, and modeled the user's general preferences to items by the interaction of them. In fact, users have their specific preferences to item attributes and different preferences are usually related. Therefore, exploring the fine-grained preferences as well as modeling the relationships among user's different preferences could improve the recommendation performance. Toward this end, we propose a dual preference distribution learning framework (DUPLE), which aims to jointly learn a general preference distribution and a specific preference distribution for a given user, where the former corresponds to the user's general preference to items and the latter refers to the user's specific preference to item attributes. Notably, the mean vector of each Gaussian distribution can capture the user's preferences, and the covariance matrix can learn their relationship. Moreover, we can summarize a preferred attribute profile for each user, depicting his/her preferred item attributes. We then can provide the explanation for each recommended item by checking the overlap between its attributes and the user's preferred attribute profile. Extensive quantitative and qualitative experiments on six public datasets demonstrate the effectiveness and explainability of the DUPLE method.
IRSep 13, 2021
AliMe MKG: A Multi-modal Knowledge Graph for Live-streaming E-commerceGuohai Xu, Hehong Chen, Feng-Lin Li et al.
Live streaming is becoming an increasingly popular trend of sales in E-commerce. The core of live-streaming sales is to encourage customers to purchase in an online broadcasting room. To enable customers to better understand a product without jumping out, we propose AliMe MKG, a multi-modal knowledge graph that aims at providing a cognitive profile for products, through which customers are able to seek information about and understand a product. Based on the MKG, we build an online live assistant that highlights product search, product exhibition and question answering, allowing customers to skim over item list, view item details, and ask item-related questions. Our system has been launched online in the Taobao app, and currently serves hundreds of thousands of customers per day.
CLAug 18, 2021
GGP: A Graph-based Grouping Planner for Explicit Control of Long Text GenerationXuming Lin, Shaobo Cui, Zhongzhou Zhao et al.
Existing data-driven methods can well handle short text generation. However, when applied to the long-text generation scenarios such as story generation or advertising text generation in the commercial scenario, these methods may generate illogical and uncontrollable texts. To address these aforementioned issues, we propose a graph-based grouping planner(GGP) following the idea of first-plan-then-generate. Specifically, given a collection of key phrases, GGP firstly encodes these phrases into an instance-level sequential representation and a corpus-level graph-based representation separately. With these two synergic representations, we then regroup these phrases into a fine-grained plan, based on which we generate the final long text. We conduct our experiments on three long text generation datasets and the experimental results reveal that GGP significantly outperforms baselines, which proves that GGP can control the long text generation by knowing how to say and in what order.
CLAug 17, 2021
SPMoE: Generate Multiple Pattern-Aware Outputs with Sparse Pattern Mixture of ExpertsShaobo Cui, Xintong Bao, Xuming Lin et al.
Many generation tasks follow a one-to-many mapping relationship: each input could be associated with multiple outputs. Existing methods like Conditional Variational AutoEncoder(CVAE) employ a latent variable to model this one-to-many relationship. However, this high-dimensional and dense latent variable lacks explainability and usually leads to poor and uncontrollable generations. In this paper, we innovatively introduce the linguistic concept of pattern to decompose the one-to-many mapping into multiple one-to-one mappings and further propose a model named Sparse Pattern Mixture of Experts(SPMoE). Each one-to-one mapping is associated with a conditional generation pattern and is modeled with an expert in SPMoE. To ensure each language pattern can be exclusively handled with an expert model for better explainability and diversity, a sparse mechanism is employed to coordinate all the expert models in SPMoE. We assess the performance of our SPMoE on the paraphrase generation task and the experiment results prove that SPMoE can achieve a good balance in terms of quality, pattern-level diversity, and corpus-level diversity.
CVAug 16, 2021
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge IntegrationYuhao Cui, Zhou Yu, Chunqi Wang et al.
Vision-and-language pretraining (VLP) aims to learn generic multimodal representations from massive image-text pairs. While various successful attempts have been proposed, learning fine-grained semantic alignments between image-text pairs plays a key role in their approaches. Nevertheless, most existing VLP approaches have not fully utilized the intrinsic knowledge within the image-text pairs, which limits the effectiveness of the learned alignments and further restricts the performance of their models. To this end, we introduce a new VLP method called ROSITA, which integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments. Specifically, we introduce a novel structural knowledge masking (SKM) strategy to use the scene graph structure as a priori to perform masked language (region) modeling, which enhances the semantic alignments by eliminating the interference information within and across modalities. Extensive ablation studies and comprehensive analysis verifies the effectiveness of ROSITA in semantic alignments. Pretrained with both in-domain and out-of-domain datasets, ROSITA significantly outperforms existing state-of-the-art VLP methods on three typical vision-and-language tasks over six benchmark datasets.
CLFeb 24, 2021
OneStop QAMaker: Extract Question-Answer Pairs from Text in a One-Stop ApproachShaobo Cui, Xintong Bao, Xinxing Zu et al.
Large-scale question-answer (QA) pairs are critical for advancing research areas like machine reading comprehension and question answering. To construct QA pairs from documents requires determining how to ask a question and what is the corresponding answer. Existing methods for QA pair generation usually follow a pipeline approach. Namely, they first choose the most likely candidate answer span and then generate the answer-specific question. This pipeline approach, however, is undesired in mining the most appropriate QA pairs from documents since it ignores the connection between question generation and answer extraction, which may lead to incompatible QA pair generation, i.e., the selected answer span is inappropriate for question generation. However, for human annotators, we take the whole QA pair into account and consider the compatibility between question and answer. Inspired by such motivation, instead of the conventional pipeline approach, we propose a model named OneStop generate QA pairs from documents in a one-stop approach. Specifically, questions and their corresponding answer span is extracted simultaneously and the process of question generation and answer extraction mutually affect each other. Additionally, OneStop is much more efficient to be trained and deployed in industrial scenarios since it involves only one model to solve the complex QA generation task. We conduct comprehensive experiments on three large-scale machine reading comprehension datasets: SQuAD, NewsQA, and DuReader. The experimental results demonstrate that our OneStop model outperforms the baselines significantly regarding the quality of generated questions, quality of generated question-answer pairs, and model efficiency.
CLMay 27, 2020
Predict-then-Decide: A Predictive Approach for Wait or Answer Task in Dialogue SystemsZehao Lin, Shaobo Cui, Guodun Li et al.
Different people have different habits of describing their intents in conversations. Some people tend to deliberate their intents in several successive utterances, i.e., they use several consistent messages for readability instead of a long sentence to express their question. This creates a predicament faced by the application of dialogue systems, especially in real-world industry scenarios, in which the dialogue system is unsure whether it should answer the query of user immediately or wait for further supplementary input. Motivated by such an interesting predicament, we define a novel Wait-or-Answer task for dialogue systems. We shed light on a new research topic about how the dialogue system can be more intelligent to behave in this Wait-or-Answer quandary. Further, we propose a predictive approach named Predict-then-Decide (PTD) to tackle this Wait-or-Answer task. More specifically, we take advantage of a decision model to help the dialogue system decide whether to wait or answer. The decision of decision model is made with the assistance of two ancillary prediction models: a user prediction and an agent prediction. The user prediction model tries to predict what the user would supplement and uses its prediction to persuade the decision model that the user has some information to add, so the dialogue system should wait. The agent prediction model tries to predict the answer of the dialogue system and convince the decision model that it is a superior choice to answer the query of user immediately since the input of user has come to an end. We conduct our experiments on two real-life scenarios and three public datasets. Experimental results on five datasets show our proposed PTD approach significantly outperforms the existing models in solving this Wait-or-Answer problem.
CLFeb 22, 2020
"Wait, I'm Still Talking!" Predicting the Dialogue Interaction Behavior Using Imagine-Then-Arbitrate ModelZehao Lin, Shaobo Cui, Guodun Li et al.
Producing natural and accurate responses like human beings is the ultimate goal of intelligent dialogue agents. So far, most of the past works concentrate on selecting or generating one pertinent and fluent response according to current query and its context. These models work on a one-to-one environment, making one response to one utterance each round. However, in real human-human conversations, human often sequentially sends several short messages for readability instead of a long message in one turn. Thus messages will not end with an explicit ending signal, which is crucial for agents to decide when to reply. So the first step for an intelligent dialogue agent is not replying but deciding if it should reply at the moment. To address this issue, in this paper, we propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to help the agent decide whether to wait or to make a response directly. Our method has two imaginator modules and an arbitrator module. The two imaginators will learn the agent's and user's speaking style respectively, generate possible utterances as the input of the arbitrator, combining with dialogue history. And the arbitrator decides whether to wait or to make a response to the user directly. To verify the performance and effectiveness of our method, we prepared two dialogue datasets and compared our approach with several popular models. Experimental results show that our model performs well on addressing ending prediction issue and outperforms baseline models.
CLNov 28, 2018
A Deep Cascade Model for Multi-Document Reading ComprehensionMing Yan, Jiangnan Xia, Chen Wu et al.
A fundamental trade-off between effectiveness and efficiency needs to be balanced when designing an online question answering system. Effectiveness comes from sophisticated functions such as extractive machine reading comprehension (MRC), while efficiency is obtained from improvements in preliminary retrieval components such as candidate document selection and paragraph ranking. Given the complexity of the real-world multi-document MRC scenario, it is difficult to jointly optimize both in an end-to-end system. To address this problem, we develop a novel deep cascade learning model, which progressively evolves from the document-level and paragraph-level ranking of candidate texts to more precise answer extraction with machine reading comprehension. Specifically, irrelevant documents and paragraphs are first filtered out with simple functions for efficiency consideration. Then we jointly train three modules on the remaining texts for better tracking the answer: the document extraction, the paragraph extraction and the answer extraction. Experiment results show that the proposed method outperforms the previous state-of-the-art methods on two large-scale multi-document benchmark datasets, i.e., TriviaQA and DuReader. In addition, our online system can stably serve typical scenarios with millions of daily requests in less than 50ms.
CLMay 1, 2018
Memory-augmented Dialogue Management for Task-oriented Dialogue SystemsZheng Zhang, Minlie Huang, Zhongzhou Zhao et al.
Dialogue management (DM) decides the next action of a dialogue system according to the current dialogue state, and thus plays a central role in task-oriented dialogue systems. Since dialogue management requires to have access to not only local utterances, but also the global semantics of the entire dialogue session, modeling the long-range history information is a critical issue. To this end, we propose a novel Memory-Augmented Dialogue management model (MAD) which employs a memory controller and two additional memory structures, i.e., a slot-value memory and an external memory. The slot-value memory tracks the dialogue state by memorizing and updating the values of semantic slots (for instance, cuisine, price, and location), and the external memory augments the representation of hidden states of traditional recurrent neural networks through storing more context information. To update the dialogue state efficiently, we also propose slot-level attention on user utterances to extract specific semantic information for each slot. Experiments show that our model can obtain state-of-the-art performance and outperforms existing baselines.
CLJan 12, 2018
AliMe Assist: An Intelligent Assistant for Creating an Innovative E-commerce ExperienceFeng-Lin Li, Minghui Qiu, Haiqing Chen et al.
We present AliMe Assist, an intelligent assistant designed for creating an innovative online shopping experience in E-commerce. Based on question answering (QA), AliMe Assist offers assistance service, customer service, and chatting service. It is able to take voice and text input, incorporate context to QA, and support multi-round interaction. Currently, it serves millions of customer questions per day and is able to address 85% of them. In this paper, we demonstrate the system, present the underlying techniques, and share our experience in dealing with real-world QA in the E-commerce field.