CLNov 24, 2023Code
Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural LanguageDi Jin, Shikib Mehri, Devamanyu Hazarika et al. · amazon-science
Learning from human feedback is a prominent technique to align the output of large language models (LLMs) with human expectations. Reinforcement learning from human feedback (RLHF) leverages human preference signals that are in the form of ranking of response pairs to perform this alignment. However, human preference on LLM outputs can come in much richer forms including natural language, which may provide detailed feedback on strengths and weaknesses of a given response. In this work we investigate data efficiency of modeling human feedback that is in natural language. Specifically, we fine-tune an open-source LLM, e.g., Falcon-40B-Instruct, on a relatively small amount (1000 records or even less) of human feedback in natural language in the form of critiques and revisions of responses. We show that this model is able to improve the quality of responses from even some of the strongest LLMs such as ChatGPT, BARD, and Vicuna, through critique and revision of those responses. For instance, through one iteration of revision of ChatGPT responses, the revised responses have 56.6% win rate over the original ones, and this win rate can be further improved to 65.9% after applying the revision for five iterations.
CLFeb 17, 2023Code
Cluster-Guided Label Generation in Extreme Multi-Label ClassificationTaehee Jung, Joo-Kyung Kim, Sungjin Lee et al. · amazon-science
For extreme multi-label classification (XMC), existing classification-based models poorly perform for tail labels and often ignore the semantic relations among labels, like treating "Wikipedia" and "Wiki" as independent and separate labels. In this paper, we cast XMC as a generation task (XLGen), where we benefit from pre-trained text-to-text models. However, generating labels from the extremely large label space is challenging without any constraints or guidance. We, therefore, propose to guide label generation using label cluster information to hierarchically generate lower-level labels. We also find that frequency-based label ordering and using decoding ensemble methods are critical factors for the improvements in XLGen. XLGen with cluster guidance significantly outperforms the classification and generation baselines on tail labels, and also generally improves the overall performance in four popular XMC benchmarks. In human evaluation, we also find XLGen generates unseen but plausible labels. Our code is now available at https://github.com/alexa/xlgen-eacl-2023.
CLFeb 10, 2023
Selective In-Context Data Augmentation for Intent Detection using Pointwise V-InformationYen-Ting Lin, Alexandros Papangelis, Seokhwan Kim et al.
This work focuses on in-context data augmentation for intent detection. Having found that augmentation via in-context prompting of large pre-trained language models (PLMs) alone does not improve performance, we introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model. Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents. It then employs intent-aware filtering, based on PVI, to remove datapoints that are not helpful to the downstream intent classifier. Our method is thus able to leverage the expressive power of large language models to produce diverse training data. Empirical results demonstrate that our method can produce synthetic training data that achieve state-of-the-art performance on three challenging intent detection datasets under few-shot settings (1.28% absolute improvement in 5-shot and 1.18% absolute in 10-shot, on average) and perform on par with the state-of-the-art in full-shot settings (within 0.01% absolute, on average).
LGOct 22, 2022
PENTATRON: PErsonalized coNText-Aware Transformer for Retrieval-based cOnversational uNderstandingNiranjan Uma Naresh, Ziyan Jiang, Ankit et al. · amazon-science
Conversational understanding is an integral part of modern intelligent devices. In a large fraction of the global traffic from customers using smart digital assistants, frictions in dialogues may be attributed to incorrect understanding of the entities in a customer's query due to factors including ambiguous mentions, mispronunciation, background noise and faulty on-device signal processing. Such errors are compounded by two common deficiencies from intelligent devices namely, (1) the device not being tailored to individual customers, and (2) the device responses being unaware of the context in the conversation session. Viewing this problem via the lens of retrieval-based search engines, we build and evaluate a scalable entity correction system, PENTATRON. The system leverages a parametric transformer-based language model to learn patterns from in-session customer-device interactions coupled with a non-parametric personalized entity index to compute the correct query, which aids downstream components in reasoning about the best response. In addition to establishing baselines and demonstrating the value of personalized and context-aware systems, we use multitasking to learn the domain of the correct entity. We also investigate the utility of language model prompts. Through extensive experiments, we show a significant upward movement of the key metric (Exact Match) by up to 500.97% (relative to the baseline).
CLMar 9, 2023
Open World Classification with Adaptive Negative SamplesKe Bai, Guoyin Wang, Jiwei Li et al.
Open world classification is a task in natural language processing with key practical relevance and impact. Since the open or {\em unknown} category data only manifests in the inference phase, finding a model with a suitable decision boundary accommodating for the identification of known classes and discrimination of the open category is challenging. The performance of existing models is limited by the lack of effective open category data during the training stage or the lack of a good mechanism to learn appropriate decision boundaries. We propose an approach based on \underline{a}daptive \underline{n}egative \underline{s}amples (ANS) designed to generate effective synthetic open category samples in the training stage and without requiring any prior knowledge or external datasets. Empirically, we find a significant advantage in using auxiliary one-versus-rest binary classifiers, which effectively utilize the generated negative samples and avoid the complex threshold-seeking stage in previous works. Extensive experiments on three benchmark datasets show that ANS achieves significant improvements over state-of-the-art methods.
LGJul 28, 2022
Learning Personalized Representations using Graph Convolutional NetworkHongyu Shen, Jinoh Oh, Shuai Zhao et al.
Generating representations that precisely reflect customers' behavior is an important task for providing personalized skill routing experience in Alexa. Currently, Dynamic Routing (DR) team, which is responsible for routing Alexa traffic to providers or skills, relies on two features to be served as personal signals: absolute traffic count and normalized traffic count of every skill usage per customer. Neither of them considers the network based structure for interactions between customers and skills, which contain richer information for customer preferences. In this work, we first build a heterogeneous edge attributed graph based customers' past interactions with the invoked skills, in which the user requests (utterances) are modeled as edges. Then we propose a graph convolutional network(GCN) based model, namely Personalized Dynamic Routing Feature Encoder(PDRFE), that generates personalized customer representations learned from the built graph. Compared with existing models, PDRFE is able to further capture contextual information in the graph convolutional function. The performance of our proposed model is evaluated by a downstream task, defect prediction, that predicts the defect label from the learned embeddings of customers and their triggered skills. We observe up to 41% improvements on the cross entropy metric for our proposed models compared to the baselines.
LGApr 5, 2022
Domain-Aware Contrastive Knowledge Transfer for Multi-domain Imbalanced DataZixuan Ke, Mohammad Kachuee, Sungjin Lee
In many real-world machine learning applications, samples belong to a set of domains e.g., for product reviews each review belongs to a product category. In this paper, we study multi-domain imbalanced learning (MIL), the scenario that there is imbalance not only in classes but also in domains. In the MIL setting, different domains exhibit different patterns and there is a varying degree of similarity and divergence among domains posing opportunities and challenges for transfer learning especially when faced with limited or insufficient training data. We propose a novel domain-aware contrastive knowledge transfer method called DCMI to (1) identify the shared domain knowledge to encourage positive transfer among similar domains (in particular from head domains to tail domains); (2) isolate the domain-specific knowledge to minimize the negative transfer from dissimilar domains. We evaluated the performance of DCMI on three different datasets showing significant improvements in different MIL scenarios.
LGApr 14, 2022
Scalable and Robust Self-Learning for Skill Routing in Large-Scale Conversational AI SystemsMohammad Kachuee, Jinseok Nam, Sarthak Ahuja et al.
Skill routing is an important component in large-scale conversational systems. In contrast to traditional rule-based skill routing, state-of-the-art systems use a model-based approach to enable natural conversations. To provide supervision signal required to train such models, ideas such as human annotation, replication of a rule-based system, relabeling based on user paraphrases, and bandit-based learning were suggested. However, these approaches: (a) do not scale in terms of the number of skills and skill on-boarding, (b) require a very costly expert annotation/rule-design, (c) introduce risks in the user experience with each model update. In this paper, we present a scalable self-learning approach to explore routing alternatives without causing abrupt policy changes that break the user experience, learn from the user interaction, and incrementally improve the routing via frequent model refreshes. To enable such robust frequent model updates, we suggest a simple and effective approach that ensures controlled policy updates for individual domains, followed by an off-policy evaluation for making deployment decisions without any need for lengthy A/B experimentation. We conduct various offline and online A/B experiments on a commercial large-scale conversational system to demonstrate the effectiveness of the proposed method in real-world production settings.
LGSep 17, 2022
Constrained Policy Optimization for Controlled Self-Learning in Conversational AI SystemsMohammad Kachuee, Sungjin Lee
Recently, self-learning methods based on user satisfaction metrics and contextual bandits have shown promising results to enable consistent improvements in conversational AI systems. However, directly targeting such metrics by off-policy bandit learning objectives often increases the risk of making abrupt policy changes that break the current user experience. In this study, we introduce a scalable framework for supporting fine-grained exploration targets for individual domains via user-defined constraints. For example, we may want to ensure fewer policy deviations in business-critical domains such as shopping, while allocating more exploration budget to domains such as music. Furthermore, we present a novel meta-gradient learning approach that is scalable and practical to address this problem. The proposed method adjusts constraint violation penalty terms adaptively through a meta objective that encourages balanced constraint satisfaction across domains. We conduct extensive experiments using data from a real-world conversational AI on a set of realistic constraint benchmarks. Based on the experimental results, we demonstrate that the proposed approach is capable of achieving the best balance between the policy value and constraint satisfaction rate.
CLJun 7, 2023
Data Augmentation for Improving Tail-traffic Robustness in Skill-routing for Dialogue SystemsTing-Wei Wu, Fatemeh Sheikholeslami, Mohammad Kachuee et al.
Large-scale conversational systems typically rely on a skill-routing component to route a user request to an appropriate skill and interpretation to serve the request. In such system, the agent is responsible for serving thousands of skills and interpretations which create a long-tail distribution due to the natural frequency of requests. For example, the samples related to play music might be a thousand times more frequent than those asking for theatre show times. Moreover, inputs used for ML-based skill routing are often a heterogeneous mix of strings, embedding vectors, categorical and scalar features which makes employing augmentation-based long-tail learning approaches challenging. To improve the skill-routing robustness, we propose an augmentation of heterogeneous skill-routing data and training targeted for robust operation in long-tail data regimes. We explore a variety of conditional encoder-decoder generative frameworks to perturb original data fields and create synthetic training data. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments using real-world data from a commercial conversational system. Based on the experiment results, the proposed approach improves more than 80% (51 out of 63) of intents with less than 10K of traffic instances in the skill-routing replication task.
CLJun 10, 2021Code
AUGNLG: Few-shot Natural Language Generation using Self-trained Data AugmentationXinnuo Xu, Guoyin Wang, Young-Bum Kim et al.
Natural Language Generation (NLG) is a key component in a task-oriented dialogue system, which converts the structured meaning representation (MR) to the natural language. For large-scale conversational systems, where it is common to have over hundreds of intents and thousands of slots, neither template-based approaches nor model-based approaches are scalable. Recently, neural NLGs started leveraging transfer learning and showed promising results in few-shot settings. This paper proposes AUGNLG, a novel data augmentation approach that combines a self-trained neural retrieval model with a few-shot learned NLU model, to automatically create MR-to-Text data from open-domain texts. The proposed system mostly outperforms the state-of-the-art methods on the FewShotWOZ data in both BLEU and Slot Error Rate. We further confirm improved results on the FewShotSGD data and provide comprehensive analysis results on key components of our system. Our code and data are available at https://github.com/XinnuoXu/AugNLG.
CLApr 18, 2019Code
ConvLab: Multi-Domain End-to-End Dialog System PlatformSungjin Lee, Qi Zhu, Ryuichi Takanobu et al.
We present ConvLab, an open-source multi-domain end-to-end dialog system platform, that enables researchers to quickly set up experiments with reusable components and compare a large set of different approaches, ranging from conventional pipeline systems to end-to-end neural models, in common environments. ConvLab offers a set of fully annotated datasets and associated pre-trained reference models. As a showcase, we extend the MultiWOZ dataset with user dialog act annotations to train all component models and demonstrate how ConvLab makes it easy and effortless to conduct complicated experiments in multi-domain end-to-end dialog settings.
CLFeb 7, 2025
SeDi-Instruct: Enhancing Alignment of Language Models through Self-Directed Instruction GenerationJungwoo Kim, Minsang Kim, Sungjin Lee
The rapid evolution of Large Language Models (LLMs) has enabled the industry to develop various AI-based services. Instruction tuning is considered essential in adapting foundation models for target domains to provide high-quality services to customers. A key challenge in instruction tuning is obtaining high-quality instruction data. Self-Instruct, which automatically generates instruction data using ChatGPT APIs, alleviates the data scarcity problem. To improve the quality of instruction data, Self-Instruct discards many of the instructions generated from ChatGPT, even though it is inefficient in terms of cost owing to many useless API calls. To generate high-quality instruction data at a low cost, we propose a novel data generation framework, Self-Direct Instruction generation (SeDi-Instruct), which employs diversity-based filtering and iterative feedback task generation. Diversity-based filtering maintains model accuracy without excessively discarding low-quality generated instructions by enhancing the diversity of instructions in a batch. This reduces the cost of synthesizing instruction data. The iterative feedback task generation integrates instruction generation and training tasks and utilizes information obtained during the training to create high-quality instruction sets. Our results show that SeDi-Instruct enhances the accuracy of AI models by 5.2%, compared with traditional methods, while reducing data generation costs by 36%.
CVSep 2, 2025
A Diffusion-Based Framework for Configurable and Realistic Multi-Storage Trace GenerationSeohyun Kim, Junyoung Lee, Jongho Park et al.
We propose DiTTO, a novel diffusion-based framework for generating realistic, precisely configurable, and diverse multi-device storage traces. Leveraging advanced diffusion techniques, DiTTO enables the synthesis of high-fidelity continuous traces that capture temporal dynamics and inter-device dependencies with user-defined configurations. Our experimental results demonstrate that DiTTO can generate traces with high fidelity and diversity while aligning closely with guided configurations with only 8% errors.
DBAug 26, 2025
Rethinking Caching for LLM Serving Systems: Beyond Traditional HeuristicsJungwoo Kim, Minsang Kim, Jaeheon Lee et al.
Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix caches neglect query semantics, while state-of-the-art semantic caches remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for LLM serving. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71$\times$ higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.
LGFeb 17, 2022
DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta CompressionJisung Park, Jeoggyun Kim, Yeseong Kim et al.
Data reduction in storage systems is becoming increasingly important as an effective solution to minimize the management cost of a data center. To maximize data-reduction efficiency, existing post-deduplication delta-compression techniques perform delta compression along with traditional data deduplication and lossless compression. Unfortunately, we observe that existing techniques achieve significantly lower data-reduction ratios than the optimal due to their limited accuracy in identifying similar data blocks. In this paper, we propose DeepSketch, a new reference search technique for post-deduplication delta compression that leverages the learning-to-hash method to achieve higher accuracy in reference search for delta compression, thereby improving data-reduction efficiency. DeepSketch uses a deep neural network to extract a data block's sketch, i.e., to create an approximate data signature of the block that can preserve similarity with other blocks. Our evaluation using eleven real-world workloads shows that DeepSketch improves the data-reduction ratio by up to 33% (21% on average) over a state-of-the-art post-deduplication delta-compression technique.
CLSep 25, 2021
Deciding Whether to Ask Clarifying Questions in Large-Scale Spoken Language UnderstandingJoo-Kyung Kim, Guoyin Wang, Sungjin Lee et al.
A large-scale conversational agent can suffer from understanding user utterances with various ambiguities such as ASR ambiguity, intent ambiguity, and hypothesis ambiguity. When ambiguities are detected, the agent should engage in a clarifying dialog to resolve the ambiguities before committing to actions. However, asking clarifying questions for all the ambiguity occurrences could lead to asking too many questions, essentially hampering the user experience. To trigger clarifying questions only when necessary for the user satisfaction, we propose a neural self-attentive model that leverages the hypotheses with ambiguities and contextual signals. We conduct extensive experiments on five common ambiguity types using real data from a large-scale commercial conversational agent and demonstrate significant improvement over a set of baseline approaches.
LGJun 4, 2021
Learning Slice-Aware Representations with Mixture of AttentionsCheng Wang, Sungjin Lee, Sunghyun Park et al.
Real-world machine learning systems are achieving remarkable performance in terms of coarse-grained metrics like overall accuracy and F-1 score. However, model improvement and development often require fine-grained modeling on individual data subsets or slices, for instance, the data slices where the models have unsatisfactory results. In practice, it gives tangible values for developing such models that can pay extra attention to critical or interested slices while retaining the original overall performance. This work extends the recent slice-based learning (SBL)~\cite{chen2019slice} with a mixture of attentions (MoA) to learn slice-aware dual attentive representations. We empirically show that the MoA approach outperforms the baseline method as well as the original SBL approach on monitored slices with two natural language understanding (NLU) tasks.
LGApr 26, 2021
Handling Long-Tail Queries with Slice-Aware Conversational SystemsCheng Wang, Sun Kim, Taiwoo Park et al.
We have been witnessing the usefulness of conversational AI systems such as Siri and Alexa, directly impacting our daily lives. These systems normally rely on machine learning models evolving over time to provide quality user experience. However, the development and improvement of the models are challenging because they need to support both high (head) and low (tail) usage scenarios, requiring fine-grained modeling strategies for specific data subsets or slices. In this paper, we explore the recent concept of slice-based learning (SBL) (Chen et al., 2019) to improve our baseline conversational skill routing system on the tail yet critical query traffic. We first define a set of labeling functions to generate weak supervision data for the tail intents. We then extend the baseline model towards a slice-aware architecture, which monitors and improves the model performance on the selected tail intents. Applied to de-identified live traffic from a commercial conversational AI system, our experiments show that the slice-aware model is beneficial in improving model performance for the tail intents while maintaining the overall performance.
CLMar 4, 2021
Neural model robustness for skill routing in large-scale conversational AI systems: A design choice explorationHan Li, Sunghyun Park, Aswarth Dara et al.
Current state-of-the-art large-scale conversational AI or intelligent digital assistant systems in industry comprises a set of components such as Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). For some of these systems that leverage a shared NLU ontology (e.g., a centralized intent/slot schema), there exists a separate skill routing component to correctly route a request to an appropriate skill, which is either a first-party or third-party application that actually executes on a user request. The skill routing component is needed as there are thousands of skills that can either subscribe to the same intent and/or subscribe to an intent under specific contextual conditions (e.g., device has a screen). Ensuring model robustness or resilience in the skill routing component is an important problem since skills may dynamically change their subscription in the ontology after the skill routing model has been deployed to production. We show how different modeling design choices impact the model robustness in the context of skill routing on a state-of-the-art commercial conversational AI system, specifically on the choices around data augmentation, model architecture, and optimization method. We show that applying data augmentation can be a very effective and practical way to drastically improve model robustness.
CLMar 1, 2021
DEUS: A Data-driven Approach to Estimate User Satisfaction in Multi-turn DialoguesZiming Li, Dookun Park, Julia Kiseleva et al.
Digital assistants are experiencing rapid growth due to their ability to assist users with day-to-day tasks where most dialogues are happening multi-turn. However, evaluating multi-turn dialogues remains challenging, especially at scale. We suggest a context-sensitive method to estimate the turn-level satisfaction for dialogue considering various types of user preferences. The costs of interactions between users and dialogue systems are formulated using a budget consumption concept. We assume users have an initial interaction budget for a dialogue formed based on the task complexity and that each turn has a cost. When the task is completed, or the budget has been exhausted, users quit the dialogue. We demonstrate our method's effectiveness by extensive experimentation with a simulated dialogue platform and real multi-turn dialogues.
CLOct 23, 2020
A scalable framework for learning from implicit user feedback to improve natural language understanding in large-scale conversational AI systemsSunghyun Park, Han Li, Ameen Patel et al.
Natural Language Understanding (NLU) is an established component within a conversational AI or digital assistant system, and it is responsible for producing semantic understanding of a user request. We propose a scalable and automatic approach for improving NLU in a large-scale conversational AI system by leveraging implicit user feedback, with an insight that user interaction data and dialog context have rich information embedded from which user satisfaction and intention can be inferred. In particular, we propose a general domain-agnostic framework for curating new supervision data for improving NLU from live production traffic. With an extensive set of experiments, we show the results of applying the framework and improving NLU for a large-scale production system and show its impact across 10 domains.
LGOct 21, 2020
Self-Supervised Contrastive Learning for Efficient User Satisfaction Prediction in Conversational AgentsMohammad Kachuee, Hao Yuan, Young-Bum Kim et al.
Turn-level user satisfaction is one of the most important performance metrics for conversational agents. It can be used to monitor the agent's performance and provide insights about defective user experiences. Moreover, a powerful satisfaction model can be used as an objective function that a conversational agent continuously optimizes for. While end-to-end deep learning has shown promising results, having access to a large number of reliable annotated samples required by these methods remains challenging. In a large-scale conversational system, there is a growing number of newly developed skills, making the traditional data collection, annotation, and modeling process impractical due to the required annotation costs as well as the turnaround times. In this paper, we suggest a self-supervised contrastive learning approach that leverages the pool of unlabeled data to learn user-agent interactions. We show that the pre-trained models using the self-supervised objective are transferable to the user satisfaction prediction. In addition, we propose a novel few-shot transfer learning approach that ensures better transferability for very small sample sizes. The suggested few-shot method does not require any inner loop optimization process and is scalable to very large datasets and complex models. Based on our experiments using real-world data from a large-scale commercial system, the suggested approach is able to significantly reduce the required number of annotations, while improving the generalization on unseen out-of-domain skills.
HCMay 29, 2020
Large-scale Hybrid Approach for Predicting User Satisfaction with Conversational AgentsDookun Park, Hao Yuan, Dongmin Kim et al.
Measuring user satisfaction level is a challenging task, and a critical component in developing large-scale conversational agent systems serving the needs of real users. An widely used approach to tackle this is to collect human annotation data and use them for evaluation or modeling. Human annotation based approaches are easier to control, but hard to scale. A novel alternative approach is to collect user's direct feedback via a feedback elicitation system embedded to the conversational agent system, and use the collected user feedback to train a machine-learned model for generalization. User feedback is the best proxy for user satisfaction, but is not available for some ineligible intents and certain situations. Thus, these two types of approaches are complementary to each other. In this work, we tackle the user satisfaction assessment problem with a hybrid approach that fuses explicit user feedback, user satisfaction predictions inferred by two machine-learned models, one trained on user feedback data and the other human annotation data. The hybrid approach is based on a waterfall policy, and the experimental results with Amazon Alexa's large-scale datasets show significant improvements in inferring user satisfaction. A detailed hybrid architecture, an in-depth analysis on user feedback data, and an algorithm that generates data sets to properly simulate the live traffic are presented in this paper.
AIApr 7, 2020
Guided Dialog Policy Learning without Adversarial Learning in the LoopZiming Li, Sungjin Lee, Baolin Peng et al.
Reinforcement Learning (RL) methods have emerged as a popular choice for training an efficient and effective dialogue policy. However, these methods suffer from sparse and unstable reward signals returned by a user simulator only when a dialogue finishes. Besides, the reward signal is manually designed by human experts, which requires domain knowledge. Recently, a number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy. However, to alternatively update the dialogue policy and the reward model on the fly, we are limited to policy-gradient-based algorithms, such as REINFORCE and PPO. Moreover, the alternating training of a dialogue agent and the reward model can easily get stuck in local optima or result in mode collapse. To overcome the listed issues, we propose to decompose the adversarial training into two steps. First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning. This approach is applicable to both on-policy and off-policy RL methods. Based on our extensive experimentation, we can conclude the proposed method: (1) achieves a remarkable task success rate using both on-policy and off-policy RL methods; and (2) has the potential to transfer knowledge from existing domains to a new domain.
CLNov 14, 2019
The Eighth Dialog System Technology ChallengeSeokhwan Kim, Michel Galley, Chulaka Gunasekara et al.
This paper introduces the Eighth Dialog System Technology Challenge. In line with recent challenges, the eighth edition focuses on applying end-to-end dialog technologies in a pragmatic way for multi-domain task-completion, noetic response selection, audio visual scene-aware dialog, and schema-guided dialog state tracking tasks. This paper describes the task definition, provided datasets, and evaluation set-up for each track. We also summarize the results of the submitted systems to highlight the overall trends of the state-of-the-art technologies for the tasks.
CLOct 25, 2019
Generating a Common Question from Multiple Documents using Multi-source Encoder-Decoder ModelsWoon Sang Cho, Yizhe Zhang, Sudha Rao et al.
Ambiguous user queries in search engines result in the retrieval of documents that often span multiple topics. One potential solution is for the search engine to generate multiple refined queries, each of which relates to a subset of the documents spanning the same topic. A preliminary step towards this goal is to generate a question that captures common concepts of multiple documents. We propose a new task of generating common question from multiple documents and present simple variant of an existing multi-source encoder-decoder framework, called the Multi-Source Question Generator (MSQG). We first train an RNN-based single encoder-decoder generator from (single document, question) pairs. At test time, given multiple documents, the 'Distribute' step of our MSQG model predicts target word distributions for each document using the trained model. The 'Aggregate' step aggregates these distributions to generate a common question. This simple yet effective strategy significantly outperforms several existing baseline models applied to the new task when evaluated using automated metrics and human judgments on the MS-MARCO-QA dataset.
CLOct 3, 2019
Data-Efficient Goal-Oriented Conversation with Dialogue Knowledge Transfer NetworksIgor Shalyminov, Sungjin Lee, Arash Eshghi et al.
Goal-oriented dialogue systems are now being widely adopted in industry where it is of key importance to maintain a rapid prototyping cycle for new products and domains. Data-driven dialogue system development has to be adapted to meet this requirement --- therefore, reducing the amount of data and annotations necessary for training such systems is a central research problem. In this paper, we present the Dialogue Knowledge Transfer Network (DiKTNet), a state-of-the-art approach to goal-oriented dialogue generation which only uses a few example dialogues (i.e. few-shot learning), none of which has to be annotated. We achieve this by performing a 2-stage training. Firstly, we perform unsupervised dialogue representation pre-training on a large source of goal-oriented dialogues in multiple domains, the MetaLWOz corpus. Secondly, at the transfer stage, we train DiKTNet using this representation together with 2 other textual knowledge sources with different levels of generality: ELMo encoder and the main dataset's source domains. Our main dataset is the Stanford Multi-Domain dialogue corpus. We evaluate our model on it in terms of BLEU and Entity F1 scores, and show that our approach significantly and consistently improves upon a series of baseline models as well as over the previous state-of-the-art dialogue generation model, ZSDG. The improvement upon the latter --- up to 10% in Entity F1 and the average of 3% in BLEU score --- is achieved using only the equivalent of 10% of ZSDG's in-domain training data.
CLSep 3, 2019
Structuring Latent Spaces for Stylized Response GenerationXiang Gao, Yizhe Zhang, Sungjin Lee et al.
Generating responses in a targeted style is a useful yet challenging task, especially in the absence of parallel data. With limited data, existing methods tend to generate responses that are either less stylized or less context-relevant. We propose StyleFusion, which bridges conversation modeling and non-parallel style transfer by sharing a structured latent space. This structure allows the system to generate stylized relevant responses by sampling in the neighborhood of the conversation model prediction, and continuously control the style level. We demonstrate this method using dialogues from Reddit data and two sets of sentences with distinct styles (arXiv and Sherlock Holmes novels). Automatic and human evaluation show that, without sacrificing appropriateness, the system generates responses of the targeted style and outperforms competitive baselines.
CLAug 16, 2019
Few-Shot Dialogue Generation Without Annotated Data: A Transfer Learning ApproachIgor Shalyminov, Sungjin Lee, Arash Eshghi et al.
Learning with minimal data is one of the key challenges in the development of practical, production-ready goal-oriented dialogue systems. In a real-world enterprise setting where dialogue systems are developed rapidly and are expected to work robustly for an ever-growing variety of domains, products, and scenarios, efficient learning from a limited number of examples becomes indispensable. In this paper, we introduce a technique to achieve state-of-the-art dialogue generation performance in a few-shot setup, without using any annotated data. We do this by leveraging background knowledge from a larger, more highly represented dialogue source --- namely, the MetaLWOz dataset. We evaluate our model on the Stanford Multi-Domain Dialogue Dataset, consisting of human-human goal-oriented dialogues in in-car navigation, appointment scheduling, and weather information domains. We show that our few-shot approach achieves state-of-the art results on that dataset by consistently outperforming the previous best model in terms of BLEU and Entity F1 scores, while being more data-efficient by not requiring any data annotation.
CLMay 24, 2019
Contextual Out-of-Domain Utterance Handling With Counterfeit Data AugmentationSungjin Lee, Igor Shalyminov
Neural dialog models often lack robustness to anomalous user input and produce inappropriate responses which leads to frustrating user experience. Although there are a set of prior approaches to out-of-domain (OOD) utterance detection, they share a few restrictions: they rely on OOD data or multiple sub-domains, and their OOD detection is context-independent which leads to suboptimal performance in a dialog. The goal of this paper is to propose a novel OOD detection method that does not require OOD data by utilizing counterfeit OOD turns in the context of a dialog. For the sake of fostering further research, we also release new dialog datasets which are 3 publicly available dialog corpora augmented with OOD turns in a controllable way. Our method outperforms state-of-the-art dialog models equipped with a conventional OOD detection mechanism by a large margin in the presence of OOD utterances.
CLMar 13, 2019
Consistent Dialogue Generation with Self-supervised Feature LearningYizhe Zhang, Xiang Gao, Sungjin Lee et al.
Generating responses that are consistent with the dialogue context is one of the central challenges in building engaging conversational agents. We demonstrate that neural conversation models can be geared towards generating consistent responses by maintaining certain features related to topics and personas throughout the conversation. Past work has required external supervision that exploits features such as user identities that are often unavailable. In our approach, topic and persona feature extractors are trained using a contrastive training scheme that utilizes the natural structure of dialogue data. We further adopt a feature disentangling loss which, paired with controllable response generation techniques, allows us to promote or demote certain learned topics and persona features. Evaluation results demonstrate the model's ability to capture meaningful topics and persona features. The incorporation of the learned features brings significant improvement in terms of the quality of generated responses on two dialogue datasets.
CLFeb 28, 2019
Jointly Optimizing Diversity and Relevance in Neural Response GenerationXiang Gao, Sungjin Lee, Yizhe Zhang et al.
Although recent neural conversation models have shown great potential, they often generate bland and generic responses. While various approaches have been explored to diversify the output of the conversation model, the improvement often comes at the cost of decreased relevance. In this paper, we propose a SpaceFusion model to jointly optimize diversity and relevance that essentially fuses the latent space of a sequence-to-sequence model and that of an autoencoder model by leveraging novel regularization terms. As a result, our approach induces a latent space in which the distance and direction from the predicted response vector roughly match the relevance and diversity, respectively. This property also lends itself well to an intuitive visualization of the latent space. Both automatic and human evaluation results demonstrate that the proposed approach brings significant improvement compared to strong baselines in both diversity and relevance.
CLNov 29, 2018
Improving Robustness of Neural Dialog Systems in a Data-Efficient Way with Turn DropoutIgor Shalyminov, Sungjin Lee
Neural network-based dialog models often lack robustness to anomalous, out-of-domain (OOD) user input which leads to unexpected dialog behavior and thus considerably limits such models' usage in mission-critical production environments. The problem is especially relevant in the setting of dialog system bootstrapping with limited training data and no access to OOD examples. In this paper, we explore the problem of robustness of such systems to anomalous input and the associated to it trade-off in accuracies on seen and unseen data. We present a new dataset for studying the robustness of dialog systems to OOD input, which is bAbI Dialog Task 6 augmented with OOD content in a controlled way. We then present turn dropout, a simple yet efficient negative sampling-based technique for improving robustness of neural dialog models. We demonstrate its effectiveness applied to Hybrid Code Network-family models (HCNs) which reach state-of-the-art results on our OOD-augmented dataset as well as the original one. Specifically, an HCN trained with turn dropout achieves state-of-the-art performance of more than 75% per-utterance accuracy on the augmented dataset's OOD turns and 74% F1-score as an OOD detector. Furthermore, we introduce a Variational HCN enhanced with turn dropout which achieves more than 56.5% accuracy on the original bAbI Task 6 dataset, thus outperforming the initially reported HCN's result.
CLNov 15, 2018
Nudging Neural Conversational Model with Domain KnowledgeSungjin Lee
Neural conversation models are attractive because one can train a model directly on dialog examples with minimal labeling. With a small amount of data, however, they often fail to generalize over test data since they tend to capture spurious features instead of semantically meaningful domain knowledge. To address this issue, we propose a novel approach that allows any human teachers to transfer their domain knowledge to the conversation model in the form of natural language rules. We tested our method with three different dialog datasets. The improved performance across all domains demonstrates the efficacy of our proposed method.
CLAug 29, 2018
Zero-Shot Adaptive Transfer for Conversational Language UnderstandingSungjin Lee, Rahul Jha
Conversational agents such as Alexa and Google Assistant constantly need to increase their language understanding capabilities by adding new domains. A massive amount of labeled data is required for training each new domain. While domain adaptation approaches alleviate the annotation cost, prior approaches suffer from increased training time and suboptimal concept alignments. To tackle this, we introduce a novel Zero-Shot Adaptive Transfer method for slot tagging that utilizes the slot description for transferring reusable concepts across domains, and enjoys efficient training without any explicit concept alignments. Extensive experimentation over a dataset of 10 domains relevant to our commercial personal digital assistant shows that our model outperforms previous state-of-the-art systems by a large margin, and achieves an even higher improvement in the low data regime.
CLJan 16, 2018
OneNet: Joint Domain, Intent, Slot Prediction for Spoken Language UnderstandingYoung-Bum Kim, Sungjin Lee, Karl Stratos
In practice, most spoken language understanding systems process user input in a pipelined manner; first domain is predicted, then intent and semantic slots are inferred according to the semantic frames of the predicted domain. The pipeline approach, however, has some disadvantages: error propagation and lack of information sharing. To address these issues, we present a unified neural network that jointly performs domain, intent, and slot predictions. Our approach adopts a principled architecture for multitask learning to fold in the state-of-the-art models for each task. With a few more ingredients, e.g. orthography-sensitive input encoding and curriculum training, our model delivered significant improvements in all three tasks across all domains over strong baselines, including one using oracle prediction for domain detection, on real user data of a commercial personal assistant.
CLDec 28, 2017
Toward Continual Learning for Conversational AgentsSungjin Lee
While end-to-end neural conversation models have led to promising advances in reducing hand-crafted features and errors induced by the traditional complex system architecture, they typically require an enormous amount of data due to the lack of modularity. Previous studies adopted a hybrid approach with knowledge-based components either to abstract out domain-specific information or to augment data to cover more diverse patterns. On the contrary, we propose to directly address the problem using recent developments in the space of continual learning for neural models. Specifically, we adopt a domain-independent neural conversational model and introduce a novel neural continual learning algorithm that allows a conversational agent to accumulate skills across different tasks in a data-efficient way. To the best of our knowledge, this is the first work that applies continual learning to conversation systems. We verified the efficacy of our method through a conversational skill transfer from either synthetic dialogs or human-human dialogs to human-computer conversations in a customer support domain.
CLNov 29, 2017
Speaker-Sensitive Dual Memory Networks for Multi-Turn Slot TaggingYoung-Bum Kim, Sungjin Lee, Ruhi Sarikaya
In multi-turn dialogs, natural language understanding models can introduce obvious errors by being blind to contextual information. To incorporate dialog history, we present a neural architecture with Speaker-Sensitive Dual Memory Networks which encode utterances differently depending on the speaker. This addresses the different extents of information available to the system - the system knows only the surface form of user utterances while it has the exact semantics of system output. We performed experiments on real user data from Microsoft Cortana, a commercial personal assistant. The result showed a significant performance improvement over the state-of-the-art slot tagging models using contextual information.
CLApr 10, 2017
Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement LearningBaolin Peng, Xiujun Li, Lihong Li et al.
Building a dialogue agent to fulfill complex tasks, such as travel planning, is challenging because the agent has to learn to collectively complete multiple subtasks. For example, the agent needs to reserve a hotel and book a flight so that there leaves enough time for commute between arrival and hotel check-in. This paper addresses this challenge by formulating the task in the mathematical framework of options over Markov Decision Processes (MDPs), and proposing a hierarchical deep reinforcement learning approach to learning a dialogue manager that operates at different temporal scales. The dialogue manager consists of: (1) a top-level dialogue policy that selects among subtasks or options, (2) a low-level dialogue policy that selects primitive actions to complete the subtask given by the top-level policy, and (3) a global state tracker that helps ensure all cross-subtask constraints be satisfied. Experiments on a travel planning task with simulated and real users show that our approach leads to significant improvements over three baselines, two based on handcrafted rules and the other based on flat deep reinforcement learning.