CLOct 2, 2023
RA-DIT: Retrieval-Augmented Dual Instruction TuningXi Victoria Lin, Xilun Chen, Mingda Chen et al. · meta-ai
Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average.
CLDec 16, 2022
BLASER: A Text-Free Speech-to-Speech Translation Evaluation MetricMingda Chen, Paul-Ambroise Duquenne, Pierre Andrews et al. · meta-ai
End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.
CLJun 22, 2023
xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource LanguagesMingda Chen, Kevin Heffernan, Onur Çelebi et al. · meta-ai
We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xSIM++. In comparison to xSIM, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xSIM, we show that xSIM++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xSIM++ also reports performance for different error types, offering more fine-grained feedback for model development.
CLMay 3, 2022
Improving In-Context Few-Shot Learning via Self-Supervised TrainingMingda Chen, Jingfei Du, Ramakanth Pasunuru et al. · uw
Self-supervised pretraining has made few-shot learning possible for many NLP tasks. But the pretraining objectives are not typically adapted specifically for in-context few-shot learning. In this paper, we propose to use self-supervision in an intermediate training stage between pretraining and downstream few-shot usage with the goal to teach the model to perform in-context few shot learning. We propose and evaluate four self-supervised objectives on two benchmarks. We find that the intermediate self-supervision stage produces models that outperform strong baselines. Ablation study shows that several factors affect the downstream performance, such as the amount of training data and the diversity of the self-supervised objectives. Human-annotated cross-task supervision and self-supervision are complementary. Qualitative analysis suggests that the self-supervised-trained models are better at following task requirements.
LGSep 30, 2024
Characterizing and Efficiently Accelerating Multimodal Generation Model InferenceYejin Lee, Anna Sun, Basil Hosmer et al. · meta-ai, stanford
Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To sustainably scale generative AI capabilities to billions of users in the world, inference must be fast and efficient. This paper pinpoints key system design and optimization opportunities by characterizing a family of emerging multi-modal generation models on real systems. Auto-regressive token generation is a critical latency performance bottleneck, typically dominated by GPU idle time. In addition to memory-intensive attention across the generative AI models, linear operations constitute significant inference latency due to the feed forward networks in Transformer-based models. We demonstrate that state-of-the-art optimization levers, spanning from applications to system software and hardware, set a 3.88x better baseline.
71.2CVMay 31
Knowledge-Intensive Video GenerationChenxu Wang, Mingda Chen
Text-to-video generation has advanced rapidly in visual quality, but remains under-evaluated for factuality and practical usefulness. We introduce knowledge-intensive video generation (KIVI), where models generate videos from short information-seeking prompts that ask for explanations, procedures, or demonstrations. To evaluate this setting, we construct KIVI-Bench, a benchmark of 1,080 prompts, and propose automatic metrics for factuality and helpfulness. Human evaluation shows that our metrics significantly better align with human annotations than existing alternatives. Experiments on seven state-of-the-art video generation models show that current systems still lag behind human performance, especially on visual properties, procedural operations, and clear information presentation. These results highlight KIVI as a challenging direction for factual and instructionally useful video generation.
65.5CLApr 1
Procedural Knowledge at Scale Improves ReasoningDi Wu, Devendra Singh Sachan, Wen-tau Yih et al. · mila
Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.
CLAug 14, 2025Code
A Survey on Diffusion Language ModelsTianyi Li, Mingda Chen, Bowei Guo et al.
Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.
CLJul 21, 2022
Leveraging Natural Supervision for Language Representation Learning and GenerationMingda Chen
Recent breakthroughs in Natural Language Processing (NLP) have been driven by language models trained on a massive amount of plain text. While powerful, deriving supervision from textual resources is still an open question. For example, language model pretraining often neglects the rich, freely-available structures in textual data. In this thesis, we describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision. We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks. Specifically, we alter the sentence prediction loss to make it better suited to other pretraining losses and more challenging to solve. We design an intermediate finetuning step that uses self-supervised training to promote models' ability in cross-task generalization. Then we describe methods to leverage the structures in Wikipedia and paraphrases. In particular, we propose training losses to exploit hyperlinks, article structures, and article category graphs for entity-, discourse-, entailment-related knowledge. We propose a framework that uses paraphrase pairs to disentangle semantics and syntax in sentence representations. We extend the framework for a novel generation task that controls the syntax of output text with a sentential exemplar. Lastly, we discuss our work on tailoring textual resources for establishing challenging evaluation tasks. We introduce three datasets by defining novel tasks using various fan-contributed websites, including a long-form data-to-text generation dataset, a screenplay summarization dataset, and a long-form story generation dataset. These datasets have unique characteristics offering challenges to future work in their respective task settings.
CLSep 26, 2019Code
ALBERT: A Lite BERT for Self-supervised Learning of Language RepresentationsZhenzhong Lan, Mingda Chen, Sebastian Goodman et al.
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT.
54.2SEMar 25
Boosting LLMs for Mutation GenerationBo Wang, Ming Deng, Mingda Chen et al.
LLM-based mutation testing is a promising testing technology, but existing approaches typically rely on a fixed set of mutations as few-shot examples or none at all. This can result in generic low-quality mutations, missed context-specific mutation patterns, substantial numbers of redundant and uncompilable mutants, and limited semantic similarity to real bugs. To overcome these limitations, we introduce SMART (Semantic Mutation with Adaptive Retrieval and Tuning). SMART integrates retrieval-augmented generation (RAG) on a vectorized dataset of real-world bugs, focused code chunking, and supervised fine-tuning using mutations coupled with real-world bugs. We conducted an extensive empirical study of SMART using 1,991 real-world Java bugs from the Defects4J and ConDefects datasets, comparing SMART to the state-of-the-art LLM-based approaches, LLMut and LLMorpheus. The results reveal that SMART substantially improves mutation validity, effectiveness, and efficiency (even enabling small-scale 7B-scale models to match or even surpass large models like GPT-4o). We also demonstrate that SMART significantly improves downstream software engineering applications, including test case prioritization and fault localization. More specifically, SMART improves validity (weighted average generation rate) from 42.89% to 65.6%. It raises the non-duplicate rate from 87.38% to 95.62%, and the compilable rate from 88.85% to 90.21%. In terms of effectiveness, it achieves a real bug detection rate of 92.61% (vs. 57.86% for LLMut) and improves the average Ochiai coefficient from 25.61% to 38.44%. For fault localization, SMART ranks 64 more bugs as Top-1 under MUSE and 57 more under Metallaxis.
CLDec 24, 2024
Improving Factuality with Explicit Working MemoryMingda Chen, Yang Li, Karthik Padthe et al.
Large language models can generate factually inaccurate content, a problem known as hallucination. Recent works have built upon retrieved-augmented generation to improve factuality through iterative prompting but these methods are limited by the traditional RAG design. To address these challenges, we introduce EWE (Explicit Working Memory), a novel approach that enhances factuality in long-form text generation by integrating a working memory that receives real-time feedback from external resources. The memory is refreshed based on online fact-checking and retrieval feedback, allowing EWE to rectify false claims during the generation process and ensure more accurate and reliable outputs. Our experiments demonstrate that Ewe outperforms strong baselines on four fact-seeking long-form generation datasets, increasing the factuality metric, VeriScore, by 2 to 6 points absolute without sacrificing the helpfulness of the responses. Further analysis reveals that the design of rules for memory updates, configurations of memory units, and the quality of the retrieval datastore are crucial factors for influencing model performance.
CLJul 31, 2025
FACTORY: A Challenging Human-Verified Prompt Set for Long-Form FactualityMingda Chen, Yang Li, Xilun Chen et al.
Long-form factuality evaluation assesses the ability of models to generate accurate, comprehensive responses to short prompts. Existing benchmarks often lack human verification, leading to potential quality issues. To address this limitation, we introduce FACTORY, a large-scale, human-verified prompt set. Developed using a model-in-the-loop approach and refined by humans, FACTORY includes challenging prompts that are fact-seeking, answerable, and unambiguous. We conduct human evaluations on 6 state-of-the-art language models using FACTORY and existing datasets. Our results show that FACTORY is a challenging benchmark: approximately 40% of the claims made in the responses of SOTA models are not factual, compared to only 10% for other datasets. Our analysis identifies the strengths of FACTORY over prior benchmarks, emphasizing its reliability and the necessity for models to reason across long-tailed facts.
CLJun 2, 2025
ImpRAG: Retrieval-Augmented Generation with Implicit QueriesWenzheng Zhang, Xi Victoria Lin, Karl Stratos et al.
Retrieval-Augmented Generation (RAG) systems traditionally treat retrieval and generation as separate processes, requiring explicit textual queries to connect them. This separation can limit the ability of models to generalize across diverse tasks. In this work, we propose a query-free RAG system, named ImpRAG, which integrates retrieval and generation into a unified model. ImpRAG allows models to implicitly express their information needs, eliminating the need for human-specified queries. By dividing pretrained decoder-only language models into specialized layer groups, ImpRAG optimizes retrieval and generation tasks simultaneously. Our approach employs a two-stage inference process, using the same model parameters and forward pass for both retrieval and generation, thereby minimizing the disparity between retrievers and language models. Experiments on 8 knowledge-intensive tasks demonstrate that ImpRAG achieves 3.6-11.5 improvements in exact match scores on unseen tasks with diverse formats, highlighting its effectiveness in enabling models to articulate their own information needs and generalize across tasks. Our analysis underscores the importance of balancing retrieval and generation parameters and leveraging generation perplexities as retrieval training objectives for enhanced performance.
CLMay 23, 2023
Few-Shot Data Synthesis for Open Domain Multi-Hop Question AnsweringMingda Chen, Xilun Chen, Wen-tau Yih
Few-shot learning for open domain multi-hop question answering typically relies on the incontext learning capability of large language models (LLMs). While powerful, these LLMs usually contain tens or hundreds of billions of parameters, making them rather inefficient at inference time. To improve performance of smaller language models, we propose a data synthesis framework for multi-hop question answering that requires less than 10 human annotated question answer pairs. Our framework depends only on rich, naturally-occurring relationships among documents and is built upon the data generation functions parameterized by LLMs and prompts. We synthesize millions of multi-hop questions and claims to finetune language models, evaluated on popular benchmarks for multi-hop question answering and fact verification. Empirically, our approach improves model performance significantly, allowing the finetuned models to be competitive with GPT-3.5 based approaches while being almost one-third the size in parameter count.
CLSep 18, 2021
TVStoryGen: A Dataset for Generating Stories with Character DescriptionsMingda Chen, Kevin Gimpel
We introduce TVStoryGen, a story generation dataset that requires generating detailed TV show episode recaps from a brief summary and a set of documents describing the characters involved. Unlike other story generation datasets, TVStoryGen contains stories that are authored by professional screen-writers and that feature complex interactions among multiple characters. Generating stories in TVStoryGen requires drawing relevant information from the lengthy provided documents about characters based on the brief summary. In addition, we propose to train reverse models on our dataset for evaluating the faithfulness of generated stories. We create TVStoryGen from fan-contributed websites, which allows us to collect 26k episode recaps with 1868.7 tokens on average. Empirically, we take a hierarchical story generation approach and find that the neural model that uses oracle content selectors for character descriptions demonstrates the best performance on automatic metrics, showing the potential of our dataset to inspire future research on story generation with constraints. Qualitative analysis shows that the best-performing model sometimes generates content that is unfaithful to the short summaries, suggesting promising directions for future work.
CLApr 14, 2021
SummScreen: A Dataset for Abstractive Screenplay SummarizationMingda Chen, Zewei Chu, Sam Wiseman et al.
We introduce SummScreen, a summarization dataset comprised of pairs of TV series transcripts and human written recaps. The dataset provides a challenging testbed for abstractive summarization for several reasons. Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript. These details must be found and integrated to form the succinct plot descriptions in the recaps. Also, TV scripts contain content that does not directly pertain to the central plot but rather serves to develop characters or provide comic relief. This information is rarely contained in recaps. Since characters are fundamental to TV series, we also propose two entity-centric evaluation metrics. Empirically, we characterize the dataset by evaluating several methods, including neural models and those based on nearest neighbors. An oracle extractive approach outperforms all benchmarked models according to automatic metrics, showing that the neural models are unable to fully exploit the input transcripts. Human evaluation and qualitative analysis reveal that our non-oracle models are competitive with their oracle counterparts in terms of generating faithful plot events and can benefit from better content selectors. Both oracle and non-oracle models generate unfaithful facts, suggesting future research directions.
CLDec 29, 2020
WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article SectionsMingda Chen, Sam Wiseman, Kevin Gimpel
Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we cast generating Wikipedia sections as a data-to-text generation task and create a large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WikiTableT contains millions of instances, covering a broad range of topics, as well as a variety of flavors of generation tasks with different levels of flexibility. We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they struggle with coherence and factuality, showing the potential for our dataset to inspire future work on long-form generation.
CLOct 12, 2020
Exemplar-Controllable Paraphrasing and Translation using BitextMingda Chen, Sam Wiseman, Kevin Gimpel
Most prior work on exemplar-based syntactically controlled paraphrase generation relies on automatically-constructed large-scale paraphrase datasets, which are costly to create. We sidestep this prerequisite by adapting models from prior work to be able to learn solely from bilingual text (bitext). Despite only using bitext for training, and in near zero-shot conditions, our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions. To evaluate these tasks quantitatively, we create three novel evaluation datasets. Our experimental results show that our models achieve competitive results on controlled paraphrase generation and strong performance on controlled machine translation. Analysis shows that our models learn to disentangle semantics and syntax in their latent representations, but still suffer from semantic drift.
CLOct 3, 2020
Mining Knowledge for Natural Language Inference from Wikipedia CategoriesMingda Chen, Zewei Chu, Karl Stratos et al.
Accurate lexical entailment (LE) and natural language inference (NLI) often require large quantities of costly annotations. To alleviate the need for labeled data, we introduce WikiNLI: a resource for improving model performance on NLI and LE tasks. It contains 428,899 pairs of phrases constructed from naturally annotated category hierarchies in Wikipedia. We show that we can improve strong baselines such as BERT and RoBERTa by pretraining them on WikiNLI and transferring the models on downstream tasks. We conduct systematic comparisons with phrases extracted from other knowledge bases such as WordNet and Wikidata to find that pretraining on WikiNLI gives the best performance. In addition, we construct WikiNLI in other languages, and show that pretraining on them improves performance on NLI tasks of corresponding languages.
CLMay 16, 2020
Learning Probabilistic Sentence Representations from ParaphrasesMingda Chen, Kevin Gimpel
Probabilistic word embeddings have shown effectiveness in capturing notions of generality and entailment, but there is very little work on doing the analogous type of investigation for sentences. In this paper we define probabilistic models that produce distributions for sentences. Our best-performing model treats each word as a linear transformation operator applied to a multivariate Gaussian distribution. We train our models on paraphrases and demonstrate that they naturally capture sentence specificity. While our proposed model achieves the best performance overall, we also show that specificity is represented by simpler architectures via the norm of the sentence vectors. Qualitative analysis shows that our probabilistic model captures sentential entailment and provides ways to analyze the specificity and preciseness of individual words.
CLNov 21, 2019
How to Ask Better Questions? A Large-Scale Multi-Domain Dataset for Rewriting Ill-Formed QuestionsZewei Chu, Mingda Chen, Jing Chen et al.
We present a large-scale dataset for the task of rewriting an ill-formed natural language question to a well-formed one. Our multi-domain question rewriting MQR dataset is constructed from human contributed Stack Exchange question edit histories. The dataset contains 427,719 question pairs which come from 303 domains. We provide human annotations for a subset of the dataset as a quality estimate. When moving from ill-formed to well-formed questions, the question quality improves by an average of 45 points across three aspects. We train sequence-to-sequence neural models on the constructed dataset and obtain an improvement of 13.2% in BLEU-4 over baseline methods built from other data resources. We release the MQR dataset to encourage research on the problem of question rewriting.
CLAug 31, 2019
Evaluation Benchmarks and Learning Criteria for Discourse-Aware Sentence RepresentationsMingda Chen, Zewei Chu, Kevin Gimpel
Prior work on pretrained sentence embeddings and benchmarks focus on the capabilities of stand-alone sentences. We propose DiscoEval, a test suite of tasks to evaluate whether sentence representations include broader context information. We also propose a variety of training objectives that makes use of natural annotations from Wikipedia to build sentence encoders capable of modeling discourse. We benchmark sentence encoders pretrained with our proposed training objectives, as well as other popular pretrained sentence encoders on DiscoEval and other sentence evaluation tasks. Empirically, we show that these training objectives help to encode different aspects of information in document structures. Moreover, BERT and ELMo demonstrate strong performances over DiscoEval with individual hidden layers showing different characteristics.
CLAug 31, 2019
EntEval: A Holistic Evaluation Benchmark for Entity RepresentationsMingda Chen, Zewei Chu, Yang Chen et al.
Rich entity representations are useful for a wide class of problems involving entities. Despite their importance, there is no standardized benchmark that evaluates the overall quality of entity representations. In this work, we propose EntEval: a test suite of diverse tasks that require nontrivial understanding of entities including entity typing, entity similarity, entity relation prediction, and entity disambiguation. In addition, we develop training techniques for learning better entity representations by using natural hyperlink annotations in Wikipedia. We identify effective objectives for incorporating the contextual information in hyperlinks into state-of-the-art pretrained language models and show that they improve strong baselines on multiple EntEval tasks.
CLJun 23, 2019
Variational Sequential Labelers for Semi-Supervised LearningMingda Chen, Qingming Tang, Karen Livescu et al.
We introduce a family of multitask variational methods for semi-supervised sequence labeling. Our model family consists of a latent-variable generative model and a discriminative labeler. The generative models use latent variables to define the conditional probability of a word given its context, drawing inspiration from word prediction objectives commonly used in learning word embeddings. The labeler helps inject discriminative information into the latent space. We explore several latent variable configurations, including ones with hierarchical structure, which enables the model to account for both label-specific and word-specific information. Our models consistently outperform standard sequential baselines on 8 sequence labeling datasets, and improve further with unlabeled data.
CLJun 23, 2019
Smaller Text Classifiers with Discriminative Cluster EmbeddingsMingda Chen, Kevin Gimpel
Word embedding parameters often dominate overall model sizes in neural methods for natural language processing. We reduce deployed model sizes of text classifiers by learning a hard word clustering in an end-to-end manner. We use the Gumbel-Softmax distribution to maximize over the latent clustering while minimizing the task loss. We propose variations that selectively assign additional parameters to words, which further improves accuracy while still remaining parameter-efficient.
CLJun 3, 2019
Controllable Paraphrase Generation with a Syntactic ExemplarMingda Chen, Qingming Tang, Sam Wiseman et al.
Prior work on controllable text generation usually assumes that the controlled attribute can take on one of a small set of values known a priori. In this work, we propose a novel task, where the syntax of a generated sentence is controlled rather by a sentential exemplar. To evaluate quantitatively with standard metrics, we create a novel dataset with human annotations. We also develop a variational model with a neural module specifically designed for capturing syntactic knowledge and several multitask training objectives to promote disentangled representation learning. Empirically, the proposed model is observed to achieve improvements over baselines and learn to capture desirable characteristics.
CLApr 2, 2019
A Multi-Task Approach for Disentangling Syntax and Semantics in Sentence RepresentationsMingda Chen, Qingming Tang, Sam Wiseman et al.
We propose a generative model for a sentence that uses two latent variables, with one intended to represent the syntax of the sentence and the other to represent its semantics. We show we can achieve better disentanglement between semantic and syntactic representations by training with multiple losses, including losses that exploit aligned paraphrastic sentences and word-order information. We also investigate the effect of moving from bag-of-words to recurrent neural network modules. We evaluate our models as well as several popular pretrained embeddings on standard semantic similarity tasks and novel syntactic similarity tasks. Empirically, we find that the model with the best performing syntactic and semantic representations also gives rise to the most disentangled representations.