IRNov 18, 2022Code
CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector RetrievalMinghan Li, Sheng-Chieh Lin, Barlas Oguz et al. · meta-ai
Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers and have achieved state-of-the-art performance on various retrieval tasks. These methods, however, are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts. In this paper, we unify different multi-vector retrieval models from a token routing viewpoint and propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval. CITADEL learns to route different token vectors to the predicted lexical ``keys'' such that a query token vector only interacts with document token vectors routed to the same key. This design significantly reduces the computation cost while maintaining high accuracy. Notably, CITADEL achieves the same or slightly better performance than the previous state of the art, ColBERT-v2, on both in-domain (MS MARCO) and out-of-domain (BEIR) evaluations, while being nearly 40 times faster. Code and data are available at https://github.com/facebookresearch/dpr-scale.
CLDec 19, 2022
Improving Faithfulness of Abstractive Summarization by Controlling Confounding Effect of Irrelevant SentencesAsish Ghoshal, Arash Einolghozati, Ankit Arun et al. · berkeley, meta-ai
Lack of factual correctness is an issue that still plagues state-of-the-art summarization systems despite their impressive progress on generating seemingly fluent summaries. In this paper, we show that factual inconsistency can be caused by irrelevant parts of the input text, which act as confounders. To that end, we leverage information-theoretic measures of causal effects to quantify the amount of confounding and precisely quantify how they affect the summarization performance. Based on insights derived from our theoretical results, we design a simple multi-task model to control such confounding by leveraging human-annotated relevant sentences when available. Crucially, we give a principled characterization of data distributions where such confounding can be large thereby necessitating the use of human annotated relevant sentences to generate factual summaries. Our approach improves faithfulness scores by 20\% over strong baselines on AnswerSumm \citep{fabbri2021answersumm}, a conversation summarization dataset where lack of faithfulness is a significant issue due to the subjective nature of the task. Our best method achieves the highest faithfulness score while also achieving state-of-the-art results on standard metrics like ROUGE and METEOR. We corroborate these improvements through human evaluation.
LGDec 29, 2020Code
Towards Understanding the Behaviors of Optimal Deep Active Learning AlgorithmsYilun Zhou, Adithya Renduchintala, Xian Li et al.
Active learning (AL) algorithms may achieve better performance with fewer data because the model guides the data selection process. While many algorithms have been proposed, there is little study on what the optimal AL algorithm looks like, which would help researchers understand where their models fall short and iterate on the design. In this paper, we present a simulated annealing algorithm to search for this optimal oracle and analyze it for several tasks. We present qualitative and quantitative insights into the behaviors of this oracle, comparing and contrasting them with those of various heuristics. Moreover, we are able to consistently improve the heuristics using one particular insight. We hope that our findings can better inform future active learning research. The code is available at https://github.com/YilunZhou/optimal-active-learning.
CLDec 31, 2020
FiD-Ex: Improving Sequence-to-Sequence Models for Extractive Rationale GenerationKushal Lakhotia, Bhargavi Paranjape, Asish Ghoshal et al.
Natural language (NL) explanations of model predictions are gaining popularity as a means to understand and verify decisions made by large black-box pre-trained models, for NLP tasks such as Question Answering (QA) and Fact Verification. Recently, pre-trained sequence to sequence (seq2seq) models have proven to be very effective in jointly making predictions, as well as generating NL explanations. However, these models have many shortcomings; they can fabricate explanations even for incorrect predictions, they are difficult to adapt to long input documents, and their training requires a large amount of labeled data. In this paper, we develop FiD-Ex, which addresses these shortcomings for seq2seq models by: 1) introducing sentence markers to eliminate explanation fabrication by encouraging extractive generation, 2) using the fusion-in-decoder architecture to handle long input contexts, and 3) intermediate fine-tuning on re-structured open domain QA datasets to improve few-shot performance. FiD-Ex significantly improves over prior work in terms of explanation metrics and task accuracy, on multiple tasks from the ERASER explainability benchmark, both in the fully supervised and in the few-shot settings.
CLOct 7, 2020
Low-Resource Domain Adaptation for Compositional Task-Oriented Semantic ParsingXilun Chen, Asish Ghoshal, Yashar Mehdad et al.
Task-oriented semantic parsing is a critical component of virtual assistants, which is responsible for understanding the user's intents (set reminder, play music, etc.). Recent advances in deep learning have enabled several approaches to successfully parse more complex queries (Gupta et al., 2018; Rongali et al.,2020), but these models require a large amount of annotated training data to parse queries on new domains (e.g. reminder, music). In this paper, we focus on adapting task-oriented semantic parsers to low-resource domains, and propose a novel method that outperforms a supervised neural model at a 10-fold data reduction. In particular, we identify two fundamental factors for low-resource domain adaptation: better representation learning and better training techniques. Our representation learning uses BART (Lewis et al., 2019) to initialize our model which outperforms encoder-only pre-trained representations used in previous work. Furthermore, we train with optimization-based meta-learning (Finn et al., 2017) to improve generalization to low-resource domains. This approach significantly outperforms all baseline methods in the experiments on a newly collected multi-domain task-oriented semantic parsing dataset (TOPv2), which we release to the public.
MEJun 28, 2019
Direct Learning with Guarantees of the Difference DAG Between Structural Equation ModelsAsish Ghoshal, Kevin Bello, Jean Honorio
Discovering cause-effect relationships between variables from observational data is a fundamental challenge in many scientific disciplines. However, in many situations it is desirable to directly estimate the change in causal relationships across two different conditions, e.g., estimating the change in genetic expression across healthy and diseased subjects can help isolate genetic factors behind the disease. This paper focuses on the problem of directly estimating the structural difference between two structural equation models (SEMs), having the same topological ordering, given two sets of samples drawn from the individual SEMs. We present an principled algorithm that can recover the difference SEM in $\mathcal{O}(d^2 \log p)$ samples, where $d$ is related to the number of edges in the difference SEM of $p$ nodes. We also study the fundamental limits and show that any method requires at least $Ω(d' \log \frac{p}{d'})$ samples to learn difference SEMs with at most $d'$ parents per node. Finally, we validate our theoretical results with synthetic experiments and show that our method outperforms the state-of-the-art. Moreover, we show the usefulness of our method by using data from the medical domain.
LGJun 2, 2019
Minimax bounds for structured predictionKevin Bello, Asish Ghoshal, Jean Honorio
Structured prediction can be considered as a generalization of many standard supervised learning tasks, and is usually thought as a simultaneous prediction of multiple labels. One standard approach is to maximize a score function on the space of labels, which decomposes as a sum of unary and pairwise potentials, each depending on one or two specific labels, respectively. For this approach, several learning and inference algorithms have been proposed over the years, ranging from exact to approximate methods while balancing the computational complexity. However, in contrast to binary and multiclass classification, results on the necessary number of samples for achieving learning is still limited, even for a specific family of predictors such as factor graphs. In this work, we provide minimax bounds for a class of factor-graph inference models for structured prediction. That is, we characterize the necessary sample complexity for any conceivable algorithm to achieve learning of factor-graph predictors.
LGMay 21, 2018
Learning Maximum-A-Posteriori Perturbation Models for Structured Prediction in Polynomial TimeAsish Ghoshal, Jean Honorio
MAP perturbation models have emerged as a powerful framework for inference in structured prediction. Such models provide a way to efficiently sample from the Gibbs distribution and facilitate predictions that are robust to random noise. In this paper, we propose a provably polynomial time randomized algorithm for learning the parameters of perturbed MAP predictors. Our approach is based on minimizing a novel Rademacher-based generalization bound on the expected loss of a perturbed MAP predictor, which can be computed in polynomial time. We obtain conditions under which our randomized learning algorithm can guarantee generalization to unseen examples.
LGJul 15, 2017
Learning linear structural equation models in polynomial time and sample complexityAsish Ghoshal, Jean Honorio
The problem of learning structural equation models (SEMs) from data is a fundamental problem in causal inference. We develop a new algorithm --- which is computationally and statistically efficient and works in the high-dimensional regime --- for learning linear SEMs from purely observational data with arbitrary noise distribution. We consider three aspects of the problem: identifiability, computational efficiency, and statistical efficiency. We show that when data is generated from a linear SEM over $p$ nodes and maximum degree $d$, our algorithm recovers the directed acyclic graph (DAG) structure of the SEM under an identifiability condition that is more general than those considered in the literature, and without faithfulness assumptions. In the population setting, our algorithm recovers the DAG structure in $\mathcal{O}(p(d^2 + \log p))$ operations. In the finite sample setting, if the estimated precision matrix is sparse, our algorithm has a smoothed complexity of $\widetilde{\mathcal{O}}(p^3 + pd^7)$, while if the estimated precision matrix is dense, our algorithm has a smoothed complexity of $\widetilde{\mathcal{O}}(p^5)$. For sub-Gaussian noise, we show that our algorithm has a sample complexity of $\mathcal{O}(\frac{d^8}{\varepsilon^2} \log (\frac{p}{\sqrtδ}))$ to achieve $\varepsilon$ element-wise additive error with respect to the true autoregression matrix with probability at most $1 - δ$, while for noise with bounded $(4m)$-th moment, with $m$ being a positive integer, our algorithm has a sample complexity of $\mathcal{O}(\frac{d^8}{\varepsilon^2} (\frac{p^2}δ)^{1/m})$.
LGJun 18, 2017
Learning Sparse Polymatrix Games in Polynomial Time and Sample ComplexityAsish Ghoshal, Jean Honorio
We consider the problem of learning sparse polymatrix games from observations of strategic interactions. We show that a polynomial time method based on $\ell_{1,2}$-group regularized logistic regression recovers a game, whose Nash equilibria are the $ε$-Nash equilibria of the game from which the data was generated (true game), in $\mathcal{O}(m^4 d^4 \log (pd))$ samples of strategy profiles --- where $m$ is the maximum number of pure strategies of a player, $p$ is the number of players, and $d$ is the maximum degree of the game graph. Under slightly more stringent separability conditions on the payoff matrices of the true game, we show that our method learns a game with the exact same Nash equilibria as the true game. We also show that $Ω(d \log (pm))$ samples are necessary for any method to consistently recover a game, with the same Nash-equilibria as the true game, from observations of strategic interactions. We verify our theoretical results through simulation experiments.
LGMar 3, 2017
Learning Graphical Games from Behavioral Data: Sufficient and Necessary ConditionsAsish Ghoshal, Jean Honorio
In this paper we obtain sufficient and necessary conditions on the number of samples required for exact recovery of the pure-strategy Nash equilibria (PSNE) set of a graphical game from noisy observations of joint actions. We consider sparse linear influence games --- a parametric class of graphical games with linear payoffs, and represented by directed graphs of n nodes (players) and in-degree of at most k. We show that one can efficiently recover the PSNE set of a linear influence game with $O(k^2 \log n)$ samples, under very general observation models. On the other hand, we show that $Ω(k \log n)$ samples are necessary for any procedure to recover the PSNE set from observations of joint actions.
LGMar 3, 2017
Learning Identifiable Gaussian Bayesian Networks in Polynomial Time and Sample ComplexityAsish Ghoshal, Jean Honorio
Learning the directed acyclic graph (DAG) structure of a Bayesian network from observational data is a notoriously difficult problem for which many hardness results are known. In this paper we propose a provably polynomial-time algorithm for learning sparse Gaussian Bayesian networks with equal noise variance --- a class of Bayesian networks for which the DAG structure can be uniquely identified from observational data --- under high-dimensional settings. We show that $O(k^4 \log p)$ number of samples suffices for our method to recover the true DAG structure with high probability, where $p$ is the number of variables and $k$ is the maximum Markov blanket size. We obtain our theoretical guarantees under a condition called Restricted Strong Adjacency Faithfulness, which is strictly weaker than strong faithfulness --- a condition that other methods based on conditional independence testing need for their success. The sample complexity of our method matches the information-theoretic limits in terms of the dependence on $p$. We show that our method out-performs existing state-of-the-art methods for learning Gaussian Bayesian networks in terms of recovering the true DAG structure while being comparable in speed to heuristic methods.
GTJul 11, 2016
From Behavior to Sparse Graphical Games: Efficient Recovery of EquilibriaAsish Ghoshal, Jean Honorio
In this paper we study the problem of exact recovery of the pure-strategy Nash equilibria (PSNE) set of a graphical game from noisy observations of joint actions of the players alone. We consider sparse linear influence games --- a parametric class of graphical games with linear payoffs, and represented by directed graphs of n nodes (players) and in-degree of at most k. We present an $\ell_1$-regularized logistic regression based algorithm for recovering the PSNE set exactly, that is both computationally efficient --- i.e. runs in polynomial time --- and statistically efficient --- i.e. has logarithmic sample complexity. Specifically, we show that the sufficient number of samples required for exact PSNE recovery scales as $\mathcal{O}(\mathrm{poly}(k) \log n)$. We also validate our theoretical results using synthetic experiments.
LGJan 27, 2016
Information-theoretic limits of Bayesian network structure learningAsish Ghoshal, Jean Honorio
In this paper, we study the information-theoretic limits of learning the structure of Bayesian networks (BNs), on discrete as well as continuous random variables, from a finite number of samples. We show that the minimum number of samples required by any procedure to recover the correct structure grows as $Ω(m)$ and $Ω(k \log m + (k^2/m))$ for non-sparse and sparse BNs respectively, where $m$ is the number of variables and $k$ is the maximum number of parents per node. We provide a simple recipe, based on an extension of the Fano's inequality, to obtain information-theoretic limits of structure recovery for any exponential family BN. We instantiate our result for specific conditional distributions in the exponential family to characterize the fundamental limits of learning various commonly used BNs, such as conditional probability table based networks, gaussian BNs, noisy-OR networks, and logistic regression networks. En route to obtaining our main results, we obtain tight bounds on the number of sparse and non-sparse essential-DAGs. Finally, as a byproduct, we recover the information-theoretic limits of sparse variable selection for logistic regression.