Vishwajeet Kumar

CL
h-index43
24papers
6,395citations
Novelty43%
AI Score55

24 Papers

CLJan 23, 2023Code
PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Avirup Sil, Jaydeep Sen, Bhavani Iyer et al. · ibm-research

The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa.

88.3IRMay 13Code
Granite Embedding Multilingual R2 Models

Parul Awasthy, Aashka Trivedi, Yushu Yang et al.

We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, designed to support responsible use and enable unrestricted research and enterprise adoption.

LGJul 18, 2024
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages

Abhishek Kumar Singh, Vishwajeet kumar, Rudra Murthy et al.

Large Language Models (LLMs) perform well on unseen tasks in English, but their abilities in non English languages are less explored due to limited benchmarks and training data. To bridge this gap, we introduce the Indic QA Benchmark, a large dataset for context grounded question answering in 11 major Indian languages, covering both extractive and abstractive tasks. Evaluations of multilingual LLMs, including instruction finetuned versions, revealed weak performance in low resource languages due to a strong English language bias in their training data. We also investigated the Translate Test paradigm,where inputs are translated to English for processing and the results are translated back into the source language for output. This approach outperformed multilingual LLMs, particularly in low resource settings. By releasing Indic QA, we aim to promote further research into LLMs question answering capabilities in low resource languages. This benchmark offers a critical resource to address existing limitations and foster multilingual understanding.

IRFeb 27, 2025Code
Granite Embedding Models

Parul Awasthy, Aashka Trivedi, Yulong Li et al. · ibm-research

We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse retrieval architectures, with both English and Multilingual capabilities. This report provides the technical details of training these highly effective 12 layer embedding models, along with their efficient 6 layer distilled counterparts. Extensive evaluations show that the models, developed with techniques like retrieval oriented pretraining, contrastive finetuning, knowledge distillation, and model merging significantly outperform publicly available models of similar sizes on both internal IBM retrieval and search tasks, and have equivalent performance on widely used information retrieval benchmarks, while being trained on high-quality data suitable for enterprise use. We publicly release all our Granite Embedding models under the Apache 2.0 license, allowing both research and commercial use at https://huggingface.co/collections/ibm-granite.

IRAug 20, 2024
Mistral-SPLADE: LLMs for better Learned Sparse Retrieval

Meet Doshi, Vishwajeet Kumar, Rudra Murthy et al.

Learned Sparse Retrievers (LSR) have evolved into an effective retrieval strategy that can bridge the gap between traditional keyword-based sparse retrievers and embedding-based dense retrievers. At its core, learned sparse retrievers try to learn the most important semantic keyword expansions from a query and/or document which can facilitate better retrieval with overlapping keyword expansions. LSR like SPLADE has typically been using encoder only models with MLM (masked language modeling) style objective in conjunction with known ways of retrieval performance improvement such as hard negative mining, distillation, etc. In this work, we propose to use decoder-only model for learning semantic keyword expansion. We posit, decoder only models that have seen much higher magnitudes of data are better equipped to learn keyword expansions needed for improved retrieval. We use Mistral as the backbone to develop our Learned Sparse Retriever similar to SPLADE and train it on a subset of sentence-transformer data which is often used for training text embedding models. Our experiments support the hypothesis that a sparse retrieval model based on decoder only large language model (LLM) surpasses the performance of existing LSR systems, including SPLADE and all its variants. The LLM based model (Echo-Mistral-SPLADE) now stands as a state-of-the-art learned sparse retrieval model on the BEIR text retrieval benchmark.

CLAug 26, 2025Code
Granite Embedding R2 Models

Parul Awasthy, Aashka Trivedi, Yulong Li et al. · ibm-research

We introduce the Granite Embedding R2 models, a comprehensive family of high-performance English encoder-based embedding models engineered for enterprise-scale dense retrieval applications. Building upon our first-generation release, these models deliver substantial improvements, including 16x expanded context length (8,192 tokens), state-of-the-art performance across diverse retrieval domains - text, code, long-document search, multi-turn conversational, and tabular data - and measurable speed advantages of 19-44\% over leading competitors while maintaining superior accuracy. Our release encompasses both bi-encoder and cross-encoder architectures, featuring a highly effective 22-layer retriever model and its efficient 12-layer counterpart, alongside a high-quality reranker model, all trained exclusively on enterprise-appropriate data with comprehensive governance oversight. The models demonstrate exceptional versatility across standard benchmarks, IBM-developed evaluation suites, and real-world enterprise use cases, establishing new performance standards for open-source embedding models. In an era where retrieval speed and accuracy are paramount for competitive advantage, the Granite R2 models deliver a compelling combination of cutting-edge performance, enterprise-ready licensing, and transparent data provenance that organizations require for mission-critical deployments. All models are publicly available under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, enabling unrestricted research and commercial use.

IRSep 9, 2024
Benchmarking and Building Zero-Shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5

Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar et al.

Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, comprehensive benchmarks for evaluating retrieval models in Hindi are lacking. To address this gap, we introduce the Hindi-BEIR benchmark, comprising 15 datasets across seven distinct tasks. We evaluate state-of-the-art multilingual retrieval models on the Hindi-BEIR benchmark, identifying task and domain-specific challenges that impact Hindi retrieval performance. Building on the insights from these results, we introduce NLLB-E5, a multilingual retrieval model that leverages a zero-shot approach to support Hindi without the need for Hindi training data. We believe our contributions, which include the release of the Hindi-BEIR benchmark and the NLLB-E5 model, will prove to be a valuable resource for researchers and promote advancements in multilingual retrieval models.

IRAug 18, 2024
Hindi-BEIR : A Large Scale Retrieval Benchmark in Hindi

Arkadeep Acharya, Rudra Murthy, Vishwajeet Kumar et al.

Given the large number of Hindi speakers worldwide, there is a pressing need for robust and efficient information retrieval systems for Hindi. Despite ongoing research, there is a lack of comprehensive benchmark for evaluating retrieval models in Hindi. To address this gap, we introduce the Hindi version of the BEIR benchmark, which includes a subset of English BEIR datasets translated to Hindi, existing Hindi retrieval datasets, and synthetically created datasets for retrieval. The benchmark is comprised of $15$ datasets spanning across $8$ distinct tasks. We evaluate state-of-the-art multilingual retrieval models on this benchmark to identify task and domain-specific challenges and their impact on retrieval performance. By releasing this benchmark and a set of relevant baselines, we enable researchers to understand the limitations and capabilities of current Hindi retrieval models, promoting advancements in this critical area. The datasets from Hindi-BEIR are publicly available.

CLJan 29
LMK > CLS: Landmark Pooling for Dense Embeddings

Meet Doshi, Aashka Trivedi, Vishwajeet Kumar et al.

Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.

IRJan 29
Influence Guided Sampling for Domain Adaptation of Text Retrievers

Meet Doshi, Vishwajeet Kumar, Yulong Li et al.

General-purpose open-domain dense retrieval systems are usually trained with a large, eclectic mix of corpora and search tasks. How should these diverse corpora and tasks be sampled for training? Conventional approaches sample them uniformly, proportional to their instance population sizes, or depend on human-level expert supervision. It is well known that the training data sampling strategy can greatly impact model performance. However, how to find the optimal strategy has not been adequately studied in the context of embedding models. We propose Inf-DDS, a novel reinforcement learning driven sampling framework that adaptively reweighs training datasets guided by influence-based reward signals and is much more lightweight with respect to GPU consumption. Our technique iteratively refines the sampling policy, prioritizing datasets that maximize model performance on a target development set. We evaluate the efficacy of our sampling strategy on a wide range of text retrieval tasks, demonstrating strong improvements in retrieval performance and better adaptation compared to existing gradient-based sampling methods, while also being 1.5x to 4x cheaper in GPU compute. Our sampling strategy achieves a 5.03 absolute NDCG@10 improvement while training a multilingual bge-m3 model and an absolute NDCG@10 improvement of 0.94 while training all-MiniLM-L6-v2, even when starting from expert-assigned weights on a large pool of training datasets.

CLNov 4, 2024
MILU: A Multi-task Indic Language Understanding Benchmark

Sshubam Verma, Mohammed Safi Ur Rahman Khan, Vishwajeet Kumar et al.

Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding Benchmark, a comprehensive evaluation benchmark designed to address this gap. MILU spans 8 domains and 41 subjects across 11 Indic languages, reflecting both general and culturally specific knowledge. With an India-centric design, incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics. We evaluate over 42 LLMs, and find that current LLMs struggle with MILU, with GPT-4o achieving the highest average accuracy at 74 percent. Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines. Models also perform better in high resource languages as compared to low resource ones. Domain-wise analysis indicates that models perform poorly in culturally relevant areas like Arts and Humanities, Law and Governance compared to general fields like STEM. To the best of our knowledge, MILU is the first of its kind benchmark focused on Indic languages, serving as a crucial step towards comprehensive cultural evaluation. All code, benchmarks, and artifacts are publicly available to foster open research.

CLDec 14, 2021
Multi-Row, Multi-Span Distant Supervision For Table+Text Question

Vishwajeet Kumar, Yash Gupta, Saneem Chemmengath et al.

Question answering (QA) over tables and linked text, also called TextTableQA, has witnessed significant research in recent years, as tables are often found embedded in documents along with related text. HybridQA and OTT-QA are the two best-known TextTableQA datasets, with questions that are best answered by combining information from both table cells and linked text passages. A common challenge in both datasets, and TextTableQA in general, is that the training instances include just the question and answer, where the gold answer may match not only multiple table cells across table rows but also multiple text spans within the scope of a table row and its associated text. This leads to a noisy multi instance training regime. We present MITQA, a transformer-based TextTableQA system that is explicitly designed to cope with distant supervision along both these axes, through a multi-instance loss objective, together with careful curriculum design. Our experiments show that the proposed multi-instance distant supervision approach helps MITQA get state-of-the-art results beating the existing baselines for both HybridQA and OTT-QA, putting MITQA at the top of HybridQA leaderboard with best EM and F1 scores on a held out test set.

CLSep 15, 2021
Topic Transferable Table Question Answering

Saneem Ahmed Chemmengath, Vishwajeet Kumar, Samarth Bharadwaj et al.

Weakly-supervised table question-answering(TableQA) models have achieved state-of-art performance by using pre-trained BERT transformer to jointly encoding a question and a table to produce structured query for the question. However, in practical settings TableQA systems are deployed over table corpora having topic and word distributions quite distinct from BERT's pretraining corpus. In this work we simulate the practical topic shift scenario by designing novel challenge benchmarks WikiSQL-TS and WikiTQ-TS, consisting of train-dev-test splits in five distinct topic groups, based on the popular WikiSQL and WikiTableQuestions datasets. We empirically show that, despite pre-training on large open-domain text, performance of models degrades significantly when they are evaluated on unseen topics. In response, we propose T3QA (Topic Transferable Table Question Answering) a pragmatic adaptation framework for TableQA comprising of: (1) topic-specific vocabulary injection into BERT, (2) a novel text-to-text transformer generator (such as T5, GPT2) based natural language question generation pipeline focused on generating topic specific training data, and (3) a logical form reranker. We show that T3QA provides a reasonably good baseline for our topic shift benchmarks. We believe our topic split benchmarks will lead to robust TableQA solutions that are better suited for practical deployment.

CLJun 24, 2021
AIT-QA: Question Answering Dataset over Complex Tables in the Airline Industry

Yannis Katsis, Saneem Chemmengath, Vishwajeet Kumar et al.

Recent advances in transformers have enabled Table Question Answering (Table QA) systems to achieve high accuracy and SOTA results on open domain datasets like WikiTableQuestions and WikiSQL. Such transformers are frequently pre-trained on open-domain content such as Wikipedia, where they effectively encode questions and corresponding tables from Wikipedia as seen in Table QA dataset. However, web tables in Wikipedia are notably flat in their layout, with the first row as the sole column header. The layout lends to a relational view of tables where each row is a tuple. Whereas, tables in domain-specific business or scientific documents often have a much more complex layout, including hierarchical row and column headers, in addition to having specialized vocabulary terms from that domain. To address this problem, we introduce the domain-specific Table QA dataset AIT-QA (Airline Industry Table QA). The dataset consists of 515 questions authored by human annotators on 116 tables extracted from public U.S. SEC filings (publicly available at: https://www.sec.gov/edgar.shtml) of major airline companies for the fiscal years 2017-2019. We also provide annotations pertaining to the nature of questions, marking those that require hierarchical headers, domain-specific terminology, and paraphrased forms. Our zero-shot baseline evaluation of three transformer-based SOTA Table QA methods - TaPAS (end-to-end), TaBERT (semantic parsing-based), and RCI (row-column encoding-based) - clearly exposes the limitation of these methods in this practical setting, with the best accuracy at just 51.8\% (RCI). We also present pragmatic table preprocessing steps used to pivot and project these complex tables into a layout suitable for the SOTA Table QA models.

AIApr 16, 2021
Capturing Row and Column Semantics in Transformer Based Question Answering over Tables

Michael Glass, Mustafa Canim, Alfio Gliozzo et al.

Transformer based architectures are recently used for the task of answering questions over tables. In order to improve the accuracy on this task, specialized pre-training techniques have been developed and applied on millions of open-domain web tables. In this paper, we propose two novel approaches demonstrating that one can achieve superior performance on table QA task without even using any of these specialized pre-training techniques. The first model, called RCI interaction, leverages a transformer based architecture that independently classifies rows and columns to identify relevant cells. While this model yields extremely high accuracy at finding cell values on recent benchmarks, a second model we propose, called RCI representation, provides a significant efficiency advantage for online QA systems over tables by materializing embeddings for existing tables. Experiments on recent benchmarks prove that the proposed methods can effectively locate cell values on tables (up to ~98% Hit@1 accuracy on WikiSQL lookup questions). Also, the interaction model outperforms the state-of-the-art transformer based approaches, pre-trained on very large table corpora (TAPAS and TaBERT), achieving ~3.4% and ~18.86% additional precision improvement on the standard WikiSQL benchmark.

CLApr 14, 2021
WARM: A Weakly (+Semi) Supervised Model for Solving Math word Problems

Oishik Chatterjee, Isha Pandey, Aashish Waikar et al.

Solving math word problems (MWPs) is an important and challenging problem in natural language processing. Existing approaches to solve MWPs require full supervision in the form of intermediate equations. However, labeling every MWP with its corresponding equations is a time-consuming and expensive task. In order to address this challenge of equation annotation, we propose a weakly supervised model for solving MWPs by requiring only the final answer as supervision. We approach this problem by first learning to generate the equation using the problem description and the final answer, which we subsequently use to train a supervised MWP solver. We propose and compare various weakly supervised techniques to learn to generate equations directly from the problem description and answer. Through extensive experiments, we demonstrate that without using equations for supervision, our approach achieves accuracy gains of 4.5% and 32% over the state-of-the-art weakly supervised approach, on the standard Math23K and AllArith datasets respectively. Additionally, we curate and release new datasets of roughly 10k MWPs each in English and in Hindi (a low resource language).These datasets are suitable for training weakly supervised models. We also present an extension of WARMM to semi-supervised learning and present further improvements on results, along with insights.

CVMar 9, 2021
Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

Aman Jain, Mayank Kothyari, Vishwajeet Kumar et al.

Multimodal IR, spanning text corpus, knowledge graph and images, called outside knowledge visual question answering (OKVQA), is of much recent interest. However, the popular data set has serious limitations. A surprisingly large fraction of queries do not assess the ability to integrate cross-modal information. Instead, some are independent of the image, some depend on speculation, some require OCR or are otherwise answerable from the image alone. To add to the above limitations, frequency-based guessing is very effective because of (unintended) widespread answer overlaps between the train and test folds. Overall, it is hard to determine when state-of-the-art systems exploit these weaknesses rather than really infer the answers, because they are opaque and their 'reasoning' process is uninterpretable. An equally important limitation is that the dataset is designed for the quantitative assessment only of the end-to-end answer retrieval task, with no provision for assessing the correct(semantic) interpretation of the input query. In response, we identify a key structural idiom in OKVQA ,viz., S3 (select, substitute and search), and build a new data set and challenge around it. Specifically, the questioner identifies an entity in the image and asks a question involving that entity which can be answered only by consulting a knowledge graph or corpus passage mentioning the entity. Our challenge consists of (i)OKVQAS3, a subset of OKVQA annotated based on the structural idiom and (ii)S3VQA, a new dataset built from scratch. We also present a neural but structurally transparent OKVQA system, S3, that explicitly addresses our challenge dataset, and outperforms recent competitive baselines.

CLJan 25, 2021
Meta-Learning for Effective Multi-task and Multilingual Modelling

Ishan Tarunesh, Sushil Khyalia, Vishwajeet Kumar et al.

Natural language processing (NLP) tasks (e.g. question-answering in English) benefit from knowledge of other tasks (e.g. named entity recognition in English) and knowledge of other languages (e.g. question-answering in Spanish). Such shared representations are typically learned in isolation, either across tasks or across languages. In this work, we propose a meta-learning approach to learn the interactions between both tasks and languages. We also investigate the role of different sampling strategies used during meta-learning. We present experiments on five different tasks and six different languages from the XTREME multilingual benchmark dataset. Our meta-learned model clearly improves in performance compared to competitive baseline models that also include multi-task baselines. We also present zero-shot evaluations on unseen target languages to demonstrate the utility of our proposed model.

CLNov 8, 2019
Question Generation from Paragraphs: A Tale of Two Hierarchical Models

Vishwajeet Kumar, Raktim Chaki, Sai Teja Talluri et al.

Automatic question generation from paragraphs is an important and challenging problem, particularly due to the long context from paragraphs. In this paper, we propose and study two hierarchical models for the task of question generation from paragraphs. Specifically, we propose (a) a novel hierarchical BiLSTM model with selective attention and (b) a novel hierarchical Transformer architecture, both of which learn hierarchical representations of paragraphs. We model a paragraph in terms of its constituent sentences, and a sentence in terms of its constituent words. While the introduction of the attention mechanism benefits the hierarchical BiLSTM model, the hierarchical Transformer, with its inherent attention and positional encoding mechanisms also performs better than flat transformer model. We conducted empirical evaluation on the widely used SQuAD and MS MARCO datasets using standard metrics. The results demonstrate the overall effectiveness of the hierarchical models over their flat counterparts. Qualitatively, our hierarchical models are able to generate fluent and relevant questions

CLSep 4, 2019
ParaQG: A System for Generating Questions and Answers from Paragraphs

Vishwajeet Kumar, Sivaanandh Muneeswaran, Ganesh Ramakrishnan et al.

Generating syntactically and semantically valid and relevant questions from paragraphs is useful with many applications. Manual generation is a labour-intensive task, as it requires the reading, parsing and understanding of long passages of text. A number of question generation models based on sequence-to-sequence techniques have recently been proposed. Most of them generate questions from sentences only, and none of them is publicly available as an easy-to-use service. In this paper, we demonstrate ParaQG, a Web-based system for generating questions from sentences and paragraphs. ParaQG incorporates a number of novel functionalities to make the question generation process user-friendly. It provides an interactive interface for a user to select answers with visual insights on generation of questions. It also employs various faceted views to group similar questions as well as filtering techniques to eliminate unanswerable questions

CLJun 6, 2019
Cross-Lingual Training for Automatic Question Generation

Vishwajeet Kumar, Nitish Joshi, Arijit Mukherjee et al.

Automatic question generation (QG) is a challenging problem in natural language understanding. QG systems are typically built assuming access to a large number of training instances where each instance is a question and its corresponding answer. For a new language, such training instances are hard to obtain making the QG problem even more challenging. Using this as our motivation, we study the reuse of an available large QG dataset in a secondary language (e.g. English) to learn a QG model for a primary language (e.g. Hindi) of interest. For the primary language, we assume access to a large amount of monolingual text but only a small QG dataset. We propose a cross-lingual QG model which uses the following training regime: (i) Unsupervised pretraining of language models in both primary and secondary languages and (ii) joint supervised training for QG in both languages. We demonstrate the efficacy of our proposed approach using two different primary languages, Hindi and Chinese. We also create and release a new question answering dataset for Hindi consisting of 6555 sentences.

CLAug 15, 2018
Putting the Horse Before the Cart:A Generator-Evaluator Framework for Question Generation from Text

Vishwajeet Kumar, Ganesh Ramakrishnan, Yuan-Fang Li

Automatic question generation (QG) is a useful yet challenging task in NLP. Recent neural network-based approaches represent the state-of-the-art in this task. In this work, we attempt to strengthen them significantly by adopting a holistic and novel generator-evaluator framework that directly optimizes objectives that reward semantics and structure. The {\it generator} is a sequence-to-sequence model that incorporates the {\it structure} and {\it semantics} of the question being generated. The generator predicts an answer in the passage that the question can pivot on. Employing the copy and coverage mechanisms, it also acknowledges other contextually important (and possibly rare) keywords in the passage that the question needs to conform to, while not redundantly repeating words. The {\it evaluator} model evaluates and assigns a reward to each predicted question based on its conformity to the {\it structure} of ground-truth questions. We propose two novel QG-specific reward functions for text conformity and answer conformity of the generated question. The evaluator also employs structure-sensitive rewards based on evaluation measures such as BLEU, GLEU, and ROUGE-L, which are suitable for QG. In contrast, most of the previous works only optimize the cross-entropy loss, which can induce inconsistencies between training (objective) and testing (evaluation) measures. Our evaluation shows that our approach significantly outperforms state-of-the-art systems on the widely-used SQuAD benchmark as per both automatic and human evaluation.

CLMar 7, 2018
Automating Reading Comprehension by Generating Question and Answer Pairs

Vishwajeet Kumar, Kireeti Boorla, Yogesh Meena et al.

Neural network-based methods represent the state-of-the-art in question generation from text. Existing work focuses on generating only questions from text without concerning itself with answer generation. Moreover, our analysis shows that handling rare words and generating the most appropriate question given a candidate answer are still challenges facing existing approaches. We present a novel two-stage process to generate question-answer pairs from the text. For the first stage, we present alternatives for encoding the span of the pivotal answer in the sentence using Pointer Networks. In our second stage, we employ sequence to sequence models for question generation, enhanced with rich linguistic features. Finally, global attention and answer encoding are used for generating the question most relevant to the answer. We motivate and linguistically analyze the role of each component in our framework and consider compositions of these. This analysis is supported by extensive experimental evaluations. Using standard evaluation metrics as well as human evaluations, our experimental results validate the significant improvement in the quality of questions generated by our framework over the state-of-the-art. The technique presented here represents another step towards more automated reading comprehension assessment. We also present a live system \footnote{Demo of the system is available at \url{https://www.cse.iitb.ac.in/~vishwajeet/autoqg.html}.} to demonstrate the effectiveness of our approach.

CLOct 14, 2016
Civique: Using Social Media to Detect Urban Emergencies

Diptesh Kanojia, Vishwajeet Kumar, Krithi Ramamritham

We present the Civique system for emergency detection in urban areas by monitoring micro blogs like Tweets. The system detects emergency related events, and classifies them into appropriate categories like "fire", "accident", "earthquake", etc. We demonstrate our ideas by classifying Twitter posts in real time, visualizing the ongoing event on a map interface and alerting users with options to contact relevant authorities, both online and offline. We evaluate our classifiers for both the steps, i.e., emergency detection and categorization, and obtain F-scores exceeding 70% and 90%, respectively. We demonstrate Civique using a web interface and on an Android application, in realtime, and show its use for both tweet detection and visualization.