Vasudeva Varma

CL
h-index11
47papers
5,490citations
Novelty44%
AI Score45

47 Papers

CLMar 22, 2023
XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages

Dhaval Taunk, Shivprasad Sagare, Anupam Patil et al. · microsoft-research

Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for low resource (LR) languages a critical problem. Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summarization to coarsely identify salient information followed by a neural abstractive model to generate the section-specific text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average.

CLDec 31, 2022Code
Towards Proactively Forecasting Sentence-Specific Information Popularity within Online News Documents

Sayar Ghosh Roy, Anshul Padhi, Risubh Jain et al.

Multiple studies have focused on predicting the prospective popularity of an online document as a whole, without paying attention to the contributions of its individual parts. We introduce the task of proactively forecasting popularities of sentences within online news documents solely utilizing their natural language content. We model sentence-specific popularity forecasting as a sequence regression task. For training our models, we curate InfoPop, the first dataset containing popularity labels for over 1.7 million sentences from over 50,000 online news documents. To the best of our knowledge, this is the first dataset automatically created using streams of incoming search engine queries to generate sentence-level popularity annotations. We propose a novel transfer learning approach involving sentence salience prediction as an auxiliary task. Our proposed technique coupled with a BERT-based neural model exceeds nDCG values of 0.8 for proactive sentence-specific popularity forecasting. Notably, our study presents a non-trivial takeaway: though popularity and salience are different concepts, transfer learning from salience prediction enhances popularity forecasting. We release InfoPop and make our code publicly available: https://github.com/sayarghoshroy/InfoPopularity

CLMar 22, 2023
GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

Dhaval Taunk, Lakshya Khanna, Pavan Kandru et al.

Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG). A typical approach collects nodes relevant to the QA pair from a KG to form a Working Graph (WG) followed by reasoning using Graph Neural Networks(GNNs). This faces two major challenges: (i) it is difficult to capture all the information from the QA in the WG, and (ii) the WG contains some irrelevant nodes from the KG. To address these, we propose GrapeQA with two simple improvements on the WG: (i) Prominent Entities for Graph Augmentation identifies relevant text chunks from the QA pair and augments the WG with corresponding latent representations from the LM, and (ii) Context-Aware Node Pruning removes nodes that are less relevant to the QA pair. We evaluate our results on OpenBookQA, CommonsenseQA and MedQA-USMLE and see that GrapeQA shows consistent improvements over its LM + KG predecessor (QA-GNN in particular) and large improvements on OpenBookQA.

CLOct 21, 2022
Gui at MixMT 2022 : English-Hinglish: An MT approach for translation of code mixed data

Akshat Gahoi, Jayant Duneja, Anshul Padhi et al.

Code-mixed machine translation has become an important task in multilingual communities and extending the task of machine translation to code mixed data has become a common task for these languages. In the shared tasks of WMT 2022, we try to tackle the same for both English + Hindi to Hinglish and Hinglish to English. The first task dealt with both Roman and Devanagari script as we had monolingual data in both English and Hindi whereas the second task only had data in Roman script. To our knowledge, we achieved one of the top ROUGE-L and WER scores for the first task of Monolingual to Code-Mixed machine translation. In this paper, we discuss the use of mBART with some special pre-processing and post-processing (transliteration from Devanagari to Roman) for the first task in detail and the experiments that we performed for the second task of translating code-mixed Hinglish to monolingual English.

CLJun 15, 2023
Neural models for Factual Inconsistency Classification with Explanations

Tathagata Raha, Mukund Choudhary, Abhinav Menon et al.

Factual consistency is one of the most important requirements when editing high quality documents. It is extremely important for automatic text generation systems like summarization, question answering, dialog modeling, and language modeling. Still, automated factual inconsistency detection is rather under-studied. Existing work has focused on (a) finding fake news keeping a knowledge base in context, or (b) detecting broad contradiction (as part of natural language inference literature). However, there has been no work on detecting and explaining types of factual inconsistencies in text, without any knowledge base in context. In this paper, we leverage existing work in linguistics to formally define five types of factual inconsistencies. Based on this categorization, we contribute a novel dataset, FICLE (Factual Inconsistency CLassification with Explanation), with ~8K samples where each sample consists of two sentences (claim and context) annotated with type and span of inconsistency. When the inconsistency relates to an entity type, it is labeled as well at two levels (coarse and fine-grained). Further, we leverage this dataset to train a pipeline of four neural models to predict inconsistency type with explanations, given a (claim, context) sentence pair. Explanations include inconsistent claim fact triple, inconsistent context span, inconsistent claim component, coarse and fine-grained inconsistent entity types. The proposed system first predicts inconsistent spans from claim and context; and then uses them to predict inconsistency types and inconsistent entity types (when inconsistency is due to entities). We experiment with multiple Transformer-based natural language classification as well as generative models, and find that DeBERTa performs the best. Our proposed methods provide a weighted F1 of ~87% for inconsistency type classification across the five classes.

CLSep 22, 2022
XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages

Shivprasad Sagare, Tushar Abhishek, Bhavyajeet Singh et al.

Multiple business scenarios require an automated generation of descriptive human-readable text from structured input data. Hence, fact-to-text generation systems have been developed for various downstream tasks like generating soccer reports, weather and financial reports, medical reports, person biographies, etc. Unfortunately, previous work on fact-to-text (F2T) generation has focused primarily on English mainly due to the high availability of relevant datasets. Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed for generation across multiple languages alongwith a dataset, XALIGN for eight languages. However, there has been no rigorous work on the actual XF2T generation problem. We extend XALIGN dataset with annotated data for four more languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive study using popular Transformer-based text generation models on our extended multi-lingual dataset, which we call XALIGNV2. Further, we investigate the performance of different text generation strategies: multiple variations of pretraining, fact-aware embeddings and structure-aware input encoding. Our extensive experiments show that a multi-lingual mT5 model which uses fact-aware embeddings with structure-aware input encoding leads to best results on average across the twelve languages. We make our code, dataset and model publicly available, and hope that this will help advance further research in this critical area.

CLMar 29, 2023
Summarizing Indian Languages using Multilingual Transformers based Models

Dhaval Taunk, Vasudeva Varma

With the advent of multilingual models like mBART, mT5, IndicBART etc., summarization in low resource Indian languages is getting a lot of attention now a days. But still the number of datasets is low in number. In this work, we (Team HakunaMatata) study how these multilingual models perform on the datasets which have Indian languages as source and target text while performing summarization. We experimented with IndicBART and mT5 models to perform the experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a performance metric.

CLJan 7, 2023
Graph-based Keyword Planning for Legal Clause Generation from Topics

Sagar Joshi, Sumanth Balaji, Aparna Garimella et al.

Generating domain-specific content such as legal clauses based on minimal user-provided information can be of significant benefit in automating legal contract generation. In this paper, we propose a controllable graph-based mechanism that can generate legal clauses using only the topic or type of the legal clauses. Our pipeline consists of two stages involving a graph-based planner followed by a clause generator. The planner outlines the content of a legal clause as a sequence of keywords in the order of generic to more specific clause information based on the input topic using a controllable graph-based mechanism. The generation stage takes in a given plan and generates a clause. The pipeline consists of a graph-based planner followed by text generation. We illustrate the effectiveness of our proposed two-stage approach on a broad set of clause topics in contracts.

CLFeb 9, 2023
Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages

Bhavyajeet Singh, Pavan Kandru, Anubhav Sharma et al.

Massive knowledge graphs like Wikidata attempt to capture world knowledge about multiple entities. Recent approaches concentrate on automatically enriching these KGs from text. However a lot of information present in the form of natural text in low resource languages is often missed out. Cross Lingual Information Extraction aims at extracting factual information in the form of English triples from low resource Indian Language text. Despite its massive potential, progress made on this task is lagging when compared to Monolingual Information Extraction. In this paper, we propose the task of Cross Lingual Fact Extraction(CLFE) from text and devise an end-to-end generative approach for the same which achieves an overall F1 score of 77.46.

CLJan 21, 2023
Investigating Strategies for Clause Recommendation

Sagar Joshi, Sumanth Balaji, Jerrin Thomas et al.

Clause recommendation is the problem of recommending a clause to a legal contract, given the context of the contract in question and the clause type to which the clause should belong. With not much prior work being done toward the generation of legal contracts, this problem was proposed as a first step toward the bigger problem of contract generation. As an open-ended text generation problem, the distinguishing characteristics of this problem lie in the nature of legal language as a sublanguage and the considerable similarity of textual content within the clauses of a specific type. This similarity aspect in legal clauses drives us to investigate the importance of similar contracts' representation for recommending clauses. In our work, we experiment with generating clauses for 15 commonly occurring clause types in contracts expanding upon the previous work on this problem and analyzing clause recommendations in varying settings using information derived from similar contracts.

SEDec 1, 2025
Teaching an Online Multi-Institutional Research Level Software Engineering Course with Industry - an Experience Report

Pankaj Jalote, Y. Raghu Reddy, Vasudeva Varma

Covid has made online teaching and learning acceptable and students, faculty, and industry professionals are all comfortable with this mode. This comfort can be leveraged to offer an online multi-institutional research-level course in an area where individual institutions may not have the requisite faculty to teach and/or research students to enroll. If the subject is of interest to industry, online offering also allows industry experts to contribute and participate with ease. Advanced topics in Software Engineering are ideally suited for experimenting with this approach as industry, which is often looking to incorporate advances in software engineering in their practices, is likely to agree to contribute and participate. In this paper we describe an experiment in teaching a course titled "AI in Software Engineering" jointly between two institutions with active industry participation, and share our and student's experience. We believe this collaborative teaching approach can be used for offering research level courses in any applied area of computer science by institutions who are small and find it difficult to offer research level courses on their own.

SEMar 4, 2024
Can LLMs Generate Architectural Design Decisions? -An Exploratory Empirical study

Rudra Dhar, Karthik Vaidhyanathan, Vasudeva Varma

Architectural Knowledge Management (AKM) involves the organized handling of information related to architectural decisions and design within a project or organization. An essential artifact of AKM is the Architecture Decision Records (ADR), which documents key design decisions. ADRs are documents that capture decision context, decision made and various aspects related to a design decision, thereby promoting transparency, collaboration, and understanding. Despite their benefits, ADR adoption in software development has been slow due to challenges like time constraints and inconsistent uptake. Recent advancements in Large Language Models (LLMs) may help bridge this adoption gap by facilitating ADR generation. However, the effectiveness of LLM for ADR generation or understanding is something that has not been explored. To this end, in this work, we perform an exploratory study that aims to investigate the feasibility of using LLM for the generation of ADRs given the decision context. In our exploratory study, we utilize GPT and T5-based models with 0-shot, few-shot, and fine-tuning approaches to generate the Decision of an ADR given its Context. Our results indicate that in a 0-shot setting, state-of-the-art models such as GPT-4 generate relevant and accurate Design Decisions, although they fall short of human-level performance. Additionally, we observe that more cost-effective models like GPT-3.5 can achieve similar outcomes in a few-shot setting, and smaller models such as Flan-T5 can yield comparable results after fine-tuning. To conclude, this exploratory study suggests that LLM can generate Design Decisions, but further research is required to attain human-level generation and establish standardized widespread adoption.

CLDec 12, 2024
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations

Manav Chaudhary, Harshit Gupta, Savita Bhat et al.

Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are required for their standalone use as reliable evaluators for subjective metrics.

CLApr 10, 2024
MetaCheckGPT -- A Multi-task Hallucination Detector Using LLM Uncertainty and Meta-models

Rahul Mehta, Andrew Hoblitzell, Jack O'Keefe et al.

Hallucinations in large language models (LLMs) have recently become a significant problem. A recent effort in this direction is a shared task at Semeval 2024 Task 6, SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. This paper describes our winning solution ranked 1st and 2nd in the 2 sub-tasks of model agnostic and model aware tracks respectively. We propose a meta-regressor framework of LLMs for model evaluation and integration that achieves the highest scores on the leaderboard. We also experiment with various transformer-based models and black box methods like ChatGPT, Vectara, and others. In addition, we perform an error analysis comparing GPT4 against our best model which shows the limitations of the former.

CLMay 20, 2025
AutoRev: Multi-Modal Graph Retrieval for Automated Peer-Review Generation

Maitreya Prafulla Chitale, Ketaki Mangesh Shetye, Harshit Gupta et al.

Enhancing the quality and efficiency of academic publishing is critical for both authors and reviewers, as research papers are central to scholarly communication and a major source of high-quality content on the web. To support this goal, we propose AutoRev, an automatic peer-review system designed to provide actionable, high-quality feedback to both reviewers and authors. AutoRev leverages a novel Multi-Modal Retrieval-Augmented Generation (RAG) framework that combines textual and graphical representations of academic papers. By modelling documents as graphs, AutoRev effectively retrieves the most pertinent information, significantly reducing the input context length for LLMs and thereby enhancing their review generation capabilities. Experimental results show that AutoRev outperforms state-of-the-art baselines by up to 58.72% and demonstrates competitive performance in human evaluations against ground truth reviews. We envision AutoRev as a powerful tool to streamline the peer-review workflow, alleviating challenges and enabling scalable, high-quality scholarly publishing. By guiding both authors and reviewers, AutoRev has the potential to accelerate the dissemination of quality research on the web at a larger scale. Code will be released upon acceptance.

CLMay 18, 2024
BrainStorm @ iREL at #SMM4H 2024: Leveraging Translation and Topical Embeddings for Annotation Detection in Tweets

Manav Chaudhary, Harshit Gupta, Vasudeva Varma

The proliferation of LLMs in various NLP tasks has sparked debates regarding their reliability, particularly in annotation tasks where biases and hallucinations may arise. In this shared task, we address the challenge of distinguishing annotations made by LLMs from those made by human domain experts in the context of COVID-19 symptom detection from tweets in Latin American Spanish. This paper presents BrainStorm @ iRELs approach to the SMM4H 2024 Shared Task, leveraging the inherent topical information in tweets, we propose a novel approach to identify and classify annotations, aiming to enhance the trustworthiness of annotated data.

CLFeb 21
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma

Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight dual-process MetaController for adaptive effort allocation. Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold increase in successful self-correction. Blinded human evaluations over 580 query pairs show an 84% aggregate preference for trustworthiness and metacognitive self-awareness over standard and Chain-of-Thought baselines. Grounding LLM reasoning in established cognitive theory offers a principled path toward more transparent and diagnostically robust AI systems.

SEApr 11, 2025
DRAFT-ing Architectural Design Decisions using LLMs

Rudra Dhar, Adyansh Kakran, Amey Karan et al.

Architectural Knowledge Management (AKM) is crucial for software development but remains challenging due to the lack of standardization and high manual effort. Architecture Decision Records (ADRs) provide a structured approach to capture Architecture Design Decisions (ADDs), but their adoption is limited due to the manual effort involved and insufficient tool support. Our previous work has shown that Large Language Models (LLMs) can assist in generating ADDs. However, simply prompting the LLM does not produce quality ADDs. Moreover, using third-party LLMs raises privacy concerns, while self-hosting them poses resource challenges. To this end, we experimented with different approaches like few-shot, retrieval-augmented generation (RAG) and fine-tuning to enhance LLM's ability to generate ADDs. Our results show that both techniques improve effectiveness. Building on this, we propose Domain Specific Retreival Augumented Few Shot Fine Tuninng, DRAFT, which combines the strengths of all these three approaches for more effective ADD generation. DRAFT operates in two phases: an offline phase that fine-tunes an LLM on generating ADDs augmented with retrieved examples and an online phase that generates ADDs by leveraging retrieved ADRs and the fine-tuned model. We evaluated DRAFT against existing approaches on a dataset of 4,911 ADRs and various LLMs and analyzed them using automated metrics and human evaluations. Results show DRAFT outperforms all other approaches in effectiveness while maintaining efficiency. Our findings indicate that DRAFT can aid architects in drafting ADDs while addressing privacy and resource constraints.

CLDec 18, 2024
Curriculum Learning for Cross-Lingual Data-to-Text Generation With Noisy Data

Kancharla Aditya Hari, Manish Gupta, Vasudeva Varma

Curriculum learning has been used to improve the quality of text generation systems by ordering the training samples according to a particular schedule in various tasks. In the context of data-to-text generation (DTG), previous studies used various difficulty criteria to order the training samples for monolingual DTG. These criteria, however, do not generalize to the crosslingual variant of the problem and do not account for noisy data. We explore multiple criteria that can be used for improving the performance of cross-lingual DTG systems with noisy data using two curriculum schedules. Using the alignment score criterion for ordering samples and an annealing schedule to train the model, we show increase in BLEU score by up to 4 points, and improvements in faithfulness and coverage of generations by 5-15% on average across 11 Indian languages and English in 2 separate datasets. We make code and data publicly available

CLDec 23, 2023
Multilingual Bias Detection and Mitigation for Indian Languages

Ankita Maity, Anubhav Sharma, Rudra Dhar et al.

Lack of diverse perspectives causes neutrality bias in Wikipedia content leading to millions of worldwide readers getting exposed by potentially inaccurate information. Hence, neutrality bias detection and mitigation is a critical problem. Although previous studies have proposed effective solutions for English, no work exists for Indian languages. First, we contribute two large datasets, mWikiBias and mWNC, covering 8 languages, for the bias detection and mitigation tasks respectively. Next, we investigate the effectiveness of popular multilingual Transformer-based models for the two tasks by modeling detection as a binary classification problem and mitigation as a style transfer problem. We make the code and data publicly available.

CLMay 5, 2023
LLM-RM at SemEval-2023 Task 2: Multilingual Complex NER using XLM-RoBERTa

Rahul Mehta, Vasudeva Varma

Named Entity Recognition(NER) is a task of recognizing entities at a token level in a sentence. This paper focuses on solving NER tasks in a multilingual setting for complex named entities. Our team, LLM-RM participated in the recently organized SemEval 2023 task, Task 2: MultiCoNER II,Multilingual Complex Named Entity Recognition. We approach the problem by leveraging cross-lingual representation provided by fine-tuning XLM-Roberta base model on datasets of all of the 12 languages provided -- Bangla, Chinese, English, Farsi, French, German, Hindi, Italian, Portuguese, Spanish, Swedish and Ukrainian

CLFeb 1, 2022
XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages

Tushar Abhishek, Shivprasad Sagare, Bhavyajeet Singh et al.

Multiple critical scenarios (like Wikipedia text generation given English Infoboxes) need automated generation of descriptive text in low resource (LR) languages from English fact triples. Previous work has focused on English fact-to-text (F2T) generation. To the best of our knowledge, there has been no previous attempt on cross-lingual alignment or generation for LR languages. Building an effective cross-lingual F2T (XF2T) system requires alignment between English structured facts and LR sentences. We propose two unsupervised methods for cross-lingual alignment. We contribute XALIGN, an XF2T dataset with 0.45M pairs across 8 languages, of which 5402 pairs have been manually annotated. We also train strong baseline XF2T generation models on the XAlign dataset.

CLSep 5, 2021
Transformer Models for Text Coherence Assessment

Tushar Abhishek, Daksh Rawat, Manish Gupta et al.

Coherence is an important aspect of text quality and is crucial for ensuring its readability. It is essential desirable for outputs from text generation systems like summarization, question answering, machine translation, question generation, table-to-text, etc. An automated coherence scoring model is also helpful in essay scoring or providing writing feedback. A large body of previous work has leveraged entity-based methods, syntactic patterns, discourse relations, and more recently traditional deep learning architectures for text coherence assessment. Previous work suffers from drawbacks like the inability to handle long-range dependencies, out-of-vocabulary words, or model sequence information. We hypothesize that coherence assessment is a cognitively complex task that requires deeper models and can benefit from other related tasks. Accordingly, in this paper, we propose four different Transformer-based architectures for the task: vanilla Transformer, hierarchical Transformer, multi-task learning-based model, and a model with fact-based input representation. Our experiments with popular benchmark datasets across multiple domains on four different coherence assessment tasks demonstrate that our models achieve state-of-the-art results outperforming existing models by a good margin.

CLMar 10, 2021
Knowledge-based Extraction of Cause-Effect Relations from Biomedical Text

Sachin Pawar, Ravina More, Girish K. Palshikar et al.

We propose a knowledge-based approach for extraction of Cause-Effect (CE) relations from biomedical text. Our approach is a combination of an unsupervised machine learning technique to discover causal triggers and a set of high-precision linguistic rules to identify cause/effect arguments of these causal triggers. We evaluate our approach using a corpus of 58,761 Leukaemia-related PubMed abstracts consisting of 568,528 sentences. We could extract 152,655 CE triplets from this corpus where each triplet consists of a cause phrase, an effect phrase and a causal trigger. As compared to the existing knowledge base - SemMedDB (Kilicoglu et al., 2012), the number of extractions are almost twice. Moreover, the proposed approach outperformed the existing technique SemRep (Rindflesch and Fiszman, 2003) on a dataset of 500 sentences.

CLJan 28, 2021
Identifying COVID-19 Fake News in Social Media

Tathagata Raha, Vijayasaradhi Indurthi, Aayush Upadhyaya et al.

The evolution of social media platforms have empowered everyone to access information easily. Social media users can easily share information with the rest of the world. This may sometimes encourage spread of fake news, which can result in undesirable consequences. In this work, we train models which can identify health news related to COVID-19 pandemic as real or fake. Our models achieve a high F1-score of 98.64%. Our models achieve second place on the leaderboard, tailing the first position with a very narrow margin 0.05% points.

CLJan 10, 2021
Summaformers @ LaySumm 20, LongSumm 20

Sayar Ghosh Roy, Nikhil Pinnaparaju, Risubh Jain et al.

Automatic text summarization has been widely studied as an important task in natural language processing. Traditionally, various feature engineering and machine learning based systems have been proposed for extractive as well as abstractive text summarization. Recently, deep learning based, specifically Transformer-based systems have been immensely popular. Summarization is a cognitively challenging task - extracting summary worthy sentences is laborious, and expressing semantics in brief when doing abstractive summarization is complicated. In this paper, we specifically look at the problem of summarizing scientific research papers from multiple domains. We differentiate between two types of summaries, namely, (a) LaySumm: A very short summary that captures the essence of the research paper in layman terms restricting overtly specific technical jargon and (b) LongSumm: A much longer detailed summary aimed at providing specific insights into various ideas touched upon in the paper. While leveraging latest Transformer-based models, our systems are simple, intuitive and based on how specific paper sections contribute to human summaries of the two types described above. Evaluations against gold standard summaries using ROUGE metrics prove the effectiveness of our approach. On blind test corpora, our system ranks first and third for the LongSumm and LaySumm tasks respectively.

CLJan 9, 2021
Task Adaptive Pretraining of Transformers for Hostility Detection

Tathagata Raha, Sayar Ghosh Roy, Ujwal Narayan et al.

Identifying adverse and hostile content on the web and more particularly, on social media, has become a problem of paramount interest in recent years. With their ever increasing popularity, fine-tuning of pretrained Transformer-based encoder models with a classifier head are gradually becoming the new baseline for natural language classification tasks. In our work, we explore the gains attributed to Task Adaptive Pretraining (TAPT) prior to fine-tuning of Transformer-based architectures. We specifically study two problems, namely, (a) Coarse binary classification of Hindi Tweets into Hostile or Not, and (b) Fine-grained multi-label classification of Tweets into four categories: hate, fake, offensive, and defamation. Building up on an architecture which takes emojis and segmented hashtags into consideration for classification, we are able to experimentally showcase the performance upgrades due to TAPT. Our system (with team name 'iREL IIIT') ranked first in the 'Hostile Post Detection in Hindi' shared task with an F1 score of 97.16% for coarse-grained detection and a weighted F1 score of 62.96% for fine-grained multi-label classification on the provided blind test corpora.

CLJan 8, 2021
Leveraging Multilingual Transformers for Hate Speech Detection

Sayar Ghosh Roy, Ujwal Narayan, Tathagata Raha et al.

Detecting and classifying instances of hate in social media text has been a problem of interest in Natural Language Processing in the recent years. Our work leverages state of the art Transformer language models to identify hate speech in a multilingual setting. Capturing the intent of a post or a comment on social media involves careful evaluation of the language style, semantic content and additional pointers such as hashtags and emojis. In this paper, we look at the problem of identifying whether a Twitter post is hateful and offensive or not. We further discriminate the detected toxic content into one of the following three classes: (a) Hate Speech (HATE), (b) Offensive (OFFN) and (c) Profane (PRFN). With a pre-trained multilingual Transformer-based text encoder at the base, we are able to successfully identify and classify hate speech from multiple languages. On the provided testing corpora, we achieve Macro F1 scores of 90.29, 81.87 and 75.40 for English, German and Hindi respectively while performing hate speech detection and of 60.70, 53.28 and 49.74 during fine-grained classification. In our experiments, we show the efficacy of Perspective API features for hate speech classification and the effects of exploiting a multilingual training scheme. A feature selection study is provided to illustrate impacts of specific features upon the architecture's classification head.

CLJan 15, 2020
Stereotypical Bias Removal for Hate Speech Detection Task using Knowledge-based Generalizations

Pinkesh Badjatiya, Manish Gupta, Vasudeva Varma

With the ever-increasing cases of hate spread on social media platforms, it is critical to design abuse detection mechanisms to proactively avoid and control such incidents. While there exist methods for hate speech detection, they stereotype words and hence suffer from inherently biased training. Bias removal has been traditionally studied for structured datasets, but we aim at bias mitigation from unstructured text data. In this paper, we make two important contributions. First, we systematically design methods to quantify the bias for any model and propose algorithms for identifying the set of words which the model stereotypes. Second, we propose novel methods leveraging knowledge-based generalizations for bias-free learning. Knowledge-based generalization provides an effective way to encode knowledge because the abstraction they provide not only generalizes content but also facilitates retraction of information from the hate speech detection classifier, thereby reducing the imbalance. We experiment with multiple knowledge generalization policies and analyze their effect on general performance and in mitigating bias. Our experiments with two real-world datasets, a Wikipedia Talk Pages dataset (WikiDetox) of size ~96k and a Twitter dataset of size ~24k, show that the use of knowledge-based generalizations results in better performance by forcing the classifier to learn from generalized content. Our methods utilize existing knowledge-bases and can easily be extended to other tasks

CLDec 25, 2019
Hybrid MemNet for Extractive Summarization

Abhishek Kumar Singh, Manish Gupta, Vasudeva Varma

Extractive text summarization has been an extensive research problem in the field of natural language understanding. While the conventional approaches rely mostly on manually compiled features to generate the summary, few attempts have been made in developing data-driven systems for extractive summarization. To this end, we present a fully data-driven end-to-end deep network which we call as Hybrid MemNet for single document summarization task. The network learns the continuous unified representation of a document before generating its summary. It jointly captures local and global sentential information along with the notion of summary worthy sentences. Experimental results on two different corpora confirm that our model shows significant performance gains compared with the state-of-the-art baselines.

CLDec 25, 2019
Unity in Diversity: Learning Distributed Heterogeneous Sentence Representation for Extractive Summarization

Abhishek Kumar Singh, Manish Gupta, Vasudeva Varma

Automated multi-document extractive text summarization is a widely studied research problem in the field of natural language understanding. Such extractive mechanisms compute in some form the worthiness of a sentence to be included into the summary. While the conventional approaches rely on human crafted document-independent features to generate a summary, we develop a data-driven novel summary system called HNet, which exploits the various semantic and compositional aspects latent in a sentence to capture document independent features. The network learns sentence representation in a way that, salient sentences are closer in the vector space than non-salient sentences. This semantic and compositional feature vector is then concatenated with the document-dependent features for sentence ranking. Experiments on the DUC benchmark datasets (DUC-2001, DUC-2002 and DUC-2004) indicate that our model shows significant performance gain of around 1.5-2 points in terms of ROUGE score compared with the state-of-the-art baselines.

CLOct 10, 2019
Multi-label Categorization of Accounts of Sexism using a Neural Framework

Pulkit Parikh, Harika Abburi, Pinkesh Badjatiya et al.

Sexism, an injustice that subjects women and girls to enormous suffering, manifests in blatant as well as subtle ways. In the wake of growing documentation of experiences of sexism on the web, the automatic categorization of accounts of sexism has the potential to assist social scientists and policy makers in studying and countering sexism better. The existing work on sexism classification, which is different from sexism detection, has certain limitations in terms of the categories of sexism used and/or whether they can co-occur. To the best of our knowledge, this is the first work on the multi-label classification of sexism of any kind(s), and we contribute the largest dataset for sexism categorization. We develop a neural solution for this multi-label classification that can combine sentence representations obtained using models such as BERT with distributional and linguistic word embeddings using a flexible, hierarchical architecture involving recurrent components and optional convolutional ones. Further, we leverage unlabeled accounts of sexism to infuse domain-specific elements into our framework. The best proposed method outperforms several deep learning as well as traditional machine learning baselines by an appreciable margin.

CLSep 22, 2019
Adapting Language Models for Non-Parallel Author-Stylized Rewriting

Bakhtiyar Syed, Gaurav Verma, Balaji Vasan Srinivasan et al.

Given the recent progress in language modeling using Transformer-based neural models and an active interest in generating stylized text, we present an approach to leverage the generalization capabilities of a language model to rewrite an input text in a target author's style. Our proposed approach adapts a pre-trained language model to generate author-stylized text by fine-tuning on the author-specific corpus using a denoising autoencoder (DAE) loss in a cascaded encoder-decoder framework. Optimizing over DAE loss allows our model to learn the nuances of an author's style without relying on parallel data, which has been a severe limitation of the previous related works in this space. To evaluate the efficacy of our approach, we propose a linguistically-motivated framework to quantify stylistic alignment of the generated text to the target author at lexical, syntactic and surface levels. The evaluation framework is both interpretable as it leads to several insights about the model, and self-contained as it does not rely on external classifiers, e.g. sentiment or formality classifiers. Qualitative and quantitative assessment indicates that the proposed approach rewrites the input text with better alignment to the target style while preserving the original content better than state-of-the-art baselines.

LGAug 29, 2018
Attention-based Neural Text Segmentation

Pinkesh Badjatiya, Litton J Kurisinkel, Manish Gupta et al.

Text segmentation plays an important role in various Natural Language Processing (NLP) tasks like summarization, context understanding, document indexing and document noise removal. Previous methods for this task require manual feature engineering, huge memory requirements and large execution times. To the best of our knowledge, this paper is the first one to present a novel supervised neural approach for text segmentation. Specifically, we propose an attention-based bidirectional LSTM model where sentence embeddings are learned using CNNs and the segments are predicted based on contextual information. This model can automatically handle variable sized context information. Compared to the existing competitive baselines, the proposed model shows a performance improvement of ~7% in WinDiff score on three benchmark datasets.

IRAug 2, 2018
SWDE : A Sub-Word And Document Embedding Based Engine for Clickbait Detection

Vaibhav Kumar, Mrinal Dhar, Dhruv Khattar et al.

In order to expand their reach and increase website ad revenue, media outlets have started using clickbait techniques to lure readers to click on articles on their digital platform. Having successfully enticed the user to open the article, the article fails to satiate his curiosity serving only to boost click-through rates. Initial methods for this task were dependent on feature engineering, which varies with each dataset. Industry systems have relied on an exhaustive set of rules to get the job done. Neural networks have barely been explored to perform this task. We propose a novel approach considering different textual embeddings of a news headline and the related article. We generate sub-word level embeddings of the title using Convolutional Neural Networks and use them to train a bidirectional LSTM architecture. An attention layer allows for calculation of significance of each term towards the nature of the post. We also generate Doc2Vec embeddings of the title and article text and model how they interact, following which it is concatenated with the output of the previous component. Finally, this representation is passed through a neural network to obtain a score for the headline. We test our model over 2538 posts (having trained it on 17000 records) and achieve an accuracy of 83.49% outscoring previous state-of-the-art approaches.

IRFeb 14, 2018
Multi-Task Learning for Extraction of Adverse Drug Reaction Mentions from Tweets

Shashank Gupta, Manish Gupta, Vasudeva Varma et al.

Adverse drug reactions (ADRs) are one of the leading causes of mortality in health care. Current ADR surveillance systems are often associated with a substantial time lag before such events are officially published. On the other hand, online social media such as Twitter contain information about ADR events in real-time, much before any official reporting. Current state-of-the-art in ADR mention extraction uses Recurrent Neural Networks (RNN), which typically need large labeled corpora. Towards this end, we propose a multi-task learning based method which can utilize a similar auxiliary task (adverse drug event detection) to enhance the performance of the main task, i.e., ADR extraction. Furthermore, in the absence of auxiliary task dataset, we propose a novel joint multi-task learning method to automatically generate weak supervision dataset for the auxiliary task when a large pool of unlabeled tweets is available. Experiments with 0.48M tweets show that the proposed approach outperforms the state-of-the-art methods for the ADR mention extraction task by 7.2% in terms of F1 score.

IRFeb 14, 2018
Co-training for Extraction of Adverse Drug Reaction Mentions from Tweets

Shashank Gupta, Manish Gupta, Vasudeva Varma et al.

Adverse drug reactions (ADRs) are one of the leading causes of mortality in health care. Current ADR surveillance systems are often associated with a substantial time lag before such events are officially published. On the other hand, online social media such as Twitter contain information about ADR events in real-time, much before any official reporting. Current state-of-the-art methods in ADR mention extraction use Recurrent Neural Networks (RNN), which typically need large labeled corpora. Towards this end, we propose a semi-supervised method based on co-training which can exploit a large pool of unlabeled tweets to augment the limited supervised training data, and as a result enhance the performance. Experiments with 0.1M tweets show that the proposed approach outperforms the state-of-the-art methods for the ADR mention extraction task by 5% in terms of F1 score.

IROct 4, 2017
Identifying Clickbait: A Multi-Strategy Approach Using Neural Networks

Vaibhav Kumar, Dhruv Khattar, Siddhartha Gairola et al.

Online media outlets, in a bid to expand their reach and subsequently increase revenue through ad monetisation, have begun adopting clickbait techniques to lure readers to click on articles. The article fails to fulfill the promise made by the headline. Traditional methods for clickbait detection have relied heavily on feature engineering which, in turn, is dependent on the dataset it is built for. The application of neural networks for this task has only been explored partially. We propose a novel approach considering all information found in a social media post. We train a bidirectional LSTM with an attention mechanism to learn the extent to which a word contributes to the post's clickbait score in a differential manner. We also employ a Siamese net to capture the similarity between source and target information. Information gleaned from images has not been considered in previous approaches. We learn image embeddings from large amounts of data using Convolutional Neural Networks to add another layer of complexity to our model. Finally, we concatenate the outputs from the three separate components, serving it as input to a fully connected layer. We conduct experiments over a test corpus of 19538 social media posts, attaining an F1 score of 65.37% on the dataset bettering the previous state-of-the-art, as well as other proposed approaches, feature engineering or otherwise.

IRSep 6, 2017
Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction Mention Extraction

Shashank Gupta, Sachin Pawar, Nitin Ramrakhiyani et al.

Social media is an useful platform to share health-related information due to its vast reach. This makes it a good candidate for public-health monitoring tasks, specifically for pharmacovigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from twitter. Medical information extraction from social media is challenging, mainly due to short and highly information nature of text, as compared to more technical and formal medical reports. Current methods in ADR mention extraction relies on supervised learning methods, which suffers from labeled data scarcity problem. The State-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which are Long-Short-Term-Memory networks (LSTMs) \cite{hochreiter1997long}. Deep neural networks, due to their large number of free parameters relies heavily on large annotated corpora for learning the end task. But in real-world, it is hard to get large labeled data, mainly due to heavy cost associated with manual annotation. Towards this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. Through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction.

IRJun 3, 2017
Simultaneous Inference of User Representations and Trust

Shashank Gupta, Pulkit Parikh, Manish Gupta et al.

Inferring trust relations between social media users is critical for a number of applications wherein users seek credible information. The fact that available trust relations are scarce and skewed makes trust prediction a challenging task. To the best of our knowledge, this is the first work on exploring representation learning for trust prediction. We propose an approach that uses only a small amount of binary user-user trust relations to simultaneously learn user embeddings and a model to predict trust between user pairs. We empirically demonstrate that for trust prediction, our approach outperforms classifier-based approaches which use state-of-the-art representation learning methods like DeepWalk and LINE as features. We also conduct experiments which use embeddings pre-trained with DeepWalk and LINE each as an input to our model, resulting in further performance improvement. Experiments with a dataset of $\sim$356K user pairs show that the proposed method can obtain an high F-score of 92.65%.

CLJun 1, 2017
Deep Learning for Hate Speech Detection in Tweets

Pinkesh Badjatiya, Shashank Gupta, Manish Gupta et al.

Hate speech detection on Twitter is critical for applications like controversial event extraction, building AI chatterbots, content recommendation, and sentiment analysis. We define this task as being able to classify a tweet as racist, sexist or neither. The complexity of the natural language constructs makes this task very challenging. We perform extensive experiments with multiple deep learning architectures to learn semantic word embeddings to handle this complexity. Our experiments on a benchmark dataset of 16K annotated tweets show that such deep learning methods outperform state-of-the-art char/word n-gram methods by ~18 F1 points.

CLApr 4, 2017
Interpretation of Semantic Tweet Representations

J Ganesh, Manish Gupta, Vasudeva Varma

Research in analysis of microblogging platforms is experiencing a renewed surge with a large number of works applying representation learning models for applications like sentiment analysis, semantic textual similarity computation, hashtag prediction, etc. Although the performance of the representation learning models has been better than the traditional baselines for such tasks, little is known about the elementary properties of a tweet encoded within these representations, or why particular representations work better for certain tasks. Our work presented here constitutes the first step in opening the black-box of vector embeddings for tweets. Traditional feature engineering methods for high-level applications have exploited various elementary properties of tweets. We believe that a tweet representation is effective for an application because it meticulously encodes the application-specific elementary properties of tweets. To understand the elementary properties encoded in a tweet representation, we evaluate the representations on the accuracy to which they can model each of those properties such as tweet length, presence of particular words, hashtags, mentions, capitalization, etc. Our systematic extensive study of nine supervised and four unsupervised tweet representations against most popular eight textual and five social elementary properties reveal that Bi-directional LSTMs (BLSTMs) and Skip-Thought Vectors (STV) best encode the textual and social properties of tweets respectively. FastText is the best model for low resource settings, providing very little degradation with reduction in embedding size. Finally, we draw interesting insights by correlating the model performance obtained for elementary property prediction tasks with the highlevel downstream applications.

CLDec 19, 2016
Improving Tweet Representations using Temporal and User Context

Ganesh J, Manish Gupta, Vasudeva Varma

In this work we propose a novel representation learning model which computes semantic representations for tweets accurately. Our model systematically exploits the chronologically adjacent tweets ('context') from users' Twitter timelines for this task. Further, we make our model user-aware so that it can do well in modeling the target tweet by exploiting the rich knowledge about the user such as the way the user writes the post and also summarizing the topics on which the user writes. We empirically demonstrate that the proposed models outperform the state-of-the-art models in predicting the user profile attributes like spouse, education and job by 19.66%, 2.27% and 2.22% respectively.

CLNov 15, 2016
Interpreting the Syntactic and Social Elements of the Tweet Representations via Elementary Property Prediction Tasks

J Ganesh, Manish Gupta, Vasudeva Varma

Research in social media analysis is experiencing a recent surge with a large number of works applying representation learning models to solve high-level syntactico-semantic tasks such as sentiment analysis, semantic textual similarity computation, hashtag prediction and so on. Although the performance of the representation learning models are better than the traditional baselines for the tasks, little is known about the core properties of a tweet encoded within the representations. Understanding these core properties would empower us in making generalizable conclusions about the quality of representations. Our work presented here constitutes the first step in opening the black-box of vector embedding for social media posts, with emphasis on tweets in particular. In order to understand the core properties encoded in a tweet representation, we evaluate the representations to estimate the extent to which it can model each of those properties such as tweet length, presence of words, hashtags, mentions, capitalization, and so on. This is done with the help of multiple classifiers which take the representation as input. Essentially, each classifier evaluates one of the syntactic or social properties which are arguably salient for a tweet. This is also the first holistic study on extensively analysing the ability to encode these properties for a wide variety of tweet representation models including the traditional unsupervised methods (BOW, LDA), unsupervised representation learning methods (Siamese CBOW, Tweet2Vec) as well as supervised methods (CNN, BLSTM).

CLNov 2, 2016
Towards Sub-Word Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text

Ameya Prabhu, Aditya Joshi, Manish Shrivastava et al.

Sentiment analysis (SA) using code-mixed data from social media has several applications in opinion mining ranging from customer satisfaction to social campaign analysis in multilingual societies. Advances in this area are impeded by the lack of a suitable annotated dataset. We introduce a Hindi-English (Hi-En) code-mixed dataset for sentiment analysis and perform empirical analysis comparing the suitability and performance of various state-of-the-art SA methods in social media. In this paper, we introduce learning sub-word level representations in LSTM (Subword-LSTM) architecture instead of character-level or word-level representations. This linguistic prior in our architecture enables us to learn the information about sentiment value of important morphemes. This also seems to work well in highly noisy text containing misspellings as shown in our experiments which is demonstrated in morpheme-level feature maps learned by our model. Also, we hypothesize that encoding this linguistic prior in the Subword-LSTM architecture leads to the superior performance. Our system attains accuracy 4-5% greater than traditional approaches on our dataset, and also outperforms the available system for sentiment analysis in Hi-En code-mixed text by 18%.

IRJan 13, 2015
Towards Deep Semantic Analysis Of Hashtags

Piyush Bansal, Romil Bansal, Vasudeva Varma

Hashtags are semantico-syntactic constructs used across various social networking and microblogging platforms to enable users to start a topic specific discussion or classify a post into a desired category. Segmenting and linking the entities present within the hashtags could therefore help in better understanding and extraction of information shared across the social media. However, due to lack of space delimiters in the hashtags (e.g #nsavssnowden), the segmentation of hashtags into constituent entities ("NSA" and "Edward Snowden" in this case) is not a trivial task. Most of the current state-of-the-art social media analytics systems like Sentiment Analysis and Entity Linking tend to either ignore hashtags, or treat them as a single word. In this paper, we present a context aware approach to segment and link entities in the hashtags to a knowledge base (KB) entry, based on the context within the tweet. Our approach segments and links the entities in hashtags such that the coherence between hashtag semantics and the tweet is maximized. To the best of our knowledge, no existing study addresses the issue of linking entities in hashtags for extracting semantic information. We evaluate our method on two different datasets, and demonstrate the effectiveness of our technique in improving the overall entity linking in tweets via additional semantic information provided by segmenting and linking entities in a hashtag.

CLMar 29, 2013
Exploring the Role of Logically Related Non-Question Phrases for Answering Why-Questions

Niraj Kumar, Rashmi Gangadharaiah, Kannan Srinathan et al.

In this paper, we show that certain phrases although not present in a given question/query, play a very important role in answering the question. Exploring the role of such phrases in answering questions not only reduces the dependency on matching question phrases for extracting answers, but also improves the quality of the extracted answers. Here matching question phrases means phrases which co-occur in given question and candidate answers. To achieve the above discussed goal, we introduce a bigram-based word graph model populated with semantic and topical relatedness of terms in the given document. Next, we apply an improved version of ranking with a prior-based approach, which ranks all words in the candidate document with respect to a set of root words (i.e. non-stopwords present in the question and in the candidate document). As a result, terms logically related to the root words are scored higher than terms that are not related to the root words. Experimental results show that our devised system performs better than state-of-the-art for the task of answering Why-questions.