Girish Keshav Palshikar

h-index14

14papers

215citations

Novelty38%

AI Score39

Ranked #82,032 of 194,257 authors (top 42%)#15,556 in CL (top 51%)

14 Papers

6.7CLDec 26, 2025

Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs

Sachin Pawar, Manoj Apte, Kshitij Jadhav et al.

Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model's fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of "natural" words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral's tokenizer splits "martial" into "mart" and "ial"). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how "bad" the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.

2.7CLDec 26, 2025

Explainable Statute Prediction via Attention-based Model and LLM Prompting

Sachin Pawar, Girish Keshav Palshikar, Anindita Sinha Banerjee et al.

In this paper, we explore the problem of automatic statute prediction where for a given case description, a subset of relevant statutes are to be predicted. Here, the term "statute" refers to a section, a sub-section, or an article of any specific Act. Addressing this problem would be useful in several applications such as AI-assistant for lawyers and legal question answering system. For better user acceptance of such Legal AI systems, we believe the predictions should also be accompanied by human understandable explanations. We propose two techniques for addressing this problem of statute prediction with explanations -- (i) AoS (Attention-over-Sentences) which uses attention over sentences in a case description to predict statutes relevant for it and (ii) LLMPrompt which prompts an LLM to predict as well as explain relevance of a certain statute. AoS uses smaller language models, specifically sentence transformers and is trained in a supervised manner whereas LLMPrompt uses larger language models in a zero-shot manner and explores both standard as well as Chain-of-Thought (CoT) prompting techniques. Both these models produce explanations for their predictions in human understandable forms. We compare statute prediction performance of both the proposed techniques with each other as well as with a set of competent baselines, across two popular datasets. Also, we evaluate the quality of the generated explanations through an automated counter-factual manner as well as through human evaluation.

2.7CLMay 12, 2025

Matching Tasks with Industry Groups for Augmenting Commonsense Knowledge

Rituraj Singh, Sachin Pawar, Girish Palshikar

Commonsense knowledge bases (KB) are a source of specialized knowledge that is widely used to improve machine learning applications. However, even for a large KB such as ConceptNet, capturing explicit knowledge from each industry domain is challenging. For example, only a few samples of general {\em tasks} performed by various industries are available in ConceptNet. Here, a task is a well-defined knowledge-based volitional action to achieve a particular goal. In this paper, we aim to fill this gap and present a weakly-supervised framework to augment commonsense KB with tasks carried out by various industry groups (IG). We attempt to {\em match} each task with one or more suitable IGs by training a neural model to learn task-IG affinity and apply clustering to select the top-k tasks per IG. We extract a total of 2339 triples of the form $\langle IG, is~capable~of, task \rangle$ from two publicly available news datasets for 24 IGs with the precision of 0.86. This validates the reliability of the extracted task-IG pairs that can be directly added to existing KBs.

0.2CLDec 16, 2021

CLICKER: A Computational LInguistics Classification Scheme for Educational Resources

Swapnil Hingmire, Irene Li, Rena Kawamura et al.

A classification scheme of a scientific subject gives an overview of its body of knowledge. It can also be used to facilitate access to research articles and other materials related to the subject. For example, the ACM Computing Classification System (CCS) is used in the ACM Digital Library search interface and also for indexing computer science papers. We observed that a comprehensive classification system like CCS or Mathematics Subject Classification (MSC) does not exist for Computational Linguistics (CL) and Natural Language Processing (NLP). We propose a classification scheme -- CLICKER for CL/NLP based on the analysis of online lectures from 77 university courses on this subject. The currently proposed taxonomy includes 334 topics and focuses on educational aspects of CL/NLP; it is based primarily, but not exclusively, on lecture notes from NLP courses. We discuss how such a taxonomy can help in various real-world applications, including tutoring platforms, resource retrieval, resource recommendation, prerequisite chain learning, and survey generation.

0.2CLMar 10, 2021

Techniques for Jointly Extracting Entities and Relations: A Survey

Sachin Pawar, Pushpak Bhattacharyya, Girish K. Palshikar

Relation Extraction is an important task in Information Extraction which deals with identifying semantic relations between entity mentions. Traditionally, relation extraction is carried out after entity extraction in a "pipeline" fashion, so that relation extraction only focuses on determining whether any semantic relation exists between a pair of extracted entity mentions. This leads to propagation of errors from entity extraction stage to relation extraction stage. Also, entity extraction is carried out without any knowledge about the relations. Hence, it was observed that jointly performing entity and relation extraction is beneficial for both the tasks. In this paper, we survey various techniques for jointly extracting entities and relations. We categorize techniques based on the approach they adopt for joint extraction, i.e. whether they employ joint inference or joint modelling or both. We further describe some representative techniques for joint inference and joint modelling. We also describe two standard datasets, evaluation techniques and performance of the joint extraction approaches on these datasets. We present a brief analysis of application of a general domain joint extraction approach to a Biomedical dataset. This survey is useful for researchers as well as practitioners in the field of Information Extraction, by covering a broad landscape of joint extraction techniques.

0.5CLMar 10, 2021

Knowledge-based Extraction of Cause-Effect Relations from Biomedical Text

Sachin Pawar, Ravina More, Girish K. Palshikar et al.

We propose a knowledge-based approach for extraction of Cause-Effect (CE) relations from biomedical text. Our approach is a combination of an unsupervised machine learning technique to discover causal triggers and a set of high-precision linguistic rules to identify cause/effect arguments of these causal triggers. We evaluate our approach using a corpus of 58,761 Leukaemia-related PubMed abstracts consisting of 568,528 sentences. We could extract 152,655 CE triplets from this corpus where each triplet consists of a cause phrase, an effect phrase and a causal trigger. As compared to the existing knowledge base - SemMedDB (Kilicoglu et al., 2012), the number of extractions are almost twice. Moreover, the proposed approach outperformed the existing technique SemRep (Rindflesch and Fiszman, 2003) on a dataset of 500 sentences.

0.2CLJun 15, 2020

Extracting N-ary Cross-sentence Relations using Constrained Subsequence Kernel

Sachin Pawar, Pushpak Bhattacharyya, Girish K. Palshikar

Most of the past work in relation extraction deals with relations occurring within a sentence and having only two entity arguments. We propose a new formulation of the relation extraction task where the relations are more general than intra-sentence relations in the sense that they may span multiple sentences and may have more than two arguments. Moreover, the relations are more specific than corpus-level relations in the sense that their scope is limited only within a document and not valid globally throughout the corpus. We propose a novel sequence representation to characterize instances of such relations. We then explore various classifiers whose features are derived from this sequence representation. For SVM classifier, we design a Constrained Subsequence Kernel which is a variant of Generalized Subsequence Kernel. We evaluate our approach on three datasets across two domains: biomedical and general domain.

3.2IRFeb 14, 2018

Multi-Task Learning for Extraction of Adverse Drug Reaction Mentions from Tweets

Shashank Gupta, Manish Gupta, Vasudeva Varma et al.

Adverse drug reactions (ADRs) are one of the leading causes of mortality in health care. Current ADR surveillance systems are often associated with a substantial time lag before such events are officially published. On the other hand, online social media such as Twitter contain information about ADR events in real-time, much before any official reporting. Current state-of-the-art in ADR mention extraction uses Recurrent Neural Networks (RNN), which typically need large labeled corpora. Towards this end, we propose a multi-task learning based method which can utilize a similar auxiliary task (adverse drug event detection) to enhance the performance of the main task, i.e., ADR extraction. Furthermore, in the absence of auxiliary task dataset, we propose a novel joint multi-task learning method to automatically generate weak supervision dataset for the auxiliary task when a large pool of unlabeled tweets is available. Experiments with 0.48M tweets show that the proposed approach outperforms the state-of-the-art methods for the ADR mention extraction task by 7.2% in terms of F1 score.

5.6IRFeb 14, 2018

Co-training for Extraction of Adverse Drug Reaction Mentions from Tweets

Shashank Gupta, Manish Gupta, Vasudeva Varma et al.

Adverse drug reactions (ADRs) are one of the leading causes of mortality in health care. Current ADR surveillance systems are often associated with a substantial time lag before such events are officially published. On the other hand, online social media such as Twitter contain information about ADR events in real-time, much before any official reporting. Current state-of-the-art methods in ADR mention extraction use Recurrent Neural Networks (RNN), which typically need large labeled corpora. Towards this end, we propose a semi-supervised method based on co-training which can exploit a large pool of unlabeled tweets to augment the limited supervised training data, and as a result enhance the performance. Experiments with 0.1M tweets show that the proposed approach outperforms the state-of-the-art methods for the ADR mention extraction task by 5% in terms of F1 score.

12.2CLDec 14, 2017

Relation Extraction : A Survey

Sachin Pawar, Girish K. Palshikar, Pushpak Bhattacharyya

With the advent of the Internet, large amount of digital text is generated everyday in the form of news articles, research publications, blogs, question answering forums and social media. It is important to develop techniques for extracting information automatically from these documents, as lot of important information is hidden within them. This extracted information can be used to improve access and management of knowledge hidden in large text corpora. Several applications such as Question Answering, Information Retrieval would benefit from this information. Entities like persons and organizations, form the most basic unit of the information. Occurrences of entities in a sentence are often linked through well-defined relations; e.g., occurrences of person and organization in a sentence may be linked through relations such as employed at. The task of Relation Extraction (RE) is to identify such relations automatically. In this paper, we survey several important supervised, semi-supervised and unsupervised RE techniques. We also cover the paradigms of Open Information Extraction (OIE) and Distant Supervision. Finally, we describe some of the recent trends in the RE techniques and possible future research directions. This survey would be useful for three kinds of readers - i) Newcomers in the field who want to quickly learn about RE; ii) Researchers who want to know how the various RE techniques evolved over time and what are possible future research directions and iii) Practitioners who just need to know which RE technique works best in various settings.

0.3CLDec 4, 2017

Mining Supervisor Evaluation and Peer Feedback in Performance Appraisals

Girish Keshav Palshikar, Sachin Pawar, Saheb Chourasia et al.

Performance appraisal (PA) is an important HR process to periodically measure and evaluate every employee's performance vis-a-vis the goals established by the organization. A PA process involves purposeful multi-step multi-modal communication between employees, their supervisors and their peers, such as self-appraisal, supervisor assessment and peer feedback. Analysis of the structured data and text produced in PA is crucial for measuring the quality of appraisals and tracking actual improvements. In this paper, we apply text mining techniques to produce insights from PA text. First, we perform sentence classification to identify strengths, weaknesses and suggestions of improvements found in the supervisor assessments and then use clustering to discover broad categories among them. Next we use multi-class multi-label classification techniques to match supervisor assessments to predefined broad perspectives on performance. Finally, we propose a short-text summarization technique to produce a summary of peer feedback comments for a given employee and compare it with manual summaries. All techniques are illustrated using a real-life dataset of supervisor assessment and peer feedback text produced during the PA of 4528 employees in a large multi-national IT company.

10.7AIDec 4, 2017

End-to-End Relation Extraction using Markov Logic Networks

Sachin Pawar, Pushpak Bhattacharya, Girish K. Palshikar

The task of end-to-end relation extraction consists of two sub-tasks: i) identifying entity mentions along with their types and ii) recognizing semantic relations among the entity mention pairs. %Identifying entity mentions along with their types and recognizing semantic relations among the entity mentions, are two very important problems in Information Extraction. It has been shown that for better performance, it is necessary to address these two sub-tasks jointly. We propose an approach for simultaneous extraction of entity mentions and relations in a sentence, by using inference in Markov Logic Networks (MLN). We learn three different classifiers : i) local entity classifier, ii) local relation classifier and iii) "pipeline" relation classifier which uses predictions of the local entity classifier. Predictions of these classifiers may be inconsistent with each other. We represent these predictions along with some domain knowledge using weighted first-order logic rules in an MLN and perform joint inference over the MLN to obtain a global output with minimum inconsistencies. Experiments on the ACE (Automatic Content Extraction) 2004 dataset demonstrate that our approach of joint extraction using MLNs outperforms the baselines of individual classifiers. Our end-to-end relation extraction performance is better than 2 out of 3 previous results reported on the ACE 2004 dataset.

0.3CLDec 4, 2017

Topics and Label Propagation: Best of Both Worlds for Weakly Supervised Text Classification

Sachin Pawar, Nitin Ramrakhiyani, Swapnil Hingmire et al.

We propose a Label Propagation based algorithm for weakly supervised text classification. We construct a graph where each document is represented by a node and edge weights represent similarities among the documents. Additionally, we discover underlying topics using Latent Dirichlet Allocation (LDA) and enrich the document graph by including the topics in the form of additional nodes. The edge weights between a topic and a text document represent level of "affinity" between them. Our approach does not require document level labelling, instead it expects manual labels only for topic nodes. This significantly minimizes the level of supervision needed as only a few topics are observed to be enough for achieving sufficiently high accuracy. The Label Propagation Algorithm is employed on this enriched graph to propagate labels among the nodes. Our approach combines the advantages of Label Propagation (through document-document similarities) and Topic Modelling (for minimal but smart supervision). We demonstrate the effectiveness of our approach on various datasets and compare with state-of-the-art weakly supervised text classification approaches.

9.3IRSep 6, 2017

Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction Mention Extraction

Shashank Gupta, Sachin Pawar, Nitin Ramrakhiyani et al.

Social media is an useful platform to share health-related information due to its vast reach. This makes it a good candidate for public-health monitoring tasks, specifically for pharmacovigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from twitter. Medical information extraction from social media is challenging, mainly due to short and highly information nature of text, as compared to more technical and formal medical reports. Current methods in ADR mention extraction relies on supervised learning methods, which suffers from labeled data scarcity problem. The State-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which are Long-Short-Term-Memory networks (LSTMs) \cite{hochreiter1997long}. Deep neural networks, due to their large number of free parameters relies heavily on large annotated corpora for learning the end task. But in real-world, it is hard to get large labeled data, mainly due to heavy cost associated with manual annotation. Towards this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. Through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction.