Tavpritesh Sethi

CL
h-index7
14papers
70citations
Novelty26%
AI Score34

14 Papers

AIDec 23, 2025
Benchmarking LLMs for Predictive Applications in the Intensive Care Units

Chehak Malhotra, Mehak Gopal, Akshaya Devadiga et al.

With the advent of LLMs, various tasks across the natural language processing domain have been transformed. However, their application in predictive tasks remains less researched. This study compares large language models, including GatorTron-Base (trained on clinical data), Llama 8B, and Mistral 7B, against models like BioBERT, DocBERT, BioClinicalBERT, Word2Vec, and Doc2Vec, setting benchmarks for predicting Shock in critically ill patients. Timely prediction of shock can enable early interventions, thus improving patient outcomes. Text data from 17,294 ICU stays of patients in the MIMIC III database were scored for length of stay > 24 hours and shock index (SI) > 0.7 to yield 355 and 87 patients with normal and abnormal SI-index, respectively. Both focal and cross-entropy losses were used during finetuning to address class imbalances. Our findings indicate that while GatorTron Base achieved the highest weighted recall of 80.5%, the overall performance metrics were comparable between SLMs and LLMs. This suggests that LLMs are not inherently superior to SLMs in predicting future clinical events despite their strong performance on text-based tasks. To achieve meaningful clinical outcomes, future efforts in training LLMs should prioritize developing models capable of predicting clinical trajectories rather than focusing on simpler tasks such as named entity recognition or phenotyping.

CLJun 24, 2023
Characterizing the Emotion Carriers of COVID-19 Misinformation and Their Impact on Vaccination Outcomes in India and the United States

Ridam Pal, Sanjana S, Deepak Mahto et al.

The COVID-19 Infodemic had an unprecedented impact on health behaviors and outcomes at a global scale. While many studies have focused on a qualitative and quantitative understanding of misinformation, including sentiment analysis, there is a gap in understanding the emotion-carriers of misinformation and their differences across geographies. In this study, we characterized emotion carriers and their impact on vaccination rates in India and the United States. A manually labelled dataset was created from 2.3 million tweets and collated with three publicly available datasets (CoAID, AntiVax, CMU) to train deep learning models for misinformation classification. Misinformation labelled tweets were further analyzed for behavioral aspects by leveraging Plutchik Transformers to determine the emotion for each tweet. Time series analysis was conducted to study the impact of misinformation on spatial and temporal characteristics. Further, categorical classification was performed using transformer models to assign categories for the misinformation tweets. Word2Vec+BiLSTM was the best model for misinformation classification, with an F1-score of 0.92. The US had the highest proportion of misinformation tweets (58.02%), followed by the UK (10.38%) and India (7.33%). Disgust, anticipation, and anger were associated with an increased prevalence of misinformation tweets. Disgust was the predominant emotion associated with misinformation tweets in the US, while anticipation was the predominant emotion in India. For India, the misinformation rate exhibited a lead relationship with vaccination, while in the US it lagged behind vaccination. Our study deciphered that emotions acted as differential carriers of misinformation across geography and time. These carriers can be monitored to develop strategic interventions for countering misinformation, leading to improved public health.

LGAug 16, 2021Code
WiseR: An end-to-end structure learning and deployment framework for causal graphical models

Shubham Maheshwari, Khushbu Pahwa, Tavpritesh Sethi

Structure learning offers an expressive, versatile and explainable approach to causal and mechanistic modeling of complex biological data. We present wiseR, an open source application for learning, evaluating and deploying robust causal graphical models using graph neural networks and Bayesian networks. We demonstrate the utility of this application through application on for biomarker discovery in a COVID-19 clinical dataset.

CLOct 30, 2020Code
A Cross-lingual Natural Language Processing Framework for Infodemic Management

Ridam Pal, Rohan Pandey, Vaibhav Gautam et al.

The COVID-19 pandemic has put immense pressure on health systems which are further strained due to the misinformation surrounding it. Under such a situation, providing the right information at the right time is crucial. There is a growing demand for the management of information spread using Artificial Intelligence. Hence, we have exploited the potential of Natural Language Processing for identifying relevant information that needs to be disseminated amongst the masses. In this work, we present a novel Cross-lingual Natural Language Processing framework to provide relevant information by matching daily news with trusted guidelines from the World Health Organization. The proposed pipeline deploys various techniques of NLP such as summarizers, word embeddings, and similarity metrics to provide users with news articles along with a corresponding healthcare guideline. A total of 36 models were evaluated and a combination of LexRank based summarizer on Word2Vec embedding with Word Mover distance metric outperformed all other models. This novel open-source approach can be used as a template for proactive dissemination of relevant healthcare information in the midst of misinformation spread associated with epidemics.

AISep 14, 2020Code
VacSIM: Learning Effective Strategies for COVID-19 Vaccine Distribution using Reinforcement Learning

Raghav Awasthi, Keerat Kaur Guliani, Saif Ahmad Khan et al.

A COVID-19 vaccine is our best bet for mitigating the ongoing onslaught of the pandemic. However, vaccine is also expected to be a limited resource. An optimal allocation strategy, especially in countries with access inequities and temporal separation of hot-spots, might be an effective way of halting the disease spread. We approach this problem by proposing a novel pipeline VacSIM that dovetails Deep Reinforcement Learning models into a Contextual Bandits approach for optimizing the distribution of COVID-19 vaccine. Whereas the Reinforcement Learning models suggest better actions and rewards, Contextual Bandits allow online modifications that may need to be implemented on a day-to-day basis in the real world scenario. We evaluate this framework against a naive allocation approach of distributing vaccine proportional to the incidence of COVID-19 cases in five different States across India (Assam, Delhi, Jharkhand, Maharashtra and Nagaland) and demonstrate up to 9039 potential infections prevented and a significant increase in the efficacy of limiting the spread over a period of 45 days through the VacSIM approach. Our models and the platform are extensible to all states of India and potentially across the globe. We also propose novel evaluation strategies including standard compartmental model-based projections and a causality-preserving evaluation of our model. Since all models carry assumptions that may need to be tested in various contexts, we open source our model VacSIM and contribute a new reinforcement learning environment compatible with OpenAI gym to make it extensible for real-world applications across the globe. (http://vacsim.tavlab.iiitd.edu.in:8000/).

CLMay 12, 2020Code
Psychometric Analysis and Coupling of Emotions Between State Bulletins and Twitter in India during COVID-19 Infodemic

Baani Leen Kaur Jolly, Palash Aggrawal, Amogh Gulati et al.

COVID-19 infodemic has been spreading faster than the pandemic itself. The misinformation riding upon the infodemic wave poses a major threat to people's health and governance systems. Since social media is the largest source of information, managing the infodemic not only requires mitigating of misinformation but also an early understanding of psychological patterns resulting from it. During the COVID-19 crisis, Twitter alone has seen a sharp 45% increase in the usage of its curated events page, and a 30% increase in its direct messaging usage, since March 6th 2020. In this study, we analyze the psychometric impact and coupling of the COVID-19 infodemic with the official bulletins related to COVID-19 at the national and state level in India. We look at these two sources with a psycho-linguistic lens of emotions and quantified the extent and coupling between the two. We modified path, a deep skip-gram based open-sourced lexicon builder for effective capture of health-related emotions. We were then able to capture the time-evolution of health-related emotions in social media and official bulletins. An analysis of lead-lag relationships between the time series of extracted emotions from official bulletins and social media using Granger's causality showed that state bulletins were leading the social media for some emotions such as Medical Emergency. Further insights that are potentially relevant for the policymaker and the communicators actively engaged in mitigating misinformation are also discussed. Our paper also introduces CoronaIndiaDataset2, the first social media based COVID-19 dataset at national and state levels from India with over 5.6 million national and 2.6 million state-level tweets. Finally, we present our findings as COVibes, an interactive web application capturing psychometric insights captured upon the CoronaIndiaDataset, both at a national and state level.

CLSep 30, 2021
Variance of Twitter Embeddings and Temporal Trends of COVID-19 cases

Mayank Sethi, Ambika Sadhu, Khushbu Pahwa et al.

The severity of the coronavirus pandemic necessitates the need of effective administrative decisions. Over 4 lakh people in India succumbed to COVID-19, with over 3 crore confirmed cases, and still counting. The threat of a plausible third wave continues to haunt millions. In this ever changing dynamic of the virus, predictive modeling methods can serve as an integral tool. The pandemic has further triggered an unprecedented usage of social media. This paper aims to propose a method for harnessing social media, specifically Twitter, to predict the upcoming scenarios related to COVID-19 cases. In this study, we seek to understand how the surges in COVID-19 related tweets can indicate rise in the cases. This prospective analysis can be utilised to aid administrators about timely resource allocation to lessen the severity of the damage. Using word embeddings to capture the semantic meaning of tweets, we identify Significant Dimensions (SDs).Our methodology predicts the rise in cases with a lead time of 15 days and 30 days with R2 scores of 0.80 and 0.62 respectively. Finally, we explain the thematic utility of the SDs.

LGAug 16, 2021
Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring

Sezal Chug, Priya Kaushal, Ponnurangam Kumaraguru et al.

Data is expanding at an unimaginable rate, and with this development comes the responsibility of the quality of data. Data Quality refers to the relevance of the information present and helps in various operations like decision making and planning in a particular organization. Mostly data quality is measured on an ad-hoc basis, and hence none of the developed concepts provide any practical application. The current empirical study was undertaken to formulate a concrete automated data quality platform to assess the quality of incoming dataset and generate a quality label, score and comprehensive report. We utilize various datasets from healthdata.gov, opendata.nhs and Demographics and Health Surveys (DHS) Program to observe the variations in the quality score and formulate a label using Principal Component Analysis(PCA). The results of the current empirical study revealed a metric that encompasses nine quality ingredients, namely provenance, dataset characteristics, uniformity, metadata coupling, percentage of missing cells and duplicate rows, skewness of data, the ratio of inconsistencies of categorical columns, and correlation between these attributes. The study also provides an illustrative case study and validation of the metric following Mutation Testing approaches. This research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label. The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.

SIMay 17, 2021
The State of Infodemic on Twitter

Drishti Jain, Tavpritesh Sethi

Following the wave of misinterpreted, manipulated and malicious information growing on the Internet, the misinformation surrounding COVID-19 has become a paramount issue. In the context of the current COVID-19 pandemic, social media posts and platforms are at risk of rumors and misinformation in the face of the serious uncertainty surrounding the virus itself. At the same time, the uncertainty and new nature of COVID-19 means that other unconfirmed information that may appear "rumored" may be an important indicator of the behavior and impact of this new virus. Twitter, in particular, has taken a center stage in this storm where Covid-19 has been a much talked about subject. We have presented an exploratory analysis of the tweets and the users who are involved in spreading misinformation and then delved into machine learning models and natural language processing techniques to identify if a tweet contains misinformation.

CLApr 2, 2021
Mining Trends of COVID-19 Vaccine Beliefs on Twitter with Lexical Embeddings

Harshita Chopra, Aniket Vashishtha, Ridam Pal et al.

Social media plays a pivotal role in disseminating news globally and acts as a platform for people to express their opinions on various topics. A wide variety of views accompanies COVID-19 vaccination drives across the globe, often colored by emotions, which change along with rising cases, approval of vaccines, and multiple factors discussed online. This study aims at analyzing the temporal evolution of different Emotion categories: Hesitation, Rage, Sorrow, Anticipation, Faith, and Contentment with Influencing Factors: Vaccine Rollout, Misinformation, Health Effects, and Inequities as lexical categories created from Tweets belonging to five countries with vital vaccine roll-out programs, namely, India, United States of America, Brazil, United Kingdom, and Australia. We extracted a corpus of nearly 1.8 million Twitter posts related to COVID-19 vaccination. Using cosine distance from selected seed words, we expanded the vocabulary of each category and tracked the longitudinal change in their strength from June 2020 to April 2021. We used community detection algorithms to find modules in positive correlation networks. Our findings suggest that tweets expressing hesitancy towards vaccines contain the highest mentions of health-related effects in all countries. Our results indicated that the patterns of hesitancy were variable across geographies and can help us learn targeted interventions. We also observed a significant change in the linear trends of categories like hesitation and contentment before and after approval of vaccines. Negative emotions like rage and sorrow gained the highest importance in the alluvial diagram. They formed a significant module with all the influencing factors in April 2021, when India observed the second wave of COVID-19 cases. The relationship between Emotions and Influencing Factors was found to be variable across the countries.

LGNov 30, 2020
Learning Explainable Interventions to Mitigate HIV Transmission in Sex Workers Across Five States in India

Raghav Awasthi, Prachi Patel, Vineet Joshi et al.

Female sex workers(FSWs) are one of the most vulnerable and stigmatized groups in society. As a result, they often suffer from a lack of quality access to care. Grassroot organizations engaged in improving health services are often faced with the challenge of improving the effectiveness of interventions due to complex influences. This work combines structure learning, discriminative modeling, and grass-root level expertise of designing interventions across five different Indian states to discover the influence of non-obvious factors for improving safe-sex practices in FSWs. A bootstrapped, ensemble-averaged Bayesian Network structure was learned to quantify the factors that could maximize condom usage as revealed from the model. A discriminative model was then constructed using XgBoost and random forest in order to predict condom use behavior The best model achieved 83% sensitivity, 99% specificity, and 99% area under the precision-recall curve for the prediction. Both generative and discriminative modeling approaches revealed that financial literacy training was the primary influence and predictor of condom use in FSWs. These insights have led to a currently ongoing field trial for assessing the real-world utility of this approach. Our work highlights the potential of explainable models for transparent discovery and prioritization of anti-HIV interventions in female sex workers in a resource-limited setting.

CVOct 30, 2020
(Un)Masked COVID-19 Trends from Social Media

Asmit Kumar Singh, Paras Mehan, Divyanshu Sharma et al.

Wearing masks is a useful protection method against COVID-19, which has caused widespread economic and social impact worldwide. Across the globe, governments have put mandates for the use of face masks, which have received both positive and negative reaction. Online social media provides an exciting platform to study the use of masks and analyze underlying mask-wearing patterns. In this article, we analyze 2.04 million social media images for six US cities. An increase in masks worn in images is seen as the COVID-19 cases rose, particularly when their respective states imposed strict regulations. We also found a decrease in the posting of group pictures as stay-at-home laws were put into place. Furthermore, mask compliance in the Black Lives Matter protest was analyzed, eliciting that 40% of the people in group photos wore masks, and 45% of them wore the masks with a fit score of greater than 80%. We introduce two new datasets, VAriety MAsks - Classification (VAMA-C) and VAriety MAsks - Segmentation (VAMA-S), for mask detection and mask fit analysis tasks, respectively. For the analysis, we create two frameworks, face mask detector (for classifying masked and unmasked faces) and mask fit analyzer (a semantic segmentation based model to calculate a mask-fit score). The face mask detector achieved a classification accuracy of 98%, and the semantic segmentation model for the mask fit analyzer achieved an Intersection Over Union (IOU) score of 98%. We conclude that such a framework can be used to evaluate the effectiveness of such public health strategies using social media platforms in times of pandemic.

CYMar 16, 2020
A Machine Learning Application for Raising WASH Awareness in the Times of COVID-19 Pandemic

Rohan Pandey, Vaibhav Gautam, Ridam Pal et al.

Background: The COVID-19 pandemic has uncovered the potential of digital misinformation in shaping the health of nations. The deluge of unverified information that spreads faster than the epidemic itself is an unprecedented phenomenon that has put millions of lives in danger. Mitigating this Infodemic requires strong health messaging systems that are engaging, vernacular, scalable, effective and continuously learn the new patterns of misinformation. Objective: We created WashKaro, a multi-pronged intervention for mitigating misinformation through conversational AI, machine translation and natural language processing. WashKaro provides the right information matched against WHO guidelines through AI, and delivers it in the right format in local languages. Methods: We theorize (i) an NLP based AI engine that could continuously incorporate user feedback to improve relevance of information, (ii) bite sized audio in the local language to improve penetrance in a country with skewed gender literacy ratios, and (iii) conversational but interactive AI engagement with users towards an increased health awareness in the community. Results: A total of 5026 people who downloaded the app during the study window, among those 1545 were active users. Our study shows that 3.4 times more females engaged with the App in Hindi as compared to males, the relevance of AI-filtered news content doubled within 45 days of continuous machine learning, and the prudence of integrated AI chatbot Satya increased thus proving the usefulness of an mHealth platform to mitigate health misinformation. Conclusion: We conclude that a multi-pronged machine learning application delivering vernacular bite-sized audios and conversational AI is an effective approach to mitigate health misinformation.

APSep 18, 2018
Learning to Address Health Inequality in the United States with a Bayesian Decision Network

Tavpritesh Sethi, Anant Mittal, Shubham Maheshwari et al.

Life-expectancy is a complex outcome driven by genetic, socio-demographic, environmental and geographic factors. Increasing socio-economic and health disparities in the United States are propagating the longevity-gap, making it a cause for concern. Earlier studies have probed individual factors but an integrated picture to reveal quantifiable actions has been missing. There is a growing concern about a further widening of healthcare inequality caused by Artificial Intelligence (AI) due to differential access to AI-driven services. Hence, it is imperative to explore and exploit the potential of AI for illuminating biases and enabling transparent policy decisions for positive social and health impact. In this work, we reveal actionable interventions for decreasing the longevity-gap in the United States by analyzing a County-level data resource containing healthcare, socio-economic, behavioral, education and demographic features. We learn an ensemble-averaged structure, draw inferences using the joint probability distribution and extend it to a Bayesian Decision Network for identifying policy actions. We draw quantitative estimates for the impact of diversity, preventive-care quality and stable-families within the unified framework of our decision network. Finally, we make this analysis and dashboard available as an interactive web-application for enabling users and policy-makers to validate our reported findings and to explore the impact of ones beyond reported in this work.