Sopan Khosla

h-index5

9papers

4,622citations

Novelty59%

AI Score49

Ranked #23,483 of 194,257 authors (top 12%)#4,927 in CL (top 16%)

9 Papers

11.2CLNov 14, 2023Code

Bring Your Own KG: Self-Supervised Program Synthesis for Zero-Shot KGQA

Dhruv Agarwal, Rajarshi Das, Sopan Khosla et al.

We present BYOKG, a universal question-answering (QA) system that can operate on any knowledge graph (KG), requires no human-annotated training data, and can be ready to use within a day -- attributes that are out-of-scope for current KGQA systems. BYOKG draws inspiration from the remarkable ability of humans to comprehend information present in an unseen KG through exploration -- starting at random nodes, inspecting the labels of adjacent nodes and edges, and combining them with their prior world knowledge. In BYOKG, exploration leverages an LLM-backed symbolic agent that generates a diverse set of query-program exemplars, which are then used to ground a retrieval-augmented reasoning procedure to predict programs for arbitrary questions. BYOKG is effective over both small- and large-scale graphs, showing dramatic gains in QA accuracy over a zero-shot baseline of 27.89 and 58.02 F1 on GrailQA and MetaQA, respectively. On GrailQA, we further show that our unsupervised BYOKG outperforms a supervised in-context learning method, demonstrating the effectiveness of exploration. Lastly, we find that performance of BYOKG reliably improves with continued exploration as well as improvements in the base LLM, notably outperforming a state-of-the-art fine-tuned model by 7.08 F1 on a sub-sampled zero-shot split of GrailQA.

27.4CLNov 13, 2025

Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

Yun He, Wenzhe Li, Hejia Zhang et al.

Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.

31.0CLAug 11, 2020Code

LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for Multi-Granular Propaganda Span Identification

Sopan Khosla, Rishabh Joshi, Ritam Dutt et al.

In this paper we describe our submission for the task of Propaganda Span Identification in news articles. We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda. The "multi-granular" model incorporates linguistic knowledge at various levels of text granularity, including word, sentence and document level syntactic, semantic and pragmatic affect features, which significantly improve model performance, compared to its language-agnostic variant. To facilitate better representation learning, we also collect a corpus of 10k news articles, and use it for fine-tuning the model. The final model is a majority-voting ensemble which learns different propaganda class boundaries by leveraging different subsets of incorporated knowledge and attains $4^{th}$ position on the test leaderboard. Our final model and code is released at https://github.com/sopu/PropagandaSemEval2020.

16.9LGFeb 25, 2025

FairGen: Controlling Sensitive Attributes for Fair Generations in Diffusion Models via Adaptive Latent Guidance

Mintong Kang, Vinayshekhar Bannihatti Kumar, Shamik Roy et al.

Text-to-image diffusion models often exhibit biases toward specific demographic groups, such as generating more males than females when prompted to generate images of engineers, raising ethical concerns and limiting their adoption. In this paper, we tackle the challenge of mitigating generation bias towards any target attribute value (e.g., "male" for "gender") in diffusion models while preserving generation quality. We propose FairGen, an adaptive latent guidance mechanism which controls the generation distribution during inference. In FairGen, a latent guidance module dynamically adjusts the diffusion process to enforce specific attributes, while a memory module tracks the generation statistics and steers latent guidance to align with the targeted fair distribution of the attribute values. Furthermore, we address the limitations of existing datasets by introducing the Holistic Bias Evaluation (HBE) benchmark, which covers diverse domains and incorporates complex prompts to assess bias more comprehensively. Extensive evaluations on HBE and Stable Bias datasets demonstrate that FairGen outperforms existing bias mitigation approaches, achieving substantial bias reduction (e.g., 68.5% gender bias reduction on Stable Diffusion 2). Ablation studies highlight FairGen's ability to flexibly control the output distribution at any user-specified granularity, ensuring adaptive and targeted bias mitigation.

31.7CLApr 20, 2021

Evaluating the Impact of a Hierarchical Discourse Representation on Entity Coreference Resolution Performance

Sopan Khosla, James Fiacco, Carolyn Rose

Recent work on entity coreference resolution (CR) follows current trends in Deep Learning applied to embeddings and relatively simple task-related features. SOTA models do not make use of hierarchical representations of discourse structure. In this work, we leverage automatically constructed discourse parse trees within a neural approach and demonstrate a significant improvement on two benchmark entity coreference-resolution datasets. We explore how the impact varies depending upon the type of mention.

31.1CLOct 12, 2020

Using Type Information to Improve Entity Coreference Resolution

Sopan Khosla, Carolyn Rose

Coreference resolution (CR) is an essential part of discourse analysis. Most recently, neural approaches have been proposed to improve over SOTA models from earlier paradigms. So far none of the published neural models leverage external semantic knowledge such as type information. This paper offers the first such model and evaluation, demonstrating modest gains in accuracy by introducing either gold standard or predicted types. In the proposed approach, type information serves both to (1) improve mention representation and (2) create a soft type consistency check between coreference candidate mentions. Our evaluation covers two different grain sizes of types over four different benchmark corpora.

0.3CLOct 5, 2020Code

MedFilter: Improving Extraction of Task-relevant Utterances from Doctor-Patient Conversations through Integration of Discourse Structure and Ontological Knowledge

Sopan Khosla, Shikhar Vashishth, Jill Fain Lehman et al.

Information extraction from conversational data is particularly challenging because the task-centric nature of conversation allows for effective communication of implicit information by humans, but is challenging for machines. The challenges may differ between utterances depending on the role of the speaker within the conversation, especially when relevant expertise is distributed asymmetrically across roles. Further, the challenges may also increase over the conversation as more shared context is built up through information communicated implicitly earlier in the dialogue. In this paper, we propose the novel modeling approach MedFilter, which addresses these insights in order to increase performance at identifying and categorizing task-relevant utterances, and in so doing, positively impacts performance at a downstream information extraction task. We evaluate this approach on a corpus of nearly 7,000 doctor-patient conversations where MedFilter is used to identify medically relevant contributions to the discussion (achieving a 10% improvement over SOTA baselines in terms of area under the PR curve). Identifying task-relevant utterances benefits downstream medical processing, achieving improvements of 15%, 105%, and 23% respectively for the extraction of symptoms, medications, and complaints.

28.3CLMay 4, 2020Code

Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques

Kundan Krishna, Sopan Khosla, Jeffrey P. Bigham et al.

Following each patient visit, physicians draft long semi-structured clinical summaries called SOAP notes. While invaluable to clinicians and researchers, creating digital SOAP notes is burdensome, contributing to physician burnout. In this paper, we introduce the first complete pipelines to leverage deep summarization models to generate these notes based on transcripts of conversations between physicians and patients. After exploring a spectrum of methods across the extractive-abstractive spectrum, we propose Cluster2Sent, an algorithm that (i) extracts important utterances relevant to each summary section; (ii) clusters together related utterances; and then (iii) generates one summary sentence per cluster. Cluster2Sent outperforms its purely abstractive counterpart by 8 ROUGE-1 points, and produces significantly more factual and coherent sentences as assessed by expert human evaluators. For reproducibility, we demonstrate similar benefits on the publicly available AMI dataset. Our results speak to the benefits of structuring summaries into sections and annotating supporting evidence when constructing summarization corpora.

32.0CLMay 21, 2018

Aff2Vec: Affect--Enriched Distributional Word Representations

Sopan Khosla, Niyati Chhaya, Kushal Chawla

Human communication includes information, opinions, and reactions. Reactions are often captured by the affective-messages in written as well as verbal communications. While there has been work in affect modeling and to some extent affective content generation, the area of affective word distributions in not well studied. Synsets and lexica capture semantic relationships across words. These models however lack in encoding affective or emotional word interpretations. Our proposed model, Aff2Vec provides a method for enriched word embeddings that are representative of affective interpretations of words. Aff2Vec outperforms the state--of--the--art in intrinsic word-similarity tasks. Further, the use of Aff2Vec representations outperforms baseline embeddings in downstream natural language understanding tasks including sentiment analysis, personality detection, and frustration prediction.