Subhankar Maity

CL
h-index13
20papers
239citations
Novelty28%
AI Score26

20 Papers

CLSep 26, 2024
Code Generation and Algorithmic Problem Solving Using Llama 3.1 405B

Aniket Deroy, Subhankar Maity

Code generation by Llama 3.1 models, such as Meta's Llama 3.1 405B, represents a significant advancement in the field of artificial intelligence, particularly in natural language processing and programming automation. This paper explores the capabilities and applications of Llama-driven code generation, highlighting its ability to translate natural language prompts into executable code across multiple programming languages. Key features include contextual awareness, multi-language support, and enhanced debugging and optimization functionalities. By examining these aspects, we illustrate how Llama can serve as a versatile tool for developers of all skill levels, improving productivity and efficiency in software development. The potential implications for education, industry, and the future of coding practices are also discussed, underscoring the transformative impact of AI in programming. Experimentation shows that while Llama 3.1 405B performs well with simple algorithmic and data structure based problems, it still struggles with problems on Quantum Computing, Bioinformatics, and Artificial Intelligence.

CLDec 17, 2023Code
Multi-Label Classification of COVID-Tweets Using Large Language Models

Aniket Deroy, Subhankar Maity

Vaccination is important to minimize the risk and spread of various diseases. In recent years, vaccination has been a key step in countering the COVID-19 pandemic. However, many people are skeptical about the use of vaccines for various reasons, including the politics involved, the potential side effects of vaccines, etc. The goal in this task is to build an effective multi-label classifier to label a social media post (particularly, a tweet) according to the specific concern(s) towards vaccines as expressed by the author of the post. We tried three different models-(a) Supervised BERT-large-uncased, (b) Supervised HateXplain model, and (c) Zero-Shot GPT-3.5 Turbo model. The Supervised BERT-large-uncased model performed best in our case. We achieved a macro-F1 score of 0.66, a Jaccard similarity score of 0.66, and received the sixth rank among other submissions. Code is available at-https://github.com/anonmous1981/AISOME

CLOct 14, 2024
Generative AI and Its Impact on Personalized Intelligent Tutoring Systems

Subhankar Maity, Aniket Deroy

Generative Artificial Intelligence (AI) is revolutionizing educational technology by enabling highly personalized and adaptive learning environments within Intelligent Tutoring Systems (ITS). This report delves into the integration of Generative AI, particularly large language models (LLMs) like GPT-4, into ITS to enhance personalized education through dynamic content generation, real-time feedback, and adaptive learning pathways. We explore key applications such as automated question generation, customized feedback mechanisms, and interactive dialogue systems that respond to individual learner needs. The report also addresses significant challenges, including ensuring pedagogical accuracy, mitigating inherent biases in AI models, and maintaining learner engagement. Future directions highlight the potential advancements in multimodal AI integration, emotional intelligence in tutoring systems, and the ethical implications of AI-driven education. By synthesizing current research and practical implementations, this report underscores the transformative potential of Generative AI in creating more effective, equitable, and engaging educational experiences.

CLJan 13, 2024
A Novel Multi-Stage Prompting Approach for Language Agnostic MCQ Generation using GPT

Subhankar Maity, Aniket Deroy, Sudeshna Sarkar

We introduce a multi-stage prompting approach (MSP) for the generation of multiple choice questions (MCQs), harnessing the capabilities of GPT models such as text-davinci-003 and GPT-4, renowned for their excellence across various NLP tasks. Our approach incorporates the innovative concept of chain-of-thought prompting, a progressive technique in which the GPT model is provided with a series of interconnected cues to guide the MCQ generation process. Automated evaluations consistently demonstrate the superiority of our proposed MSP method over the traditional single-stage prompting (SSP) baseline, resulting in the production of high-quality distractors. Furthermore, the one-shot MSP technique enhances automatic evaluation results, contributing to improved distractor generation in multiple languages, including English, German, Bengali, and Hindi. In human evaluations, questions generated using our approach exhibit superior levels of grammaticality, answerability, and difficulty, highlighting its efficacy in various languages.

CLMay 19, 2024
Exploring the Capabilities of Prompted Large Language Models in Educational and Assessment Applications

Subhankar Maity, Aniket Deroy, Sudeshna Sarkar

In the era of generative artificial intelligence (AI), the fusion of large language models (LLMs) offers unprecedented opportunities for innovation in the field of modern education. We embark on an exploration of prompted LLMs within the context of educational and assessment applications to uncover their potential. Through a series of carefully crafted research questions, we investigate the effectiveness of prompt-based techniques in generating open-ended questions from school-level textbooks, assess their efficiency in generating open-ended questions from undergraduate-level technical textbooks, and explore the feasibility of employing a chain-of-thought inspired multi-stage prompting approach for language-agnostic multiple-choice question (MCQ) generation. Additionally, we evaluate the ability of prompted LLMs for language learning, exemplified through a case study in the low-resource Indian language Bengali, to explain Bengali grammatical errors. We also evaluate the potential of prompted LLMs to assess human resource (HR) spoken interview transcripts. By juxtaposing the capabilities of LLMs with those of human experts across various educational tasks and domains, our aim is to shed light on the potential and limitations of LLMs in reshaping educational practices.

CLOct 14, 2024
Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models

Shubham Kumar Nigam, Aniket Deroy, Subhankar Maity et al.

This study investigates judgment prediction in a realistic scenario within the context of Indian judgments, utilizing a range of transformer-based models, including InLegalBERT, BERT, and XLNet, alongside LLMs such as Llama-2 and GPT-3.5 Turbo. In this realistic scenario, we simulate how judgments are predicted at the point when a case is presented for a decision in court, using only the information available at that time, such as the facts of the case, statutes, precedents, and arguments. This approach mimics real-world conditions, where decisions must be made without the benefit of hindsight, unlike retrospective analyses often found in previous studies. For transformer models, we experiment with hierarchical transformers and the summarization of judgment facts to optimize input for these models. Our experiments with LLMs reveal that GPT-3.5 Turbo excels in realistic scenarios, demonstrating robust performance in judgment prediction. Furthermore, incorporating additional legal information, such as statutes and precedents, significantly improves the outcome of the prediction task. The LLMs also provide explanations for their predictions. To evaluate the quality of these predictions and explanations, we introduce two human evaluation metrics: Clarity and Linking. Our findings from both automatic and human evaluations indicate that, despite advancements in LLMs, they are yet to achieve expert-level performance in judgment prediction and explanation tasks.

CLOct 12, 2024
The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models

Subhankar Maity, Aniket Deroy

In recent years, large language models (LLMs) and generative AI have revolutionized natural language processing (NLP), offering unprecedented capabilities in education. This chapter explores the transformative potential of LLMs in automated question generation and answer assessment. It begins by examining the mechanisms behind LLMs, emphasizing their ability to comprehend and generate human-like text. The chapter then discusses methodologies for creating diverse, contextually relevant questions, enhancing learning through tailored, adaptive strategies. Key prompting techniques, such as zero-shot and chain-of-thought prompting, are evaluated for their effectiveness in generating high-quality questions, including open-ended and multiple-choice formats in various languages. Advanced NLP methods like fine-tuning and prompt-tuning are explored for their role in generating task-specific questions, despite associated costs. The chapter also covers the human evaluation of generated questions, highlighting quality variations across different methods and areas for improvement. Furthermore, it delves into automated answer assessment, demonstrating how LLMs can accurately evaluate responses, provide constructive feedback, and identify nuanced understanding or misconceptions. Examples illustrate both successful assessments and areas needing improvement. The discussion underscores the potential of LLMs to replace costly, time-consuming human assessments when appropriately guided, showcasing their advanced understanding and reasoning capabilities in streamlining educational processes.

CYOct 18, 2024
Human-Centric eXplainable AI in Education

Subhankar Maity, Aniket Deroy

As artificial intelligence (AI) becomes more integrated into educational environments, how can we ensure that these systems are both understandable and trustworthy? The growing demand for explainability in AI systems is a critical area of focus. This paper explores Human-Centric eXplainable AI (HCXAI) in the educational landscape, emphasizing its role in enhancing learning outcomes, fostering trust among users, and ensuring transparency in AI-driven tools, particularly through the innovative use of large language models (LLMs). What challenges arise in the implementation of explainable AI in educational contexts? This paper analyzes these challenges, addressing the complexities of AI models and the diverse needs of users. It outlines comprehensive frameworks for developing HCXAI systems that prioritize user understanding and engagement, ensuring that educators and students can effectively interact with these technologies. Furthermore, what steps can educators, developers, and policymakers take to create more effective, inclusive, and ethically responsible AI solutions in education? The paper provides targeted recommendations to address this question, highlighting the necessity of prioritizing explainability. By doing so, how can we leverage AI's transformative potential to foster equitable and engaging educational experiences that support diverse learners?

CLOct 16, 2024
MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation

Aniket Deroy, Subhankar Maity, Sudeshna Sarkar

Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson's correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.

CLJan 29, 2025
Leveraging In-Context Learning and Retrieval-Augmented Generation for Automatic Question Generation in Educational Domains

Subhankar Maity, Aniket Deroy, Sudeshna Sarkar

Question generation in education is a time-consuming and cognitively demanding task, as it requires creating questions that are both contextually relevant and pedagogically sound. Current automated question generation methods often generate questions that are out of context. In this work, we explore advanced techniques for automated question generation in educational contexts, focusing on In-Context Learning (ICL), Retrieval-Augmented Generation (RAG), and a novel Hybrid Model that merges both methods. We implement GPT-4 for ICL using few-shot examples and BART with a retrieval module for RAG. The Hybrid Model combines RAG and ICL to address these issues and improve question quality. Evaluation is conducted using automated metrics, followed by human evaluation metrics. Our results show that both the ICL approach and the Hybrid Model consistently outperform other methods, including baseline models, by generating more contextually accurate and relevant questions.

CLNov 6, 2024
YouTube Comments Decoded: Leveraging LLMs for Low Resource Language Classification

Aniket Deroy, Subhankar Maity

Sarcasm detection is a significant challenge in sentiment analysis, particularly due to its nature of conveying opinions where the intended meaning deviates from the literal expression. This challenge is heightened in social media contexts where code-mixing, especially in Dravidian languages, is prevalent. Code-mixing involves the blending of multiple languages within a single utterance, often with non-native scripts, complicating the task for systems trained on monolingual data. This shared task introduces a novel gold standard corpus designed for sarcasm and sentiment detection within code-mixed texts, specifically in Tamil-English and Malayalam-English languages. The primary objective of this task is to identify sarcasm and sentiment polarity within a code-mixed dataset of Tamil-English and Malayalam-English comments and posts collected from social media platforms. Each comment or post is annotated at the message level for sentiment polarity, with particular attention to the challenges posed by class imbalance, reflecting real-world scenarios.In this work, we experiment with state-of-the-art large language models like GPT-3.5 Turbo via prompting to classify comments into sarcastic or non-sarcastic categories. We obtained a macro-F1 score of 0.61 for Tamil language. We obtained a macro-F1 score of 0.50 for Malayalam language.

CLNov 14, 2024
HateGPT: Unleashing GPT-3.5 Turbo to Combat Hate Speech on X

Aniket Deroy, Subhankar Maity

The widespread use of social media platforms like Twitter and Facebook has enabled people of all ages to share their thoughts and experiences, leading to an immense accumulation of user-generated content. However, alongside the benefits, these platforms also face the challenge of managing hate speech and offensive content, which can undermine rational discourse and threaten democratic values. As a result, there is a growing need for automated methods to detect and mitigate such content, especially given the complexity of conversations that may require contextual analysis across multiple languages, including code-mixed languages like Hinglish, German-English, and Bangla. We participated in the English task where we have to classify English tweets into two categories namely Hate and Offensive and Non Hate-Offensive. In this work, we experiment with state-of-the-art large language models like GPT-3.5 Turbo via prompting to classify tweets into Hate and Offensive or Non Hate-Offensive. In this study, we evaluate the performance of a classification model using Macro-F1 scores across three distinct runs. The Macro-F1 score, which balances precision and recall across all classes, is used as the primary metric for model evaluation. The scores obtained are 0.756 for run 1, 0.751 for run 2, and 0.754 for run 3, indicating a high level of performance with minimal variance among the runs. The results suggest that the model consistently performs well in terms of precision and recall, with run 1 showing the highest performance. These findings highlight the robustness and reliability of the model across different runs.

CLNov 6, 2024
Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages

Aniket Deroy, Subhankar Maity

Language Identification (LI) is crucial for various natural language processing tasks, serving as a foundational step in applications such as sentiment analysis, machine translation, and information retrieval. In multilingual societies like India, particularly among the youth engaging on social media, text often exhibits code-mixing, blending local languages with English at different linguistic levels. This phenomenon presents formidable challenges for LI systems, especially when languages intermingle within single words. Dravidian languages, prevalent in southern India, possess rich morphological structures yet suffer from under-representation in digital platforms, leading to the adoption of Roman or hybrid scripts for communication. This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages. In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories. Our findings show that the Kannada model consistently outperformed the Tamil model across most metrics, indicating a higher accuracy and reliability in identifying and categorizing Kannada language instances. In contrast, the Tamil model showed moderate performance, particularly needing improvement in precision and recall.

CLApr 8, 2025
STRIVE: A Think & Improve Approach with Iterative Refinement for Enhancing Question Quality Estimation

Aniket Deroy, Subhankar Maity

Automatically assessing question quality is crucial for educators as it saves time, ensures consistency, and provides immediate feedback for refining teaching materials. We propose a novel methodology called STRIVE (Structured Thinking and Refinement with multiLLMs for Improving Verified Question Estimation) using a series of Large Language Models (LLMs) for automatic question evaluation. This approach aims to improve the accuracy and depth of question quality assessment, ultimately supporting diverse learners and enhancing educational practices. The method estimates question quality in an automated manner by generating multiple evaluations based on the strengths and weaknesses of the provided question and then choosing the best solution generated by the LLM. Then the process is improved by iterative review and response with another LLM until the evaluation metric values converge. This sophisticated method of evaluating question quality improves the estimation of question quality by automating the task of question quality evaluation. Correlation scores show that using this proposed method helps to improve correlation with human judgments compared to the baseline method. Error analysis shows that metrics like relevance and appropriateness improve significantly relative to human judgments by using STRIVE.

CLApr 8, 2025
Leveraging Prompt-Tuning for Bengali Grammatical Error Explanation Using Large Language Models

Subhankar Maity, Aniket Deroy

We propose a novel three-step prompt-tuning method for Bengali Grammatical Error Explanation (BGEE) using state-of-the-art large language models (LLMs) such as GPT-4, GPT-3.5 Turbo, and Llama-2-70b. Our approach involves identifying and categorizing grammatical errors in Bengali sentences, generating corrected versions of the sentences, and providing natural language explanations for each identified error. We evaluate the performance of our BGEE system using both automated evaluation metrics and human evaluation conducted by experienced Bengali language experts. Our proposed prompt-tuning approach shows that GPT-4, the best performing LLM, surpasses the baseline model in automated evaluation metrics, with a 5.26% improvement in F1 score and a 6.95% improvement in exact match. Furthermore, compared to the previous baseline, GPT-4 demonstrates a decrease of 25.51% in wrong error type and a decrease of 26.27% in wrong error explanation. However, the results still lag behind the human baseline.

CLApr 8, 2025
Towards Smarter Hiring: Are Zero-Shot and Few-Shot Pre-trained LLMs Ready for HR Spoken Interview Transcript Analysis?

Subhankar Maity, Aniket Deroy, Sudeshna Sarkar

This research paper presents a comprehensive analysis of the performance of prominent pre-trained large language models (LLMs), including GPT-4 Turbo, GPT-3.5 Turbo, text-davinci-003, text-babbage-001, text-curie-001, text-ada-001, llama-2-7b-chat, llama-2-13b-chat, and llama-2-70b-chat, in comparison to expert human evaluators in providing scores, identifying errors, and offering feedback and improvement suggestions to candidates during mock HR (Human Resources) interviews. We introduce a dataset called HURIT (Human Resource Interview Transcripts), which comprises 3,890 HR interview transcripts sourced from real-world HR interview scenarios. Our findings reveal that pre-trained LLMs, particularly GPT-4 Turbo and GPT-3.5 Turbo, exhibit commendable performance and are capable of producing evaluations comparable to those of expert human evaluators. Although these LLMs demonstrate proficiency in providing scores comparable to human experts in terms of human evaluation metrics, they frequently fail to identify errors and offer specific actionable advice for candidate performance improvement in HR interviews. Our research suggests that the current state-of-the-art pre-trained LLMs are not fully conducive for automatic deployment in an HR interview assessment. Instead, our findings advocate for a human-in-the-loop approach, to incorporate manual checks for inconsistencies and provisions for improving feedback quality as a more suitable strategy.

CLNov 12, 2024
CryptoLLM: Unleashing the Power of Prompted LLMs for SmartQnA and Classification of Crypto Posts

Aniket Deroy, Subhankar Maity

The rapid growth of social media has resulted in an large volume of user-generated content, particularly in niche domains such as cryptocurrency. This task focuses on developing robust classification models to accurately categorize cryptocurrency-related social media posts into predefined classes, including but not limited to objective, positive, negative, etc. Additionally, the task requires participants to identify the most relevant answers from a set of posts in response to specific questions. By leveraging advanced LLMs, this research aims to enhance the understanding and filtering of cryptocurrency discourse, thereby facilitating more informed decision-making in this volatile sector. We have used a prompt-based technique to solve the classification task for reddit posts and twitter posts. Also, we have used 64-shot technique along with prompts on GPT-4-Turbo model to determine whether a answer is relevant to a question or not.

CLNov 11, 2024
Cancer-Answer: Empowering Cancer Care with Advanced Large Language Models

Aniket Deroy, Subhankar Maity

Gastrointestinal (GI) tract cancers account for a substantial portion of the global cancer burden, where early diagnosis is critical for improved management and patient outcomes. The complex aetiologies and overlapping symptoms across GI cancers often delay diagnosis, leading to suboptimal treatment strategies. Cancer-related queries are crucial for timely diagnosis, treatment, and patient education, as access to accurate, comprehensive information can significantly influence outcomes. However, the complexity of cancer as a disease, combined with the vast amount of available data, makes it difficult for clinicians and patients to quickly find precise answers. To address these challenges, we leverage large language models (LLMs) such as GPT-3.5 Turbo to generate accurate, contextually relevant responses to cancer-related queries. Pre-trained with medical data, these models provide timely, actionable insights that support informed decision-making in cancer diagnosis and care, ultimately improving patient outcomes. We calculate two metrics: A1 (which represents the fraction of entities present in the model-generated answer compared to the gold standard) and A2 (which represents the linguistic correctness and meaningfulness of the model-generated answer with respect to the gold standard), achieving maximum values of 0.546 and 0.881, respectively.

CLNov 7, 2024
RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval

Aniket Deroy, Subhankar Maity

Code-mixing, the integration of lexical and grammatical elements from multiple languages within a single sentence, is a widespread linguistic phenomenon, particularly prevalent in multilingual societies. In India, social media users frequently engage in code-mixed conversations using the Roman script, especially among migrant communities who form online groups to share relevant local information. This paper focuses on the challenges of extracting relevant information from code-mixed conversations, specifically within Roman transliterated Bengali mixed with English. This study presents a novel approach to address these challenges by developing a mechanism to automatically identify the most relevant answers from code-mixed conversations. We have experimented with a dataset comprising of queries and documents from Facebook, and Query Relevance files (QRels) to aid in this task. Our results demonstrate the effectiveness of our approach in extracting pertinent information from complex, code-mixed digital conversations, contributing to the broader field of natural language processing in multilingual and informal text environments. We use GPT-3.5 Turbo via prompting alongwith using the sequential nature of relevant documents to frame a mathematical model which helps to detect relevant documents corresponding to a query.

CLJun 21, 2024
How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?

Subhankar Maity, Aniket Deroy, Sudeshna Sarkar

We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo's ability to generate questions that require higher-order thinking skills, especially at the "understanding" level according to Bloom's Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom's Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.