CLJan 4Code
Reasoning Over Recall: Evaluating the Efficacy of Generalist Architectures vs. Specialized Fine-Tunes in RAG-Based Mental Health Dialogue SystemsMd Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal
The deployment of Large Language Models (LLMs) in mental health counseling faces the dual challenges of hallucinations and lack of empathy. While the former may be mitigated by RAG (retrieval-augmented generation) by anchoring answers in trusted clinical sources, there remains an open question as to whether the most effective model under this paradigm would be one that is fine-tuned on mental health data, or a more general and powerful model that succeeds purely on the basis of reasoning. In this paper, we perform a direct comparison by running four open-source models through the same RAG pipeline using ChromaDB: two generalist reasoners (Qwen2.5-3B and Phi-3-Mini) and two domain-specific fine-tunes (MentalHealthBot-7B and TherapyBot-7B). We use an LLM-as-a-Judge framework to automate evaluation over 50 turns. We find a clear trend: the generalist models outperform the domain-specific ones in empathy (3.72 vs. 3.26, $p < 0.001$) in spite of being much smaller (3B vs. 7B), and all models perform well in terms of safety, but the generalist models show better contextual understanding and are less prone to overfitting as we observe in the domain-specific models. Overall, our results indicate that for RAG-based therapy systems, strong reasoning is more important than training on mental health-specific vocabulary; i.e. a well-reasoned general model would provide more empathetic and balanced support than a larger narrowly fine-tuned model, so long as the answer is already grounded in clinical evidence.
CLNov 25, 2025
A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP PipelinesMd Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal
Text normalization is an essential preprocessing step in many natural language processing (NLP) tasks, and stemming is one such normalization technique that reduces words to their base or root form. However, evaluating stemming methods is challenging because current evaluation approaches are limited and do not capture the potential harm caused by excessive stemming; therefore, it is essential to develop new approaches to evaluate stemming methods. To address this issue, this study propose a novel, task-oriented approach to evaluate stemming methods, which considers three aspects: (1) the utility of stemming using Stemming Effectiveness Score (SES), (2) the impact of stemming on downstream tasks using Model Performance Delta (MPD), and (3) the semantic similarity between stemmed and original words using Average Normalized Levenshtein Distance (ANLD), thus providing a comprehensive evaluation framework. We apply our evaluation framework to compare two stemmers for Bangla (BNLTK) and English (Snowball), and our results reveal a significant issue, prompting us to analyze their performance in detail. While the Bangla stemmer achieves the highest SES (1.67) due to effective word reduction (CR = 1.90), SES alone is insufficient because our proposed safety measure, ANLD, reveals that this high SES is due to harmful over-stemming (ANLD = 0.26), which correlates with the observed decrease in downstream performance.In contrast, the English stemmer achieves a moderate SES (1.31) with a safe meaning distance (ANLD = 0.14), allowing its word reduction to contribute positively to downstream performance; therefore, it is a more reliable stemmer. Our study provides a valuable tool for distinguishing between potential efficiency gains (high SES) and meaning preservation (low ANLD).
CVNov 22, 2025
Less Is More: An Explainable AI Framework for Lightweight Malaria ClassificationMd Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal
Background and Objective: Deep learning models have high computational needs and lack interpretability but are often the first choice for medical image classification tasks. This study addresses whether complex neural networks are essential for the simple binary classification task of malaria. We introduce the Extracted Morphological Feature Engineered (EMFE) pipeline, a transparent, reproducible, and low compute machine learning approach tailored explicitly for simple cell morphology, designed to achieve deep learning performance levels on a simple CPU only setup with the practical aim of real world deployment. Methods: The study used the NIH Malaria Cell Images dataset, with two features extracted from each cell image: the number of non background pixels and the number of holes within the cell. Logistic Regression and Random Forest were compared against ResNet18, DenseNet121, MobileNetV2, and EfficientNet across accuracy, model size, and CPU inference time. An ensemble model was created by combining Logistic Regression and Random Forests to achieve higher accuracy while retaining efficiency. Results: The single variable Logistic Regression model achieved a test accuracy of 94.80 percent with a file size of 1.2 kB and negligible inference latency (2.3 ms). The two stage ensemble improved accuracy to 97.15 percent. In contrast, the deep learning methods require 13.6 MB to 44.7 MB of storage and show significantly higher inference times (68 ms). Conclusion: This study shows that a compact feature engineering approach can produce clinically meaningful classification performance while offering gains in transparency, reproducibility, speed, and deployment feasibility. The proposed pipeline demonstrates that simple interpretable features paired with lightweight models can serve as a practical diagnostic solution for environments with limited computational resources.