Jamaludin Mohd-Yusof

LG
h-index29
3papers
7citations
Novelty32%
AI Score22

3 Papers

LGMar 18, 2025
Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis

Alexander Partin, Priyanka Vasanthakumari, Oleksandr Narykov et al.

Deep learning (DL) and machine learning (ML) models have shown promise in drug response prediction (DRP), yet their ability to generalize across datasets remains an open question, raising concerns about their real-world applicability. Due to the lack of standardized benchmarking approaches, model evaluations and comparisons often rely on inconsistent datasets and evaluation criteria, making it difficult to assess true predictive capabilities. In this work, we introduce a benchmarking framework for evaluating cross-dataset prediction generalization in DRP models. Our framework incorporates five publicly available drug screening datasets, six standardized DRP models, and a scalable workflow for systematic evaluation. To assess model generalization, we introduce a set of evaluation metrics that quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results), enabling a more comprehensive assessment of model transferability. Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments. While several models demonstrate relatively strong cross-dataset generalization, no single model consistently outperforms across all datasets. Furthermore, we identify CTRPv2 as the most effective source dataset for training, yielding higher generalization scores across target datasets. By sharing this standardized evaluation framework with the community, our study aims to establish a rigorous foundation for model comparison, and accelerate the development of robust DRP models for real-world applications.

LGApr 1, 2025
Global explainability of a deep abstaining classifier

Sayera Dhaubhadel, Jamaludin Mohd-Yusof, Benjamin H. McMahon et al.

We present a global explainability method to characterize sources of errors in the histology prediction task of our real-world multitask convolutional neural network (MTCNN)-based deep abstaining classifier (DAC), for automated annotation of cancer pathology reports from NCI-SEER registries. Our classifier was trained and evaluated on 1.04 million hand-annotated samples and makes simultaneous predictions of cancer site, subsite, histology, laterality, and behavior for each report. The DAC framework enables the model to abstain on ambiguous reports and/or confusing classes to achieve a target accuracy on the retained (non-abstained) samples, but at the cost of decreased coverage. Requiring 97% accuracy on the histology task caused our model to retain only 22% of all samples, mostly the less ambiguous and common classes. Local explainability with the GradInp technique provided a computationally efficient way of obtaining contextual reasoning for thousands of individual predictions. Our method, involving dimensionality reduction of approximately 13000 aggregated local explanations, enabled global identification of sources of errors as hierarchical complexity among classes, label noise, insufficient information, and conflicting evidence. This suggests several strategies such as exclusion criteria, focused annotation, and reduced penalties for errors involving hierarchically related classes to iteratively improve our DAC in this complex real-world implementation.

LGSep 10, 2020
Why I'm not Answering: Understanding Determinants of Classification of an Abstaining Classifier for Cancer Pathology Reports

Sayera Dhaubhadel, Jamaludin Mohd-Yusof, Kumkum Ganguly et al.

Safe deployment of deep learning systems in critical real world applications requires models to make very few mistakes, and only under predictable circumstances. In this work, we address this problem using an abstaining classifier that is tuned to have $>$95% accuracy, and then identify the determinants of abstention using LIME. Essentially, we are training our model to learn the attributes of pathology reports that are likely to lead to incorrect classifications, albeit at the cost of reduced sensitivity. We demonstrate an abstaining classifier in a multitask setting for classifying cancer pathology reports from the NCI SEER cancer registries on six tasks of interest. For these tasks, we reduce the classification error rate by factors of 2--5 by abstaining on 25--45% of the reports. For the specific task of classifying cancer site, we are able to identify metastasis, reports involving lymph nodes, and discussion of multiple cancer sites as responsible for many of the classification mistakes, and observe that the extent and types of mistakes vary systematically with cancer site (e.g., breast, lung, and prostate). When combining across three of the tasks, our model classifies 50% of the reports with an accuracy greater than 95% for three of the six tasks\edit, and greater than 85% for all six tasks on the retained samples. Furthermore, we show that LIME provides a better determinant of classification than measures of word occurrence alone. By combining a deep abstaining classifier with feature identification using LIME, we are able to identify concepts responsible for both correctness and abstention when classifying cancer sites from pathology reports. The improvement of LIME over keyword searches is statistically significant, presumably because words are assessed in context and have been identified as a local determinant of classification.