Elham Dolatabadi

CL
h-index20
24papers
413citations
Novelty37%
AI Score56

24 Papers

CLFeb 7, 2023Code
Bringing the State-of-the-Art to Customers: A Neural Agent Assistant Framework for Customer Service Support

Stephen Obadinma, Faiza Khan Khattak, Shirley Wang et al. · utoronto

Building Agent Assistants that can help improve customer service support requires inputs from industry users and their customers, as well as knowledge about state-of-the-art Natural Language Processing (NLP) technology. We combine expertise from academia and industry to bridge the gap and build task/domain-specific Neural Agent Assistants (NAA) with three high-level components for: (1) Intent Identification, (2) Context Retrieval, and (3) Response Generation. In this paper, we outline the pipeline of the NAA's core system and also present three case studies in which three industry partners successfully adapt the framework to find solutions to their unique challenges. Our findings suggest that a collaborative process is instrumental in spurring the development of emerging NLP models for Conversational AI tasks in industry. The full reference implementation code and results are available at \url{https://github.com/VectorInstitute/NAA}

CLMay 26
Keyphrase Generative Representation of Youth Crisis Conversations Beyond Static Taxonomies

Abeer Badawi, Will Aitken, Lydia Sequeira et al.

Crisis Responders (CRs) rapidly assess thousands of youth SMS conversations each year to identify mental health concerns and guide support. Yet youth distress is increasingly expressed through evolving and context-specific language that often does not fit fixed-label taxonomies. This work analyzed 703,975 de-identified Kids Help Phone conversations (2018-2023) and expanded KHP's 19-label issue taxonomy into a 39-label hierarchical schema. We then introduce Keyphrase Generative Representation (KGR), a constrained LLM generating concise, conversation-specific keyphrases, evaluated across 129 conversations and 387 expert annotations. The expanded taxonomy achieved expert consensus reliability, with an accuracy of 0.96, and expert review found that 81% of keyphrases accurately reflected content and 74% improved clarity. KGR surfaced identity-linked themes absent from the fixed taxonomy, including immigration problems and caregiver burden, and supported a topic-retrieval workflow that increased accuracy from 0.25 to 0.70 (+0.45) over the manual analyst process. KGR marks a shift toward hybrid, interpretable generative representations that extend crisis response beyond static taxonomies to surface emerging and culturally grounded patterns of youth distress.

AIJan 26Code
Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation

Abeer Badawi, Md Tahmid Rahman Laskar, Elahe Rahimi et al.

The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support. While LLMs offer potential for accessible emotional assistance, their reliability, therapeutic relevance, and alignment with human standards remain challenging to address. This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue. Our approach involved curating a dataset of 500 mental health conversations from datasets with real-world scenario questions and evaluating the responses generated by nine diverse LLMs, including closed source and open source models. More specifically, these responses were evaluated by two psychiatric trained experts, who independently rated each on a 5 point Likert scale across a comprehensive 6 attribute rubric. This rubric captures Cognitive Support and Affective Resonance, providing a multidimensional perspective on therapeutic quality. Our analysis reveals that LLMs provide strong cognitive reliability by producing safe, coherent, and clinically appropriate information, but they demonstrate unstable affective alignment. Although closed source models (e.g., GPT-4o) offer balanced therapeutic responses, open source models show greater variability and emotional flatness. We reveal a persistent cognitive-affective gap and highlight the need for failure aware, clinically grounded evaluation frameworks that prioritize relational sensitivity alongside informational accuracy in mental health oriented LLMs. We advocate for balanced evaluation protocols with human in the loop that center on therapeutic sensitivity and provide a framework to guide the responsible design and clinical oversight of mental health oriented conversational AI.

LGAug 23, 2024
Exploring Bias and Prediction Metrics to Characterise the Fairness of Machine Learning for Equity-Centered Public Health Decision-Making: A Narrative Review

Shaina Raza, Arash Shaban-Nejad, Elham Dolatabadi et al.

Background: The rapid advancement of Machine Learning (ML) represents novel opportunities to enhance public health research, surveillance, and decision-making. However, there is a lack of comprehensive understanding of algorithmic bias, systematic errors in predicted population health outcomes, resulting from the public health application of ML. The objective of this narrative review is to explore the types of bias generated by ML and quantitative metrics to assess these biases. Methods : We performed search on PubMed, MEDLINE, IEEE (Institute of Electrical and Electronics Engineers), ACM (Association for Computing Machinery) Digital Library, Science Direct, and Springer Nature. We used keywords to identify studies describing types of bias and metrics to measure these in the domain of ML and public and population health published in English between 2008 and 2023, inclusive. Results: A total of 72 articles met the inclusion criteria. Our review identified the commonly described types of bias and quantitative metrics to assess these biases from an equity perspective. Conclusion : The review will help formalize the evaluation framework for ML on public health from an equity perspective.

LGMay 12Code
Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

Mohammed Saidul Islam, Negin Baghbanzadeh, Farnaz Kohankhaki et al.

Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.

CLJun 1, 2025Code
Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge

Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi et al.

Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction, even in zero-shot scenarios. However, evaluating LLMs in this task remains challenging due to their ability to generate human-like text, often producing synonyms or abbreviations of gold-standard answers, making traditional automatic evaluation metrics unreliable. On the other hand, while human evaluation is more reliable, it is costly and time-consuming, making it impractical for real-world applications. This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction. We benchmark 8 LLMs as judges to evaluate the responses generated by 5 other LLMs across 3 biomedical relation extraction datasets. Unlike other text-generation tasks, we observe that LLM-based judges perform quite poorly (usually below 50% accuracy) in the biomedical relation extraction task. Our findings reveal that it happens mainly because relations extracted by LLMs do not adhere to any standard format. To address this, we propose structured output formatting for LLM-generated responses that helps LLM-Judges to improve their performance by about 15% (on average). We also introduce a domain adaptation technique to further enhance LLM-Judge performance by effectively transferring knowledge between datasets. We release both our human-annotated and LLM-annotated judgment data (36k samples in total) for public use here: https://github.com/tahmedge/llm_judge_biomedical_re.

CLFeb 17, 2025Code
VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment

Shaina Raza, Ashmal Vayani, Aditya Jain et al.

Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI safety benchmarks focus on single modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitate credible news, remains largely unaddressed. We introduce the Vision-Language Disinformation Detection Benchmark (VLDBench), the first large-scale resource supporting both unimodal (text-only) and multimodal (text + image) disinformation detection. VLDBench comprises approximately 62,000 labeled text-image pairs across 13 categories, curated from 58 news outlets. Using a semi-automated pipeline followed by expert review, 22 domain experts invested over 500 hours to produce high-quality annotations with substantial inter-annotator agreement. Evaluations of state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) on VLDBench show that incorporating visual cues improves detection accuracy by 5 to 35 percentage points over text-only models. VLDBench provides data and code for evaluation, fine-tuning, and robustness testing to support disinformation analysis. Developed in alignment with AI governance frameworks (e.g., the MIT AI Risk Repository), VLDBench offers a principled foundation for advancing trustworthy disinformation detection in multimodal media. Project: https://vectorinstitute.github.io/VLDBench/ Dataset: https://huggingface.co/datasets/vector-institute/VLDBench Code: https://github.com/VectorInstitute/VLDBench

CYJul 2, 2024
Practical Guide for Causal Pathways and Sub-group Disparity Analysis

Farnaz Kohankhaki, Shaina Raza, Oluwanifemi Bamgbose et al.

In this study, we introduce the application of causal disparity analysis to unveil intricate relationships and causal pathways between sensitive attributes and the targeted outcomes within real-world observational data. Our methodology involves employing causal decomposition analysis to quantify and examine the causal interplay between sensitive attributes and outcomes. We also emphasize the significance of integrating heterogeneity assessment in causal disparity analysis to gain deeper insights into the impact of sensitive attributes within specific sub-groups on outcomes. Our two-step investigation focuses on datasets where race serves as the sensitive attribute. The results on two datasets indicate the benefit of leveraging causal analysis and heterogeneity assessment not only for quantifying biases in the data but also for disentangling their influences on outcomes. We demonstrate that the sub-groups identified by our approach to be affected the most by disparities are the ones with the largest ML classification errors. We also show that grouping the data only based on a sensitive attribute is not enough, and through these analyses, we can find sub-groups that are directly affected by disparities. We hope that our findings will encourage the adoption of such methodologies in future ethical AI practices and bias audits, fostering a more equitable and fair technological landscape.

CLOct 21, 2025Code
When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar et al.

Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/

IVMar 18, 2025Code
Advancing Medical Representation Learning Through High-Quality Data

Negin Baghbanzadeh, Adibvafa Fallahpour, Yasaman Parhizkar et al.

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

CVMar 3, 2025Code
A Shared Encoder Approach to Multimodal Representation Learning

Shuvendu Roy, Franklin Ogidi, Ali Etemad et al.

Multimodal representation learning has demonstrated remarkable potential in enabling models to process and integrate diverse data modalities, such as text and images, for improved understanding and performance. While the medical domain can benefit significantly from this paradigm, the scarcity of paired multimodal data and reliance on proprietary or pretrained encoders pose significant challenges. In this work, we present a shared encoder framework for multimodal representation learning tailored to the medical domain. Our approach employs a single set of encoder parameters shared across modalities, augmented with learnable modality features. Empirical results demonstrate that our shared encoder idea achieves superior performance compared to separate modality-specific encoders, demonstrating improved generalization in data-constrained settings. Notably, the performance gains are more pronounced with fewer training examples, underscoring the efficiency of our shared encoder framework for real-world medical applications with limited data. Our code and experiment setup are available at https://github.com/VectorInstitute/shared_encoder.

CVMar 1
When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh et al.

Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.

CVMar 14, 2025
Similarity-Aware Token Pruning: Your VLM but Faster

Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi et al.

The computational demands of Vision Transformers (ViTs) and Vision-Language Models (VLMs) remain a significant challenge due to the quadratic complexity of self-attention. While token pruning offers a promising solution, existing methods often introduce training overhead or fail to adapt dynamically across layers. We present SAINT, a training-free token pruning framework that leverages token similarity and a graph-based formulation to dynamically optimize pruning rates and redundancy thresholds. Through systematic analysis, we identify a universal three-stage token evolution process (aligner-explorer-aggregator) in transformers, enabling aggressive pruning in early stages without sacrificing critical information. For ViTs, SAINT doubles the throughput of ViT-H/14 at 224px with only 0.6% accuracy loss on ImageNet-1K, surpassing the closest competitor by 0.8%. For VLMs, we apply SAINT in three modes: ViT-only, LLM-only, and hybrid. SAINT reduces LLaVA-13B's tokens by 75%, achieving latency comparable to LLaVA-7B with less than 1% performance loss across benchmarks. Our work establishes a unified, practical framework for efficient inference in ViTs and VLMs.

LGMar 21, 2025
A Flexible Fairness Framework with Surrogate Loss Reweighting for Addressing Sociodemographic Disparities

Wen Xu, Elham Dolatabadi

This paper presents a new algorithmic fairness framework called $\boldsymbolα$-$\boldsymbolβ$ Fair Machine Learning ($\boldsymbolα$-$\boldsymbolβ$ FML), designed to optimize fairness levels across sociodemographic attributes. Our framework employs a new family of surrogate loss functions, paired with loss reweighting techniques, allowing precise control over fairness-accuracy trade-offs through tunable hyperparameters $\boldsymbolα$ and $\boldsymbolβ$. To efficiently solve the learning objective, we propose Parallel Stochastic Gradient Descent with Surrogate Loss (P-SGD-S) and establish convergence guarantees for both convex and nonconvex loss functions. Experimental results demonstrate that our framework improves overall accuracy while reducing fairness violations, offering a smooth trade-off between standard empirical risk minimization and strict minimax fairness. Results across multiple datasets confirm its adaptability, ensuring fairness improvements without excessive performance degradation.

LGMay 22, 2025
Automated Capability Evaluation of Foundation Models

Arash Afkanpour, Omkar Dige, Fatemeh Tavakoli et al.

Current evaluation frameworks for foundation models rely heavily on static, manually curated benchmarks, limiting their ability to capture the full breadth of model capabilities. This paper introduces Active learning for Capability Evaluation (ACE), a novel framework for scalable, automated, and fine-grained evaluation of foundation models. ACE leverages the knowledge embedded in powerful frontier models to decompose a domain into semantically meaningful capabilities and generates diverse evaluation tasks, significantly reducing human effort. In Mathematics, ACE generated 433 capabilities and 11,800 tasks, covering 94% of Wikipedia-defined skills in the domain while introducing novel, coherent ones. To maximize efficiency, ACE fits a capability model in latent semantic space, allowing reliable approximation of a subject model's performance by evaluating only a subset of capabilities via active learning. It reaches within 0.01 RMSE of exhaustive evaluation by evaluating less than half of capabilities. Compared to static datasets, ACE provides more balanced coverage and uncovers fine-grained differences that aggregate metrics fail to capture. Our results demonstrate that ACE provides a more complete and informative picture of model capabilities, which is essential for safe and well-informed deployment of foundation models.

CVJun 3, 2025
Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Negin Baghbanzadeh, Sajad Ashkezari, Elham Dolatabadi et al.

Compound figures, which are multi-panel composites containing diverse subfigures, are ubiquitous in biomedical literature, yet large-scale subfigure extraction remains largely unaddressed. Prior work on subfigure extraction has been limited in both dataset size and generalizability, leaving a critical open question: How does high-fidelity image-text alignment via large-scale subfigure extraction impact representation learning in vision-language models? We address this gap by introducing a scalable subfigure extraction pipeline based on transformer-based object detection, trained on a synthetic corpus of 500,000 compound figures, and achieving state-of-the-art performance on both ImageCLEF 2016 and synthetic benchmarks. Using this pipeline, we release OPEN-PMC-18M, a large-scale high quality biomedical vision-language dataset comprising 18 million clinically relevant subfigure-caption pairs spanning radiology, microscopy, and visible light photography. We train and evaluate vision-language models on our curated datasets and show improved performance across retrieval, zero-shot classification, and robustness benchmarks, outperforming existing baselines. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.

HCFeb 21, 2025
Position: Beyond Assistance -- Reimagining LLMs as Ethical and Adaptive Co-Creators in Mental Health Care

Abeer Badawi, Md Tahmid Rahman Laskar, Jimmy Xiangji Huang et al.

This position paper argues for a fundamental shift in how Large Language Models (LLMs) are integrated into the mental health care domain. We advocate for their role as co-creators rather than mere assistive tools. While LLMs have the potential to enhance accessibility, personalization, and crisis intervention, their adoption remains limited due to concerns about bias, evaluation, over-reliance, dehumanization, and regulatory uncertainties. To address these challenges, we propose two structured pathways: SAFE-i (Supportive, Adaptive, Fair, and Ethical Implementation) Guidelines for ethical and responsible deployment, and HAAS-e (Human-AI Alignment and Safety Evaluation) Framework for multidimensional, human-centered assessment. SAFE-i provides a blueprint for data governance, adaptive model engineering, and real-world integration, ensuring LLMs align with clinical and ethical standards. HAAS-e introduces evaluation metrics that go beyond technical accuracy to measure trustworthiness, empathy, cultural sensitivity, and actionability. We call for the adoption of these structured approaches to establish a responsible and scalable model for LLM-driven mental health support, ensuring that AI complements, rather than replaces, human expertise.

CLJan 21, 2025
Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition

Juan Andres Medina Florez, Shaina Raza, Rashida Lynn et al.

Understanding the prevalence, disparities, and symptom variations of Post COVID-19 Condition (PCC) for vulnerable populations is crucial to improving care and addressing intersecting inequities. This study aims to develop a comprehensive framework for integrating social determinants of health (SDOH) into PCC research by leveraging NLP techniques to analyze disparities and variations in SDOH representation within PCC case reports. Following construction of a PCC Case Report Corpus, comprising over 7,000 case reports from the LitCOVID repository, a subset of 709 reports were annotated with 26 core SDOH-related entity types using pre-trained named entity recognition (NER) models, human review, and data augmentation to improve quality, diversity and representation of entity types. An NLP pipeline integrating NER, natural language inference (NLI), trigram and frequency analyses was developed to extract and analyze these entities. Both encoder-only transformer models and RNN-based models were assessed for the NER objective. Fine-tuned encoder-only BERT models outperformed traditional RNN-based models in generalizability to distinct sentence structures and greater class sparsity. Exploratory analysis revealed variability in entity richness, with prevalent entities like condition, age, and access to care, and underrepresentation of sensitive categories like race and housing status. Trigram analysis highlighted frequent co-occurrences among entities, including age, gender, and condition. The NLI objective (entailment and contradiction analysis) showed attributes like "Experienced violence or abuse" and "Has medical insurance" had high entailment rates (82.4%-80.3%), while attributes such as "Is female-identifying," "Is married," and "Has a terminal condition" exhibited high contradiction rates (70.8%-98.5%).

CVJun 11, 2024
Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

Shuvendu Roy, Yasaman Parhizkar, Franklin Ogidi et al.

We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain. Through this study, we aim to answer the following research questions: (i) How transferable are general-domain representations to the medical domain? (ii) Is multimodal contrastive training sufficient, or does it benefit from unimodal training as well? (iii) What is the impact of feature granularity on the effectiveness of multimodal medical representation learning? To answer these questions, we investigate eight contrastive learning approaches under identical training setups, and train them on 2.8 million image-text pairs from four datasets, and evaluate them on 25 downstream tasks, including classification (zero-shot and linear probing), image-to-text and text-to-image retrieval, and visual question-answering. Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained features. Finally, we make our code publicly available.

CLJan 19, 2024
FAIR Enough: How Can We Develop and Assess a FAIR-Compliant Dataset for Large Language Models' Training?

Shaina Raza, Shardul Ghuge, Chen Ding et al.

The rapid evolution of Large Language Models (LLMs) highlights the necessity for ethical considerations and data integrity in AI development, particularly emphasizing the role of FAIR (Findable, Accessible, Interoperable, Reusable) data principles. While these principles are crucial for ethical data stewardship, their specific application in the context of LLM training data remains an under-explored area. This research gap is the focus of our study, which begins with an examination of existing literature to underline the importance of FAIR principles in managing data for LLM training. Building upon this, we propose a novel framework designed to integrate FAIR principles into the LLM development lifecycle. A contribution of our work is the development of a comprehensive checklist intended to guide researchers and developers in applying FAIR data principles consistently across the model development process. The utility and effectiveness of our framework are validated through a case study on creating a FAIR-compliant dataset aimed at detecting and mitigating biases in LLMs. We present this framework to the community as a tool to foster the creation of technologically advanced, ethically grounded, and socially responsible AI models.

LGMay 4, 2023
MLHOps: Machine Learning for Healthcare Operations

Faiza Khan Khattak, Vallijah Subasri, Amrit Krishnan et al.

Machine Learning Health Operations (MLHOps) is the combination of processes for reliable, efficient, usable, and ethical deployment and maintenance of machine learning models in healthcare settings. This paper provides both a survey of work in this area and guidelines for developers and clinicians to deploy and maintain their own models in clinical practice. We cover the foundational concepts of general machine learning operations, describe the initial setup of MLHOps pipelines (including data sources, preparation, engineering, and tools). We then describe long-term monitoring and updating (including data distribution shifts and model updating) and ethical considerations (including bias, fairness, interpretability, and privacy). This work therefore provides guidance across the full pipeline of MLHOps from conception to initial and ongoing deployment.

CLDec 31, 2020
An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain

Paul Grouchy, Shobhit Jain, Michael Liu et al.

With the growing amount of text in health data, there have been rapid advances in large pre-trained models that can be applied to a wide variety of biomedical tasks with minimal task-specific modifications. Emphasizing the cost of these models, which renders technical replication challenging, this paper summarizes experiments conducted in replicating BioBERT and further pre-training and careful fine-tuning in the biomedical domain. We also investigate the effectiveness of domain-specific and domain-agnostic pre-trained models across downstream biomedical NLP tasks. Our finding confirms that pre-trained models can be impactful in some downstream NLP tasks (QA and NER) in the biomedical domain; however, this improvement may not justify the high cost of domain-specific pre-training.

CVSep 24, 2020
A Computer Vision Approach to Combat Lyme Disease

Sina Akbarian, Tania Cawston, Laurent Moreno et al.

Lyme disease is an infectious disease transmitted to humans by a bite from an infected Ixodes species (blacklegged ticks). It is one of the fastest growing vector-borne illness in North America and is expanding its geographic footprint. Lyme disease treatment is time-sensitive, and can be cured by administering an antibiotic (prophylaxis) to the patient within 72 hours after a tick bite by the Ixodes species. However, the laboratory-based identification of each tick that might carry the bacteria is time-consuming and labour intensive and cannot meet the maximum turn-around-time of 72 hours for an effective treatment. Early identification of blacklegged ticks using computer vision technologies is a potential solution in promptly identifying a tick and administering prophylaxis within a crucial window period. In this work, we build an automated detection tool that can differentiate blacklegged ticks from other ticks species using advanced deep learning and computer vision approaches. We demonstrate the classification of tick species using Convolution Neural Network (CNN) models, trained end-to-end from tick images directly. Advanced knowledge transfer techniques within teacher-student learning frameworks are adopted to improve the performance of classification of tick species. Our best CNN model achieves 92% accuracy on test set. The tool can be integrated with the geography of exposure to determine the risk of Lyme disease infection and need for prophylaxis treatment.

IVAug 31, 2020
Evaluating Knowledge Transfer in Neural Network for Medical Images

Sina Akbarian, Laleh Seyyed-Kalantari, Farzad Khalvati et al.

Deep learning and knowledge transfer techniques have permeated the field of medical imaging and are considered as key approaches for revolutionizing diagnostic imaging practices. However, there are still challenges for the successful integration of deep learning into medical imaging tasks due to a lack of large annotated imaging data. To address this issue, we propose a teacher-student learning framework to transfer knowledge from a carefully pre-trained convolutional neural network (CNN) teacher to a student CNN. In this study, we explore the performance of knowledge transfer in the medical imaging setting. We investigate the proposed network's performance when the student network is trained on a small dataset (target dataset) as well as when teacher's and student's domains are distinct. The performances of the CNN models are evaluated on three medical imaging datasets including Diabetic Retinopathy, CheXpert, and ChestX-ray8. Our results indicate that the teacher-student learning framework outperforms transfer learning for small imaging datasets. Particularly, the teacher-student learning framework improves the area under the ROC Curve (AUC) of the CNN model on a small sample of CheXpert (n=5k) by 4% and on ChestX-ray8 (n=5.6k) by 9%. In addition to small training data size, we also demonstrate a clear advantage of the teacher-student learning framework in the medical imaging setting compared to transfer learning. We observe that the teacher-student network holds a great promise not only to improve the performance of diagnosis but also to reduce overfitting when the dataset is small.