38.3AIMay 24
Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus PatientsAbdurrahim Yilmaz, Ayşe Esra Koku Aksu, Duygu Yamen et al.
Chronic dermatologic diseases such as pemphigus require long-term follow-up, generating extensive longitudinal clinical documentation that is difficult to review comprehensively during routine visits and increasing clinician workload as well as the risk of missing critical historical information. We evaluated whether a locally deployed, privacy-preserving small language model (SLM) could retrieve structured clinical features and generate longitudinal summaries from long-term dermatology follow-up records. In this retrospective case series, thirty pemphigus patients contributed 541 visit notes that were aggregated into full longitudinal records (89,336 words); 56 clinically relevant features were annotated by two expert dermatologists. The locally deployed SLM (Qwen3 4B Thinking 2507) was queried with each complete record to retrieve 56 features and generate one final report summaries. Across 1,680 feature retrieval tasks, mean accuracy was 82.25%. Dermatologists' ratings of AI-generated summaries were high for overall quality (8.23-8.47), clinical accuracy (7.93-8.20), and usefulness (8.47-8.50), with no significant inter-evaluator differences and an overall preference for AI summaries in 53.3% of evaluations. These findings suggest that privacy-preserving, locally deployed SLMs can outperform medical experts and reliably generate clinically meaningful longitudinal summaries. SLMs may support clinical decision-making when integrated with appropriate oversight.
CVJan 20
DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and ReasoningAbdurrahim Yilmaz, Ozan Erdem, Ece Gokyayla et al.
Vision-language models (VLMs) are increasingly important in medical applications; however, their evaluation in dermatology remains limited by datasets that focus primarily on image-level classification tasks such as lesion recognition. While valuable for recognition, such datasets cannot assess the full visual understanding, language grounding, and clinical reasoning capabilities of multimodal models. Visual question answering (VQA) benchmarks are required to evaluate how models interpret dermatological images, reason over fine-grained morphology, and generate clinically meaningful descriptions. We introduce DermaBench, a clinician-annotated dermatology VQA benchmark built on the Diverse Dermatology Images (DDI) dataset. DermaBench comprises 656 clinical images from 570 unique patients spanning Fitzpatrick skin types I-VI. Using a hierarchical annotation schema with 22 main questions (single-choice, multi-choice, and open-ended), expert dermatologists annotated each image for diagnosis, anatomic site, lesion morphology, distribution, surface features, color, and image quality, together with open-ended narrative descriptions and summaries, yielding approximately 14.474 VQA-style annotations. DermaBench is released as a metadata-only dataset to respect upstream licensing and is publicly available at Harvard Dataverse.
CVFeb 22
Artefact-Aware Fungal Detection in Dermatophytosis: A Real-Time Transformer-Based Approach for KOH MicroscopyRana Gursoy, Abdurrahim Yilmaz, Baris Kizilyaprak et al.
Dermatophytosis is commonly assessed using potassium hydroxide (KOH) microscopy, yet accurate recognition of fungal hyphae is hindered by artefacts, heterogeneous keratin clearance, and notable inter-observer variability. This study presents a transformer-based detection framework using the RT-DETR model architecture to achieve precise, query-driven localization of fungal structures in high-resolution KOH images. A dataset of 2,540 routinely acquired microscopy images was manually annotated using a multi-class strategy to explicitly distinguish fungal elements from confounding artefacts. The model was trained with morphology-preserving augmentations to maintain the structural integrity of thin hyphae. Evaluation on an independent test set demonstrated robust object-level performance, with a recall of 0.9737, precision of 0.8043, and an AP@0.50 of 93.56%. When aggregated for image-level diagnosis, the model achieved 100% sensitivity and 98.8% accuracy, correctly identifying all positive cases without missing a single diagnosis. Qualitative outputs confirmed the robust localization of low-contrast hyphae even in artefact-rich fields. These results highlight that an artificial intelligence (AI) system can serve as a highly reliable, automated screening tool, effectively bridging the gap between image-level analysis and clinical decision-making in dermatomycology.
LGJan 24, 2025
Humanity's Last ExamLong Phan, Alice Gatti, Ziwen Han et al. · amazon-science, apple-ml
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
CVJan 31, 2025Code
DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology DatasetsAbdurrahim Yilmaz, Furkan Yuceyalcin, Ece Gokyayla et al.
A major barrier to developing vision large language models (LLMs) in dermatology is the lack of large image--text pairs dataset. We introduce DermaSynth, a dataset comprising of 92,020 synthetic image--text pairs curated from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using Gemini 2.0, we used clinically related prompts and self-instruct method to generate diverse and rich synthetic texts. Metadata of the datasets were incorporated into the input prompts by targeting to reduce potential hallucinations. The resulting dataset builds upon open access dermatological image repositories (DERM12345, BCN20000, PAD-UFES-20, SCIN, and HIBA) that have permissive CC-BY-4.0 licenses. We also fine-tuned a preliminary Llama-3.2-11B-Vision-Instruct model, DermatoLlama 1.0, on 5,000 samples. We anticipate this dataset to support and accelerate AI research in dermatology. Data and code underlying this work are accessible at https://github.com/abdurrahimyilmaz/DermaSynth.
CVApr 7, 2025
An ensemble deep learning approach to detect tumors on Mohs micrographic surgery slidesAbdurrahim Yilmaz, Serra Atilla Aydin, Deniz Temur et al.
Mohs micrographic surgery (MMS) is the gold standard technique for removing high risk nonmelanoma skin cancer however, intraoperative histopathological examination demands significant time, effort, and professionality. The objective of this study is to develop a deep learning model to detect basal cell carcinoma (BCC) and artifacts on Mohs slides. A total of 731 Mohs slides from 51 patients with BCCs were used in this study, with 91 containing tumor and 640 without tumor which was defined as non-tumor. The dataset was employed to train U-Net based models that segment tumor and non-tumor regions on the slides. The segmented patches were classified as tumor, or non-tumor to produce predictions for whole slide images (WSIs). For the segmentation phase, the deep learning model success was measured using a Dice score with 0.70 and 0.67 value, area under the curve (AUC) score with 0.98 and 0.96 for tumor and non-tumor, respectively. For the tumor classification, an AUC of 0.98 for patch-based detection, and AUC of 0.91 for slide-based detection was obtained on the test dataset. We present an AI system that can detect tumors and non-tumors in Mohs slides with high success. Deep learning can aid Mohs surgeons and dermatopathologists in making more accurate decisions.
IVJun 11, 2024
DERM12345: A Large, Multisource Dermatoscopic Skin Lesion Dataset with 38 SubclassesAbdurrahim Yilmaz, Sirin Pekcan Yasar, Gulsum Gencoglan et al.
Skin lesion datasets provide essential information for understanding various skin conditions and developing effective diagnostic tools. They aid the artificial intelligence-based early detection of skin cancer, facilitate treatment planning, and contribute to medical education and research. Published large datasets have partially coverage the subclassifications of the skin lesions. This limitation highlights the need for more expansive and varied datasets to reduce false predictions and help improve the failure analysis for skin lesions. This study presents a diverse dataset comprising 12,345 dermatoscopic images with 38 subclasses of skin lesions collected in Turkiye which comprises different skin types in the transition zone between Europe and Asia. Each subgroup contains high-resolution photos and expert annotations, providing a strong and reliable basis for future research. The detailed analysis of each subgroup provided in this study facilitates targeted research endeavors and enhances the depth of understanding regarding the skin lesions. This dataset distinguishes itself through a diverse structure with 5 super classes, 15 main classes, 38 subclasses and its 12,345 high-resolution dermatoscopic images.
IVOct 23, 2021
Benchmarking of Lightweight Deep Learning Architectures for Skin Cancer Classification using ISIC 2017 DatasetAbdurrahim Yilmaz, Mucahit Kalebasi, Yegor Samoylenko et al.
Skin cancer is one of the deadly types of cancer and is common in the world. Recently, there has been a huge jump in the rate of people getting skin cancer. For this reason, the number of studies on skin cancer classification with deep learning are increasing day by day. For the growth of work in this area, the International Skin Imaging Collaboration (ISIC) organization was established and they created an open dataset archive. In this study, images were taken from ISIC 2017 Challenge. The skin cancer images taken were preprocessed and data augmented. Later, these images were trained with transfer learning and fine-tuning approach and deep learning models were created in this way. 3 different mobile deep learning models and 3 different batch size values were determined for each, and a total of 9 models were created. Among these models, the NASNetMobile model with 16 batch size got the best result. The accuracy value of this model is 82.00%, the precision value is 81.77% and the F1 score value is 0.8038. Our method is to benchmark mobile deep learning models which have few parameters and compare the results of the models.
ROAug 4, 2021
Mechatronic Investigation of Wound Healing Process by Using Micro RobotAbdurrahim Yilmaz, Ali Anil Demircali, Serra Ozkasap et al.
The purpose of this study is to find ideal forces for reducing cell stress in wound healing process by micro robots. Because of this aim, we made two simulations on COMSOL Multiphysics with micro robot to find correct force. As a result of these simulation, we created force curves to obtain the minimum force and friction force that could lift the cells from the surface will be determined. As the potential of the system for two micro robots that have 2 mm x 0.25 mm x 0.4 mm dimension SU-8 body with 3 NdFeB that have 0.25 thickness and diameter, simulation results at maximum force in the x-axis calculated with 4.640 mN, the distance between the two robots is 150 um.
CVJun 30, 2021
Deep Convolutional Neural Networks for Onychomycosis DetectionAbdurrahim Yilmaz, Fatih Goktay, Rahmetullah Varol et al.
The diagnosis of superficial fungal infections in dermatology is still mostly based on manual direct microscopic examination with Potassium Hydroxide (KOH) solution. However, this method can be time consuming and its diagnostic accuracy rates vary widely depending on the clinician's experience. With the increase of neural network applications in the field of clinical microscopy, it is now possible to automate such manual processes increasing both efficiency and accuracy. This study presents a deep neural network structure that enables the rapid solutions for these problems and can perform automatic fungi detection in grayscale images without dyes. 160 microscopic field photographs containing the fungal element, obtained from patients with onychomycosis, and 297 microscopic field photographs containing dissolved keratin obtained from normal nails were collected. Smaller patches containing 4234 fungi and 4981 keratin were extracted from these images. In order to detect fungus and keratin, VGG16 and InceptionV3 models were developed. The VGG16 model had 95.98% accuracy, and the area under the curve (AUC) value of 0.9930, while the InceptionV3 model had 95.90% accuracy and the AUC value of 0.9917. However, average accuracy and AUC value of clinicians is 72.8% and 0.87, respectively. This deep learning model allows the development of an automated system that can detect fungi within microscopic images.
FLU-DYNJun 3, 2021
The Effect of Pore Structure in Flapping Wings on Flight PerformanceAbdurrahim Yilmaz, Asli Tekeci, Meryem Ece Ozyetkin et al.
This study investigates the effects of porosity on flying creatures such as dragonflies, moths, hummingbirds, etc. wing and shows that pores can affect wing performance. These studies were performed by 3D porous flapping wing flow analyses on Comsol Multiphysics. In this study, we analyzed different numbers of the porous wing at different angles of inclination in order to see the effect of pores on lift and drag forces. To compare the results 9 different analyses were performed. In these analyses, airflow velocity was taken as 5 m/s, angle of attack as 5 degrees, frequency as 25 Hz, and flapping angle as 30 degrees. By keeping these values constant, the number of pores was changed to 36, 48, and 60, and the pore angles of inclination to 60, 70, and 80 degrees. Analyses were carried out by giving laminar flow to this wing designed in the Comsol Multiphysics program. The importance of pores was investigated by comparing the results of these analyses.