Ivan Sviridov

CL
h-index13
7papers
16citations
Novelty38%
AI Score49

7 Papers

31.0CLJun 5Code
LLM-Guided Evolution for Medical Decision Pipelines

Ivan Sviridov, Artem Oskin, Ivan Panin et al.

Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at https://github.com/univanxx/llm_guided_evo_medical. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions. Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from $77.3\%$ to $87.1\%$ and emergency recall from $0.60$ to $0.97$, while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy--cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone.

HCMar 26, 2025Code
3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark

Ivan Sviridov, Amina Miftakhova, Artemiy Tereshchenko et al.

Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real-world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through temperament-based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for widely used open and closed-source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM's context boosts F1 by up to 20%. Source code is available at https://github.com/univanxx/3mdbench.

CLNov 19, 2025Code
Multimodal Evaluation of Russian-language Architectures

Artem Chervyakov, Ulyana Isaeva, Anton Emelyanov et al.

Multimodal large language models (MLLMs) are currently at the center of research attention, showing rapid progress in scale and capabilities, yet their intelligence, limitations, and risks remain insufficiently understood. To address these issues, particularly in the context of the Russian language, where no multimodal benchmarks currently exist, we introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures. The benchmark is instruction-based and encompasses default text, image, audio, and video modalities, comprising 18 newly constructed evaluation tasks for both general-purpose models and modality-specific architectures (image-to-text, video-to-text, and audio-to-text). Our contributions include: (i) a universal taxonomy of multimodal abilities; (ii) 18 datasets created entirely from scratch with attention to Russian cultural and linguistic specificity, unified prompts, and metrics; (iii) baseline results for both closed-source and open-source models; (iv) a methodology for preventing benchmark leakage, including watermarking and licenses for private sets. While our current focus is on Russian, the proposed benchmark provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages, particularly within the Slavic language family.

CLFeb 28, 2025Code
RuCCoD: Towards Automated ICD Coding in Russian

Aleksandr Nesterov, Andrey Sakhovskiy, Ivan Sviridov et al.

This study investigates the feasibility of automating clinical coding in Russian, a language with limited biomedical resources. We present a new dataset for ICD coding, which includes diagnosis fields from electronic health records (EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD codes. This dataset serves as a benchmark for several state-of-the-art models, including BERT, LLaMA with LoRA, and RAG, with additional experiments examining transfer learning across domains (from PubMed abstracts to medical diagnosis) and terminologies (from UMLS concepts to ICD codes). We then apply the best-performing model to label an in-house EHR dataset containing patient histories from 2017 to 2021. Our experiments, conducted on a carefully curated test set, demonstrate that training with the automated predicted codes leads to a significant improvement in accuracy compared to manually annotated data from physicians. We believe our findings offer valuable insights into the potential for automating clinical coding in resource-limited languages like Russian, which could enhance clinical efficiency and data accuracy in these contexts. Our code and dataset are available at https://github.com/auto-icd-coding/ruccod.

SPMar 3, 2025
Conditional Electrocardiogram Generation Using Hierarchical Variational Autoencoders

Ivan Sviridov, Konstantin Egorov

Cardiovascular diseases (CVDs) are disorders impacting the heart and circulatory system. These disorders are the foremost and continuously escalating cause of mortality worldwide. One of the main tasks when working with CVDs is analyzing and identifying pathologies on a 12-lead electrocardiogram (ECG) with a standard 10-second duration. Using machine learning (ML) in automatic ECG analysis increases CVD diagnostics' availability, speed, and accuracy. However, the most significant difficulty in developing ML models is obtaining a sufficient training dataset. Due to the limitations of medical data usage, such as expensiveness, errors, the ambiguity of labels, imbalance of classes, and privacy issues, utilizing synthetic samples depending on specific pathologies bypasses these restrictions and improves algorithm quality. Existing solutions for the conditional generation of ECG signals are mainly built on Generative Adversarial Networks (GANs), and only a few papers consider the architectures based on Variational Autoencoders (VAEs), showing comparable results in recent works. This paper proposes the publicly available conditional Nouveau VAE model for ECG signal generation (cNVAE-ECG), which produces high-resolution ECGs with multiple pathologies. We provide an extensive comparison of the proposed model on various practical downstream tasks, including transfer learning scenarios showing an area under the receiver operating characteristic (AUROC) increase up to 2% surpassing GAN-like competitors.

CLFeb 25, 2025
Uncertainty-aware abstention in medical diagnosis based on medical texts

Artem Vazhentsev, Ivan Sviridov, Alvard Barseghyan et al.

This study addresses the critical issue of reliability for AI-assisted medical diagnosis. We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis. Such selective prediction (or abstention) approaches are usually based on the modeling predictive uncertainty of machine learning models involved. This study explores uncertainty quantification in machine learning models for medical text analysis, addressing diverse tasks across multiple datasets. We focus on binary mortality prediction from textual data in MIMIC-III, multi-label medical code prediction using ICD-10 codes from MIMIC-IV, and multi-class classification with a private outpatient visits dataset. Additionally, we analyze mental health datasets targeting depression and anxiety detection, utilizing various text-based sources, such as essays, social media posts, and clinical descriptions. In addition to comparing uncertainty methods, we introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks. Our results provide a detailed comparison of uncertainty quantification methods. They demonstrate the effectiveness of HUQ-2 in capturing and evaluating uncertainty, paving the way for more reliable and interpretable applications in medical text analysis.

AIFeb 26, 2024
GigaPevt: Multimodal Medical Assistant

Pavel Blinov, Konstantin Egorov, Ivan Sviridov et al.

Building an intelligent and efficient medical assistant is still a challenging AI problem. The major limitation comes from the data modality scarceness, which reduces comprehensive patient perception. This demo paper presents the GigaPevt, the first multimodal medical assistant that combines the dialog capabilities of large language models with specialized medical models. Such an approach shows immediate advantages in dialog quality and metric performance, with a 1.18% accuracy improvement in the question-answering task.