Martin J. Menten

CV
h-index71
25papers
310citations
Novelty53%
AI Score55

25 Papers

SPAug 9, 2023Code
Unlocking the diagnostic potential of electrocardiograms through information transfer from cardiac magnetic resonance imaging

Özgün Turgut, Philip Müller, Paul Hager et al.

Cardiovascular diseases (CVD) can be diagnosed using various diagnostic modalities. The electrocardiogram (ECG) is a cost-effective and widely available diagnostic aid that provides functional information of the heart. However, its ability to classify and spatially localise CVD is limited. In contrast, cardiac magnetic resonance (CMR) imaging provides detailed structural information of the heart and thus enables evidence-based diagnosis of CVD, but long scan times and high costs limit its use in clinical routine. In this work, we present a deep learning strategy for cost-effective and comprehensive cardiac screening solely from ECG. Our approach combines multimodal contrastive learning with masked data modelling to transfer domain-specific information from CMR imaging to ECG representations. In extensive experiments using data from 40,044 UK Biobank subjects, we demonstrate the utility and generalisability of our method for subject-specific risk prediction of CVD and the prediction of cardiac phenotypes using only ECG data. Specifically, our novel multimodal pre-training paradigm improves performance by up to 12.19 % for risk prediction and 27.59 % for phenotype prediction. In a qualitative analysis, we demonstrate that our learned ECG representations incorporate information from CMR image regions of interest. Our entire pipeline is publicly available at https://github.com/oetu/MMCL-ECG-CMR.

CVMar 24, 2023
Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data

Paul Hager, Martin J. Menten, Daniel Rueckert

Medical datasets and especially biobanks, often contain extensive tabular data with rich clinical information in addition to images. In practice, clinicians typically have less data, both in terms of diversity and scale, but still wish to deploy deep learning solutions. Combined with increasing medical dataset sizes and expensive annotation costs, the necessity for unsupervised methods that can pretrain multimodally and predict unimodally has risen. To address these needs, we propose the first self-supervised contrastive learning framework that takes advantage of images and tabular data to train unimodal encoders. Our solution combines SimCLR and SCARF, two leading contrastive learning strategies, and is simple and effective. In our experiments, we demonstrate the strength of our framework by predicting risks of myocardial infarction and coronary artery disease (CAD) using cardiac MR images and 120 clinical features from 40,000 UK Biobank subjects. Furthermore, we show the generalizability of our approach to natural images using the DVM car advertisement dataset. We take advantage of the high interpretability of tabular data and through attribution and ablation experiments find that morphometric tabular features, describing size and shape, have outsized importance during the contrastive learning process and improve the quality of the learned embeddings. Finally, we introduce a novel form of supervised contrastive learning, label as a feature (LaaF), by appending the ground truth label as a tabular feature during multimodal pretraining, outperforming all supervised contrastive baselines.

CVSep 15, 2023Code
3D Arterial Segmentation via Single 2D Projections and Depth Supervision in Contrast-Enhanced CT Images

Alina F. Dima, Veronika A. Zimmer, Martin J. Menten et al.

Automated segmentation of the blood vessels in 3D volumes is an essential step for the quantitative diagnosis and treatment of many vascular diseases. 3D vessel segmentation is being actively investigated in existing works, mostly in deep learning approaches. However, training 3D deep networks requires large amounts of manual 3D annotations from experts, which are laborious to obtain. This is especially the case for 3D vessel segmentation, as vessels are sparse yet spread out over many slices and disconnected when visualized in 2D slices. In this work, we propose a novel method to segment the 3D peripancreatic arteries solely from one annotated 2D projection per training image with depth supervision. We perform extensive experiments on the segmentation of peripancreatic arteries on 3D contrast-enhanced CT images and demonstrate how well we capture the rich depth information from 2D projections. We demonstrate that by annotating a single, randomly chosen projection for each training sample, we obtain comparable performance to annotating multiple 2D projections, thereby reducing the annotation effort. Furthermore, by mapping the 2D labels to the 3D space using depth information and incorporating this into training, we almost close the performance gap between 3D supervision and 2D supervision. Our code is available at: https://github.com/alinafdima/3Dseg-mip-depth.

IVJun 19, 2023
Synthetic optical coherence tomography angiographs for detailed retinal vessel segmentation without human annotations

Linus Kreitner, Johannes C. Paetzold, Nikolaus Rauch et al.

Optical coherence tomography angiography (OCTA) is a non-invasive imaging modality that can acquire high-resolution volumes of the retinal vasculature and aid the diagnosis of ocular, neurological and cardiac diseases. Segmenting the visible blood vessels is a common first step when extracting quantitative biomarkers from these images. Classical segmentation algorithms based on thresholding are strongly affected by image artifacts and limited signal-to-noise ratio. The use of modern, deep learning-based segmentation methods has been inhibited by a lack of large datasets with detailed annotations of the blood vessels. To address this issue, recent work has employed transfer learning, where a segmentation network is trained on synthetic OCTA images and is then applied to real data. However, the previously proposed simulations fail to faithfully model the retinal vasculature and do not provide effective domain adaptation. Because of this, current methods are unable to fully segment the retinal vasculature, in particular the smallest capillaries. In this work, we present a lightweight simulation of the retinal vascular network based on space colonization for faster and more realistic OCTA synthesis. We then introduce three contrast adaptation pipelines to decrease the domain gap between real and artificial images. We demonstrate the superior segmentation performance of our approach in extensive quantitative and qualitative experiments on three public datasets that compare our method to traditional computer vision algorithms and supervised training using human annotations. Finally, we make our entire pipeline publicly available, including the source code, pretrained models, and a large dataset of synthetic OCTA images.

IVJul 22, 2022
Physiology-based simulation of the retinal vasculature enables annotation-free segmentation of OCT angiographs

Martin J. Menten, Johannes C. Paetzold, Alina Dima et al.

Optical coherence tomography angiography (OCTA) can non-invasively image the eye's circulatory system. In order to reliably characterize the retinal vasculature, there is a need to automatically extract quantitative metrics from these images. The calculation of such biomarkers requires a precise semantic segmentation of the blood vessels. However, deep-learning-based methods for segmentation mostly rely on supervised training with voxel-level annotations, which are costly to obtain. In this work, we present a pipeline to synthesize large amounts of realistic OCTA images with intrinsically matching ground truth labels; thereby obviating the need for manual annotation of training data. Our proposed method is based on two novel components: 1) a physiology-based simulation that models the various retinal vascular plexuses and 2) a suite of physics-based image augmentations that emulate the OCTA image acquisition process including typical artifacts. In extensive benchmarking experiments, we demonstrate the utility of our synthetic data by successfully training retinal vessel segmentation algorithms. Encouraged by our method's competitive quantitative and superior qualitative performance, we believe that it constitutes a versatile tool to advance the quantitative analysis of OCTA images.

CVAug 4, 2022
Metadata-enhanced contrastive learning from retinal optical coherence tomography images

Robbie Holland, Oliver Leingang, Hrvoje Bogunović et al.

Deep learning has potential to automate screening, monitoring and grading of disease in medical images. Pretraining with contrastive learning enables models to extract robust and generalisable features from natural image datasets, facilitating label-efficient downstream image analysis. However, the direct application of conventional contrastive methods to medical datasets introduces two domain-specific issues. Firstly, several image transformations which have been shown to be crucial for effective contrastive learning do not translate from the natural image to the medical image domain. Secondly, the assumption made by conventional methods, that any two images are dissimilar, is systematically misleading in medical datasets depicting the same anatomy and disease. This is exacerbated in longitudinal image datasets that repeatedly image the same patient cohort to monitor their disease progression over time. In this paper we tackle these issues by extending conventional contrastive frameworks with a novel metadata-enhanced strategy. Our approach employs widely available patient metadata to approximate the true set of inter-image contrastive relationships. To this end we employ records for patient identity, eye position (i.e. left or right) and time series information. In experiments using two large longitudinal datasets containing 170,427 retinal OCT images of 7,912 patients with age-related macular degeneration (AMD), we evaluate the utility of using metadata to incorporate the temporal dynamics of disease progression into pretraining. Our metadata-enhanced approach outperforms both standard contrastive methods and a retinal image foundation model in five out of six image-level downstream tasks related to AMD. Due to its modularity, our method can be quickly and cost-effectively tested to establish the potential benefits of including available metadata in contrastive pretraining.

IVOct 28, 2022
Automated analysis of diabetic retinopathy using vessel segmentation maps as inductive bias

Linus Kreitner, Ivan Ezhov, Daniel Rueckert et al.

Recent studies suggest that early stages of diabetic retinopathy (DR) can be diagnosed by monitoring vascular changes in the deep vascular complex. In this work, we investigate a novel method for automated DR grading based on optical coherence tomography angiography (OCTA) images. Our work combines OCTA scans with their vessel segmentations, which then serve as inputs to task specific networks for lesion segmentation, image quality assessment and DR grading. For this, we generate synthetic OCTA images to train a segmentation network that can be directly applied on real OCTA data. We test our approach on MICCAI 2022's DR analysis challenge (DRAC). In our experiments, the proposed method performs equally well as the baseline model.

CVSep 5, 2023
A skeletonization algorithm for gradient-based optimization

Martin J. Menten, Johannes C. Paetzold, Veronika A. Zimmer et al.

The skeleton of a digital image is a compact representation of its topology, geometry, and scale. It has utility in many computer vision applications, such as image description, segmentation, and registration. However, skeletonization has only seen limited use in contemporary deep learning solutions. Most existing skeletonization algorithms are not differentiable, making it impossible to integrate them with gradient-based optimization. Compatible algorithms based on morphological operations and neural networks have been proposed, but their results often deviate from the geometry and topology of the true medial axis. This work introduces the first three-dimensional skeletonization algorithm that is both compatible with gradient-based optimization and preserves an object's topology. Our method is exclusively based on matrix additions and multiplications, convolutional operations, basic non-linear functions, and sampling from a uniform probability distribution, allowing it to be easily implemented in any major deep learning library. In benchmarking experiments, we prove the advantages of our skeletonization algorithm compared to non-differentiable, morphological, and neural-network-based baselines. Finally, we demonstrate the utility of our algorithm by integrating it with two medical image processing applications that use gradient-based optimization: deep-learning-based blood vessel segmentation, and multimodal registration of the mandible in computed tomography and magnetic resonance images.

IVAug 1, 2023
Metrics to Quantify Global Consistency in Synthetic Medical Images

Daniel Scholz, Benedikt Wiestler, Daniel Rueckert et al.

Image synthesis is increasingly being adopted in medical image processing, for example for data augmentation or inter-modality image translation. In these critical applications, the generated images must fulfill a high standard of biological correctness. A particular requirement for these images is global consistency, i.e an image being overall coherent and structured so that all parts of the image fit together in a realistic and meaningful way. Yet, established image quality metrics do not explicitly quantify this property of synthetic images. In this work, we introduce two metrics that can measure the global consistency of synthetic images on a per-image basis. To measure the global consistency, we presume that a realistic image exhibits consistent properties, e.g., a person's body fat in a whole-body MRI, throughout the depicted object or scene. Hence, we quantify global consistency by predicting and comparing explicit attributes of images on patches using supervised trained neural networks. Next, we adapt this strategy to an unlabeled setting by measuring the similarity of implicit image features predicted by a self-supervised trained network. Our results demonstrate that predicting explicit attributes of synthetic images on patches can distinguish globally consistent from inconsistent images. Implicit representations of images are less sensitive to assess global consistency but are still serviceable when labeled data is unavailable. Compared to established metrics, such as the FID, our method can explicitly measure global consistency on a per-image basis, enabling a dedicated analysis of the biological plausibility of single synthetic images.

IVJan 11, 2023
Clustering disease trajectories in contrastive feature space for biomarker discovery in age-related macular degeneration

Robbie Holland, Oliver Leingang, Christopher Holmes et al.

Age-related macular degeneration (AMD) is the leading cause of blindness in the elderly. Current grading systems based on imaging biomarkers only coarsely group disease stages into broad categories and are unable to predict future disease progression. It is widely believed that this is due to their focus on a single point in time, disregarding the dynamic nature of the disease. In this work, we present the first method to automatically discover biomarkers that capture temporal dynamics of disease progression. Our method represents patient time series as trajectories in a latent feature space built with contrastive learning. Then, individual trajectories are partitioned into atomic sub-sequences that encode transitions between disease states. These are clustered using a newly introduced distance metric. In quantitative experiments we found our method yields temporal biomarkers that are predictive of conversion to late AMD. Furthermore, these clusters were highly interpretable to ophthalmologists who confirmed that many of the clusters represent dynamics that have previously been linked to the progression of AMD, even though they are currently not included in any clinical grading system.

AIJul 11, 2024
Specialized curricula for training vision-language models in retinal image analysis

Robbie Holland, Thomas R. P. Taylor, Christopher Holmes et al.

Clinicians spend a significant amount of time reviewing medical images and transcribing their findings regarding patient diagnosis, referral and treatment in text form. Vision-language models (VLMs), which automatically interpret images and summarize their findings as text, have enormous potential to alleviate clinical workloads and increase patient access to high-quality medical care. While foundational models have stirred considerable interest in the medical community, it is unclear whether their general capabilities translate to real-world clinical utility. In this work, we demonstrate that OpenAI's ChatGPT-4o model, in addition to two foundation VLMs designed for medical use, markedly underperform compared to practicing ophthalmologists on specialist tasks crucial to the care of patients with age-related macular degeneration (AMD). To address this, we initially identified the essential capabilities required for image-based clinical decision-making, and then developed a curriculum to selectively train VLMs in these skills. The resulting model, RetinaVLM, can be instructed to write reports that significantly outperform those written by leading foundation medical VLMs and ChatGPT-4o in disease staging (F1 score of 0.63 vs. 0.33) and patient referral (0.67 vs. 0.50), and approaches the diagnostic performance of junior ophthalmologists (who achieve 0.77 and 0.78 on the respective tasks). Furthermore, in a single-blind reader study two senior ophthalmologists with up to 32 years of experience found RetinaVLM's reports were found to be substantially more accurate than those by ChatGPT-4o (64.3% vs. 14.3%). These results reinforce that our curriculum-based approach provides a blueprint towards specializing foundation medical VLMs for real-world clinical tasks.

CVSep 28, 2023
Propagation and Attribution of Uncertainty in Medical Imaging Pipelines

Leonhard F. Feiner, Martin J. Menten, Kerstin Hammernik et al.

Uncertainty estimation, which provides a means of building explainable neural networks for medical imaging applications, have mostly been studied for single deep learning models that focus on a specific task. In this paper, we propose a method to propagate uncertainty through cascades of deep learning models in medical imaging pipelines. This allows us to aggregate the uncertainty in later stages of the pipeline and to obtain a joint uncertainty measure for the predictions of later models. Additionally, we can separately report contributions of the aleatoric, data-based, uncertainty of every component in the pipeline. We demonstrate the utility of our method on a realistic imaging pipeline that reconstructs undersampled brain and knee magnetic resonance (MR) images and subsequently predicts quantitative information from the images, such as the brain volume, or knee side or patient's sex. We quantitatively show that the propagated uncertainty is correlated with input uncertainty and compare the proportions of contributions of pipeline stages to the joint uncertainty measure.

LGDec 29, 2025
Stochastic Siamese MAE Pretraining for Longitudinal Medical Images

Taha Emre, Arunava Chakravarty, Thomas Pinetz et al.

Temporally aware image representations are crucial for capturing disease progression in 3D volumes of longitudinal medical datasets. However, recent state-of-the-art self-supervised learning approaches like Masked Autoencoding (MAE), despite their strong representation learning capabilities, lack temporal awareness. In this paper, we propose STAMP (Stochastic Temporal Autoencoder with Masked Pretraining), a Siamese MAE framework that encodes temporal information through a stochastic process by conditioning on the time difference between the 2 input volumes. Unlike deterministic Siamese approaches, which compare scans from different time points but fail to account for the inherent uncertainty in disease evolution, STAMP learns temporal dynamics stochastically by reframing the MAE reconstruction loss as a conditional variational inference objective. We evaluated STAMP on two OCT and one MRI datasets with multiple visits per patient. STAMP pretrained ViT models outperformed both existing temporal MAE methods and foundation models on different late stage Age-Related Macular Degeneration and Alzheimer's Disease progression prediction which require models to learn the underlying non-deterministic temporal dynamics of the diseases.

22.8CLApr 22
Weighting What Matters: Boosting Sample Efficiency in Medical Report Generation via Token Reweighting

Alexander Weers, Daniel Rueckert, Martin J. Menten

Training vision-language models (VLMs) for medical report generation is often hindered by the scarcity of high-quality annotated data. This work evaluates the use of a weighted loss function to improve data efficiency. Compared to standard cross-entropy loss, which treats all token prediction errors equally, the reweighted loss shifts the focus to semantically salient tokens with outsized clinical importance. In experiments on ophthalmological report generation, we show that this simple method improves efficiency across multiple data scales, achieving similar report quality with up to ten times less training data.

LGFeb 10
Step-resolved data attribution for looped transformers

Georgios Kaissis, David Mildenberger, Juan Felipe Gomez et al.

We study how individual training examples shape the internal computation of looped transformers, where a shared block is applied for $τ$ recurrent iterations to enable latent reasoning. Existing training-data influence estimators such as TracIn yield a single scalar score that aggregates over all loop iterations, obscuring when during the recurrent computation a training example matters. We introduce \textit{Step-Decomposed Influence (SDI)}, which decomposes TracIn into a length-$τ$ influence trajectory by unrolling the recurrent computation graph and attributing influence to specific loop iterations. To make SDI practical at transformer scale, we propose a TensorSketch implementation that never materialises per-example gradients. Experiments on looped GPT-style models and algorithmic reasoning tasks show that SDI scales excellently, matches full-gradient baselines with low error and supports a broad range of data attribution and interpretability tasks with per-step insights into the latent reasoning process.

CVDec 6, 2023Code
How Low Can You Go? Surfacing Prototypical In-Distribution Samples for Unsupervised Anomaly Detection

Felix Meissen, Johannes Getzner, Alexander Ziller et al.

Unsupervised anomaly detection (UAD) alleviates large labeling efforts by training exclusively on unlabeled in-distribution data and detecting outliers as anomalies. Generally, the assumption prevails that large training datasets allow the training of higher-performing UAD models. However, in this work, we show that UAD with extremely few training samples can already match -- and in some cases even surpass -- the performance of training with the whole training dataset. Building upon this finding, we propose an unsupervised method to reliably identify prototypical samples to further boost UAD performance. We demonstrate the utility of our method on seven different established UAD benchmarks from computer vision, industrial defect detection, and medicine. With just 25 selected samples, we even exceed the performance of full training in $25/67$ categories in these benchmarks. Additionally, we show that the prototypical in-distribution samples identified by our proposed method generalize well across models and datasets and that observing their sample selection criteria allows for a successful manual selection of small subsets of high-performing samples. Our code is available at https://anonymous.4open.science/r/uad_prototypical_samples/

IVMar 12, 2024
Deep-learning-based clustering of OCT images for biomarker discovery in age-related macular degeneration (Pinnacle study report 4)

Robbie Holland, Rebecca Kaye, Ahmed M. Hagag et al.

Diseases are currently managed by grading systems, where patients are stratified by grading systems into stages that indicate patient risk and guide clinical management. However, these broad categories typically lack prognostic value, and proposals for new biomarkers are currently limited to anecdotal observations. In this work, we introduce a deep-learning-based biomarker proposal system for the purpose of accelerating biomarker discovery in age-related macular degeneration (AMD). It works by first training a neural network using self-supervised contrastive learning to discover, without any clinical annotations, features relating to both known and unknown AMD biomarkers present in 46,496 retinal optical coherence tomography (OCT) images. To interpret the discovered biomarkers, we partition the images into 30 subsets, termed clusters, that contain similar features. We then conduct two parallel 1.5-hour semi-structured interviews with two independent teams of retinal specialists that describe each cluster in clinical language. Overall, both teams independently identified clearly distinct characteristics in 27 of 30 clusters, of which 23 were related to AMD. Seven were recognised as known biomarkers already used in established grading systems and 16 depicted biomarker combinations or subtypes that are either not yet used in grading systems, were only recently proposed, or were unknown. Clusters separated incomplete from complete retinal atrophy, intraretinal from subretinal fluid and thick from thin choroids, and in simulation outperformed clinically-used grading systems in prognostic value. Overall, contrastive learning enabled the automatic proposal of AMD biomarkers that go beyond the set used by clinically established grading systems. Ultimately, we envision that equipping clinicians with discovery-oriented deep-learning tools can accelerate discovery of novel prognostic biomarkers.

IVMar 25, 2024
Diff-Def: Diffusion-Generated Deformation Fields for Conditional Atlases

Sophie Starck, Vasiliki Sideri-Lampretsa, Bernhard Kainz et al.

Anatomical atlases are widely used for population studies and analysis. Conditional atlases target a specific sub-population defined via certain conditions, such as demographics or pathologies, and allow for the investigation of fine-grained anatomical differences like morphological changes associated with ageing or disease. Existing approaches use either registration-based methods that are often unable to handle large anatomical variations or generative adversarial models, which are challenging to train since they can suffer from training instabilities. Instead of generating atlases directly in as intensities, we propose using latent diffusion models to generate deformation fields, which transform a general population atlas into one representing a specific sub-population. Our approach ensures structural integrity, enhances interpretability and avoids hallucinations that may arise during direct image synthesis by generating this deformation field and regularising it using a neighbourhood of images. We compare our method to several state-of-the-art atlas generation methods using brain MR images from the UK Biobank. Our method generates highly realistic atlases with smooth transformations and high anatomical fidelity, outperforming existing baselines. We demonstrate the quality of these atlases through comprehensive evaluations, including quantitative metrics for anatomical accuracy, perceptual similarity, and qualitative analyses displaying the consistency and realism of the generated atlases.

CVMar 12, 2025
Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis

Chenjun Li, Laurin Lux, Alexander H. Berger et al.

Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely interventions and preventing vision loss. However, current staging models are hardly interpretable, and most public datasets contain no clinical reasoning or interpretation beyond image-level labels. In this paper, we present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis. Our approach leverages optical coherence tomography angiography (OCTA) images by constructing biologically informed graphs that encode key retinal vascular features such as vessel morphology and spatial connectivity. A graph neural network (GNN) then performs DR staging while integrated gradients highlight critical nodes and edges and their individual features that drive the classification decisions. We collect this graph-based knowledge which attributes the model's prediction to physiological structures and their characteristics. We then transform it into textual descriptions for VLMs. We perform instruction-tuning with these textual descriptions and the corresponding image to train a student VLM. This final agent can classify the disease and explain its decision in a human interpretable way solely based on a single image input. Experimental evaluations on both proprietary and public datasets demonstrate that our method not only improves classification accuracy but also offers more clinically interpretable results. An expert study further demonstrates that our method provides more accurate diagnostic explanations and paves the way for precise localization of pathologies in OCTA images.

CVFeb 23, 2025
Interpretable Retinal Disease Prediction Using Biology-Informed Heterogeneous Graph Representations

Laurin Lux, Alexander H. Berger, Maria Romeo Tricas et al.

Interpretability is crucial to enhance trust in machine learning models for medical diagnostics. However, most state-of-the-art image classifiers based on neural networks are not interpretable. As a result, clinicians often resort to known biomarkers for diagnosis, although biomarker-based classification typically performs worse than large neural networks. This work proposes a method that surpasses the performance of established machine learning models while simultaneously improving prediction interpretability for diabetic retinopathy staging from optical coherence tomography angiography (OCTA) images. Our method is based on a novel biology-informed heterogeneous graph representation that models retinal vessel segments, intercapillary areas, and the foveal avascular zone (FAZ) in a human-interpretable way. This graph representation allows us to frame diabetic retinopathy staging as a graph-level classification task, which we solve using an efficient graph neural network. We benchmark our method against well-established baselines, including classical biomarker-based classifiers, convolutional neural networks (CNNs), and vision transformers. Our model outperforms all baselines on two datasets. Crucially, we use our biology-informed graph to provide explanations of unprecedented detail. Our approach surpasses existing methods in precisely localizing and identifying critical vessels or intercapillary areas. In addition, we give informative and human-interpretable attributions to critical characteristics. Our work contributes to the development of clinical decision-support tools in ophthalmology.

CVMar 12, 2024
Spatiotemporal Representation Learning for Short and Long Medical Image Time Series

Chengzhi Shen, Martin J. Menten, Hrvoje Bogunović et al.

Analyzing temporal developments is crucial for the accurate prognosis of many medical conditions. Temporal changes that occur over short time scales are key to assessing the health of physiological functions, such as the cardiac cycle. Moreover, tracking longer term developments that occur over months or years in evolving processes, such as age-related macular degeneration (AMD), is essential for accurate prognosis. Despite the importance of both short and long term analysis to clinical decision making, they remain understudied in medical deep learning. State of the art methods for spatiotemporal representation learning, developed for short natural videos, prioritize the detection of temporal constants rather than temporal developments. Moreover, they do not account for varying time intervals between acquisitions, which are essential for contextualizing observed changes. To address these issues, we propose two approaches. First, we combine clip-level contrastive learning with a novel temporal embedding to adapt to irregular time series. Second, we propose masking and predicting latent frame representations of the temporal sequence. Our two approaches outperform all prior methods on temporally-dependent tasks including cardiac output estimation and three prognostic AMD tasks. Overall, this enables the automated analysis of temporal patterns which are typically overlooked in applications of deep learning to medicine.

IVMar 10, 2025
Skelite: Compact Neural Networks for Efficient Iterative Skeletonization

Luis D. Reyes Vargas, Martin J. Menten, Johannes C. Paetzold et al.

Skeletonization extracts thin representations from images that compactly encode their geometry and topology. These representations have become an important topological prior for preserving connectivity in curvilinear structures, aiding medical tasks like vessel segmentation. Existing compatible skeletonization algorithms face significant trade-offs: morphology-based approaches are computationally efficient but prone to frequent breakages, while topology-preserving methods require substantial computational resources. We propose a novel framework for training iterative skeletonization algorithms with a learnable component. The framework leverages synthetic data, task-specific augmentation, and a model distillation strategy to learn compact neural networks that produce thin, connected skeletons with a fully differentiable iterative algorithm. Our method demonstrates a 100 times speedup over topology-constrained algorithms while maintaining high accuracy and generalizing effectively to new domains without fine-tuning. Benchmarking and downstream validation in 2D and 3D tasks demonstrate its computational efficiency and real-world applicability

CVDec 11, 2025
Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning

Chenjun Li, Cheng Wan, Laurin Lux et al.

Vision-Language Models (VLMs) offer a promising path toward interpretable medical diagnosis by allowing users to ask about clinical explanations alongside predictions and across different modalities. However, training VLMs for detailed reasoning requires large-scale image-text datasets. In many specialized domains, for example in reading Optical Coherence Tomography Angiography (OCTA) images, such precise text with grounded description of pathologies is scarce or even non-existent. To overcome this bottleneck, we introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text, specifically: realistic retinal vasculature with Diabetic Retinopathy (DR) features: capillary dropout, microaneurysms, neovascularization, and tortuosity, while automatically generating granular reasoning texts. Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs. Our experiments show that a general-purpose VLM (Qwen3-VL-8b) trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images, outperforming supervised baselines. Through human expert evaluation we also demonstrate that it significantly enhances explanation quality and pathology localization on clinical data.

LGOct 8, 2025
Efficient numeracy in language models through single-token number embeddings

Linus Kreitner, Paul Hager, Jonathan Mengedoht et al.

To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either limiting the numerical intuition of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel tokenization strategy that embeds any number into a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.

CVMar 11, 2024
Cross-domain and Cross-dimension Learning for Image-to-Graph Transformers

Alexander H. Berger, Laurin Lux, Suprosanna Shit et al.

Direct image-to-graph transformation is a challenging task that involves solving object detection and relationship prediction in a single model. Due to this task's complexity, large training datasets are rare in many domains, making the training of deep-learning methods challenging. This data sparsity necessitates transfer learning strategies akin to the state-of-the-art in general computer vision. In this work, we introduce a set of methods enabling cross-domain and cross-dimension learning for image-to-graph transformers. We propose (1) a regularized edge sampling loss to effectively learn object relations in multiple domains with different numbers of edges, (2) a domain adaptation framework for image-to-graph transformers aligning image- and graph-level features from different domains, and (3) a projection function that allows using 2D data for training 3D transformers. We demonstrate our method's utility in cross-domain and cross-dimension experiments, where we utilize labeled data from 2D road networks for simultaneous learning in vastly different target domains. Our method consistently outperforms standard transfer learning and self-supervised pretraining on challenging benchmarks, such as retinal or whole-brain vessel graph extraction.