RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture
This work addresses the challenge of limited paired image-text data in medical imaging by enabling effective self-supervised learning for radiology, which is incremental as it builds on existing joint embedding architectures but applies them specifically to chest X-rays.
The paper tackles the problem of learning robust radiology encoders without relying on language supervision by introducing RadJEPA, a self-supervised framework that predicts latent representations of masked chest X-ray regions, achieving performance exceeding state-of-the-art approaches across disease classification, semantic segmentation, and report generation tasks.
Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.