CVAILGOct 3, 2025

Align Your Query: Representation Alignment for Multimodality Medical Object Detection

arXiv:2510.02789v11 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses robust multimodality medical object detection for healthcare applications, offering a practical solution but is incremental as it builds on existing DETR-style architectures.

The paper tackles the problem of medical object detection across mixed modalities (CXR, CT, MRI) by proposing a representation alignment framework to address heterogeneous statistics and disjoint representation spaces, resulting in consistent AP improvements with minimal overhead and no architectural modifications.

Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes