3 Papers

34.4SDMay 18Code
SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

Md Hasan, Nyvenn Castro, Daiqi Liu et al.

Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM

36.9CVMay 18
Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI

Daiqi Liu, Lukas Mulzer, Md Hasan et al.

Segmenting vocal tract articulators in real-time MRI (rtMRI) is a challenging dynamic image segmentation problem characterized by low contrast, rapid motion, and limited spatial resolution. However, while rtMRI acquisitions may provide synchronized acoustic signals, existing methods discard this information, and the few multimodal approaches that incorporate audio cannot be deployed when audio is unavailable. We propose a three-stage framework that leverages acoustic and phonological supervision during training while requiring only the rtMRI image at inference: phonological representations are converted into spatial bounding-box priors for articulator localization, visual and acoustic encoders are aligned via dual-level cross-modal contrastive pretraining, and the learned representations are fused through a cross-attention decoder, effectively transferring multimodal knowledge into a single-modality inference pipeline. Evaluated on 75-Speaker~Annot-16 and USC-TIMIT datasets, our method outperforms existing unimodal and multimodal methods, demonstrating that multimodal supervision provides transferable benefits for precise and clinically deployable vocal tract segmentation.

34.3LGMay 8
A Deep Risk Estimator for Known Operator Learning

Andreas Maier, Md Hasan, Paulina Conrad et al.

We describe an approach for estimating the statistical risk of deep networks that contain a mix of learned and known operators. Building on the maximal training error bounds previously established for known operator learning, we derive a deep risk estimator that connects the expected error of a layered network to the size of the training sample. The estimator decomposes the total risk into a sum over learned layers; every known operator contributes zero to this sum, while every learned layer adds an approximation term inspired by Barron's classic work and an estimation term that decreases with the number of training samples. We are able to show that the bound shrinks whenever a learned layer is replaced by a known operator and that the corresponding sample requirement scales with the number of trainable parameters of the layer that is replaced. As an application, we use computed tomography as an example and compare an operator-aware filtered backprojection network with a fully connected substitute that collapses the entire reconstruction pipeline into a single learned dense matrix. The predicted parameter ratio coincides with the structural sparsity that the analytic decomposition into a circulant filter and a sparse backprojection exposes. We confirm the predicted scaling on CPU at small image scale and on GPU at medium image scale, all on the same scaling law. Beyond CT reconstruction, the estimator applies to physics-informed neural networks that hardcode a known physical operation in its architecture, and we expect the result to be of interest for a broad community working on operator-aware deep learning. Calibrating the per-layer constants on each sweep yields a bound that tracks the empirical test MSE within a factor of two at every training-set size, so the estimator can be inverted to predict how many training samples are required to reach a target error.