ASAIMay 19, 2025

Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR

arXiv:2505.13079v11 citationsh-index: 10INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses modality gaps in speech recognition for improved accuracy, representing an incremental advance over existing optimal transport methods.

The paper tackles the challenge of aligning linguistic and acoustic modalities in automatic speech recognition by proposing Graph Matching Optimal Transport (GM-OT), which models sequences as structured graphs to minimize distances between nodes and edges, resulting in significant performance gains over state-of-the-art models.

Transferring linguistic knowledge from a pretrained language model (PLM) to acoustic feature learning has proven effective in enhancing end-to-end automatic speech recognition (E2E-ASR). However, aligning representations between linguistic and acoustic modalities remains a challenge due to inherent modality gaps. Optimal transport (OT) has shown promise in mitigating these gaps by minimizing the Wasserstein distance (WD) between linguistic and acoustic feature distributions. However, previous OT-based methods overlook structural relationships, treating feature vectors as unordered sets. To address this, we propose Graph Matching Optimal Transport (GM-OT), which models linguistic and acoustic sequences as structured graphs. Nodes represent feature embeddings, while edges capture temporal and sequential relationships. GM-OT minimizes both WD (between nodes) and Gromov-Wasserstein distance (GWD) (between edges), leading to a fused Gromov-Wasserstein distance (FGWD) formulation. This enables structured alignment and more efficient knowledge transfer compared to existing OT-based approaches. Theoretical analysis further shows that prior OT-based methods in linguistic knowledge transfer can be viewed as a special case within our GM-OT framework. We evaluate GM-OT on Mandarin ASR using a CTC-based E2E-ASR system with a PLM for knowledge transfer. Experimental results demonstrate significant performance gains over state-of-the-art models, validating the effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes