CLSDASJun 16, 2025

Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

arXiv:2506.13596v23 citationsh-index: 3Workshop on Multilingual Conversational Speech Language Model (MLC-SLM)
Originality Synthesis-oriented
AI Analysis

This addresses multilingual speech recognition for the MLC-SLM Challenge 2025, presenting an incremental improvement with competitive results.

This paper tackled multilingual speech recognition by combining a fine-tuned Whisper encoder with projector architectures and LLM decoders, achieving competitive performance with average WER/CER results of 16.63% using Gemma3-12B and 18.6% using Qwen2.5-7B.

This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B as decoder-only language model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes