SparQLe: Speech Queries to Text Translation Through LLMs
This work addresses the problem of seamless multi-modal processing and speech understanding for applications that rely on speech-to-text translation.
This study tackled the problem of integrating speech representations with Large Language Models (LLMs) for speech-to-text translation, resulting in a method that effectively preserves the semantic content of the input speech. The proposed approach serves as a bridge between self-supervised speech models and instruction-tuned LLMs.
With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that combines self-supervised speech representations with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English speech data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising approach for various speech understanding applications.