CL AIDec 20, 2024

MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula

Sieun Hyeon, Kyudan Jung, Jaehee Won, Nam-Joon Kim, Hyun Gon Ryu, Hyuk-Jae Lee, Jaeyoung Do

arXiv:2412.15655v34.27 citationsh-index: 13Has CodeAAAI

Originality Incremental advance

AI Analysis

This addresses the need for clear communication of mathematical content in academic and professional settings, especially for hearing-impaired or non-native speakers, but is incremental as it builds on existing ASR and language model techniques.

The paper tackles the problem of converting spoken mathematical expressions into accurate LaTeX formulas, which is hindered by errors in current AS Speech Recognition models, and introduces MathSpeech, a pipeline using small language models that achieves better performance than GPT-4o, with a CER reduction from 0.390 to 0.298 and higher ROUGE/BLEU scores.

In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i $\textit{side}$ of x), instead of the concise $\LaTeX{}$ format (i.e., $ e^{ix} = \cos(x) + i\sin(x) $), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured $\LaTeX{}$ representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates $\LaTeX{}$ generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for $\LaTeX{}$ translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.

View on arXiv PDF Code

Similar