Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer
This addresses the problem of automating LaTeX generation for researchers and students, but it is incremental as it builds on existing methods with architectural improvements.
The paper tackled converting handwritten or digital mathematical expression images into LaTeX code by comparing transformer-based architectures against CNN-RNN baselines, finding that vision transformers outperformed in accuracy, BLEU scores, and lower Levenshtein distances.
Transforming mathematical expressions into LaTeX poses a significant challenge. In this paper, we examine the application of advanced transformer-based architectures to address the task of converting handwritten or digital mathematical expression images into corresponding LaTeX code. As a baseline, we utilize the current state-of-the-art CNN encoder and LSTM decoder. Additionally, we explore enhancements to the CNN-RNN architecture by replacing the CNN encoder with the pretrained ResNet50 model with modification to suite the grey scale input. Further, we experiment with vision transformer model and compare with Baseline and CNN-LSTM model. Our findings reveal that the vision transformer architectures outperform the baseline CNN-RNN framework, delivering higher overall accuracy and BLEU scores while achieving lower Levenshtein distances. Moreover, these results highlight the potential for further improvement through fine-tuning of model parameters. To encourage open research, we also provide the model implementation, enabling reproduction of our results and facilitating further research in this domain.