Enhancing Presentation Slide Generation by LLMs with a Multi-Staged End-to-End Approach
This addresses the time-consuming and expertise-heavy task of manual slide creation for users needing presentations from documents, though it appears incremental as it builds on existing LLM and VLM methods.
The paper tackles the problem of generating presentation slides from long documents with multimodal content, proposing a multi-staged end-to-end model that combines LLM and VLM, and shows it outperforms direct LLM prompting in automated metrics and human evaluations.
Generating presentation slides from a long document with multimodal elements such as text and images is an important task. This is time consuming and needs domain expertise if done manually. Existing approaches for generating a rich presentation from a document are often semi-automatic or only put a flat summary into the slides ignoring the importance of a good narrative. In this paper, we address this research gap by proposing a multi-staged end-to-end model which uses a combination of LLM and VLM. We have experimentally shown that compared to applying LLMs directly with state-of-the-art prompting, our proposed multi-staged solution is better in terms of automated metrics and human evaluation.