Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models
It addresses the need for professional-quality video generation for creative, educational, and industrial users, but is incremental as it combines existing models.
This work tackled the problem of automatic cinematic video synthesis from text inputs by developing a method that integrates Stable Diffusion for images, GPT-2 for narrative, and a hybrid audio pipeline, resulting in 60-second videos with high visual quality and narrative coherence.
Advances in generative artificial intelligence have altered multimedia creation, allowing for automatic cinematic video synthesis from text inputs. This work describes a method for creating 60-second cinematic movies incorporating Stable Diffusion for high-fidelity image synthesis, GPT-2 for narrative structuring, and a hybrid audio pipeline using gTTS and YouTube-sourced music. It uses a five-scene framework, which is augmented by linear frame interpolation, cinematic post-processing (e.g., sharpening), and audio-video synchronization to provide professional-quality results. It was created in a GPU-accelerated Google Colab environment using Python 3.11. It has a dual-mode Gradio interface (Simple and Advanced), which supports resolutions of up to 1024x768 and frame rates of 15-30 FPS. Optimizations such as CUDA memory management and error handling ensure reliability. The experiments demonstrate outstanding visual quality, narrative coherence, and efficiency, furthering text-to-video synthesis for creative, educational, and industrial applications.