AS CL LG SDJul 10, 2024

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

arXiv:2407.07801v27.37 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This addresses captioning for audio-visual data, but it appears incremental as it builds on existing representation learning and language models.

The authors tackled audio-visual captioning by proposing AVCap, a framework that uses audio-visual features as text tokens, and it outperformed existing methods across all metrics.

In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions. Leveraging these advancements, we propose AVCap, an Audio-Visual Captioning framework, a simple yet powerful baseline approach applicable to audio-visual captioning. AVCap utilizes audio-visual features as text tokens, which has many advantages not only in performance but also in the extensibility and scalability of the model. AVCap is designed around three pivotal dimensions: the exploration of optimal audio-visual encoder architectures, the adaptation of pre-trained models according to the characteristics of generated text, and the investigation into the efficacy of modality fusion in captioning. Our method outperforms existing audio-visual captioning methods across all metrics and the code is available on https://github.com/JongSuk1/AVCap

View on arXiv PDF Code

Similar