Factor-Conditioned Speaking-Style Captioning
This work addresses the challenge of improving speaking-style captioning for applications like audio description or accessibility tools, though it appears incremental as it builds on existing captioning methods.
The paper tackles the problem of generating diverse and accurate speaking-style captions by introducing factor-conditioned captioning (FCC) to explicitly learn speaking-style factors, and greedy-then-sampling (GtS) decoding to balance accuracy and diversity. The results show that FCC outperforms original caption-based training and, with GtS, generates more diverse captions while maintaining style prediction performance.
This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based training, and with GtS, it generates more diverse captions while keeping style prediction performance.