From Simple to Professional: A Combinatorial Controllable Image Captioning Agent
This work addresses the need for user-friendly, high-quality captioning tools in image analysis, though it appears incremental by combining existing methods for improved control.
The paper tackles the problem of generating professional image captions from simple user instructions by introducing CapAgent, which automatically transforms instructions and uses multimodal models and external tools to produce precise, context-aware captions with specified guidelines.
The Controllable Image Captioning Agent (CapAgent) is an innovative system designed to bridge the gap between user simplicity and professional-level outputs in image captioning tasks. CapAgent automatically transforms user-provided simple instructions into detailed, professional instructions, enabling precise and context-aware caption generation. By leveraging multimodal large language models (MLLMs) and external tools such as object detection tool and search engines, the system ensures that captions adhere to specified guidelines, including sentiment, keywords, focus, and formatting. CapAgent transparently controls each step of the captioning process, and showcases its reasoning and tool usage at every step, fostering user trust and engagement. The project code is available at https://github.com/xin-ran-w/CapAgent.