CV AINov 27, 2024

CoVis: A Collaborative Framework for Fine-grained Graphic Visual Understanding

Xiaoyu Deng, Zhengjian Kang, Xintao Li, Yongzhe Zhang, Tianmin Guo

arXiv:2411.18764v15.21 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the problem of inefficient and limited visual information interpretation for observers, though it appears incremental as it builds on existing segmentation and LLM techniques.

The paper tackles the problem of human reliance on personal knowledge for interpreting visual content, which affects information quality and efficiency, by proposing CoVis, a collaborative framework for fine-grained visual understanding that uses a cascaded dual-layer segmentation network and LLM-based content generator. Results from quantitative experiments and 32 human participants show that CoVis outperforms current methods in feature extraction and generates more comprehensive and detailed visual descriptions than general-purpose large models.

Graphic visual content helps in promoting information communication and inspiration divergence. However, the interpretation of visual content currently relies mainly on humans' personal knowledge background, thereby affecting the quality and efficiency of information acquisition and understanding. To improve the quality and efficiency of visual information transmission and avoid the limitation of the observer due to the information cocoon, we propose CoVis, a collaborative framework for fine-grained visual understanding. By designing and implementing a cascaded dual-layer segmentation network coupled with a large-language-model (LLM) based content generator, the framework extracts as much knowledge as possible from an image. Then, it generates visual analytics for images, assisting observers in comprehending imagery from a more holistic perspective. Quantitative experiments and qualitative experiments based on 32 human participants indicate that the CoVis has better performance than current methods in feature extraction and can generate more comprehensive and detailed visual descriptions than current general-purpose large models.

View on arXiv PDF

Similar