CVFeb 14, 2025

Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding

Thanh-Dat Truong, Hoang-Quan Nguyen, Xuan-Bac Nguyen, Ashley Dowling, Xin Li, Khoa Luu

arXiv:2502.09906v117.411 citationsh-index: 16Int J Comput Vis

Originality Incremental advance

AI Analysis

This addresses a fundamental problem in precision agriculture for sustainable development, but it is incremental as it adapts existing multimodal AI to a specific domain.

The paper tackles the lack of insect knowledge in multimodal conversational AI by introducing Insect-LLaVA, a model that achieves state-of-the-art performance on insect-related tasks through a new large-scale dataset and a foundation model with micro-feature learning.

Multimodal conversational generative AI has shown impressive capabilities in various vision and language understanding through learning massive text-image data. However, current conversational models still lack knowledge about visual insects since they are often trained on the general knowledge of vision-language data. Meanwhile, understanding insects is a fundamental problem in precision agriculture, helping to promote sustainable development in agriculture. Therefore, this paper proposes a novel multimodal conversational model, Insect-LLaVA, to promote visual understanding in insect-domain knowledge. In particular, we first introduce a new large-scale Multimodal Insect Dataset with Visual Insect Instruction Data that enables the capability of learning the multimodal foundation models. Our proposed dataset enables conversational models to comprehend the visual and semantic features of the insects. Second, we propose a new Insect-LLaVA model, a new general Large Language and Vision Assistant in Visual Insect Understanding. Then, to enhance the capability of learning insect features, we develop an Insect Foundation Model by introducing a new micro-feature self-supervised learning with a Patch-wise Relevant Attention mechanism to capture the subtle differences among insect images. We also present Description Consistency loss to improve micro-feature learning via text descriptions. The experimental results evaluated on our new Visual Insect Question Answering benchmarks illustrate the effective performance of our proposed approach in visual insect understanding and achieve State-of-the-Art performance on standard benchmarks of insect-related tasks.

View on arXiv PDF

Similar