CVMar 25

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

arXiv:2603.2469651.4h-index: 3Has Code

AI Analysis

This work addresses the problem of limited multimodal data for lunar terrain analysis, enabling more effective AI tools for planetary scientists, though it is incremental as it adapts existing methods to a new domain.

The paper tackled the lack of large-scale datasets for applying vision-language models to planetary science by introducing LLaVA-LE, a model specialized for lunar exploration, which achieved a 3.3x performance gain over Base LLaVA and a reasoning score of 1.070, exceeding judge references.

Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge's own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.

View on arXiv PDF Code

Similar