CLLGGEO-PHJun 24, 2024

Classification of Geological Borehole Descriptions Using a Domain Adapted Large Language Model

arXiv:2407.10991v13 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of processing unstructured geological data for domain experts, offering incremental improvements in efficiency and accuracy for geological analysis.

The paper tackles the problem of extracting structured information from unstructured geological borehole descriptions by introducing GEOBERTje, a domain-adapted large language model trained on Dutch-language data from Flanders, which outperforms rule-based approaches and GPT-4 in classifying lithology classes.

Geological borehole descriptions contain detailed textual information about the composition of the subsurface. However, their unstructured format presents significant challenges for extracting relevant features into a structured format. This paper introduces GEOBERTje: a domain adapted large language model trained on geological borehole descriptions from Flanders (Belgium) in the Dutch language. This model effectively extracts relevant information from the borehole descriptions and represents it into a numeric vector space. Showcasing just one potential application of GEOBERTje, we finetune a classifier model on a limited number of manually labeled observations. This classifier categorizes borehole descriptions into a main, second and third lithology class. We show that our classifier outperforms both a rule-based approach and GPT-4 of OpenAI. This study exemplifies how domain adapted large language models enhance the efficiency and accuracy of extracting information from complex, unstructured geological descriptions. This offers new opportunities for geological analysis and modeling using vast amounts of data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes