A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior
This provides a systematic benchmark for improving fault delineation in seismic interpretation, addressing domain shift challenges, but it is incremental as it builds on existing models and datasets.
The paper tackles the problem of limited generalizability of fault delineation models across diverse seismic data due to domain shifts, inconsistent evaluation, and training issues, finding that fine-tuning can cause catastrophic forgetting, larger models like Segformer are more robust, and domain adaptation outperforms fine-tuning for large shifts but underperforms for similar domains.
Machine learning has taken a critical role in seismic interpretation workflows, especially in fault delineation tasks. However, despite the recent proliferation of pretrained models and synthetic datasets, the field still lacks a systematic understanding of the generalizability limits of these models across seismic data representing diverse geologic, acquisition and processing settings. Distributional shifts between data sources, limitations in fine-tuning strategies and labeled data accessibility, and inconsistent evaluation protocols all remain major roadblocks to deploying reliable models in real-world exploration. In this paper, we present the first large-scale benchmarking study explicitly designed to provide guidelines for domain shift strategies in seismic interpretation. Our benchmark spans over 200 combinations of model architectures, datasets and training strategies, across three datasets (synthetic and real) including FaultSeg3D, CRACKS, and Thebe. We systematically assess pretraining, fine-tuning, and joint training under varying domain shifts. Our analysis shows that common fine-tuning practices can lead to catastrophic forgetting, especially when source and target datasets are disjoint, and that larger models such as Segformer are more robust than smaller architectures. We also find that domain adaptation methods outperform fine-tuning when shifts are large, yet underperform when domains are similar. Finally, we complement segmentation metrics with a novel analysis based on fault characteristic descriptors, revealing how models absorb structural biases from training datasets. Overall, we establish a robust experimental baseline that provides insights into tradeoffs in current fault delineation workflows and highlights directions for building more generalizable and interpretable models.