Viacheslav Barkov

h-index3

4papers

36citations

Novelty37%

AI Score34

Ranked #113,824 of 194,257 authors (top 59%)#25,047 in LG (top 62%)

4 Papers

6.9LGJun 19

Rejections Based on Predictive Uncertainty Enable Reliable Routine Soil Spectroscopy

Jonas Schmidinger, Robin Gebbers, Marc-Olivier Gasser et al.

Soil properties relevant to agricultural and environmental applications are conventionally measured using elaborate laboratory methods involving physical and chemical processing. While highly accurate, these conventional methods are costly and time-consuming. In contrast, optical spectroscopy paired with machine learning enables rapid and cost-effective predictions of multiple soil properties. However, spectroscopic modelling is often considered unreliable, as the predictive accuracy varies between soil properties and individual samples. To balance this trade-off between cost and reliability, we introduce reject-to-remeasure: an AI-based measurement framework that combines probabilistic modelling with uncertainty-guided rejection. In this framework, soil samples are first analysed using spectroscopy, after which predictions are rejected if their predictive uncertainty exceeds predefined quality constraints. Rejected samples are subsequently remeasured using conventional laboratory procedures. On a regional visible-near-infrared spectral soil library from Québec, we demonstrate that reject-to-remeasure with modern foundation models (TabPFNv2.5 and TabICLv2) can facilitate the integration of optical spectroscopy into routine laboratory workflows while meeting user-defined accuracy requirements and reducing measurement costs.

9.4LGAug 13, 2025

Modern Neural Networks for Small Tabular Datasets: The New Default for Field-Scale Digital Soil Mapping?

Viacheslav Barkov, Jonas Schmidinger, Robin Gebbers et al.

In the field of pedometrics, tabular machine learning is the predominant method for predicting soil properties from remote and proximal soil sensing data, forming a central component of digital soil mapping. At the field-scale, this predictive soil modeling (PSM) task is typically constrained by small training sample sizes and high feature-to-sample ratios in soil spectroscopy. Traditionally, these conditions have proven challenging for conventional deep learning methods. Classical machine learning algorithms, particularly tree-based models like Random Forest and linear models such as Partial Least Squares Regression, have long been the default choice for field-scale PSM. Recent advances in artificial neural networks (ANN) for tabular data challenge this view, yet their suitability for field-scale PSM has not been proven. We introduce a comprehensive benchmark that evaluates state-of-the-art ANN architectures, including the latest multilayer perceptron (MLP)-based models (TabM, RealMLP), attention-based transformer variants (FT-Transformer, ExcelFormer, T2G-Former, AMFormer), retrieval-augmented approaches (TabR, ModernNCA), and an in-context learning foundation model (TabPFN). Our evaluation encompasses 31 field- and farm-scale datasets containing 30 to 460 samples and three critical soil properties: soil organic matter or soil organic carbon, pH, and clay content. Our results reveal that modern ANNs consistently outperform classical methods on the majority of tasks, demonstrating that deep learning has matured sufficiently to overcome the long-standing dominance of classical machine learning for PSM. Notably, TabPFN delivers the strongest overall performance, showing robustness across varying conditions. We therefore recommend the adoption of modern ANNs for field-scale PSM and propose TabPFN as the new default choice in the toolkit of every pedometrician.

9.4LGFeb 27, 2025Code

LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping

J. Schmidinger, S. Vogel, V. Barkov et al.

Digital soil mapping (DSM) relies on a broad pool of statistical methods, yet determining the optimal method for a given context remains challenging and contentious. Benchmarking studies on multiple datasets are needed to reveal strengths and limitations of commonly used methods. Existing DSM studies usually rely on a single dataset with restricted access, leading to incomplete and potentially misleading conclusions. To address these issues, we introduce an open-access dataset collection called Precision Liming Soil Datasets (LimeSoDa). LimeSoDa consists of 31 field- and farm-scale datasets from various countries. Each dataset has three target soil properties: (1) soil organic matter or soil organic carbon, (2) clay content and (3) pH, alongside a set of features. Features are dataset-specific and were obtained by optical spectroscopy, proximal- and remote soil sensing. All datasets were aligned to a tabular format and are ready-to-use for modeling. We demonstrated the use of LimeSoDa for benchmarking by comparing the predictive performance of four learning algorithms across all datasets. This comparison included multiple linear regression (MLR), support vector regression (SVR), categorical boosting (CatBoost) and random forest (RF). The results showed that although no single algorithm was universally superior, certain algorithms performed better in specific contexts. MLR and SVR performed better on high-dimensional spectral datasets, likely due to better compatibility with principal components. In contrast, CatBoost and RF exhibited considerably better performances when applied to datasets with a moderate number (< 20) of features. These benchmarking results illustrate that the performance of a method is highly context-dependent. LimeSoDa therefore provides an important resource for improving the development and evaluation of statistical methods in DSM.

7.1LGSep 11, 2025

Kriging prior Regression: A Case for Kriging-Based Spatial Features with TabPFN in Soil Mapping

Jonas Schmidinger, Viacheslav Barkov, Sebastian Vogel et al.

Machine learning and geostatistics are two fundamentally different frameworks for predicting and spatially mapping soil properties. Geostatistics leverages the spatial structure of soil properties, while machine learning captures the relationship between available environmental features and soil properties. We propose a hybrid framework that enriches ML with spatial context through engineering of 'spatial lag' features from ordinary kriging. We call this approach 'kriging prior regression' (KpR), as it follows the inverse logic of regression kriging. To evaluate this approach, we assessed both the point and probabilistic prediction performance of KpR, using the TabPFN model across six fieldscale datasets from LimeSoDa. These datasets included soil organic carbon, clay content, and pH, along with features derived from remote sensing and in-situ proximal soil sensing. KpR with TabPFN demonstrated reliable uncertainty estimates and more accurate predictions in comparison to several other spatial techniques (e.g., regression/residual kriging with TabPFN), as well as to established non-spatial machine learning algorithms (e.g., random forest). Most notably, it significantly improved the average R2 by around 30% compared to machine learning algorithms without spatial context. This improvement was due to the strong prediction performance of the TabPFN algorithm itself and the complementary spatial information provided by KpR features. TabPFN is particularly effective for prediction tasks with small sample sizes, common in precision agriculture, whereas KpR can compensate for weak relationships between sensing features and soil properties when proximal soil sensing data are limited. Hence, we conclude that KpR with TabPFN is a very robust and versatile modelling framework for digital soil mapping in precision agriculture.