GEO-PH LGJul 16, 2021

A Data-driven feature selection and machine-learning model benchmark for the prediction of longitudinal dispersion coefficient

Yifeng Zhao, Pei Zhang, S. A. Galindo-Torres, Stan Z. Li

arXiv:2107.12970v11.2

Originality Incremental advance

AI Analysis

This work addresses the need for reliable feature selection and model choice in predicting dispersion coefficients for environmental simulations, though it is incremental as it builds on existing ML approaches.

The study tackled the prediction of the longitudinal dispersion coefficient in natural streams by identifying an optimal feature set and benchmarking machine learning models, finding that support vector machine performed best while decision tree was unsuitable due to poor generalization.

Longitudinal Dispersion(LD) is the dominant process of scalar transport in natural streams. An accurate prediction on LD coefficient(Dl) can produce a performance leap in related simulation. The emerging machine learning(ML) techniques provide a self-adaptive tool for this problem. However, most of the existing studies utilize an unproved quaternion feature set, obtained through simple theoretical deduction. Few studies have put attention on its reliability and rationality. Besides, due to the lack of comparative comparison, the proper choice of ML models in different scenarios still remains unknown. In this study, the Feature Gradient selector was first adopted to distill the local optimal feature sets directly from multivariable data. Then, a global optimal feature set (the channel width, the flow velocity, the channel slope and the cross sectional area) was proposed through numerical comparison of the distilled local optimums in performance with representative ML models. The channel slope is identified to be the key parameter for the prediction of LDC. Further, we designed a weighted evaluation metric which enables comprehensive model comparison. With the simple linear model as the baseline, a benchmark of single and ensemble learning models was provided. Advantages and disadvantages of the methods involved were also discussed. Results show that the support vector machine has significantly better performance than other models. Decision tree is not suitable for this problem due to poor generalization ability. Notably, simple models show superiority over complicated model on this low-dimensional problem, for their better balance between regression and generalization.

View on arXiv PDF

Similar