LG CV NE MLOct 13, 2020

Similarity Based Stratified Splitting: an approach to train better classifiers

Felipe Farias, Teresa Ludermir, Carmelo Bastos-Filho

arXiv:2010.06099v12.321 citations

Originality Synthesis-oriented

AI Analysis

This work addresses a domain-specific problem for machine learning practitioners by improving data splitting methods to enhance classifier training and evaluation, though it is incremental as it builds on existing stratified splitting techniques.

The paper tackles the problem of data splitting for training classifiers by proposing a Similarity-Based Stratified Splitting technique that uses input and output space information to place similar samples in different splits, resulting in more realistic performance estimation and outperforming ordinary stratified 10-fold cross-validation in 75% of scenarios across 22 benchmark datasets.

We propose a Similarity-Based Stratified Splitting (SBSS) technique, which uses both the output and input space information to split the data. The splits are generated using similarity functions among samples to place similar samples in different splits. This approach allows for a better representation of the data in the training phase. This strategy leads to a more realistic performance estimation when used in real-world applications. We evaluate our proposal in twenty-two benchmark datasets with classifiers such as Multi-Layer Perceptron, Support Vector Machine, Random Forest and K-Nearest Neighbors, and five similarity functions Cityblock, Chebyshev, Cosine, Correlation, and Euclidean. According to the Wilcoxon Sign-Rank test, our approach consistently outperformed ordinary stratified 10-fold cross-validation in 75\% of the assessed scenarios.

View on arXiv PDF

Similar