k-Nearest Neighbors by Means of Sequence to Sequence Deep Neural Networks and Memory Networks
This work addresses classification and imbalanced data problems for machine learning practitioners, offering an incremental improvement by adapting existing deep learning architectures to mimic and enhance kNN.
The paper tackled the problem of improving k-Nearest Neighbors classification by proposing models based on sequence-to-sequence and memory networks that generate sequences of labels and features, also functioning as oversamplers. The results show these models outperform traditional methods like kNN, XGBoost, and others on structured datasets, and achieve competitive performance on image and text datasets, while often beating oversampling techniques like SMOTE on imbalanced data.
k-Nearest Neighbors is one of the most fundamental but effective classification models. In this paper, we propose two families of models built on a sequence to sequence model and a memory network model to mimic the k-Nearest Neighbors model, which generate a sequence of labels, a sequence of out-of-sample feature vectors and a final label for classification, and thus they could also function as oversamplers. We also propose 'out-of-core' versions of our models which assume that only a small portion of data can be loaded into memory. Computational experiments show that our models on structured datasets outperform k-Nearest Neighbors, a feed-forward neural network, XGBoost, lightGBM, random forest and a memory network, due to the fact that our models must produce additional output and not just the label. On image and text datasets, the performance of our model is close to many state-of-the-art deep models. As an oversampler on imbalanced datasets, the sequence to sequence kNN model often outperforms Synthetic Minority Over-sampling Technique and Adaptive Synthetic Sampling.