Analytics Modelling over Multiple Datasets using Vector Embeddings
This addresses the problem of dataset selection for analysts, but it appears incremental as it builds on existing modeling frameworks with a new vectorization approach.
The paper tackles the challenge of selecting high-quality datasets for analytics by proposing a method that predicts analytics outcomes using vector embeddings, achieving accurate predictions and increased speedup compared to a state-of-the-art framework.
The massive increase in the data volume and dataset availability for analysts compels researchers to focus on data content and select high-quality datasets to enhance the performance of analytics operators. While selecting high-quality data significantly boosts analytical accuracy and efficiency, the exact process is very challenging given large-scale dataset availability. To address this issue, we propose a novel methodology that infers the outcome of analytics operators by creating a model from the available datasets. Each dataset is transformed to a vector embedding representation generated by our proposed deep learning model NumTabData2Vec, where similarity search are employed. Through experimental evaluation, we compare the prediction performance and the execution time of our framework to another state-of-the-art modelling operator framework, illustrating that our approach predicts analytics outcomes accurately, and increases speedup. Furthermore, our vectorization model can project different real-world scenarios to a lower vector embedding representation accurately and distinguish them.