LGOct 1, 2023

Determining the Optimal Number of Clusters for Time Series Datasets with Symbolic Pattern Forest

arXiv:2310.00820v12 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses a key challenge in time series clustering for data mining applications, though it appears incremental as an extension of an existing method.

The paper tackled the problem of determining the optimal number of clusters for time series datasets by extending the Symbolic Pattern Forest algorithm and using the Silhouette Coefficient, achieving significant improvement over baselines on UCR archive datasets.

Clustering algorithms are among the most widely used data mining methods due to their exploratory power and being an initial preprocessing step that paves the way for other techniques. But the problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such methods. The most widely used clustering algorithms like k-means and k-shape in time series data mining also need the ground truth for the number of clusters that need to be generated. In this work, we extended the Symbolic Pattern Forest algorithm, another time series clustering algorithm, to determine the optimal number of clusters for the time series datasets. We used SPF to generate the clusters from the datasets and chose the optimal number of clusters based on the Silhouette Coefficient, a metric used to calculate the goodness of a clustering technique. Silhouette was calculated on both the bag of word vectors and the tf-idf vectors generated from the SAX words of each time series. We tested our approach on the UCR archive datasets, and our experimental results so far showed significant improvement over the baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes