IROct 24, 2016

SSH (Sketch, Shingle, & Hash) for Indexing Massive-Scale Time Series

arXiv:1610.07328v14.81 citations

Originality Highly original

AI Analysis

This addresses a bottleneck in large-scale time series applications where existing methods degrade for longer queries, offering a substantial speedup.

The paper tackles the slow similarity search with Dynamic Time Warping (DTW) for time series by proposing SSH, an efficient hashing scheme that prunes around 95% of candidates and is about 20 times faster than the state-of-the-art UCR suite without significant accuracy loss.

Similarity search on time series is a frequent operation in large-scale data-driven applications. Sophisticated similarity measures are standard for time series matching, as they are usually misaligned. Dynamic Time Warping or DTW is the most widely used similarity measure for time series because it combines alignment and matching at the same time. However, the alignment makes DTW slow. To speed up the expensive similarity search with DTW, branch and bound based pruning strategies are adopted. However, branch and bound based pruning are only useful for very short queries (low dimensional time series), and the bounds are quite weak for longer queries. Due to the loose bounds branch and bound pruning strategy boils down to a brute-force search. To circumvent this issue, we design SSH (Sketch, Shingle, & Hashing), an efficient and approximate hashing scheme which is much faster than the state-of-the-art branch and bound searching technique: the UCR suite. SSH uses a novel combination of sketching, shingling and hashing techniques to produce (probabilistic) indexes which align (near perfectly) with DTW similarity measure. The generated indexes are then used to create hash buckets for sub-linear search. Our results show that SSH is very effective for longer time sequence and prunes around 95% candidates, leading to the massive speedup in search with DTW. Empirical results on two large-scale benchmark time series data show that our proposed method can be around 20 times faster than the state-of-the-art package (UCR suite) without any significant loss in accuracy.

View on arXiv PDF

Similar