Zhixin Qi

24.2DBMar 16

A New Lower Bounding Paradigm and Tighter Lower Bounds for Elastic Similarity Measures

Zemin Chao, Boyu Xiao, Zitong Li et al.

Elastic similarity measures are fundamental to time series similarity search because of their ability to handle temporal misalignments. These measures are inherently computationally expensive, therefore necessitating the use of lower bounds to prune unnecessary comparisons. This paper proposes a new \emph{Bipartite Graph Edge-Cover Paradigm} for deriving lower bounds, which applies to a broad class of elastic similarity measures. This paradigm formulates lower bounding as a vertex-weighting problem on a weighted bipartite graph induced from the input time series. Under this paradigm, most of the existing lower bounds of elastic similarity measures can be viewed as simple instantiations. We further propose \textit{BGLB}, an instantiation of the proposed paradigm that incorporates an additional augmentation term, yielding lower bounds that are provably tighter. Theoretical analysis and extensive experiments on 128 real-world datasets demonstrate that \textit{BGLB} achieves the tightest known lower bounds for six elastic measures (ERP, MSM, TWED, LCSS, EDR, and SWALE). Moreover, \textit{BGLB} remains highly competitive for \textit{DTW} with a favorable trade-off between tightness and computational efficiency. In nearest neighbor search, integrating \textit{BGLB} into filter pipelines consistently outperforms state-of-the-art methods, achieving speedups ranging from $24.6\%$ to $84.9\%$ across various elastic similarity measures. Besides, \textit{BGLB} also delivers a significant acceleration in density-based clustering applications, validating the practical potential of \textit{BGLB} in time series similarity search tasks based on elastic similarity measures.

DBMar 16, 2018

Impacts of Dirty Data: and Experimental Evaluation

Zhixin Qi, Hongzhi Wang, Jianzhong Li et al.

Data quality issues have attracted widespread attention due to the negative impacts of dirty data on data mining and machine learning results. The relationship between data quality and the accuracy of results could be applied on the selection of the appropriate algorithm with the consideration of data quality and the determination of the data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, this paper conducts an experimental comparison for the effects of missing, inconsistent and conflicting data on classification and clustering algorithms. Based on the experimental findings, we provide guidelines for algorithm selection and data cleaning.

Zhixin Qi

2 Papers