XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series Classification
This addresses the problem of inconsistent and qualitative assessment for researchers and practitioners in time series machine learning, though it is incremental as it adapts benchmarking approaches from other data types.
The paper tackles the lack of standardized evaluation for explainability methods in time series classification by proposing XTSC-Bench, a benchmarking tool with datasets, models, and metrics, and finds that current methods need improvements in robustness and reliability, especially for multivariate data.
Despite the growing body of work on explainable machine learning in time series classification (TSC), it remains unclear how to evaluate different explainability methods. Resorting to qualitative assessment and user studies to evaluate explainers for TSC is difficult since humans have difficulties understanding the underlying information contained in time series data. Therefore, a systematic review and quantitative comparison of explanation methods to confirm their correctness becomes crucial. While steps to standardized evaluations were taken for tabular, image, and textual data, benchmarking explainability methods on time series is challenging due to a) traditional metrics not being directly applicable, b) implementation and adaption of traditional metrics for time series in the literature vary, and c) varying baseline implementations. This paper proposes XTSC-Bench, a benchmarking tool providing standardized datasets, models, and metrics for evaluating explanation methods on TSC. We analyze 3 perturbation-, 6 gradient- and 2 example-based explanation methods to TSC showing that improvements in the explainers' robustness and reliability are necessary, especially for multivariate data.