Rethinking Streaming Machine Learning Evaluation
This work addresses evaluation gaps in streaming ML, which is incremental as it builds on existing methods without introducing new algorithms.
The paper argues that traditional batch accuracy metrics are insufficient for streaming machine learning due to challenges like delayed labels, and proposes additional metrics to better evaluate performance in streaming settings.
While most work on evaluating machine learning (ML) models focuses on computing accuracy on batches of data, tracking accuracy alone in a streaming setting (i.e., unbounded, timestamp-ordered datasets) fails to appropriately identify when models are performing unexpectedly. In this position paper, we discuss how the nature of streaming ML problems introduces new real-world challenges (e.g., delayed arrival of labels) and recommend additional metrics to assess streaming ML performance.