Overview of streaming-data algorithms
This is an incremental overview paper that addresses the challenge of processing huge, fast-paced data streams for data miners and researchers in fields like sensor networks and finance.
The paper tackles the problem of clustering data streams, which are massive and potentially unbounded, to enable knowledge discovery in applications like sensor data analysis and finance, emphasizing the need for single-pass algorithms with low memory consumption.
Due to recent advances in data collection techniques, massive amounts of data are being collected at an extremely fast pace. Also, these data are potentially unbounded. Boundless streams of data collected from sensors, equipments, and other data sources are referred to as data streams. Various data mining tasks can be performed on data streams in search of interesting patterns. This paper studies a particular data mining task, clustering, which can be used as the first step in many knowledge discovery processes. By grouping data streams into homogeneous clusters, data miners can learn about data characteristics which can then be developed into classification models for new data or predictive models for unknown events. Recent research addresses the problem of data-stream mining to deal with applications that require processing huge amounts of data such as sensor data analysis and financial applications. For such analysis, single-pass algorithms that consume a small amount of memory are critical.