Toward a generic representation of random variables for machine learning
This work addresses a foundational challenge in machine learning for stochastic processes, offering a generic representation that could benefit various applications, though it appears incremental as it builds on existing non-parametric methods.
The paper tackles the problem of representing random variables for machine learning on stochastic processes by introducing a non-parametric approach that separates dependency and distribution without information loss, along with an associated metric. Results show improved performance in clustering financial time series, such as credit default swaps prices, with experiments on synthetic datasets and reproducible code provided.
This paper presents a pre-processing and a distance which improve the performance of machine learning algorithms working on independent and identically distributed stochastic processes. We introduce a novel non-parametric approach to represent random variables which splits apart dependency and distribution without losing any information. We also propound an associated metric leveraging this representation and its statistical estimate. Besides experiments on synthetic datasets, the benefits of our contribution is illustrated through the example of clustering financial time series, for instance prices from the credit default swaps market. Results are available on the website www.datagrapple.com and an IPython Notebook tutorial is available at www.datagrapple.com/Tech for reproducible research.