IROct 18, 2016

Finding Representative Points in Multivariate Data Using PCA

Ashwinkumar Ganesan, Tim Oates, Matt Schmill

arXiv:1610.05819v12.71 citations

Originality Synthesis-oriented

AI Analysis

This addresses the need for efficient data summarization in fields like environmental science, though it appears incremental by applying PCA to an existing representation problem.

The paper tackles the problem of identifying representative points in multivariate datasets by defining representativeness and developing a method using Principal Component Analysis (PCA) for dimension reduction, tested on GLOBE project data and showing improved representativeness compared to random sampling.

The idea of representation has been used in various fields of study from data analysis to political science. In this paper, we define representativeness and describe a method to isolate data points that can represent the entire data set. Also, we show how the minimum set of representative data points can be generated. We use data from GLOBE (a project to study the effects on Land Change based on a set of parameters that include temperature, forest cover, human population, atmospheric parameters and many other variables) to test & validate the algorithm. Principal Component Analysis (PCA) is used to reduce the dimensions of the multivariate data set, so that the representative points can be generated efficiently and its Representativeness has been compared against Random Sampling of points from the data set.

View on arXiv PDF

Similar