LG MLAug 5, 2018

Hybrid Subspace Learning for High-Dimensional Data

Micol Marchetti-Bowick, Benjamin J. Lengerich, Ankur P. Parikh, Eric P. Xing

arXiv:1808.01687v10.82 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of high-dimensional data analysis for applications like genomics and computer vision, but it is incremental as it builds on existing subspace learning methods by introducing a hybrid approach.

The paper tackled the problem of high-dimensional data where traditional subspace learning assumes all variables can be projected to low-dimensional spaces, but they argued this is unsuitable for many datasets. They proposed a hybrid dimensionality reduction technique that maps some features to a low-dimensional subspace while keeping others in the original space, resulting in more accurate latent space estimation and lower reconstruction error, as demonstrated on synthetic, gene expression, and video background subtraction datasets.

The high-dimensional data setting, in which p >> n, is a challenging statistical paradigm that appears in many real-world problems. In this setting, learning a compact, low-dimensional representation of the data can substantially help distinguish signal from noise. One way to achieve this goal is to perform subspace learning to estimate a small set of latent features that capture the majority of the variance in the original data. Most existing subspace learning models, such as PCA, assume that the data can be fully represented by its embedding in one or more latent subspaces. However, in this work, we argue that this assumption is not suitable for many high-dimensional datasets; often only some variables can easily be projected to a low-dimensional space. We propose a hybrid dimensionality reduction technique in which some features are mapped to a low-dimensional subspace while others remain in the original space. Our model leads to more accurate estimation of the latent space and lower reconstruction error. We present a simple optimization procedure for the resulting biconvex problem and show synthetic data results that demonstrate the advantages of our approach over existing methods. Finally, we demonstrate the effectiveness of this method for extracting meaningful features from both gene expression and video background subtraction datasets.

View on arXiv PDF

Similar