LGCEMLJun 27, 2012

Gene Expression Time Course Clustering with Countably Infinite Hidden Markov Models

arXiv:1206.6824v134 citations
Originality Incremental advance
AI Analysis

This is an incremental improvement for bioinformatics researchers analyzing gene expression time series, offering a more automated and robust clustering method.

The paper tackled the problem of clustering gene expression time course data by proposing a countably infinite hidden Markov model (HDP-HMM) to avoid manual model complexity selection, showing that it outperforms finite models and traditional methods on two large datasets with improved clustering indices.

Most existing approaches to clustering gene expression time course data treat the different time points as independent dimensions and are invariant to permutations, such as reversal, of the experimental time course. Approaches utilizing HMMs have been shown to be helpful in this regard, but are hampered by having to choose model architectures with appropriate complexities. Here we propose for a clustering application an HMM with a countably infinite state space; inference in this model is possible by recasting it in the hierarchical Dirichlet process (HDP) framework (Teh et al. 2006), and hence we call it the HDP-HMM. We show that the infinite model outperforms model selection methods over finite models, and traditional time-independent methods, as measured by a variety of external and internal indices for clustering on two large publicly available data sets. Moreover, we show that the infinite models utilize more hidden states and employ richer architectures (e.g. state-to-state transitions) without the damaging effects of overfitting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes