ML AISep 25, 2017

Mining a Sub-Matrix of Maximal Sum

Vincent Branders, Pierre Schaus, Pierre Dupont

arXiv:1709.08461v13.45 citations

Originality Incremental advance

AI Analysis

This incremental work addresses computational efficiency for bioinformatics researchers analyzing gene expression data.

The paper tackles the NP-hard problem of finding a maximal sum sub-matrix in data, such as gene expression, by proposing CPGC and MILP algorithms that improve solution times over prior methods, with CPGC being fastest for good solutions.

Biclustering techniques have been widely used to identify homogeneous subgroups within large data matrices, such as subsets of genes similarly expressed across subsets of patients. Mining a max-sum sub-matrix is a related but distinct problem for which one looks for a (non-necessarily contiguous) rectangular sub-matrix with a maximal sum of its entries. Le Van et al. (Ranked Tiling, 2014) already illustrated its applicability to gene expression analysis and addressed it with a constraint programming (CP) approach combined with large neighborhood search (CP-LNS). In this work, we exhibit some key properties of this NP-hard problem and define a bounding function such that larger problems can be solved in reasonable time. Two different algorithms are proposed in order to exploit the highlighted characteristics of the problem: a CP approach with a global constraint (CPGC) and mixed integer linear programming (MILP). Practical experiments conducted both on synthetic and real gene expression data exhibit the characteristics of these approaches and their relative benefits over the original CP-LNS method. Overall, the CPGC approach tends to be the fastest to produce a good solution. Yet, the MILP formulation is arguably the easiest to formulate and can also be competitive.

View on arXiv PDF

Similar