Sha Cao

h-index22

6papers

27citations

Novelty54%

AI Score29

Ranked #144,417 of 194,257 authors (top 74%)#31,816 in LG (top 79%)

6 Papers

1.2MESep 1, 2021Code

Spatially and Robustly Hybrid Mixture Regression Model for Inference of Spatial Dependence

Wennan Chang, Pengtao Dang, Changlin Wan et al.

In this paper, we propose a Spatial Robust Mixture Regression model to investigate the relationship between a response variable and a set of explanatory variables over the spatial domain, assuming that the relationships may exhibit complex spatially dynamic patterns that cannot be captured by constant regression coefficients. Our method integrates the robust finite mixture Gaussian regression model with spatial constraints, to simultaneously handle the spatial nonstationarity, local homogeneity, and outlier contaminations. Compared with existing spatial regression models, our proposed model assumes the existence a few distinct regression models that are estimated based on observations that exhibit similar response-predictor relationships. As such, the proposed model not only accounts for nonstationarity in the spatial trend, but also clusters observations into a few distinct and homogenous groups. This provides an advantage on interpretation with a few stationary sub-processes identified that capture the predominant relationships between response and predictor variables. Moreover, the proposed method incorporates robust procedures to handle contaminations from both regression outliers and spatial outliers. By doing so, we robustly segment the spatial domain into distinct local regions with similar regression coefficients, and sporadic locations that are purely outliers. Rigorous statistical hypothesis testing procedure has been designed to test the significance of such segmentation. Experimental results on many synthetic and real-world datasets demonstrate the robustness, accuracy, and effectiveness of our proposed method, compared with other robust finite mixture regression, spatial regression and spatial segmentation methods.

2.3LGJul 31, 2020Code

Geometric All-Way Boolean Tensor Decomposition

Changlin Wan, Wennan Chang, Tong Zhao et al.

Boolean tensor has been broadly utilized in representing high dimensional logical data collected on spatial, temporal and/or other relational domains. Boolean Tensor Decomposition (BTD) factorizes a binary tensor into the Boolean sum of multiple rank-1 tensors, which is an NP-hard problem. Existing BTD methods have been limited by their high computational cost, in applications to large scale or higher order tensors. In this work, we presented a computationally efficient BTD algorithm, namely \textit{Geometric Expansion for all-order Tensor Factorization} (GETF), that sequentially identifies the rank-1 basis components for a tensor from a geometric perspective. We conducted rigorous theoretical analysis on the validity as well as algorithemic efficiency of GETF in decomposing all-order tensor. Experiments on both synthetic and real-world data demonstrated that GETF has significantly improved performance in reconstruction accuracy, extraction of latent structures and it is an order of magnitude faster than other state-of-the-art methods.

1.2LGJul 31, 2020Code

Denoising individual bias for a fairer binary submatrix detection

Changlin Wan, Wennan Chang, Tong Zhao et al.

Low rank representation of binary matrix is powerful in disentangling sparse individual-attribute associations, and has received wide applications. Existing binary matrix factorization (BMF) or co-clustering (CC) methods often assume i.i.d background noise. However, this assumption could be easily violated in real data, where heterogeneous row- or column-wise probability of binary entries results in disparate element-wise background distribution, and paralyzes the rationality of existing methods. We propose a binary data denoising framework, namely BIND, which optimizes the detection of true patterns by estimating the row- or column-wise mixture distribution of patterns and disparate background, and eliminating the binary attributes that are more likely from the background. BIND is supported by thoroughly derived mathematical property of the row- and column-wise mixture distributions. Our experiment on synthetic and real-world data demonstrated BIND effectively removes background noise and drastically increases the fairness and accuracy of state-of-the arts BMF and CC methods.

1.2MEMay 23, 2020

Component-wise Adaptive Trimming For Robust Mixture Regression

Wennan Chang, Xinyu Zhou, Yong Zang et al.

Parameter estimation of mixture regression model using the expectation maximization (EM) algorithm is highly sensitive to outliers. Here we propose a fast and efficient robust mixture regression algorithm, called Component-wise Adaptive Trimming (CAT) method. We consider simultaneous outlier detection and robust parameter estimation to minimize the effect of outlier contamination. Robust mixture regression has many important applications including in human cancer genomics data, where the population often displays strong heterogeneity added by unwanted technological perturbations. Existing robust mixture regression methods suffer from outliers as they either conduct parameter estimation in the presence of outliers, or rely on prior knowledge of the level of outlier contamination. CAT was implemented in the framework of classification expectation maximization, under which a natural definition of outliers could be derived. It implements a least trimmed squares (LTS) approach within each exclusive mixing component, where the robustness issue could be transformed from the mixture case to simple linear regression case. The high breakdown point of the LTS approach allows us to avoid the pre-specification of trimming parameter. Compared with multiple existing algorithms, CAT is the most competitive one that can handle and adaptively trim off outliers as well as heavy tailed noise, in different scenarios of simulated data and real genomic data. CAT has been implemented in an R package `RobMixReg' available in CRAN.

6.6LGSep 9, 2019

Fast And Efficient Boolean Matrix Factorization By Geometric Segmentation

Changlin Wan, Wennan Chang, Tong Zhao et al.

Boolean matrix has been used to represent digital information in many fields, including bank transaction, crime records, natural language processing, protein-protein interaction, etc. Boolean matrix factorization (BMF) aims to find an approximation of a binary matrix as the Boolean product of two low rank Boolean matrices, which could generate vast amount of information for the patterns of relationships between the features and samples. Inspired by binary matrix permutation theories and geometric segmentation, we developed a fast and efficient BMF approach called MEBF (Median Expansion for Boolean Factorization). Overall, MEBF adopted a heuristic approach to locate binary patterns presented as submatrices that are dense in 1's. At each iteration, MEBF permutates the rows and columns such that the permutated matrix is approximately Upper Triangular-Like (UTL) with so-called Simultaneous Consecutive-ones Property (SC1P). The largest submatrix dense in 1 would lies on the upper triangular area of the permutated matrix, and its location was determined based on a geometric segmentation of a triangular. We compared MEBF with other state of the art approaches on data scenarios with different sparsity and noise levels. MEBF demonstrated superior performances in lower reconstruction error, and higher computational efficiency, as well as more accurate sparse patterns than popular methods such as ASSO, PANDA and MP. We demonstrated the application of MEBF on both binary and non-binary data sets, and revealed its further potential in knowledge retrieving and data denoising.

1.2QMAug 6, 2019

Predicted disease compositions of human gliomas estimated from multiparametric MRI can predict endothelial proliferation, tumor grade, and overall survival

Emily E Diller, Sha Cao, Beth Ey et al.

Background and Purpose: Biopsy is the main determinants of glioma clinical management, but require invasive sampling that fail to detect relevant features because of tumor heterogeneity. The purpose of this study was to evaluate the accuracy of a voxel-wise, multiparametric MRI radiomic method to predict features and develop a minimally invasive method to objectively assess neoplasms. Methods: Multiparametric MRI were registered to T1-weighted gadolinium contrast-enhanced data using a 12 degree-of-freedom affine model. The retrospectively collected MRI data included T1-weighted, T1-weighted gadolinium contrast-enhanced, T2-weighted, fluid attenuated inversion recovery, and multi-b-value diffusion-weighted acquired at 1.5T or 3.0T. Clinical experts provided voxel-wise annotations for five disease states on a subset of patients to establish a training feature vector of 611,930 observations. Then, a k-nearest-neighbor (k-NN) classifier was trained using a 25% hold-out design. The trained k-NN model was applied to 13,018,171 observations from seventeen histologically confirmed glioma patients. Linear regression tested overall survival (OS) relationship to predicted disease compositions (PDC) and diagnostic age (alpha = 0.05). Canonical discriminant analysis tested if PDC and diagnostic age could differentiate clinical, genetic, and microscopic factors (alpha = 0.05). Results: The model predicted voxel annotation class with a Dice similarity coefficient of 94.34% +/- 2.98. Linear combinations of PDCs and diagnostic age predicted OS (p = 0.008), grade (p = 0.014), and endothelia proliferation (p = 0.003); but fell short predicting gene mutations for TP53BP1 and IDH1. Conclusions: This voxel-wise, multi-parametric MRI radiomic strategy holds potential as a non-invasive decision-making aid for clinicians managing patients with glioma.