LGAPMLFeb 21, 2014

Important Molecular Descriptors Selection Using Self Tuned Reweighted Sampling Method for Prediction of Antituberculosis Activity

arXiv:1402.5360v11 citations
Originality Incremental advance
AI Analysis

This work addresses descriptor selection for sulfonamide derivatives in drug discovery, but it is incremental as it builds on existing multivariate regression and sampling techniques.

The authors tackled the problem of selecting important molecular descriptors for predicting antituberculosis activity by developing a new method called self tuned reweighted sampling (STRS), which achieved better prediction accuracy compared to full descriptor set PLS modeling and Monte Carlo uninformative variable elimination (MC-UVE).

In this paper, a new descriptor selection method for selecting an optimal combination of important descriptors of sulfonamide derivatives data, named self tuned reweighted sampling (STRS), is developed. descriptors are defined as the descriptors with large absolute coefficients in a multivariate linear regression model such as partial least squares(PLS). In this study, the absolute values of regression coefficients of PLS model are used as an index for evaluating the importance of each descriptor Then, based on the importance level of each descriptor, STRS sequentially selects N subsets of descriptors from N Monte Carlo (MC) sampling runs in an iterative and competitive manner. In each sampling run, a fixed ratio (e.g. 80%) of samples is first randomly selected to establish a regresson model. Next, based on the regression coefficients, a two-step procedure including rapidly decreasing function (RDF) based enforced descriptor selection and self tuned sampling (STS) based competitive descriptor selection is adopted to select the important descriptorss. After running the loops, a number of subsets of descriptors are obtained and root mean squared error of cross validation (RMSECV) of PLS models established with subsets of descriptors is computed. The subset of descriptors with the lowest RMSECV is considered as the optimal descriptor subset. The performance of the proposed algorithm is evaluated by sulfanomide derivative dataset. The results reveal an good characteristic of STRS that it can usually locate an optimal combination of some important descriptors which are interpretable to the biologically of interest. Additionally, our study shows that better prediction is obtained by STRS when compared to full descriptor set PLS modeling, Monte Carlo uninformative variable elimination (MC-UVE).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes