Shaohan Chen

ML
h-index5
10papers
58citations
Novelty40%
AI Score40

10 Papers

GNAug 28, 2024
Identification of Prognostic Biomarkers for Stage III Non-Small Cell Lung Carcinoma in Female Nonsmokers Using Machine Learning

Huili Zheng, Qimin Zhang, Yiru Gong et al.

Lung cancer remains a leading cause of cancer-related deaths globally, with non-small cell lung cancer (NSCLC) being the most common subtype. This study aimed to identify key biomarkers associated with stage III NSCLC in non-smoking females using gene expression profiling from the GDS3837 dataset. Utilizing XGBoost, a machine learning algorithm, the analysis achieved a strong predictive performance with an AUC score of 0.835. The top biomarkers identified - CCAAT enhancer binding protein alpha (C/EBP-alpha), lactate dehydrogenase A4 (LDHA), UNC-45 myosin chaperone B (UNC-45B), checkpoint kinase 1 (CHK1), and hypoxia-inducible factor 1 subunit alpha (HIF-1-alpha) - have been validated in the literature as being significantly linked to lung cancer. These findings highlight the potential of these biomarkers for early diagnosis and personalized therapy, emphasizing the value of integrating machine learning with molecular profiling in cancer research.

64.3CHEM-PHMar 15
Life cycle assessment for all organic chemicals

Shaohan Chen, Tim Langhorst, Julian Nöhl et al.

Chemicals are embedded in nearly every aspect of modern society, yet their production poses substantial sustainability concerns. Achieving a sustainable chemical industry requires detailed Life Cycle Assessment (LCA); however, current assessments face many unknowns due to limited, partly inconsistent, and untransparent data coverage since existing Life Cycle Inventory (LCI) databases account for only a tiny fraction of traded chemicals. Here, we introduce the Chemical RetrosYnthesiS for Transparent Assessment of Life-cycles (CRYSTAL) framework, which automatically generates consistent and transparent LCI data for organic chemicals based on their molecular structure using retrosynthesis and machine-learned gate-to-gate inventories. Using the predictive power of CRYSTAL, we create a consistent database for more than 70000 organic chemicals, comprising over 110000 transparent LCI datasets that quantify both feedstock and energy demands, together with associated auxiliary materials, biosphere flows, and waste flows. From this comprehensive database, we identify 50 key environmental hotspots driving high impacts of organic chemical production across multiple environmental categories and pivotal hub chemicals that are most critical for downstream chemical production. In providing this comprehensive data foundation, the CRYSTAL framework offers systematic guidance for targeted engineering and policy interventions. Its transparent, modular nature is designed to shift chemical LCA from a reliance on "unknown unknowns" to a collaboratively improvable mapping of "known unknowns".

LGMar 31, 2023
Domain Knowledge integrated for Blast Furnace Classifier Design

Shaohan Chen, Di Fan, Chuanhou Gao

Blast furnace modeling and control is one of the important problems in the industrial field, and the black-box model is an effective mean to describe the complex blast furnace system. In practice, there are often different learning targets, such as safety and energy saving in industrial applications, depending on the application. For this reason, this paper proposes a framework to design a domain knowledge integrated classification model that yields a classifier for industrial application. Our knowledge incorporated learning scheme allows the users to create a classifier that identifies "important samples" (whose misclassifications can lead to severe consequences) more correctly, while keeping the proper precision of classifying the remaining samples. The effectiveness of the proposed method has been verified by two real blast furnace datasets, which guides the operators to utilize their prior experience for controlling the blast furnace systems better.

QMSep 12, 2024
Graphical Structural Learning of rs-fMRI data in Heavy Smokers

Yiru Gong, Qimin Zhang, Huili Zheng et al.

Recent studies revealed structural and functional brain changes in heavy smokers. However, the specific changes in topological brain connections are not well understood. We used Gaussian Undirected Graphs with the graphical lasso algorithm on rs-fMRI data from smokers and non-smokers to identify significant changes in brain connections. Our results indicate high stability in the estimated graphs and identify several brain regions significantly affected by smoking, providing valuable insights for future clinical research.

42.5LGApr 6
MAVEN: A Mesh-Aware Volumetric Encoding Network for Simulating 3D Flexible Deformation

Zhe Feng, Shilong Tao, Haonan Sun et al.

Deep learning-based approaches, particularly graph neural networks (GNNs), have gained prominence in simulating flexible deformations and contacts of solids, due to their ability to handle unstructured physical fields and nonlinear regression on graph structures. However, existing GNNs commonly represent meshes with graphs built solely from vertices and edges. These approaches tend to overlook higher-dimensional spatial features, e.g., 2D facets and 3D cells, from the original geometry. As a result, it is challenging to accurately capture boundary representations and volumetric characteristics, though this information is critically important for modeling contact interactions and internal physical quantity propagation, particularly under sparse mesh discretization. In this paper, we introduce MAVEN, a mesh-aware volumetric encoding network for simulating 3D flexible deformation, which explicitly models geometric mesh elements of higher dimension to achieve a more accurate and natural physical simulation. MAVEN establishes learnable mappings among 3D cells, 2D facets, and vertices, enabling flexible mutual transformations. Explicit geometric features are incorporated into the model to alleviate the burden of implicitly learning geometric patterns. Experimental results show that MAVEN consistently achieves state-of-the-art performance across established datasets and a novel metal stretch-bending task featuring large deformations and prolonged contacts.

QMDec 13, 2024
Cardiovascular Disease Detection By Leveraging Semi-Supervised Learning

Shaohan Chen, Zheyan Liu, Huili Zheng et al.

Cardiovascular disease (CVD) persists as a primary cause of death on a global scale, which requires more effective and timely detection methods. Traditional supervised learning approaches for CVD detection rely heavily on large-labeled datasets, which are often difficult to obtain. This paper employs semi-supervised learning models to boost efficiency and accuracy of CVD detection when there are few labeled samples. By leveraging both labeled and vast amounts of unlabeled data, our approach demonstrates improvements in prediction performance, while reducing the dependency on labeled data. Experimental results in a publicly available dataset show that semi-supervised models outperform traditional supervised learning techniques, providing an intriguing approach for the initial identification of cardiovascular disease within clinical environments.

MLJul 6, 2021
Transfer Learning in Information Criteria-based Feature Selection

Shaohan Chen, Nikolaos V. Sahinidis, Chuanhou Gao

This paper investigates the effectiveness of transfer learning based on Mallows' Cp. We propose a procedure that combines transfer learning with Mallows' Cp (TLCp) and prove that it outperforms the conventional Mallows' Cp criterion in terms of accuracy and stability. Our theoretical results indicate that, for any sample size in the target domain, the proposed TLCp estimator performs better than the Cp estimator by the mean squared error (MSE) metric in the case of orthogonal predictors, provided that i) the dissimilarity between the tasks from source domain and target domain is small, and ii) the procedure parameters (complexity penalties) are tuned according to certain explicit rules. Moreover, we show that our transfer learning framework can be extended to other feature selection criteria, such as the Bayesian information criterion. By analyzing the solution of the orthogonalized Cp, we identify an estimator that asymptotically approximates the solution of the Cp criterion in the case of non-orthogonal predictors. Similar results are obtained for the non-orthogonal TLCp. Finally, simulation studies and applications with real data demonstrate the usefulness of the TLCp scheme.

MLSep 5, 2018
Knowledge Integrated Classifier Design Based on Utility Optimization

Shaohan Chen, Chuanhou Gao

This paper proposes a systematic framework to design a classification model that yields a classifier which optimizes a utility function based on prior knowledge. Specifically, as the data size grows, we prove that the produced classifier asymptotically converges to the optimal classifier, an extended version of the Bayes rule, which maximizes the utility function. Therefore, we provide a meaningful theoretical interpretation for modeling with the knowledge incorporated. Our knowledge incorporation method allows domain experts to guide the classifier towards correctly classifying data that they think to be more significant.

MLMay 31, 2018
Efficacy of regularized multi-task learning based on SVM models

Shaohan Chen, Zhou Fang, Sijie Lu et al.

This paper investigates the efficacy of a regularized multi-task learning (MTL) framework based on SVM (M-SVM) to answer whether MTL always provides reliable results and how MTL outperforms independent learning. We first find that M-SVM is Bayes risk consistent in the limit of large sample size. This implies that despite the task dissimilarities, M-SVM always produces a reliable decision rule for each task in terms of misclassification error when the data size is large enough. Furthermore, we find that the task-interaction vanishes as the data size goes to infinity, and the convergence rates of M-SVM and its single-task counterpart have the same upper bound. The former suggests that M-SVM cannot improve the limit classifier's performance; based on the latter, we conjecture that the optimal convergence rate is not improved when the task number is fixed. As a novel insight of MTL, our theoretical and experimental results achieved an excellent agreement that the benefit of the MTL methods lies in the improvement of the pre-convergence-rate factor (PCR, to be denoted in Section III) rather than the convergence rate. Moreover, this improvement of PCR factors is more significant when the data size is small.

MLOct 9, 2017
Enhancing Interpretability of Black-box Soft-margin SVM by Integrating Data-based Priors

Shaohan Chen, Chuanhou Gao, Ping Zhang

The lack of interpretability often makes black-box models difficult to be applied to many practical domains. For this reason, the current work, from the black-box model input port, proposes to incorporate data-based prior information into the black-box soft-margin SVM model to enhance its interpretability. The concept and incorporation mechanism of data-based prior information are successively developed, based on which the interpretable or partly interpretable SVM optimization model is designed and then solved through handily rewriting the optimization problem as a nonlinear quadratic programming problem. An algorithm for mining data-based linear prior information from data set is also proposed, which generates a linear expression with respect to two appropriate inputs identified from all inputs of system. At last, the proposed interpretability enhancement strategy is applied to eight benchmark examples for effectiveness exhibition.