Dongjin Choi

h-index7

3papers

17citations

Novelty42%

AI Score25

Ranked #165,088 of 194,257 authors (top 85%)#36,071 in LG (top 90%)

3 Papers

1.2NADec 5, 2017

Fast, Accurate, and Scalable Method for Sparse Coupled Matrix-Tensor Factorization

Dongjin Choi, Jun-Gi Jang, U Kang

How can we capture the hidden properties from a tensor and a matrix data simultaneously in a fast, accurate, and scalable way? Coupled matrix-tensor factorization (CMTF) is a major tool to extract latent factors from a tensor and matrices at once. Designing an accurate and efficient CMTF method has become more crucial as the size and dimension of real-world data are growing explosively. However, existing methods for CMTF suffer from lack of accuracy, slow running time, and limited scalability. In this paper, we propose S3CMTF, a fast, accurate, and scalable CMTF method. S3CMTF achieves high speed by exploiting the sparsity of real-world tensors, and high accuracy by capturing inter-relations between factors. Also, S3CMTF accomplishes additional speed-up by lock-free parallel SGD update for multi-core shared memory systems. We present two methods, S3CMTF-naive and S3CMTF-opt. S3CMTF-naive is a basic version of S3CMTF, and S3CMTF-opt improves its speed by exploiting intermediate data. We theoretically and empirically show that S3CMTF is the fastest, outperforming existing methods. Experimental results show that S3CMTF is 11~43 times faster, and 2.1~4.1 times more accurate than existing methods. S3CMTF shows linear scalability on the number of data entries and the number of cores. In addition, we apply S3CMTF to Yelp recommendation tensor data coupled with 3 additional matrices to discover interesting properties.

2.0LGAug 22, 2023

Patient Clustering via Integrated Profiling of Clinical and Digital Data

Dongjin Choi, Andy Xiang, Ozgur Ozturk et al.

We introduce a novel profile-based patient clustering model designed for clinical data in healthcare. By utilizing a method grounded on constrained low-rank approximation, our model takes advantage of patients' clinical data and digital interaction data, including browsing and search, to construct patient profiles. As a result of the method, nonnegative embedding vectors are generated, serving as a low-dimensional representation of the patients. Our model was assessed using real-world patient data from a healthcare web portal, with a comprehensive evaluation approach which considered clustering and recommendation capabilities. In comparison to other baselines, our approach demonstrated superior performance in terms of clustering coherence and recommendation accuracy.

1.4LGNov 22, 2017Code

SNeCT: Scalable network constrained Tucker decomposition for integrative multi-platform data analysis

Dongjin Choi, Lee Sael

Motivation: How do we integratively analyze large-scale multi-platform genomic data that are high dimensional and sparse? Furthermore, how can we incorporate prior knowledge, such as the association between genes, in the analysis systematically? Method: To solve this problem, we propose a Scalable Network Constrained Tucker decomposition method we call SNeCT. SNeCT adopts parallel stochastic gradient descent approach on the proposed parallelizable network constrained optimization function. SNeCT decomposition is applied to tensor constructed from large scale multi-platform multi-cohort cancer data, PanCan12, constrained on a network built from PathwayCommons database. Results: The decomposed factor matrices are applied to stratify cancers, to search for top-k similar patients, and to illustrate how the matrices can be used for personalized interpretation. In the stratification test, combined twelve-cohort data is clustered to form thirteen subclasses. The thirteen subclasses have a high correlation to tissue of origin in addition to other interesting observations, such as clear separation of OV cancers to two groups, and high clinical correlation within subclusters formed in cohorts BRCA and UCEC. In the top-k search, a new patient's genomic profile is generated and searched against existing patients based on the factor matrices. The similarity of the top-k patient to the query is high for 23 clinical features, including estrogen/progesterone receptor statuses of BRCA patients with average precision value ranges from 0.72 to 0.86 and from 0.68 to 0.86, respectively. We also provide an illustration of how the factor matrices can be used for interpretable personalized analysis of each patient.