Seongwon Kim

CL
4papers
108citations
Novelty26%
AI Score37

4 Papers

BMAug 28, 2023Code
PeptideBERT: A Language Model based on Transformers for Peptide Property Prediction

Chakradhar Guntuboina, Adrita Das, Parisa Mollaei et al.

Recent advances in Language Models have enabled the protein modeling community with a powerful tool since protein sequences can be represented as text. Specifically, by taking advantage of Transformers, sequence-to-property prediction will be amenable without the need for explicit structural data. In this work, inspired by recent progress in Large Language Models (LLMs), we introduce PeptideBERT, a protein language model for predicting three key properties of peptides (hemolysis, solubility, and non-fouling). The PeptideBert utilizes the ProtBERT pretrained transformer model with 12 attention heads and 12 hidden layers. We then finetuned the pretrained model for the three downstream tasks. Our model has achieved state of the art (SOTA) for predicting Hemolysis, which is a task for determining peptide's potential to induce red blood cell lysis. Our PeptideBert non-fouling model also achieved remarkable accuracy in predicting peptide's capacity to resist non-specific interactions. This model, trained predominantly on shorter sequences, benefits from the dataset where negative examples are largely associated with insoluble peptides. Codes, models, and data used in this study are freely available at: https://github.com/ChakradharG/PeptideBERT

LGOct 30, 2023
GPCR-BERT: Interpreting Sequential Design of G Protein Coupled Receptors Using Protein Language Models

Seongwon Kim, Parisa Mollaei, Akshay Antony et al.

With the rise of Transformers and Large Language Models (LLMs) in Chemistry and Biology, new avenues for the design and understanding of therapeutics have opened up to the scientific community. Protein sequences can be modeled as language and can take advantage of recent advances in LLMs, specifically with the abundance of our access to the protein sequence datasets. In this paper, we developed the GPCR-BERT model for understanding the sequential design of G Protein-Coupled Receptors (GPCRs). GPCRs are the target of over one-third of FDA-approved pharmaceuticals. However, there is a lack of comprehensive understanding regarding the relationship between amino acid sequence, ligand selectivity, and conformational motifs (such as NPxxY, CWxP, E/DRY). By utilizing the pre-trained protein model (Prot-Bert) and fine-tuning with prediction tasks of variations in the motifs, we were able to shed light on several relationships between residues in the binding pocket and some of the conserved motifs. To achieve this, we took advantage of attention weights, and hidden states of the model that are interpreted to extract the extent of contributions of amino acids in dictating the type of masked ones. The fine-tuned models demonstrated high accuracy in predicting hidden residues within the motifs. In addition, the analysis of embedding was performed over 3D structures to elucidate the higher-order interactions within the conformations of the receptors.

CVNov 22, 2022
UpCycling: Semi-supervised 3D Object Detection without Sharing Raw-level Unlabeled Scenes

Sunwook Hwang, Youngseok Kim, Seongwon Kim et al.

Semi-supervised Learning (SSL) has received increasing attention in autonomous driving to reduce the enormous burden of 3D annotation. In this paper, we propose UpCycling, a novel SSL framework for 3D object detection with zero additional raw-level point cloud: learning from unlabeled de-identified intermediate features (i.e., smashed data) to preserve privacy. Since these intermediate features are naturally produced by the inference pipeline, no additional computation is required on autonomous vehicles. However, generating effective consistency loss for unlabeled feature-level scene turns out to be a critical challenge. The latest SSL frameworks for 3D object detection that enforce consistency regularization between different augmentations of an unlabeled raw-point scene become detrimental when applied to intermediate features. To solve the problem, we introduce a novel combination of hybrid pseudo labels and feature-level Ground Truth sampling (F-GT), which safely augments unlabeled multi-type 3D scene features and provides high-quality supervision. We implement UpCycling on two representative 3D object detection models: SECOND-IoU and PV-RCNN. Experiments on widely-used datasets (Waymo, KITTI, and Lyft) verify that UpCycling outperforms other augmentation methods applied at the feature level. In addition, while preserving privacy, UpCycling performs better or comparably to the state-of-the-art methods that utilize raw-level unlabeled data in both domain adaptation and partial-label scenarios.

CLJan 14Code
A.X K1 Technical Report

Sung Jun Cheon, Jaekyung Cho, Seongho Choi et al.

We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.