Bryan Cai

CLJul 1, 2023

Hierarchical Pretraining for Biomedical Term Embeddings

Bryan Cai, Sihang Zeng, Yucong Lin et al. · tsinghua

Electronic health records (EHR) contain narrative notes that provide extensive details on the medical condition and management of patients. Natural language processing (NLP) of clinical notes can use observed frequencies of clinical terms as predictive features for downstream applications such as clinical decision making and patient trajectory prediction. However, due to the vast number of highly similar and related clinical concepts, a more effective modeling strategy is to represent clinical terms as semantic embeddings via representation learning and use the low dimensional embeddings as feature vectors for predictive modeling. To achieve efficient representation, fine-tuning pretrained language models with biomedical knowledge graphs may generate better embeddings for biomedical terms than those from standard language models alone. These embeddings can effectively discriminate synonymous pairs of from those that are unrelated. However, they often fail to capture different degrees of similarity or relatedness for concepts that are hierarchical in nature. To overcome this limitation, we propose HiPrBERT, a novel biomedical term representation model trained on additionally complied data that contains hierarchical structures for various biomedical terms. We modify an existing contrastive loss function to extract information from these hierarchies. Our numerical experiments demonstrate that HiPrBERT effectively learns the pair-wise distance from hierarchical information, resulting in a substantially more informative embeddings for further biomedical applications

DSMar 29, 2017

Priv'IT: Private and Sample Efficient Identity Testing

Bryan Cai, Constantinos Daskalakis, Gautam Kamath

We develop differentially private hypothesis testing methods for the small sample regime. Given a sample $\cal D$ from a categorical distribution $p$ over some domain $Σ$, an explicitly described distribution $q$ over $Σ$, some privacy parameter $\varepsilon$, accuracy parameter $α$, and requirements $β_{\rm I}$ and $β_{\rm II}$ for the type I and type II errors of our test, the goal is to distinguish between $p=q$ and $d_{\rm{TV}}(p,q) \geq α$. We provide theoretical bounds for the sample size $|{\cal D}|$ so that our method both satisfies $(\varepsilon,0)$-differential privacy, and guarantees $β_{\rm I}$ and $β_{\rm II}$ type I and type II errors. We show that differential privacy may come for free in some regimes of parameters, and we always beat the sample complexity resulting from running the $χ^2$-test with noisy counts, or standard approaches such as repetition for endowing non-private $χ^2$-style statistics with differential privacy guarantees. We experimentally compare the sample complexity of our method to that of recently proposed methods for private hypothesis testing.

Bryan Cai

2 Papers