AIJul 11, 2023Code
An Open-Source Knowledge Graph Ecosystem for the Life SciencesTiffany J. Callahan, Ignacio J. Tripodi, Adrianne L. Stefanski et al. · berkeley, harvard
Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoints and abstraction algorithms), and benchmarks (e.g., prebuilt KGs and embeddings). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.
MEOct 13, 2020
A standardized framework for risk-based assessment of treatment effect heterogeneity in observational healthcare databasesAlexandros Rekkas, David van Klaveren, Patrick B. Ryan et al.
The Predictive Approaches to Treatment Effect Heterogeneity statement focused on baseline risk as a robust predictor of treatment effect and provided guidance on risk-based assessment of treatment effect heterogeneity in the RCT setting. The aim of this study was to extend this approach to the observational setting using a standardized scalable framework. The proposed framework consists of five steps: 1) definition of the research aim, i.e., the population, the treatment, the comparator and the outcome(s) of interest; 2) identification of relevant databases; 3) development of a prediction model for the outcome(s) of interest; 4) estimation of relative and absolute treatment effect within strata of predicted risk, after adjusting for observed confounding; 5) presentation of the results. We demonstrate our framework by evaluating heterogeneity of the effect of angiotensin-converting enzyme (ACE) inhibitors versus beta blockers on three efficacy and six safety outcomes across three observational databases. The proposed framework can supplement any comparative effectiveness study. We provide a publicly available R software package for applying this framework to any database mapped to the Observational Medical Outcomes Partnership Common Data Model. In our demonstration, patients at low risk of acute myocardial infarction received negligible absolute benefits for all three efficacy outcomes, though they were more pronounced in the highest risk quarter, especially for hospitalization with heart failure. However, failing diagnostics showed evidence of residual imbalances even after adjustment for observed confounding. Our framework allows for the evaluation of differential treatment effects across risk strata, which offers the opportunity to consider the benefit-harm trade-off between alternative treatments.
APAug 14, 2020
Logistic regression models for patient-level prediction based on massive observational data: Do we need all data?Luis H. John, Jan A. Kors, Jenna M. Reps et al.
Objective: Provide guidance on sample size considerations for developing predictive models by empirically establishing the adequate sample size, which balances the competing objectives of improving model performance and reducing model complexity as well as computational requirements. Materials and Methods: We empirically assess the effect of sample size on prediction performance and model complexity by generating learning curves for 81 prediction problems (23 outcomes predicted in a depression cohort, 58 outcomes predicted in a hypertension cohort) in three large observational health databases, requiring training of 17,248 prediction models. The adequate sample size was defined as the sample size for which the performance of a model equalled the maximum model performance minus a small threshold value. Results: The adequate sample size achieves a median reduction of the number of observations of 9.5%, 37.3%, 58.5%, and 78.5% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively. The median reduction of the number of predictors in the models was 8.6%, 32.2%, 48.2%, and 68.3% for the thresholds of 0.001, 0.005, 0.01, and 0.02, respectively. Discussion: Based on our results a conservative, yet significant, reduction in sample size and model complexity can be estimated for future prediction work. Though, if a researcher is willing to generate a learning curve a much larger reduction of the model complexity may be possible as suggested by a large outcome-dependent variability. Conclusion: Our results suggest that in most cases only a fraction of the available data was sufficient to produce a model close to the performance of one developed on the full data set, but with a substantially reduced model complexity.