Quantitative Medicine Scientist Critical Path Institute, United States
Disclosure(s):
Wes Anderson, PhD.: No financial relationships to disclose
Objectives: Real-World Data (RWD) from sources like electronic health records is crucial for analyzing clinical characteristics and outcomes within numerous diseases and therapeutic areas. Researchers aim to use RWD, standardized and harmonized into an open-source common data model (CDM), to evaluate the clinical characteristics of COVID-19 patients (1). The objective of this abstract is to validate statistical and machine learning methods for analysis of RWD, which may improve prognosis assessments and treatment guidelines for future pandemics, supporting informed decision-making by policymakers and regulators through real-world evidence generation.
Methods: This study included 92,457 adult (18+ years) patients hospitalized with acute COVID-19 in 8 healthcare institutions from March of 2020 and March of 2024. Analysts harmonized patient level data including vital signs, demographics, treatments, interventions (e.g., oxygen support from various modalities), comorbidities, and laboratory tests to the Observational Medical Outcomes Partnership CDM to support standardization and harmonization across institutions. From this, a cross-sectional dataset with 115 initial covariates captured in the first 48 hours of admission was formed. Researchers performed data quality assessments and cleaning by removing outliers that were not clinically plausible, eliminating covariates with missingness > 10%, and performing complete case analysis. Researchers performed time-to-event analysis on the remaining 76,847 patients using Random Survival Forests (RSF) (2) and Cox Proportional Hazards (CPH) (3) models, both incorporating feature selection via variable importance and Least Absolute Shrinkage and Selection Operator (LASSO), respectively, and results were compared using the Concordance Index (c-Index).
Results: The two methodologies showed an overlap in selected features. The most important variable identified by feature importance was age, followed by oxygen support level, SpO2 and respiratory rate within the first 48 hours. The LASSO analysis highlighted age and SpO2 in addition to oxygen support level and critical lab tests (e.g., maximum bilirubin and minimum platelet counts). The RSF model achieved a c-index of 0.83, whereas the CPH model was 0.79. Although the CPH model demonstrated good predictive power, its numerous features increased the violations of proportional hazards assumptions. Mitigating these violations through methods like stratification can risk model overfitting, suggesting the potential of alternative machine learning methods in the future.
Conclusions: This work demonstrates the utilization of statistical and machine learning methods for analyzing RWD to provide an assessment of those who are at higher risk of death in critical care situations, specifically for COVID-19. The results provide an insight into factors that are associated with survival while also giving guidance into potential risk management strategies through this information.
Citations: Citations: [1] Heavner et. al., doi: 10.1097/CCE.0000000000000893. [2] Ishwaran et. al., Random survival forests for R. Rnews. 2007;7:25–31. [3] Cox, D. R., doi: 10.1111/j.2517-6161.1972.tb00899.x.