Machine learning methods as alternatives to regression modelling in propensity score and disease risk score based methods
Guo Y.
Background Propensity Score (PS) and Disease Risk Score (DRS) based methods have become increasingly popular in observational studies to reduce confounding bias when estimating treatment effects. Traditionally, logistic regression modelling has been the primary approach to estimating PS and DRS. However, logistic models have limitations, particularly in scenarios with large numbers of confounders, nonlinear relationships, and complex interactions. Machine learning (ML) methods have shown potential as alternatives to regression modelling due to their ability to handle high-dimensional data and capture complex relationships. However, there is a scarcity of data evaluating and comparing these methods to logistic regression modelling in the context of PS and DRS estimation. Methods I set out to evaluate various data-driven ML methods as alternatives to logistic regression modelling using pre-defined covariates for PS and DRS estimation. Tested ML methods included L1 regularization (LASSO), Multi-layer perceptron (MLP), Extreme Gradient Boosting (XgBoost) and Random Forest (RF). I investigated comparisons of different ML methods in various scenarios with different complexity, non-additivity, and treatment and outcome prevalence. A range of simulation studies including Monte Carlo and plasmode simulations were conducted, as well as an analysis of real-world (routinely collected) clinical data. Hyperparameter tuning methods were considered to optimize model performance. Results My studies highlight the advantages and limitations of ML methods for PS and DRS estimation. These findings emphasize the need for careful application of these methods. A key result from my thesis was the implementation of cross-validation hyperparameter tuning and enhanced performance across all ML methods, a practice rarely seen in previous PS and DRS literature. My research also revealed that both logistic regression and ML methods could achieve optimal performance in different scenarios among those tested, highlighting the value of making a variety of methods available for PS or DRS estimation, and the need for guidance on their application according to the characteristics of the data at hand. An important advantage of ML methods is the elimination of the need for a covariate selection process, unlike traditional logistic regression methods that require pre-selected covariates. This attribute renders ML methods more scalable when dealing with problems involving a large number of covariates. Also, ML methods can excel in some simulations with a large number of confounders and a large number of sample sizes and nonlinear data, while logistic regression can provide more stable results under conditions of lower sample size and treatment prevalence. In the comparative analysis of PS and DRS, I showed that DRS performed better in scenarios with low treatment prevalence below 0.1. However, when treatment was more common, PS outperformed DRS across all evaluated scenarios, both in real-world and simulated data. In both the clinical dataset and the corresponding plasmode simulation, I observed that the XgBoost method yielded the best performance in terms of PS and DRS estimation, covariate balance, and relative bias for treatment effect estimation. Conclusions This thesis evaluated the use and performance of specific ML methods — mainly XgBoost, LASSO, and MLP — as potential alternatives to traditional logistic regression models in the context of PS and DRS estimation. While logistic regression with preselected covariates demonstrated better performance in scenarios with a small number of confounders, ML methods, especially XgBoost, performed better in some settings with a large number of covariates and larger sample sizes and nonlinear data. An important gap filled by this thesis is the implementation of cross-validation hyperparameter tuning in the PS and DRS model training process. A comparative analysis of PS and DRS under diverse treatment prevalences and using various models offered valuable guidance on the use of PS and DRS depending on treatment prevalence. In summary, this thesis fills gaps in the understanding of the strengths and weaknesses of ML methods for estimating PS and DRS in health data sciences.