Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Background: Machine learning (ML) methods are promising alternatives for data-driven propensity score (PS) estimation. Different metrics are available for model evaluation and hyper-parameter tuning, but there is no clear guidance on which (if any) should be used for PS estimation using data-driven ML methods. Objectives: We aimed to assess the usefulness of different metrics for model and hyper-parameter selection of ML-based PS. To do this, we investigated the association between these metrics and observed bias using 4 different ML models. Methods: Data (n = 100 000) were generated using parametric Monte Carlo simulations with 100 iterations and 100 confounders. Binary exposure with prevalence of 0.5 and binary outcome with prevalence of 0.02 were generated via logistic distributions. We tested two treatment effect scenarios with true odds ratios (OR) 1.5 and 2. First, we used four ML methods to estimate PS: 1. LASSO (Least Absolute Shrinkage and Selection Operator), 2. RF (Random Forest), 3. MLP (Multilayer Perceptron), and 4. XgBoost (eXtreme Gradient Boosting). Second, PS 1:1 matching was applied using each of the estimated PS to obtain relative bias and ASAM (average absolute standardized mean difference) of treatment effect estimation. Third, we explored the relationship between relative bias and ASAM with the following PS metrics: 1. AUC (Area Under the ROC Curve), 2. Calibration slope, and 3. Brier score. Lastly, we tested different hyper-parameters and reported the treatment effect estimation bias and the best metrics for PS estimation from the previous step. Results: We found that with fixed ML hyper-parameters, Brier score and calibration slope performed best at predicting bias after PS matching for both OR. For example, with OR 1.5, the lowest relative bias was obtained by MLP at 19.78% (95% CI: 15.22%, 24.32%), with calibration slope 0.88 (95% CI: 0.86, 0.90) and Brier score 0.27 (95% CI: 0.25, 0.29) being the best among all tested ML methods. Conversely, AUC was not consistently associated with bias. For example, RF had the best AUC at 0.99 (95% CI: 0.98, 1) but led to the largest bias at 34.32% (95% CI: 30.20%, 38.44%). Experiments after hyper-parameter tests for different ML models also showed that PS models with better Brier scores and calibration slopes can generate the lowest bias. Conclusions: We found metrics including calibration (Brier score and calibration slope) were useful evaluation metrics for model selection and hyper-parameter tuning of PS estimation using ML models. Conversely, discrimination estimates (AUC) could be misleading, with some scenarios showing almost perfect discrimination but very large bias. More research is needed to confirm these findings and to provide guidance for data-driven ML-based PS estimation.

More information Original publication




Conference paper

Publication Date





43 - 44