BACKGROUND:Bladder cancer (BC) symptoms often overlap with benign conditions, while no routine screening exists for general population. We aim to develop a machine learning (ML)-based screening pipeline for early BC detection using electronic-health-records (EHRs) in primary care.
METHODS:A multi-centred case-control cohort (1995-2018; n = 64,884) was created for model training and testing. We further validated the model prospectively on an independent cohort (2019-2020; n = 4,569). We proposed the Parsimony driven REweighting for Calibrated Input-based Screening for Early detection-Adjustable Grey Zone (PRECISE-AGZ), which identified influential features from 48,261 candidates and developed a calibrated logistic regression screening model with optimised grey-zone thresholds.
RESULTS:We finally identified 38 features, achieving an AUC (area under the curve) of 0.789 (95% CI: 0.780-0.798) on testing set. Neurological disorders (e.g., Parkinson's disease, OR: 0.86, 95% CI: 0.79-0.92) and medications (e.g., Tamoxifen, OR: 1.13, 95% CI: 1.07-1.20) emerged as novel predictors for BC screening. The screening model stratified the population into three risk categories based on predicted probability: low-risk (0.55), achieving a sensitivity of 0.852, F1-score of 0.799, and screening population coverage (SPC) of 34.5%. Applied to the prospective validation cohort, model performance varied by months before BC diagnosis, with sensitivities ranging from 0.872 (F1-score: 0.714, SPC: 29.9%) at the first month to 0.667 (F1-score: 0.690, SPC: 12.7%) at the twelfth month.
CONCLUSION:The PRECISE-AGZ pipeline efficiently identified clinical signals from EHRs for early BC detection, offering promising potential for implementing population-based BC screening.