Prediction of Oswestry Disability Index and Numeric Rating Scale scores after lumbar spine surgery: machine learning model development and fairness assessment

Por: Joakimsen · H. L. · Lund · J. A. · Burman · J. · Woldaregay · A. Z. · Berg · B. · Solberg · T. K. · Ingebrigtsen · T. · Mikalsen · K. O.

Background

One-third of patients operated for degenerative conditions in the lumbar spine do not report substantial improvement after 12 months. Most previous outcome prediction models are classifiers. This constrains nuances in prediction and use for decision support.

Objectives

To develop and test models for the prediction of continuous outcome scores and retrieval of similar patients’ outcomes, and to evaluate the models’ fairness.

Setting

Norwegian public and private specialist healthcare.

Participants and data source

All cases recorded with an elective operation for lumbar disc herniation (LDH, n=18 377) or lumbar spinal stenosis (LSS, n=24 540) in the Norwegian Registry for Spine Surgery from 1 January 2007 to 23 May 2023.

Outcome measures

All outcomes were patient-reported 12 months after the operation. The primary outcome was the Oswestry disability index (ODI), modelled on a scale ranging from 0 to 100. Numeric Rating Scale scores (range 0–10) for back and leg pain were secondary outcomes.

Model building and performance

We selected 22 predictors recorded preoperatively by patients and clinicians based on Shapley Additive Explanations values. Data were split into 80%/20% training/test samples for LDH and LSS. Six machine learning methods for regression, that is, with a continuous outcome (extreme gradient boosting (XGBoost), Gaussian process regression, gradient boosting regression, artificial neural networks and linear regression), were trained for both conditions using fivefold cross-validation. We report the magnitude and distribution of errors as mean absolute error (MAE) with 95% CIs, and explanatory power as the coefficient of determination (R²). Fairness and calibration were assessed with violin and calibration plots of error. We developed a patient-similarity function that uses a K-nearest neighbour model to retrieve the individual outcomes of the 50 most similar patients and evaluated it by calculating L1 distances (Manhattan distances) across subgroups.

Results

XGBoost regression performed best for both conditions. The models showed good calibration and predicted ODI with MAE 11.32 (95% CI 11.00 to 11.63) and R² 0.27 (95% CI 0.24 to 0.29) for LDH and MAE 12.05 (95% CI 11.76 to 12.32) and R² 0.31 (95% CI 0.28 to 0.34) for LSS. The MAEs for back and leg pain were 2.09 (95% CI 2.04 to 2.15) and 1.95 (95% CI 1.90 to 2.00) for LDH and 2.33 (95% CI 2.28 to 2.38) and 2.13 (95% CI 2.08 to 2.16) for LSS. All models were fair with differences in error between subgroups for sex, age, education level and native language. In the patient-similarity function, distances at baseline were evenly distributed across subgroups.

Conclusions

Our machine learning models predicted continuous outcomes with MAEs close to the SEs of measurements. The models were fair across sociodemographic subgroups. We succeeded in developing a patient-similarity function which supplements the predictions.

FreshRSS

Prediction of Oswestry Disability Index and Numeric Rating Scale scores after lumbar spine surgery: machine learning model development and fairness assessment