← Back to Projects

Surgery Duration Prediction with AutoML

Leveraging automated machine learning to reduce NHS operating room inefficiencies and improve theatre utilisation

Python AutoGluon XGBoost SHAP Healthcare AutoML NHS

The Problem

Operating room inefficiencies cost the NHS over £400 million annually. With OR operational costs ranging from £14-£30 per minute, accurate surgery duration prediction is critical for reducing overruns, avoiding idle time, and maximising theatre utilisation.

Key Challenges

  • 7.4 million patients on NHS waiting lists (March 2025)
  • Traditional methods are inaccurate:
    • Surgeon estimates: 59-70 minute mean absolute error (MAE)
    • Historical averages: 31-38 minute MAE
  • Consequences of poor estimation: Late starts, last-minute cancellations, staff overtime, wasted capacity
  • Limited technical resources: Many trusts lack ML expertise for implementing advanced predictive models
"Surgeon estimates incorporate real-time insights but are inherently subjective, prone to cognitive biases, and impose unnecessary cognitive load on clinical staff who should focus on patient care."

The Solution

This study evaluated AutoGluon, an automated machine learning (AutoML) framework, for surgery duration prediction using a retrospective dataset of 94,502 elective orthopaedic procedures from East Kent Hospitals University NHS Foundation Trust (2010-2025).

AutoGluon: Automated Machine Learning

AutoGluon automates the entire ML pipeline—algorithm selection, hyperparameter optimization, feature engineering, and model validation—enabling sophisticated predictive modeling without requiring extensive ML expertise.

Multi-Layer Stacking Architecture

AutoGluon multi-layer stacking with bagging

AutoGluon's unique multi-layer stacking with multi-fold bagging enables incremental predictive power gains while maintaining independence between layers.

Model Comparison Methodology

AutoGluon's performance was rigorously compared against three baseline models under identical preprocessing and computational budgets:

  • Linear Regression - Simple baseline
  • XGBoost - State-of-the-art gradient boosting (current gold standard)
  • Feed-Forward Neural Network - Deep learning approach

Feature Analysis with SHAP

SHAP (SHapley Additive exPlanations) analysis identified key drivers of surgical duration and overrun likelihood by quantifying how each feature contributes to predictions. Two models were developed:

  • Duration Regressor: Predicts actual surgery length in minutes
  • Overrun Classifier: Predicts likelihood of exceeding scheduled time

Dataset: 94,502 elective orthopaedic procedures (2010-2025) with 36 features including patient demographics, clinical comorbidities, procedure details, scheduling context, and staff identifiers.

Results & Impact

Predictive Performance

AutoGluon achieved state-of-the-art performance with minimal technical configuration:

Model MAE (minutes) Improvement
AutoGluon (1 hour) 15.70 0.77 26% vs XGBoost
AutoGluon (4 hours) 11.84 0.88 46% vs surgeon estimates
XGBoost 16.02 0.77
Neural Network 17.67
Linear Regression 20.25 0.69
Surgeon Estimates (baseline) ~59-70

Key Feature Drivers

SHAP analysis revealed which factors most strongly influence surgery duration and overrun risk:

Duration Predictors

  1. Intended Management (inpatient vs day case)
    Mean SHAP: 17.8 minutes
  2. Procedure Code
    Mean SHAP: 15.8 minutes
  3. Anaesthetic Type
    General vs regional

Overrun Risk Drivers

  1. Procedure Code
    Mean SHAP: 9.9 percentage points
  2. Year
    Temporal scheduling changes
  3. Theatre Location
    Equipment and setup differences

Operational Insights

Key Finding: Intended management (inpatient status) strongly predicts duration but has minimal impact on overrun likelihood. This suggests schedulers already effectively account for inpatient status when allocating theatre time, but struggle to accommodate procedure-specific duration variability.

Actionable recommendations for OR scheduling:

  • Adjust buffer times by procedure code rather than uniformly by inpatient status
  • Batch similar procedure types together to reduce schedule uncertainty
  • Focus on anaesthetic type as a modifiable factor in scheduling optimization
  • Monitor temporal trends (year effects) for systematic changes in OR efficiency

Real-World Impact

What This Means for NHS Trusts

  • Reduced overtime: More accurate predictions = fewer late-running lists
  • Better capacity utilization: Minimize idle time between cases
  • Data-driven scheduling: Replace subjective estimates with evidence-based predictions
  • Accessible implementation: AutoML requires minimal ML expertise, enabling rapid deployment across trusts
  • Cost savings: Even small improvements in prediction accuracy translate to significant financial impact at scale

Technical Implementation

Data Preprocessing Pipeline

To ensure fair model comparison, all models received identically preprocessed data:

  • Cleaning: Removed features with >50% missingness, handled procedure length outliers, imputed missing procedure codes (16.56% of cases) using domain knowledge
  • Encoding: Out-of-fold target encoding for high-cardinality features (prevents data leakage), one-hot encoding for low-cardinality features (<20 unique values)
  • Train/Val/Test Split: 64% / 16% / 20%

Hyperparameter Optimization

All baseline models (LR, XGBoost, NN) used Optuna with identical computational budgets to ensure fair comparison. AutoGluon variants tested:

  • AutoGluon-raw: Minimal preprocessing (MAE: 15.38)
  • AutoGluon-clean: Basic cleaning (MAE: 15.40)
  • AutoGluon-processed: Full preprocessing (MAE: 15.70)
  • AutoGluon-full: Extended 4-hour training (MAE: 11.84)

Finding: AutoGluon demonstrated robust performance regardless of preprocessing quality, with minimal difference between raw and processed data. This suggests computational resources drive performance gains more than preprocessing refinements.

SHAP Analysis Methodology

SHAP TreeExplainer was applied to XGBoost models to generate feature importance values. Mean absolute SHAP values quantify each feature's average effect on predictions, with AutoGluon's permutation feature importance providing independent validation.

Code & Data Availability: Full implementation including data preprocessing, model training, and SHAP analysis available on GitHub →

Conclusion & Future Work

This work demonstrates that AutoML frameworks can deliver state-of-the-art predictive performance with minimal technical expertise, addressing a critical barrier to ML adoption in healthcare operations. AutoGluon's 46% improvement over surgeon estimates, combined with actionable SHAP-based insights, provides a credible pathway for NHS trusts to implement evidence-based scheduling optimization.

Limitations

  • Limited to elective orthopaedic procedures at a single NHS Trust
  • Dataset lacked granular chronological data and staff experience variables
  • Prospective validation in live theatre scheduling required to evaluate real-world impact

Next Steps

  • Uncertainty quantification: Provide confidence intervals for duration predictions to enable risk-informed scheduling decisions
  • Multi-specialty expansion: Evaluate performance across surgical specialties
  • Real-time deployment: Integrate into theatre management systems for prospective evaluation
  • Temporal drift monitoring: Track feature importance changes over time to detect shifts in operational patterns

View on GitHub Discuss This Project