Surgery Duration Prediction with AutoML

Leveraging automated machine learning to reduce NHS operating room inefficiencies and improve theatre utilisation

Python AutoGluon XGBoost SHAP Healthcare AutoML NHS

The Problem

Operating room inefficiencies cost the NHS over £400 million annually. With OR operational costs ranging from £14-£30 per minute, accurate surgery duration prediction is critical for reducing overruns, avoiding idle time, and maximising theatre utilisation.

Key Challenges

7.4 million patients on NHS waiting lists (March 2025)
Traditional methods are inaccurate:
- Surgeon estimates: 59-70 minute mean absolute error (MAE)
- Historical averages: 31-38 minute MAE
Consequences of poor estimation: Late starts, last-minute cancellations, staff overtime, wasted capacity
Limited technical resources: Many trusts lack ML expertise for implementing advanced predictive models

"Surgeon estimates incorporate real-time insights but are inherently subjective, prone to cognitive biases, and impose unnecessary cognitive load on clinical staff who should focus on patient care."

The Solution

This study evaluated AutoGluon, an automated machine learning (AutoML) framework, for surgery duration prediction using a retrospective dataset of 94,502 elective orthopaedic procedures from East Kent Hospitals University NHS Foundation Trust (2010-2025).

AutoGluon: Automated Machine Learning

AutoGluon automates the entire ML pipeline—algorithm selection, hyperparameter optimization, feature engineering, and model validation—enabling sophisticated predictive modeling without requiring extensive ML expertise.

Multi-Layer Stacking Architecture

AutoGluon multi-layer stacking with bagging

AutoGluon's unique multi-layer stacking with multi-fold bagging enables incremental predictive power gains while maintaining independence between layers.

Model Comparison Methodology

AutoGluon's performance was rigorously compared against three baseline models under identical preprocessing and computational budgets:

Linear Regression - Simple baseline
XGBoost - State-of-the-art gradient boosting (current gold standard)
Feed-Forward Neural Network - Deep learning approach

Feature Analysis with SHAP

SHAP (SHapley Additive exPlanations) analysis identified key drivers of surgical duration and overrun likelihood by quantifying how each feature contributes to predictions. Two models were developed:

Duration Regressor: Predicts actual surgery length in minutes
Overrun Classifier: Predicts likelihood of exceeding scheduled time

Dataset: 94,502 elective orthopaedic procedures (2010-2025) with 36 features including patient demographics, clinical comorbidities, procedure details, scheduling context, and staff identifiers.

Results & Impact

Predictive Performance

AutoGluon achieved state-of-the-art performance with minimal technical configuration:

Model	MAE (minutes)	R²	Improvement
AutoGluon (1 hour)	15.70	0.77	26% vs XGBoost
AutoGluon (4 hours)	11.84	0.88	46% vs surgeon estimates
XGBoost	16.02	0.77	—
Neural Network	17.67	—	—
Linear Regression	20.25	0.69	—
Surgeon Estimates (baseline)	~59-70	—	—

Key Feature Drivers

SHAP analysis revealed which factors most strongly influence surgery duration and overrun risk:

Duration Predictors

Intended Management (inpatient vs day case)
Mean SHAP: 17.8 minutes
Procedure Code
Mean SHAP: 15.8 minutes
Anaesthetic Type
General vs regional

Overrun Risk Drivers

Procedure Code
Mean SHAP: 9.9 percentage points
Year
Temporal scheduling changes
Theatre Location
Equipment and setup differences

Operational Insights

Key Finding: Intended management (inpatient status) strongly predicts duration but has minimal impact on overrun likelihood. This suggests schedulers already effectively account for inpatient status when allocating theatre time, but struggle to accommodate procedure-specific duration variability.

Actionable recommendations for OR scheduling:

Adjust buffer times by procedure code rather than uniformly by inpatient status
Batch similar procedure types together to reduce schedule uncertainty
Focus on anaesthetic type as a modifiable factor in scheduling optimization
Monitor temporal trends (year effects) for systematic changes in OR efficiency

Real-World Impact

What This Means for NHS Trusts

Reduced overtime: More accurate predictions = fewer late-running lists
Better capacity utilization: Minimize idle time between cases
Data-driven scheduling: Replace subjective estimates with evidence-based predictions
Accessible implementation: AutoML requires minimal ML expertise, enabling rapid deployment across trusts
Cost savings: Even small improvements in prediction accuracy translate to significant financial impact at scale

Technical Implementation

Data Preprocessing Pipeline

To ensure fair model comparison, all models received identically preprocessed data:

Cleaning: Removed features with >50% missingness, handled procedure length outliers, imputed missing procedure codes (16.56% of cases) using domain knowledge
Encoding: Out-of-fold target encoding for high-cardinality features (prevents data leakage), one-hot encoding for low-cardinality features (<20 unique values)
Train/Val/Test Split: 64% / 16% / 20%

Hyperparameter Optimization

All baseline models (LR, XGBoost, NN) used Optuna with identical computational budgets to ensure fair comparison. AutoGluon variants tested:

AutoGluon-raw: Minimal preprocessing (MAE: 15.38)
AutoGluon-clean: Basic cleaning (MAE: 15.40)
AutoGluon-processed: Full preprocessing (MAE: 15.70)
AutoGluon-full: Extended 4-hour training (MAE: 11.84)

Finding: AutoGluon demonstrated robust performance regardless of preprocessing quality, with minimal difference between raw and processed data. This suggests computational resources drive performance gains more than preprocessing refinements.

SHAP Analysis Methodology

SHAP TreeExplainer was applied to XGBoost models to generate feature importance values. Mean absolute SHAP values quantify each feature's average effect on predictions, with AutoGluon's permutation feature importance providing independent validation.

Code & Data Availability: Full implementation including data preprocessing, model training, and SHAP analysis available on GitHub →

Conclusion & Future Work

This work demonstrates that AutoML frameworks can deliver state-of-the-art predictive performance with minimal technical expertise, addressing a critical barrier to ML adoption in healthcare operations. AutoGluon's 46% improvement over surgeon estimates, combined with actionable SHAP-based insights, provides a credible pathway for NHS trusts to implement evidence-based scheduling optimization.

Limitations

Limited to elective orthopaedic procedures at a single NHS Trust
Dataset lacked granular chronological data and staff experience variables
Prospective validation in live theatre scheduling required to evaluate real-world impact

Next Steps

Uncertainty quantification: Provide confidence intervals for duration predictions to enable risk-informed scheduling decisions
Multi-specialty expansion: Evaluate performance across surgical specialties
Real-time deployment: Integrate into theatre management systems for prospective evaluation
Temporal drift monitoring: Track feature importance changes over time to detect shifts in operational patterns

View on GitHub Discuss This Project