Surgery Duration Prediction with AutoML
Leveraging automated machine learning to reduce NHS operating room inefficiencies and improve theatre utilisation
The Problem
Operating room inefficiencies cost the NHS over £400 million annually. With OR operational costs ranging from £14-£30 per minute, accurate surgery duration prediction is critical for reducing overruns, avoiding idle time, and maximising theatre utilisation.
Key Challenges
- 7.4 million patients on NHS waiting lists (March 2025)
- Traditional methods are inaccurate:
- Surgeon estimates: 59-70 minute mean absolute error (MAE)
- Historical averages: 31-38 minute MAE
- Consequences of poor estimation: Late starts, last-minute cancellations, staff overtime, wasted capacity
- Limited technical resources: Many trusts lack ML expertise for implementing advanced predictive models
"Surgeon estimates incorporate real-time insights but are inherently subjective, prone to cognitive biases, and impose unnecessary cognitive load on clinical staff who should focus on patient care."
The Solution
This study evaluated AutoGluon, an automated machine learning (AutoML) framework, for surgery duration prediction using a retrospective dataset of 94,502 elective orthopaedic procedures from East Kent Hospitals University NHS Foundation Trust (2010-2025).
AutoGluon: Automated Machine Learning
AutoGluon automates the entire ML pipeline—algorithm selection, hyperparameter optimization, feature engineering, and model validation—enabling sophisticated predictive modeling without requiring extensive ML expertise.
Multi-Layer Stacking Architecture
AutoGluon's unique multi-layer stacking with multi-fold bagging enables incremental predictive power gains while maintaining independence between layers.
Model Comparison Methodology
AutoGluon's performance was rigorously compared against three baseline models under identical preprocessing and computational budgets:
- Linear Regression - Simple baseline
- XGBoost - State-of-the-art gradient boosting (current gold standard)
- Feed-Forward Neural Network - Deep learning approach
Feature Analysis with SHAP
SHAP (SHapley Additive exPlanations) analysis identified key drivers of surgical duration and overrun likelihood by quantifying how each feature contributes to predictions. Two models were developed:
- Duration Regressor: Predicts actual surgery length in minutes
- Overrun Classifier: Predicts likelihood of exceeding scheduled time
Dataset: 94,502 elective orthopaedic procedures (2010-2025) with 36 features including patient demographics, clinical comorbidities, procedure details, scheduling context, and staff identifiers.
Results & Impact
Predictive Performance
AutoGluon achieved state-of-the-art performance with minimal technical configuration:
| Model | MAE (minutes) | R² | Improvement |
|---|---|---|---|
| AutoGluon (1 hour) | 15.70 | 0.77 | 26% vs XGBoost |
| AutoGluon (4 hours) | 11.84 | 0.88 | 46% vs surgeon estimates |
| XGBoost | 16.02 | 0.77 | — |
| Neural Network | 17.67 | — | — |
| Linear Regression | 20.25 | 0.69 | — |
| Surgeon Estimates (baseline) | ~59-70 | — | — |
Key Feature Drivers
SHAP analysis revealed which factors most strongly influence surgery duration and overrun risk:
Duration Predictors
- Intended Management (inpatient vs day case)
Mean SHAP: 17.8 minutes - Procedure Code
Mean SHAP: 15.8 minutes - Anaesthetic Type
General vs regional
Overrun Risk Drivers
- Procedure Code
Mean SHAP: 9.9 percentage points - Year
Temporal scheduling changes - Theatre Location
Equipment and setup differences
Operational Insights
Key Finding: Intended management (inpatient status) strongly predicts duration but has minimal impact on overrun likelihood. This suggests schedulers already effectively account for inpatient status when allocating theatre time, but struggle to accommodate procedure-specific duration variability.
Actionable recommendations for OR scheduling:
- Adjust buffer times by procedure code rather than uniformly by inpatient status
- Batch similar procedure types together to reduce schedule uncertainty
- Focus on anaesthetic type as a modifiable factor in scheduling optimization
- Monitor temporal trends (year effects) for systematic changes in OR efficiency
Real-World Impact
What This Means for NHS Trusts
- Reduced overtime: More accurate predictions = fewer late-running lists
- Better capacity utilization: Minimize idle time between cases
- Data-driven scheduling: Replace subjective estimates with evidence-based predictions
- Accessible implementation: AutoML requires minimal ML expertise, enabling rapid deployment across trusts
- Cost savings: Even small improvements in prediction accuracy translate to significant financial impact at scale
Technical Implementation
Data Preprocessing Pipeline
To ensure fair model comparison, all models received identically preprocessed data:
- Cleaning: Removed features with >50% missingness, handled procedure length outliers, imputed missing procedure codes (16.56% of cases) using domain knowledge
- Encoding: Out-of-fold target encoding for high-cardinality features (prevents data leakage), one-hot encoding for low-cardinality features (<20 unique values)
- Train/Val/Test Split: 64% / 16% / 20%
Hyperparameter Optimization
All baseline models (LR, XGBoost, NN) used Optuna with identical computational budgets to ensure fair comparison. AutoGluon variants tested:
- AutoGluon-raw: Minimal preprocessing (MAE: 15.38)
- AutoGluon-clean: Basic cleaning (MAE: 15.40)
- AutoGluon-processed: Full preprocessing (MAE: 15.70)
- AutoGluon-full: Extended 4-hour training (MAE: 11.84)
Finding: AutoGluon demonstrated robust performance regardless of preprocessing quality, with minimal difference between raw and processed data. This suggests computational resources drive performance gains more than preprocessing refinements.
SHAP Analysis Methodology
SHAP TreeExplainer was applied to XGBoost models to generate feature importance values. Mean absolute SHAP values quantify each feature's average effect on predictions, with AutoGluon's permutation feature importance providing independent validation.
Code & Data Availability: Full implementation including data preprocessing, model training, and SHAP analysis available on GitHub →
Conclusion & Future Work
This work demonstrates that AutoML frameworks can deliver state-of-the-art predictive performance with minimal technical expertise, addressing a critical barrier to ML adoption in healthcare operations. AutoGluon's 46% improvement over surgeon estimates, combined with actionable SHAP-based insights, provides a credible pathway for NHS trusts to implement evidence-based scheduling optimization.
Limitations
- Limited to elective orthopaedic procedures at a single NHS Trust
- Dataset lacked granular chronological data and staff experience variables
- Prospective validation in live theatre scheduling required to evaluate real-world impact
Next Steps
- Uncertainty quantification: Provide confidence intervals for duration predictions to enable risk-informed scheduling decisions
- Multi-specialty expansion: Evaluate performance across surgical specialties
- Real-time deployment: Integrate into theatre management systems for prospective evaluation
- Temporal drift monitoring: Track feature importance changes over time to detect shifts in operational patterns