Multimodal Deep Learning for Emergency Department Triage Acuity Prediction

Combining structured clinical data and free-text chief complaints to predict patient triage acuity levels

Multimodal CNN Architecture combining text and structured clinical data

Project Overview

Emergency department triage is a critical process that determines the order and urgency with which patients receive care. The Emergency Severity Index (ESI) categorizes patients from Level 1 (most severe, life-threatening) to Level 5 (least severe). This project developed a multimodal deep learning system to predict triage acuity levels by combining traditional structured clinical data (vital signs, demographics) with free-text chief complaints processed through advanced natural language understanding.

High-level overview of the multimodal triage acuity prediction system

Challenge: Traditional triage prediction models rely solely on structured data, missing the rich contextual information contained in patient chief complaints. This project aimed to build a multimodal system that effectively fuses numerical clinical features with semantic embeddings from text to improve prediction accuracy.

The Journey: From Data to Deployment

Phase 1: Understanding the Problem

My journey began with studying the MIMIC-IV-ED dataset from PhysioNet—a comprehensive collection of emergency department visits containing structured clinical measurements alongside free-text chief complaints. I quickly realized that vital signs alone tell only part of the story. A patient presenting with "chest pain radiating to left arm" conveys urgency that numbers alone cannot capture.

The Emergency Severity Index presents a challenging multi-class classification problem with inherent class imbalance (ESI Level 1 patients are thankfully rare). I needed an approach that could handle both modalities effectively while maintaining clinical interpretability.

Phase 2: Building the Text Understanding Pipeline

The first major technical decision was choosing a text encoder. After researching biomedical NLP models, I selected all-MPNet-base-v2 from sentence-transformers. While not specifically trained on medical text like BioBERT or ClinicalBERT, MPNet offers strong general semantic understanding and produces high-quality sentence embeddings (768 dimensions).

I implemented batch processing for embedding generation to handle the dataset efficiently. The preprocessing pipeline:

Tokenizes chief complaint text with padding and truncation (max 512 tokens)
Generates contextualized embeddings using MPNet's transformer architecture
Applies mean pooling across token embeddings for sentence-level representation
Caches embeddings to disk for faster subsequent training runs

This approach transformed unstructured text like "difficulty breathing, fever for 3 days" into dense vector representations capturing semantic meaning and clinical context.

Phase 3: Architecting the Multimodal Network

The core innovation was designing a neural architecture that could effectively fuse text and structured features. I developed a dual-branch convolutional neural network:

Text Branch (Convolutional Processing)

Conv1D Layer 1: 1 → 64 channels, kernel size 5, extracts local patterns from embeddings
MaxPool: Reduces dimensionality, retains salient features
Conv1D Layer 2: 64 → 32 channels, deeper feature extraction
MaxPool: Further compression to 32 × 384 = 12,288 features
Flatten: Prepares for fusion with structured features

Structured Data Branch (Fully Connected)

Linear Layer 1: Input features → 16 neurons with ReLU activation
Linear Layer 2: 16 → 16 neurons, further non-linear transformation
Processes vital signs (temperature, heart rate, respiratory rate, O2 saturation, blood pressure, pain score)
Incorporates demographic features (gender, arrival transport method)

Fusion & Classification

Concatenation: Combines 12,288 text features + 16 structured features
Final Classifier: Linear layer mapping to 5-class output (ESI 1-5)
Training: Cross-entropy loss, Adam optimizer (lr=0.001), 5 epochs

Phase 4: Docker and DevOps Infrastructure

A critical aspect of this project was ensuring reproducible deployment in a Trusted Research Environment (TRE). Working with sensitive healthcare data required containerization using Docker, which packages applications and their dependencies into isolated, portable containers. The TRE provided a secure Linux environment where the MIMIC-IV-ED dataset could be accessed without compromising patient confidentiality.

The DevOps application lifecycle began with creating Docker images on my local machine using requirements.txt and Dockerfile specifications. These images were compressed into .tar.gz format and transferred securely into the TRE via the Airlock system. Within the TRE, images were loaded onto the local Docker registry and instantiated as running containers to execute Python scripts for exploratory data analysis, model training, and evaluation.

The DevOps application lifecycle showing Docker image creation, compression, secure transfer to TRE, and container execution

Command-line scripting (bash) was essential for managing working directories, running scripts, and launching successive Docker container iterations (e.g., docker load -i acuity_pred_image.tar). Additionally, CUDA commands were crucial for accessing the TRE's GPU resources, both for initializing Docker containers with GPU access and assigning computationally intensive tasks like chief complaint embedding generation to the GPU during script execution.

Phase 5: Data Preprocessing Challenges

Real-world healthcare data is messy. I encountered several data quality issues that required careful handling:

Pain Score Validation: Filtered records to ensure pain scores were numeric and within 0-10 range
Acuity Bounds: Validated ESI scores fell within 1-5 (converting to 0-indexed for model training)
Missing Data: Dropped incomplete records to maintain data integrity
Feature Scaling: Applied StandardScaler to vital signs for numerical stability
Categorical Encoding: One-hot encoded arrival transport and gender features

This preprocessing pipeline ensured clean, normalized inputs for both the text embeddings and structured features branches.

Phase 6: Training and Evaluation

Training employed stratified train-test splitting (80/20) to maintain class balance across severity levels. I used batch training (batch size 32) with GPU acceleration where available. The model converged within 5 epochs, balancing computational efficiency with learning capacity.

For evaluation, I implemented comprehensive metrics:

Per-Class ROC-AUC: Evaluated discriminative ability for each acuity level separately
Micro-Average AUC: Overall model performance across all classes
Precision, Recall, F1-Score: Class-wise performance metrics

Results & Performance

The project successfully progressed through three key DevOps milestones, each representing a major advancement in deployment capability within the TRE. The first Docker container enabled successful exploratory data analysis (EDA) of the MIMIC-IV-ED dataset. The second container was adapted for the ML pipeline and included critical GPU initialization in the TRE, enabling accelerated model training. The third and final container supported the Streamlit application, integrating live text-embedding generation with the trained model weights to deliver real-time predictions.

DevOps milestones achieved showing three iterations of Docker containers, from initial EDA through ML pipeline development to final Streamlit deployment with live prediction capability

This iterative DevOps cycle worked seamlessly with the agile methodology employed throughout the project, allowing the team to adapt to challenges during weekly sprints. Completion of these milestones underpinned the group's progression, ultimately enabling training and evaluation of four classification models: two for acuity prediction and two for hospital admission disposition.

0.93

ROC AUC (ESI Level 1)

Most severe cases

0.91

ROC AUC (ESI Level 4)

Less urgent cases

0.83

ROC AUC (ESI Level 2)

Emergent cases

0.93

Micro-Average AUC

Overall performance

MultiCNN-Acuity Model ROC-AUC curves for multi-class Emergency Severity Index (ESI) prediction, calculated using one-vs-rest approach with micro-average representing weighted average performance across all ESI classes

Key Findings

To compare and assess model performance, accuracy and F1 score metrics were prioritized. Accuracy is widely used both in research and in quantifying nurse triage performance, making it an appropriate benchmark. F1 score balances precision and recall, providing more informative evaluation under class imbalance conditions. The MultiCNN-Acuity model achieved an accuracy of 0.70 and F1 score of 0.68, improving upon emergency nurse accuracy (0.64) though not yet surpassing emergency physician performance (0.75).

Excellent ESI Level 1 Detection: The model achieved 0.93 AUC for the most critical patients, crucial for patient safety
Strong Overall Performance: 0.93 micro-average AUC demonstrates robust multi-class discrimination
Moderate Performance on ESI 2-3: Mid-acuity levels showed slightly lower AUC (0.79-0.83), likely due to clinical overlap between adjacent severity levels
Multimodal Advantage: Combining text and structured features outperformed either modality alone (based on ablation testing)

Clinical Impact: The model demonstrates particular strength in identifying the most critical patients (ESI Level 1, AUC 0.93) and least urgent cases (ESI Levels 4-5, AUC 0.91-0.94), which is clinically valuable for triage prioritization. The slightly lower performance on mid-level acuity reflects the inherent difficulty even human clinicians face in distinguishing between adjacent ESI levels.

Phase 7: Deployment with Streamlit

To make the model accessible for demonstration and potential clinical evaluation, I developed a Streamlit web application. The application's function was to mimic a clinician decision support tool and enable live prediction of acuity. This required integrating several critical components: (i) a custom Docker container with Streamlit dependencies; (ii) live text-embedding generation using the MPNet model; (iii) the complete ML pipeline including data preprocessing and trained model weights; and (iv) thoughtful user interface and experience (UI/UX) design considerations.

From a UI/UX perspective, the interface was designed to balance simplicity, usability, and safety. Recognizing that such an application could potentially be regulated as a medical device due to its ability to guide clinical decisions, several safety features were incorporated. The app provides an intuitive interface for clinicians to:

Input patient vital signs and demographics
Enter free-text chief complaint
Receive real-time acuity predictions with probability distributions across all ESI levels

The deployment was containerized using Docker for reproducibility and secure operation within the TRE. This ensures the model can be evaluated in secure healthcare settings without exposing sensitive patient data.

Live acuity prediction using Streamlit application within the TRE, showing input fields for vital signs, arrival method, gender, and free-text chief complaint, with real-time Emergency Severity Index prediction and corresponding triage recommendations

The final Streamlit interface accepts nine structured input variables from clinical triage assessment (temperature, heart rate, respiratory rate, oxygen saturation, blood pressure, pain score, arrival transport method, and gender) alongside the free-text chief complaint. The chief complaint is embedded in real-time using the same MPNet transformer employed during training. The application outputs a predicted ESI level (1 to 5) with appropriate action recommendations. Dual temperature units (Celsius and Fahrenheit) were included for flexibility, and a prominent notification informs users that this prediction is solely an AI-based recommendation and should not constitute a final clinical decision—supporting rather than replacing professional judgment.

Technical Architecture Summary

PyTorch MPNet Sentence Transformers Convolutional Neural Networks MIMIC-IV-ED Streamlit Docker scikit-learn

Lessons Learned & Future Directions

What Worked Well:

Convolutional processing of text embeddings captured local semantic patterns effectively
Late fusion (after separate branch processing) allowed each modality to develop specialized representations
Caching embeddings dramatically reduced iteration time during hyperparameter tuning
Stratified splitting maintained class balance despite severity-level imbalance

Challenges & Limitations:

Version Control: Multiple team members concurrently developing Python scripts for preprocessing and modeling, further modified for TRE compatibility and GPU optimization, led to conflicting file versions in the GitHub repository. This was mitigated by establishing a canonical main folder requiring pull requests approved by two group members and detailed changelog documentation
Class Imbalance: ESI Level 1 cases are rare; considered oversampling or class weighting in future iterations
Mid-Level Confusion: ESI 2-3 distinction remains challenging (AUC 0.79-0.83); clinical guidelines themselves show overlap between adjacent severity levels
Interpretability: Deep learning models lack transparency; future work could incorporate attention mechanisms to highlight influential chief complaint phrases
Generalizability: Model trained on MIMIC-ED (US academic hospital); external validation on UK population needed for NHS deployment

Future Enhancements:

Experiment with medical domain-specific language models such as Biomedical General Embedding (BGE-M3), which offers more powerful and clinically-oriented embeddings than the all-mpnet-base-v2 transformer used in this study. BGE-M3 likely provides superior clinical text clustering across diverse patient presentations
Incorporate LLM-based ESI classification prediction (e.g., LLaMA3.1-70B), which has demonstrated remarkable 0.91 accuracy on similar tasks
Implement feature rankings and attention mechanisms to boost clinician trust, uptake, and model interpretability
Improve predictive performance in the moderate ESI range (levels 2-3) where clinical overlap presents the greatest challenge
Conduct external validation studies to test generalizability to UK NHS patient populations
Incorporate temporal features (time of day, day of week, seasonal patterns)
Develop ensemble approaches combining multiple architectures
Extend to predict additional outcomes (admission likelihood, length of stay)

Tools & Technologies

PyTorch: Deep learning framework for model development and training
Sentence Transformers: MPNet embeddings for semantic text representation
MIMIC-IV-ED: Emergency department dataset from PhysioNet
scikit-learn: Preprocessing, metrics, train-test splitting
Streamlit: Interactive web application for model demonstration
Docker: Containerization for reproducible deployment in TREs
CUDA: GPU acceleration for training and inference

Significance & Impact

This project demonstrates that multimodal deep learning can effectively combine structured clinical data with unstructured text to predict emergency department triage acuity. In the context of NHS emergency departments where only 57% of patients now meet the 4-hour admission target (compared to 95% a decade ago), and with record 61,529 people waiting more than 12 hours in January 2025, automated triage support systems could help address the critical pressure on healthcare resources. Key contributions include:

Methodological: Demonstrated effective fusion of numerical and textual features using dual-branch CNN architecture with merged text CNN and structured data MLP
Clinical Relevance: Achieved accuracy (0.70) exceeding emergency nurses (0.64) and strong performance on critical patient identification (ESI Level 1 AUC 0.93), essential for patient safety
Reproducibility & Security: Provided complete containerized deployment pipeline enabling secure operation within Trusted Research Environments (TREs), ensuring patient data confidentiality while maintaining model accessibility
Translational Infrastructure: Successfully demonstrated the complete DevOps lifecycle from data preprocessing through GPU-accelerated training to live web application deployment, addressing the translation gap between ML research and clinical implementation
Educational Value: Serves as practical example of applied multimodal deep learning in healthcare, including agile methodology, version control, and team collaboration

While not intended for immediate clinical deployment without extensive validation, this work provides a foundation for future triage decision support systems and demonstrates the value of integrating diverse data modalities in predictive healthcare models. The iterative DevOps approach successfully delivered a live decision support tool that respects legal, ethical, and technical constraints required for healthcare AI applications.

Ethical Considerations: All development used the publicly available MIMIC-IV-ED demo dataset. Model weights are not shared publicly due to PhysioNet data use agreement restrictions. Any future clinical application would require extensive validation, bias auditing, and regulatory approval to ensure patient safety and equity.

View on GitHub

Back to Projects