Axel Winter, Bjarne Pfitzner, Robin P. van de Water, Lara Faraj, Christoph Riepe, Wolf-Heinrich Hahn, Felix Krenzien, Christian Schineis, Thomas Malinka, Wenzel Schöning, Christian Denecke, Bert Arnrich, Katharina Beyer, Johann Pratschke, Igor M. Sauer, Max M. Maurer
- Background: Comprehensive preoperative risk stratification is essential for improving perioperative outcomes and guiding informed decisions in general surgery (GS). However, data scarcity remains a key challenge to developing robust, high-dimensional artificial intelligence (AI) models. To address this data barrier in surgical AI, transfer learning (TL) enables neural networks (NNs) to transfer and adapt knowledge from pretrained source models to new domains with critically limited data availability.
Methods: This multicenter study included patients undergoing advanced GS at three tertiary centers between 2015 and 2023. Multiple large-scale source models for 90-day mortality prediction were trained on 85 preoperative parameters. Subsequently, organ-specific fine-tuning was performed for esophageal, liver, pancreatic, and colorectal surgery individually. TL models were benchmarked against standard ML models and conventional risk scores using the area under the receiver-operatingBackground: Comprehensive preoperative risk stratification is essential for improving perioperative outcomes and guiding informed decisions in general surgery (GS). However, data scarcity remains a key challenge to developing robust, high-dimensional artificial intelligence (AI) models. To address this data barrier in surgical AI, transfer learning (TL) enables neural networks (NNs) to transfer and adapt knowledge from pretrained source models to new domains with critically limited data availability.
Methods: This multicenter study included patients undergoing advanced GS at three tertiary centers between 2015 and 2023. Multiple large-scale source models for 90-day mortality prediction were trained on 85 preoperative parameters. Subsequently, organ-specific fine-tuning was performed for esophageal, liver, pancreatic, and colorectal surgery individually. TL models were benchmarked against standard ML models and conventional risk scores using the area under the receiver-operating characteristic curve (AUROC), precision-recall curve (AUPRC), and F1-score, including 95% confidence intervals. Feature analyses were performed for each NN to investigate and compare model interpretability.
Results: 14 922 patients (mean [SD] age: 58.5 [16.1] years) were included. Conventional ML achieved AUROCs of 0.75 (0.72-0.79; esophageal surgery), 0.80 (0.79-0.82; liver surgery), 0.73 (0.71-0.76; pancreatic surgery), and 0.92 (0.92-0.92; colorectal surgery) with corresponding AUPRCs reaching 0.37 (0.33-0.43), 0.30 (0.29-0.31), 0.29 (0.24-0.34), and 0.57 (0.56-0.58), respectively. TL significantly improved AUPRCs by 38% in esophageal (0.54 [0.51-0.58], P < 0.001), 14% in liver (0.34 [0.32-0.36], P < 0.001), and 8% in pancreatic surgery (0.31 [0.28-0.37], P < 0.001). Patient age and the Charlson Comorbidity Index (CCI) consistently emerged as the highest-weight features across all TL models. All NNs outperformed the American Society of Anesthesiologists Physical Status and CCI as conventional risk scores in predicting mortality.
Conclusions: Machine learning outperforms conventional risk modeling in preoperative mortality prediction. TL can significantly enhance model performance in surgical domains with limited data availability, offering a promising approach to overcome persisting data constraints for AI in surgery.…

