Maximilian Schmutz, Sebastian Sommer, Julia Sander, David Graumann, Johannes Raffler, Iñaki Soto-Rey, Seyedmostafa Sheikhalishahi, Lisa Schmidt, Leonhard Paul Unkelbach, Levent Ortak, Tina Schaller, Sebastian Dintner, Kathrin Hildebrand, Michaela Kuhlen, Frank Jordan, Martin Trepel, Christian Hinske, Rainer Claus
- Background
Large language models (LLMs) like ChatGPT 4.0 hold promise for enhancing clinical decision-making in precision oncology, particularly within molecular tumor boards (MTBs). This study assesses ChatGPT 4.0’s performance in generating therapy recommendations for complex real-world cancer cases compared to expert human MTB (hMTB) teams.
Methods
We retrospectively analyzed 20 anonymized MTB cases from the Comprehensive Cancer Center Augsburg (CCCA), covering breast cancer (n = 3), glioblastoma (n = 3), colorectal cancer (n = 2), and rare tumors. ChatGPT 4.0 recommendations were evaluated against hMTB outputs using metrics including recommendation type (therapeutic/diagnostic), information density (IDM), consistency, quality (level of evidence [LoE]), and efficiency. Each case was prompted thrice to evaluate variability (Fleiss’ Kappa).
Results
ChatGPT 4.0 generated more therapeutic recommendations per case than hMTB (median 3 vs. 1, p = 0.005), with comparableBackground
Large language models (LLMs) like ChatGPT 4.0 hold promise for enhancing clinical decision-making in precision oncology, particularly within molecular tumor boards (MTBs). This study assesses ChatGPT 4.0’s performance in generating therapy recommendations for complex real-world cancer cases compared to expert human MTB (hMTB) teams.
Methods
We retrospectively analyzed 20 anonymized MTB cases from the Comprehensive Cancer Center Augsburg (CCCA), covering breast cancer (n = 3), glioblastoma (n = 3), colorectal cancer (n = 2), and rare tumors. ChatGPT 4.0 recommendations were evaluated against hMTB outputs using metrics including recommendation type (therapeutic/diagnostic), information density (IDM), consistency, quality (level of evidence [LoE]), and efficiency. Each case was prompted thrice to evaluate variability (Fleiss’ Kappa).
Results
ChatGPT 4.0 generated more therapeutic recommendations per case than hMTB (median 3 vs. 1, p = 0.005), with comparable diagnostic suggestions (median 1 vs. 2, p = 0.501). Therapeutic scope from ChatGPT 4.0 included off-label and clinical trial options. IDM scores indicated similar content depth between ChatGPT 4.0 (median 0.67) and hMTB (median 0.75; p = 0.084). Moderate consistency was observed across replicate runs (median Fleiss’ Kappa=0.51). ChatGPT 4.0 occasionally utilized lower-level or preclinical evidence more frequently (p = 0.0019). Efficiency favored ChatGPT 4.0 significantly (median 15.2 vs. 34.7 minutes; p < 0.001).
Conclusion
Incorporating ChatGPT 4.0 into MTB workflows enhances efficiency and provides relevant recommendations, especially in guideline-supported cases. However, variability in evidence prioritization highlights the need for ongoing human oversight. A hybrid approach, integrating human expertise with LLM support, may optimize precision oncology decision-making.…

