326P Pathology's last exam? A curated text-based benchmark dataset for diagnostic pathology [Abstract]

Reitsam, Nic G.; Gustav, M.; Foersch, S.; Märkl, Bruno; Kather, J. N.

doi:10.1016/j.esmorw.2025.100522

Bachground Robust evaluation of large language models (LLMs) and agentic artificial intelligence (AI) in diagnostic pathology requires datasets that reflect the sequential, multimodal, and integrative nature of real-world practice. Existing resources rarely capture the structured interplay of clinical presentation, histology, immunohistochemistry, and molecular findings. To address this gap, we developed Pathology’s Last Exam, a text-based benchmark of pathology cases designed to rigorously assess LLM-based diagnostic systems. Methods We curated 100 pathology cases from practice and leading journals (e.g., Am J Surg Pathol, Mod Pathol), enriched for rare and emerging entities, aberrant immunophenotypes, lesions of intermediate biological potential, and other challenging scenarios. Each case comprises a clinical summary, histopathology, special stains/immunohistochemistry, molecular findings, final diagnosis with references, and standardized metadata. All diagnostic evidence wasBachground Robust evaluation of large language models (LLMs) and agentic artificial intelligence (AI) in diagnostic pathology requires datasets that reflect the sequential, multimodal, and integrative nature of real-world practice. Existing resources rarely capture the structured interplay of clinical presentation, histology, immunohistochemistry, and molecular findings. To address this gap, we developed Pathology’s Last Exam, a text-based benchmark of pathology cases designed to rigorously assess LLM-based diagnostic systems. Methods We curated 100 pathology cases from practice and leading journals (e.g., Am J Surg Pathol, Mod Pathol), enriched for rare and emerging entities, aberrant immunophenotypes, lesions of intermediate biological potential, and other challenging scenarios. Each case comprises a clinical summary, histopathology, special stains/immunohistochemistry, molecular findings, final diagnosis with references, and standardized metadata. All diagnostic evidence was provided to four large language models (MedGemma-27B, GPT-OSS-120B, Llama-4-Maverick-17B, GPT-5-Mini), each tasked with generating a final diagnostic interpretation. The dataset further supports stepwise information release to emulate the temporal progression of real diagnostic workflows, enabling systematic evaluation of model reasoning at both early and fully informed stages. Results The dataset spans several organ systems, and includes rare and complex diagnoses of neoplastic and non-neoplastic pathology cases (e.g., RUNX1-Mutant AML Mimicking B-Lymphoblastic Leukemia with aberrant B-cell immunophenotype; pilomatrix-like high-grade endometrioid carcinoma; POU2F3-positive, neuroendocrine marker low small cell carcinoma etc.). On the full-information diagnostic task, accuracy ranged from 29% (MedGemma-27B) to 75% (GPT-5-Mini). Conclusions Pathology’s Last Exam provides a unique dataset for diagnostic reasoning in surgical pathology. Its structured, literature- and practice-derived cases support rigorous evaluation of AI models. Our findings underscore the need for expanded, pathology-specific reasoning benchmarks that combine curated literature-derived cases with new, expert-generated scenarios. Editorial acknowledgement During the preparation of this work the author(s) used ChatGPT 5 in order to assist with language editing. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.… show more

Author:	Nic G. Reitsam ORCiD GND, M. Gustav, S. Foersch, Bruno Märkl ORCiD GND, J. N. Kather
URN:	urn:nbn:de:bvb:384-opus4-1268851
Frontdoor URL	https://opus.bibliothek.uni-augsburg.de/opus4/126885
ISSN:	2949-8201OPAC
Parent Title (English):	ESMO Real World Data and Digital Oncology
Publisher:	Elsevier BV
Place of publication:	Amsterdam
Type:	Article
Language:	English
Year of first Publication:	2025
Publishing Institution:	Universität Augsburg
Release Date:	2025/12/12
Volume:	10
Issue:	Supplement
First Page:	100522
DOI:	https://doi.org/10.1016/j.esmorw.2025.100522
Institutes:	Medizinische Fakultät
	Medizinische Fakultät / Universitätsklinikum
	Medizinische Fakultät / Lehrstuhl für Allgemeine und Spezielle Pathologie
Dewey Decimal Classification:	6 Technik, Medizin, angewandte Wissenschaften / 61 Medizin und Gesundheit / 610 Medizin und Gesundheit
Licence (German):	CC-BY-NC-ND 4.0: Creative Commons: Namensnennung - Nicht kommerziell - Keine Bearbeitung

Open Access

326P Pathology's last exam? A curated text-based benchmark dataset for diagnostic pathology [Abstract]

Download full text files

Export metadata

Statistics

Additional Services