- Bachground
Robust evaluation of large language models (LLMs) and agentic artificial intelligence (AI) in diagnostic pathology requires datasets that reflect the sequential, multimodal, and integrative nature of real-world practice. Existing resources rarely capture the structured interplay of clinical presentation, histology, immunohistochemistry, and molecular findings. To address this gap, we developed Pathology’s Last Exam, a text-based benchmark of pathology cases designed to rigorously assess LLM-based diagnostic systems.
Methods
We curated 100 pathology cases from practice and leading journals (e.g., Am J Surg Pathol, Mod Pathol), enriched for rare and emerging entities, aberrant immunophenotypes, lesions of intermediate biological potential, and other challenging scenarios. Each case comprises a clinical summary, histopathology, special stains/immunohistochemistry, molecular findings, final diagnosis with references, and standardized metadata. All diagnostic evidence wasBachground
Robust evaluation of large language models (LLMs) and agentic artificial intelligence (AI) in diagnostic pathology requires datasets that reflect the sequential, multimodal, and integrative nature of real-world practice. Existing resources rarely capture the structured interplay of clinical presentation, histology, immunohistochemistry, and molecular findings. To address this gap, we developed Pathology’s Last Exam, a text-based benchmark of pathology cases designed to rigorously assess LLM-based diagnostic systems.
Methods
We curated 100 pathology cases from practice and leading journals (e.g., Am J Surg Pathol, Mod Pathol), enriched for rare and emerging entities, aberrant immunophenotypes, lesions of intermediate biological potential, and other challenging scenarios. Each case comprises a clinical summary, histopathology, special stains/immunohistochemistry, molecular findings, final diagnosis with references, and standardized metadata. All diagnostic evidence was provided to four large language models (MedGemma-27B, GPT-OSS-120B, Llama-4-Maverick-17B, GPT-5-Mini), each tasked with generating a final diagnostic interpretation. The dataset further supports stepwise information release to emulate the temporal progression of real diagnostic workflows, enabling systematic evaluation of model reasoning at both early and fully informed stages.
Results
The dataset spans several organ systems, and includes rare and complex diagnoses of neoplastic and non-neoplastic pathology cases (e.g., RUNX1-Mutant AML Mimicking B-Lymphoblastic Leukemia with aberrant B-cell immunophenotype; pilomatrix-like high-grade endometrioid carcinoma; POU2F3-positive, neuroendocrine marker low small cell carcinoma etc.). On the full-information diagnostic task, accuracy ranged from 29% (MedGemma-27B) to 75% (GPT-5-Mini).
Conclusions
Pathology’s Last Exam provides a unique dataset for diagnostic reasoning in surgical pathology. Its structured, literature- and practice-derived cases support rigorous evaluation of AI models. Our findings underscore the need for expanded, pathology-specific reasoning benchmarks that combine curated literature-derived cases with new, expert-generated scenarios.
Editorial acknowledgement
During the preparation of this work the author(s) used ChatGPT 5 in order to assist with language editing. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.…

