ML auditing and reproducibility: applying a core criteria catalog to an early sepsis onset detection system

Schwarz, Markus; Hinske, Ludwig Christian; Mansmann, Ulrich; Albashiti, Fady

doi:10.1109/access.2025.3579631

Background: On the way towards a commonly agreed framework for auditing ML algorithms, in our previous paper we proposed a 30-question core criteria catalog. In this paper, we apply our catalog to an early sepsis onset detection system use case. Methods: The assessment of the ML algorithm behind the sepsis prediction system takes place in a kind of external audit. We apply the questions of our catalog with described context to the available sepsis project resources made publicly available. For the audit process we considered three steps proposed by the Supreme Audit Institutions of Finland et al. and utilized inter-rater reliability techniques. We also conducted an extensive reproduction study, as being encouraged by our catalog, including data perturbation experiments. Results: We were able to successfully apply our 30-question catalog to the sepsis ML algorithm development project. 37% of the questions were rated as fully addressed, 33% of the questions as partially addressed and 30%Background: On the way towards a commonly agreed framework for auditing ML algorithms, in our previous paper we proposed a 30-question core criteria catalog. In this paper, we apply our catalog to an early sepsis onset detection system use case. Methods: The assessment of the ML algorithm behind the sepsis prediction system takes place in a kind of external audit. We apply the questions of our catalog with described context to the available sepsis project resources made publicly available. For the audit process we considered three steps proposed by the Supreme Audit Institutions of Finland et al. and utilized inter-rater reliability techniques. We also conducted an extensive reproduction study, as being encouraged by our catalog, including data perturbation experiments. Results: We were able to successfully apply our 30-question catalog to the sepsis ML algorithm development project. 37% of the questions were rated as fully addressed, 33% of the questions as partially addressed and 30% of the questions as not addressed, based on the first auditor. The weighted Cohen’s kappa agreement coefficient results in κ=0.51 . The focus of the sepsis project is on algorithm design, data properties and assessment metrics. In our reproduction study, using externally validated pooled prediction on the self-attention deep learning model, we achieved an AUC of 0.717 (95% CI, 0.693-0.740) and a PPV of 28.3 (95% CI, 24.5-32.0) at 80% TPR and 18.8% sepsis-case prevalence harmonization. For the lead time to sepsis onset, we could not reproduce meaningful values. In the perturbation experiment, the model showed an AUC of 0.799 (95% CI, 0.756-0.843) with modified input data in contrast to an AUC of 0.788 (95% CI, 0.743-0.833) with original input data, when trained on the AUMC dataset and validated externally. Discussion: The catalog application results are visualized in a radar diagram, allowing an auditor to quickly assess and compare strengths and weaknesses of ML algorithm development or implementation projects. In general, we were able to reproduce the magnitude of the sepsis project’s reported performance metrics. However, certain steps of the reproduction study proved to be challenging due to necessary code changes and dependencies on package versions and the runtime environment. The extent of the deviation in the result metrics was −5.83% for the AUC and −11.03% for the PPV, presumably explained by our absence of tuning. The AUC change of 1.45% indicates resilience of the self-attention deep learning model to input data manipulation. An algorithmic error is most likely responsible for the missing lead time to sepsis onset metric. Even though the acquired weighted Cohen’s kapa coefficient is interpreted as having a “fair to good” agreement between both auditors, there exists potential subjectivity showing room for improvement. This could be mitigated if more groups (multiple auditors) would apply our catalog to existing ML development and implementation projects. A subsequent “catalog application guideline” could be established this way. Our activities might also help development or implementation teams to prepare themselves for future, legally required audits of their newly created ML algorithms/AI products.… show more

Author:	Markus Schwarz, Ludwig Christian Hinske ORCiD GND, Ulrich Mansmann, Fady Albashiti
URN:	urn:nbn:de:bvb:384-opus4-1243839
Frontdoor URL	https://opus.bibliothek.uni-augsburg.de/opus4/124383
ISSN:	2169-3536OPAC
Parent Title (English):	IEEE Access
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Type:	Article
Language:	English
Year of first Publication:	2025
Publishing Institution:	Universität Augsburg
Release Date:	2025/08/18
Volume:	13
First Page:	104899
Last Page:	104915
DOI:	https://doi.org/10.1109/access.2025.3579631
Institutes:	Medizinische Fakultät
	Medizinische Fakultät / Universitätsklinikum
	Medizinische Fakultät / Lehrstuhl für Datenmanagement und Clinical Decision Support
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):	CC-BY 4.0: Creative Commons: Namensnennung (mit Print on Demand)

Open Access

ML auditing and reproducibility: applying a core criteria catalog to an early sepsis onset detection system

Download full text files

Export metadata

Statistics

Print On Demand

Additional Services