From proprietary printed forms to standardized digital exchange formats – care transition records to FHIR as an example
- The transition from proprietary, paper-based care transition records (CTRs) to standardized digital formats like HL7 FHIR remains a significant challenge for healthcare institutions. Variability in document layouts, coupled with slow adoption of new interoperability standards, complicates efforts to digitize patient records while preserving data integrity and preserving data integrity and privacy. This study presents a machine learning-based pipeline for automated information extraction from scanned CTRs. Synthetic training data was generated using a custom CTR generator. A Detectron2-based object detection model, integrated with LayoutParser for document structure analysis and Tesseract OCR for text recognition, was trained on this synthetic dataset. Checkbox detection was performed via an image-processing pipeline based on pixel density analysis. Extracted information was mapped to the FHIR-based PIO-ULB (Pflegeinformationsobjekt - Überleitungsbogen) format using a customThe transition from proprietary, paper-based care transition records (CTRs) to standardized digital formats like HL7 FHIR remains a significant challenge for healthcare institutions. Variability in document layouts, coupled with slow adoption of new interoperability standards, complicates efforts to digitize patient records while preserving data integrity and preserving data integrity and privacy. This study presents a machine learning-based pipeline for automated information extraction from scanned CTRs. Synthetic training data was generated using a custom CTR generator. A Detectron2-based object detection model, integrated with LayoutParser for document structure analysis and Tesseract OCR for text recognition, was trained on this synthetic dataset. Checkbox detection was performed via an image-processing pipeline based on pixel density analysis. Extracted information was mapped to the FHIR-based PIO-ULB (Pflegeinformationsobjekt - Überleitungsbogen) format using a custom serialization tool.
A synthetic dataset of 10,000 CTR samples was used for training and evaluation. The model achieved high values for accuracy, precision, recall, and F1-score metrics for synthetic data (97%, 98%, 95%, 97%) and showed robust performance for real-world data (85%, 86%, 83%, 85%). Lower performance on real-world data was attributed to layout variability and scanning artifacts absent from the synthetic training set.
The results demonstrate the feasibility of using machine learning for automated extraction and standardization of CTRs, particularly when relying on synthetic data to overcome data privacy constraints during development. While accuracy declines with real-world document variability, the approach provides a possible interim solution for facilitating interoperability in healthcare documentation. Future work will focus on extending data generation to cover complex document layouts and integrating advanced OCR and handwriting recognition methods to further improve extraction performance.…
Author: | Viktor Werlitz, Lukas Kleybolte, Sabahudin Balic, Elisabeth V. Mess, Andreas MahlerORCiD, Claudia Reuter, Alexandra Teynor |
---|---|
URN: | urn:nbn:de:bvb:384-opus4-1251564 |
Frontdoor URL | https://opus.bibliothek.uni-augsburg.de/opus4/125156 |
ISBN: | 9781643686158OPAC |
ISSN: | 0926-9630OPAC |
ISSN: | 1879-8365OPAC |
Parent Title (English): | German Medical Data Sciences 2025: GMDS Illuminates Health: proceedings of the 70th Annual Meeting of the German Association of Medical Informatics, Biometry, and Epidemiology e.V. (gmds), Jena, Germany, 7-11 September 2025 |
Publisher: | IOS Press |
Place of publication: | Amsterdam |
Editor: | Rainer Röhrig, Thomas Ganslandt, Klaus Jung, Ann-Kristin Kock-Schoppenhauer, Jochem König, Ulrich Sax, Martin Sedlmayr, Cord Spreckelsen, Antonia Zapf |
Type: | Conference Proceeding |
Language: | English |
Year of first Publication: | 2025 |
Publishing Institution: | Universität Augsburg |
Release Date: | 2025/09/20 |
First Page: | 162 |
Last Page: | 169 |
Series: | Studies in Health Technology and Informatics ; 331 |
DOI: | https://doi.org/10.3233/shti251392 |
Institutes: | Medizinische Fakultät |
Medizinische Fakultät / Universitätsklinikum | |
Dewey Decimal Classification: | 6 Technik, Medizin, angewandte Wissenschaften / 61 Medizin und Gesundheit / 610 Medizin und Gesundheit |
Licence (German): | ![]() |