• search hit 1 of 561
Back to Result List

Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions

  • Rating scales have shaped psychological research, but are resource-intensive and can burden participants. Large Language Models (LLMs) offer a tool to assess latent constructs in text. This study introduces LLM rating scales, which use LLM responses instead of human ratings. We demonstrate this approach with an LLM rating scale measuring patient engagement in therapy transcripts. Automatically transcribed videos of 1,131 sessions from 155 patients were analyzed using DISCOVER, a software framework for local multimodal human behavior analysis. Llama 3.1 8B LLM rated 120 engagement items, averaging the top eight into a total score. Psychometric evaluation showed a normal distribution, strong reliability (ω = 0.953), and acceptable fit (CFI = 0.968, SRMR = 0.022), except RMSEA = 0.108. Validity was supported by significant correlations with engagement determinants (e.g., motivation, r = .413), processes (e.g., between-session efforts, r = .390), and outcomes (e.g., symptoms, r = - .304).Rating scales have shaped psychological research, but are resource-intensive and can burden participants. Large Language Models (LLMs) offer a tool to assess latent constructs in text. This study introduces LLM rating scales, which use LLM responses instead of human ratings. We demonstrate this approach with an LLM rating scale measuring patient engagement in therapy transcripts. Automatically transcribed videos of 1,131 sessions from 155 patients were analyzed using DISCOVER, a software framework for local multimodal human behavior analysis. Llama 3.1 8B LLM rated 120 engagement items, averaging the top eight into a total score. Psychometric evaluation showed a normal distribution, strong reliability (ω = 0.953), and acceptable fit (CFI = 0.968, SRMR = 0.022), except RMSEA = 0.108. Validity was supported by significant correlations with engagement determinants (e.g., motivation, r = .413), processes (e.g., between-session efforts, r = .390), and outcomes (e.g., symptoms, r = - .304). Results remained robust across bootstrap resampling and cross-validation, accounting for nested data. The LLM rating scale exhibited strong psychometric properties, demonstrating the potential of the approach as an assessment tool. Importantly, this automated approach uses interpretable items, ensuring clear understanding of measured constructs, while supporting local implementation and protecting confidential data.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Steffen T. EberhardtORCiD, Antonia VehlenORCiD, Jana SchaffrathORCiD, Brian SchwartzORCiD, Tobias BaurORCiDGND, Dominik SchillerORCiDGND, Tobias HallmenORCiDGND, Elisabeth AndréORCiDGND, Wolfgang LutzORCiD
URN:urn:nbn:de:bvb:384-opus4-1247774
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/124777
ISSN:2045-2322OPAC
Parent Title (English):Scientific Reports
Publisher:Springer Science and Business Media LLC
Type:Article
Language:English
Year of first Publication:2025
Publishing Institution:Universität Augsburg
Release Date:2025/09/03
Volume:15
Issue:1
First Page:29541
DOI:https://doi.org/10.1038/s41598-025-14923-y
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Menschzentrierte Künstliche Intelligenz
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):CC-BY 4.0: Creative Commons: Namensnennung (mit Print on Demand)