Towards automated annotation of infant-caregiver engagement phases with multimodal foundation models

  • Caregiver mental health disorders increase the risk of insecure infant attachment and can negatively impact multiple aspects of child development, including cognitive, emotional, and social growth. Infant-caregiver interactions contain subtle psychological and behavioral cues that reveal these adverse effects, underscoring the need for analytical methods to assess them effectively. The Face-to-Face-Still-Face (FFSF) paradigm is a key approach in psychological research for investigating these dynamics, and the Infant and Caregiver Engagement Phases revised German edition (ICEP-R) annotation scheme provides a structured framework for evaluating FFSF interactions. However, manual annotation is labor-intensive and limits scalability, thus hindering a deeper understanding of early developmental impairments. To address this, we developed a computational method that automates the annotation of caregiver-infant interactions using features extracted from audio-visual foundational models. OurCaregiver mental health disorders increase the risk of insecure infant attachment and can negatively impact multiple aspects of child development, including cognitive, emotional, and social growth. Infant-caregiver interactions contain subtle psychological and behavioral cues that reveal these adverse effects, underscoring the need for analytical methods to assess them effectively. The Face-to-Face-Still-Face (FFSF) paradigm is a key approach in psychological research for investigating these dynamics, and the Infant and Caregiver Engagement Phases revised German edition (ICEP-R) annotation scheme provides a structured framework for evaluating FFSF interactions. However, manual annotation is labor-intensive and limits scalability, thus hindering a deeper understanding of early developmental impairments. To address this, we developed a computational method that automates the annotation of caregiver-infant interactions using features extracted from audio-visual foundational models. Our approach was tested on 92 FFSF video sessions. Findings demonstrate that models based on bidirectional LSTM and linear classifiers show varying effectiveness depending on the role and feature modality. Specifically, bidirectional LSTM models generally perform better in predicting complex infant engagement phases across multimodal features, while linear models show competitive performance, particularly with unimodal feature encodings like Wav2Vec2-BERT. To support further research, we share our raw feature dataset annotated with ICEP-R labels, enabling broader refinement of computational methods in this area.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Daksitha Senel Withanage Don, Dominik SchillerGND, Tobias HallmenGND, Silvan MertesORCiDGND, Tobias BaurORCiDGND, Florian LingenfelserGND, Mitho Müller, Lea Kaubisch, Corinna Reck, Elisabeth AndréORCiDGND
URN:urn:nbn:de:bvb:384-opus4-1166079
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/116607
ISBN:979-8-4007-0462-8OPAC
Parent Title (English):ICMI '24: International Conference on Multimodel Interaction, San Jose, Costa Rica, November 4-8, 2024
Publisher:ACM
Place of publication:New York, NY
Type:Conference Proceeding
Language:English
Year of first Publication:2024
Publishing Institution:Universität Augsburg
Release Date:2024/11/15
First Page:428
Last Page:438
DOI:https://doi.org/10.1145/3678957.3685704
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Menschzentrierte Künstliche Intelligenz
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):CC-BY 4.0: Creative Commons: Namensnennung (mit Print on Demand)