- Caregiver mental health disorders increase the risk of insecure infant attachment and can negatively impact multiple aspects of child development, including cognitive, emotional, and social growth. Infant-caregiver interactions contain subtle psychological and behavioral cues that reveal these adverse effects, underscoring the need for analytical methods to assess them effectively. The Face-to-Face-Still-Face (FFSF) paradigm is a key approach in psychological research for investigating these dynamics, and the Infant and Caregiver Engagement Phases revised German edition (ICEP-R) annotation scheme provides a structured framework for evaluating FFSF interactions. However, manual annotation is labor-intensive and limits scalability, thus hindering a deeper understanding of early developmental impairments. To address this, we developed a computational method that automates the annotation of caregiver-infant interactions using features extracted from audio-visual foundational models. OurCaregiver mental health disorders increase the risk of insecure infant attachment and can negatively impact multiple aspects of child development, including cognitive, emotional, and social growth. Infant-caregiver interactions contain subtle psychological and behavioral cues that reveal these adverse effects, underscoring the need for analytical methods to assess them effectively. The Face-to-Face-Still-Face (FFSF) paradigm is a key approach in psychological research for investigating these dynamics, and the Infant and Caregiver Engagement Phases revised German edition (ICEP-R) annotation scheme provides a structured framework for evaluating FFSF interactions. However, manual annotation is labor-intensive and limits scalability, thus hindering a deeper understanding of early developmental impairments. To address this, we developed a computational method that automates the annotation of caregiver-infant interactions using features extracted from audio-visual foundational models. Our approach was tested on 92 FFSF video sessions. Findings demonstrate that models based on bidirectional LSTM and linear classifiers show varying effectiveness depending on the role and feature modality. Specifically, bidirectional LSTM models generally perform better in predicting complex infant engagement phases across multimodal features, while linear models show competitive performance, particularly with unimodal feature encodings like Wav2Vec2-BERT. To support further research, we share our raw feature dataset annotated with ICEP-R labels, enabling broader refinement of computational methods in this area.…