- Stress recognition is a key component of affect-aware systems to improve mental and physical well-being. While multimodal affect recognition systems based on physiological signals have shown promise, achieving robust generalization across different datasets remains a major challenge due to variations in stress induction protocols and labeling practices. For instance, “stress” labels can vary widely between datasets: WESAD uses the Trier Social Stress Test to induce social-evaluative stress, while SWELL-KW relies on cognitive workload tasks. Such differences in the nature and intensity of stressors, as well as inconsistencies in how labels are defined (e.g., “social stress” vs. “mental stress”), create major challenges for generalization. Prior work has explored deep transfer learning and unimodal self-supervised methods, but cross-dataset generalizability is still limited. To address this gap, we propose a multimodal self-supervised learning (SSL) framework based on contrastiveStress recognition is a key component of affect-aware systems to improve mental and physical well-being. While multimodal affect recognition systems based on physiological signals have shown promise, achieving robust generalization across different datasets remains a major challenge due to variations in stress induction protocols and labeling practices. For instance, “stress” labels can vary widely between datasets: WESAD uses the Trier Social Stress Test to induce social-evaluative stress, while SWELL-KW relies on cognitive workload tasks. Such differences in the nature and intensity of stressors, as well as inconsistencies in how labels are defined (e.g., “social stress” vs. “mental stress”), create major challenges for generalization. Prior work has explored deep transfer learning and unimodal self-supervised methods, but cross-dataset generalizability is still limited. To address this gap, we propose a multimodal self-supervised learning (SSL) framework based on contrastive objectives that learns transferable representations from unlabeled physiological signals. Unlike conventional deep transfer learning approaches, our framework does not rely on stress labels during the pretraining stage and is evaluated under a strict leave-one-subject-out (LOSO) protocol to ensure realistic cross-subject generalization. We systematically study the impact of SSL across multiple encoder architectures, including Convolutional Neural Networks (CNN), Temporal Convolutional Networks (TCN), ResNet34-1D, and a CNN–Transformer hybrid, enabling a systematic analysis of how different encoder architectures affect representation transferability. Experiments are conducted across three laboratory datasets (WESAD, VERBIO, AffectHRI) and two daily life datasets (SWEET, LD), covering lab-to-lab, lab-to-daily, and daily-to-lab transfer scenarios. Overall, our findings highlight multimodal self-supervised learning as an effective and label-efficient framework for improving cross-dataset generalization, particularly under realistic cross-subject and cross-context evaluation settings.…

