Refine
Year of publication
Document Type
- Article (6)
- Conference Proceeding (6)
- Book (2)
- Preprint (2)
Language
- English (16)
Keywords
Student’s shift of attention away from a current learning task to task-unrelated thought, also called mind wandering, occurs about 30% of the time spent on education-related activities. Its frequent occurrence has a negative effect on learning outcomes across learning tasks. Automated detection of mind wandering might offer an opportunity to assess the attentional state continuously and non-intrusively over time and hence enable large-scale research on learning materials and responding to inattention with targeted interventions. To achieve this, an accessible detection approach that performs well for various systems and settings is required. In this work, we explore a new, generalizable approach to video-based mind wandering detection that can be transferred to naturalistic settings across learning tasks. Therefore, we leverage two datasets, consisting of facial videos during reading in the lab (N = 135) and lecture viewing in-the-wild (N = 15). When predicting mind wandering, deep neural networks (DNN) and long short-term memory networks (LSTMs) achieve F1 scores of 0.44 (AUC-PR = 0.40) and 0.459 (AUC-PR = 0.39), above chance level, with latent features based on transfer-learning on the lab data. When exploring generalizability by training on the lab dataset and predicting on the in-the-wild dataset, BiLSTMs on latent features perform comparably to the state-of-the-art with an F1 score of 0.352 (AUC-PR = 0.26). Moreover, we investigate the fairness of predictive models across gender and show based on post-hoc explainability methods that employed latent features mainly encode information on eye and mouth areas. We discuss the benefits of generalizability and possible applications.
Comparison of clinical geneticist and computer visual attention in assessing genetic conditions
(2024)
Artificial intelligence (AI) for facial diagnostics is increasingly used in the genetics clinic to evaluate patients with potential genetic conditions. Current approaches focus on one type of AI called Deep Learning (DL). While DL- based facial diagnostic platforms have a high accuracy rate for many conditions, less is understood about how this technology assesses and classifies (categorizes) images, and how this compares to humans. To compare human and computer attention, we performed eye-tracking analyses of geneticist clinicians (n = 22) and non-clinicians (n = 22) who viewed images of people with 10 different genetic conditions, as well as images of unaffected individuals. We calculated the Intersection-over-Union (IoU) and Kullback–Leibler divergence (KL) to compare the visual attentions of the two participant groups, and then the clinician group against the saliency maps of our deep learning classifier. We found that human visual attention differs greatly from DL model’s saliency results. Averaging over all the test images, IoU and KL metric for the successful (accurate) clinician visual attentions versus the saliency maps were 0.15 and 11.15, respectively. Individuals also tend to have a specific pattern of image inspection, and clinicians demonstrate different visual attention patterns than non-clinicians (IoU and KL of clinicians versus non-clinicians were 0.47 and 2.73, respectively). This study shows that humans (at different levels of expertise) and a computer vision model examine images differently. Understanding these differences can improve the design and use of AI tools, and lead to more meaningful interactions between clinicians and AI technologies.
Computer vision-based methods have valuable use cases in precision medicine, and recognizing facial phenotypes of genetic disorders is one of them. Many genetic disorders are known to affect faces' visual appearance and geometry. Automated classification and similarity retrieval aid physicians in decision-making to diagnose possible genetic conditions as early as possible. Previous work has addressed the problem as a classification problem and used deep learning methods. The challenging issue in practice is the sparse label distribution and huge class imbalances across categories. Furthermore, most disorders have few labeled samples in training sets, making representation learning and generalization essential to acquiring a reliable feature descriptor. In this study, we used a facial recognition model trained on a large corpus of healthy individuals as a pre-task and transferred it to facial phenotype recognition. Furthermore, we created simple baselines of few-shot meta-learning methods to improve our base feature descriptor. Our quantitative results on GestaltMatcher Database show that our CNN baseline surpasses previous works, including GestaltMatcher, and few-shot meta-learning strategies improve retrieval performance in frequent and rare classes.
Rare genetic disorders affect more than 6% of the global population. Reaching a diagnosis is challenging because rare disorders are very diverse. Many disorders have recognizable facial features that are hints for clinicians to diagnose patients. Previous work, such as GestaltMatcher, utilized representation vectors produced by a DCNN similar to AlexNet to match patients in high-dimensional feature space to support "unseen" ultra-rare disorders. However, the architecture and dataset used for transfer learning in GestaltMatcher have become outdated. Moreover, a way to train the model for generating better representation vectors for unseen ultra-rare disorders has not yet been studied. Because of the overall scarcity of patients with ultra-rare disorders, it is infeasible to directly train a model on them. Therefore, we first analyzed the influence of replacing GestaltMatcher DCNN with a state-of-the-art face recognition approach, iResNet with ArcFace. Additionally, we experimented with different face recognition datasets for transfer learning. Furthermore, we proposed test-time augmentation, and model ensembles that mix general face verification models and models specific for verifying disorders to improve the disorder verification accuracy of unseen ultra-rare disorders. Our proposed ensemble model achieves state-of-the-art performance on both seen and unseen disorders.
In social signal processing and computer vision, there has been increasing number of studies which are related with social and behavioural sciences to some extent in last years. Affective state of human has very significant potential in many application areas such as evaluating market trends, understanding the decision-making, interpreting social interactions and their underlying background, and so on. Among the agents that make our emotions understandable, the facial expressions are the most prominent and descriptive sign of a humans's affective state. This thesis presents a literature survey on the state-of-the-art of facial expression recognition, comparison of different approaches in automatic analysis of emotions, and proposes a new embedded framework for facial expression recognition problem. Although there have been large number of studies in facial expression recognition, the number of ``affective'' embedded systems are fairly scarce. In this study, an efficient embedded framework is implemented on a system-on-chip (SoC) development board. Many application areas of facial expression recognition systems necessitate the mobility, and embedded platforms which have both hardware and software development tools, as well as low power consumption and increased adaptivity. In this study, different feature extraction methods such as local binary pattern (LBP), local ternary pattern (LTP) and Gabor filters are compared using different extraction strategies and varied kernel functions and parameters in learning phase, support vector machines (SVM). In embedded framework of facial expression system, local binary patterns and support vector machines-based methodology is preferred, because of its higher accuracy and time performance. Besides OpenCV implementation on embedded linux operating system, Zynq-7000 all programmable SoC is used to measure the performance of LBP feature extraction. Our final system has capable of facial expression recognition in both static images and video sequences at 4-5 fps.
Many modern applications of artificial intelligence involve, to some extent, an understanding of human attention, activity, intention, and competence from multimodal visual data. Nonverbal behavioral cues detected using computer vision and machine learning methods include valuable information for understanding human behaviors, including attention and engagement. The use of such automated methods in educational settings has a tremendous potential for good. Beneficial uses include classroom analytics to measure teaching quality and the development of interventions to improve teaching based on these analytics, as well as presentation analysis to help students deliver their messages persuasively and effectively. This dissertation presents a general framework based on multimodal visual sensing to analyze engagement and related tasks from visual modalities. While the majority of engagement literature in affective and social computing focuses on computer-based learning and educational games, we investigate automated engagement estimation in the classroom using different nonverbal behavioral cues and developed methods to extract attentional and emotional features. Furthermore, we validate the efficiency of proposed approaches on real-world data collected from videotaped classes at university and secondary school. In addition to learning activities, we perform behavior analysis on students giving short scientific presentations using multimodal cues, including face, body, and voice features. Besides engagement and presentation competence, we approach human behavior understanding from a broader perspective by studying the analysis of joint attention in a group of people, teachers' perception using egocentric camera view and mobile eye trackers, and automated anonymization of audiovisual data in classroom studies. Educational analytics present valuable opportunities to improve learning and teaching. The work in this dissertation suggests a computational framework for estimating student engagement and presentation competence, together with supportive computer vision problems.
Extensive use of the internet has enabled easy access to many different sources, such as news and social media. Content shared on the internet cannot be fully fact-checked and, as a result, misinformation can spread in a fast and easy way. Recently, psychologists and economists have shown in many experiments that prior beliefs, knowledge, and the willingness to think deliberately are important determinants to explain who falls for fake news. Many of these studies only rely on self-reports, which suffer from social desirability. We need more objective measures of information processing, such as eye movements, to effectively analyze the reading of news. To provide the research community the opportunity to study human behaviors in relation to news truthfulness, we propose the FakeNewsPerception dataset. FakeNewsPerception consists of eye movements during reading, perceived believability scores, questionnaires including Cognitive Reflection Test (CRT) and News-Find-Me (NFM) perception, and political orientation, collected from 25 participants with 60 news items. Initial analyses of the eye movements reveal that human perception differs when viewing true and fake news.