Topic-oriented text features can match visual deep models of video memorability

Kleinlein, Ricardo; Luna-Jiménez, Cristina; Arias-Cuadrado, David; Ferreiros, Javier; Fernández-Martínez, Fernando

doi:10.3390/app11167406

Topic-oriented text features can match visual deep models of video memorability

Ricardo Kleinlein, Cristina Luna-Jiménez, David Arias-Cuadrado, Javier Ferreiros, Fernando Fernández-Martínez

Not every visual media production is equally retained in memory. Recent studies have shown that the elements of an image, as well as their mutual semantic dependencies, provide a strong clue as to whether a video clip will be recalled on a second viewing or not. We believe that short textual descriptions encapsulate most of these relationships among the elements of a video, and thus they represent a rich yet concise source of information to tackle the problem of media memorability prediction. In this paper, we deepen the study of short captions as a means to convey in natural language the visual semantics of a video. We propose to use vector embeddings from a pretrained SBERT topic detection model with no adaptation as input features to a linear regression model, showing that, from such a representation, simpler algorithms can outperform deep visual models. Our results suggest that text descriptions expressed in natural language might be effective in embodying the visual semanticsNot every visual media production is equally retained in memory. Recent studies have shown that the elements of an image, as well as their mutual semantic dependencies, provide a strong clue as to whether a video clip will be recalled on a second viewing or not. We believe that short textual descriptions encapsulate most of these relationships among the elements of a video, and thus they represent a rich yet concise source of information to tackle the problem of media memorability prediction. In this paper, we deepen the study of short captions as a means to convey in natural language the visual semantics of a video. We propose to use vector embeddings from a pretrained SBERT topic detection model with no adaptation as input features to a linear regression model, showing that, from such a representation, simpler algorithms can outperform deep visual models. Our results suggest that text descriptions expressed in natural language might be effective in embodying the visual semantics required to model video memorability.…

Metadaten
Author:	Ricardo Kleinlein, Cristina Luna-Jiménez ORCiD GND, David Arias-Cuadrado, Javier Ferreiros, Fernando Fernández-Martínez
URN:	urn:nbn:de:bvb:384-opus4-1226521
Frontdoor URL	https://opus.bibliothek.uni-augsburg.de/opus4/122652
ISSN:	2076-3417OPAC
Parent Title (English):	Applied Sciences
Publisher:	MDPI AG
Type:	Article
Language:	English
Year of first Publication:	2021
Publishing Institution:	Universität Augsburg
Release Date:	2025/06/05
Volume:	11
Issue:	16
First Page:	7406
DOI:	https://doi.org/10.3390/app11167406
Institutes:	Fakultät für Angewandte Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Menschzentrierte Künstliche Intelligenz
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):	CC-BY 4.0: Creative Commons: Namensnennung

Open Access

Topic-oriented text features can match visual deep models of video memorability

Download full text files

Export metadata

Statistics

Additional Services