Coherent multi-sentence video description with variable level of detail

Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptions’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work.

Metadaten
Author:	Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich ORCiD GND, Manfred Pinkal, Bernt Schiele
Frontdoor URL	https://opus.bibliothek.uni-augsburg.de/opus4/105700
ISBN:	978-3-319-11751-5OPAC
ISBN:	978-3-319-11752-2OPAC
ISSN:	0302-9743OPAC
ISSN:	1611-3349OPAC
Parent Title (English):	Pattern Recognition: 36th German Conference, GCPR 2014, Münster, Germany, September 2-5, 2014, Proceedings
Publisher:	Springer
Place of publication:	Cham
Editor:	Xiaoyi Jiang, Joachim Hornegger, Reinhard Koch
Type:	Conference Proceeding
Language:	English
Year of first Publication:	2014
Release Date:	2023/07/10
First Page:	184
Last Page:	195
Series:	Lecture Notes in Computer Science ; 8753
DOI:	https://doi.org/10.1007/978-3-319-11752-2_15
Institutes:	Fakultät für Angewandte Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik / Professur für Sprachverstehen mit der Anwendung Digital Humanities

Open Access