Refine
Has Fulltext
- yes (19)
Year of publication
Document Type
- Part of a Book (8)
- Conference Proceeding (7)
- Article (1)
- Bachelor Thesis (1)
- Doctoral Thesis (1)
- Master's Thesis (1)
Language
- English (19)
Keywords
Frame-level event detection in athletics videos with pose-based convolutional sequence networks
(2019)
In this paper we address the problem of motion event detection in athlete recordings from individual sports. In contrast to recent end-to-end approaches, we propose to use 2D human pose sequences as an intermediate representation that decouples human motion from the raw video information. Combined with domain-adapted athlete tracking, we describe two approaches to event detection on pose sequences and evaluate them in complementary domains: swimming and athletics. For swimming, we show how robust decision rules on pose statistics can detect different motion events during swim starts, with a F1 score of over 91% despite limited data. For athletics, we use a convolutional sequence model to infer stride-related events in long and triple jump recordings, leading to highly accurate detections with 96% in F1 score at only +/- 5ms temporal deviation. Our approach is not limited to these domains and shows the flexibility of pose-based motion event detection.
The current state-of-the-art in monocular 3D human pose estimation is heavily influenced by weakly supervised methods. These allow 2D labels to be used to learn effective 3D human pose recovery either directly from images or via 2D-to-3D pose uplifting. In this paper we present a detailed analysis of the most commonly used simplified projection models, which relate the estimated 3D pose representation to 2D labels: normalized perspective and weak perspective projections. Specifically, we derive theoretical lower bound errors for those projection models under the commonly used mean per-joint position error (MPJPE). Additionally, we show how the normalized perspective projection can be replaced to avoid this guaranteed minimal error. We evaluate the derived lower bounds on the most commonly used 3D human pose estimation benchmark datasets. Our results show that both projection models lead to an inherent minimal error between 19.3mm and 54.7mm, even after alignment in position and scale. This is a considerable share when comparing with recent state-of-the-art results. Our paper thus establishes a theoretical baseline that shows the importance of suitable projection models in weakly supervised 3D human pose estimation.
The state-of-the-art for monocular 3D human pose esti- mation in videos is dominated by the paradigm of 2D-to- 3D pose uplifting. While the uplifting methods themselves are rather efficient, the true computational complexity de- pends on the per-frame 2D pose estimation. In this paper, we present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences but still produce temporally dense 3D pose estimates. We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks. This allows to de- couple the sampling rate of input 2D poses and the target frame rate of the video and drastically decreases the total computational complexity. Additionally, we explore the op- tion of pre-training on large motion capture archives, which has been largely neglected so far. We evaluate our method on two popular benchmark datasets: Human3.6M and MPI- INF-3DHP. With an MPJPE of 45.0 mm and 46.9 mm, re- spectively, our proposed method can compete with the state- of-the-art while reducing inference time by a factor of 12.
This enables real-time throughput with variable consumer hardware in stationary and mobile applications. We re- lease our code and models at https://github.com/
goldbricklemon/uplift-upsample-3dhpe
Whenever images or videos of humans are recorded, the intention is often to capture the behavior, the physical actions or the general body motion of the depicted persons. With the omnipresence of standalone or integrated cameras in all areas of daily life, creating such recordings has become trivial. But the extraction of information about human motion from visual data remains a difficult task that normally requires other persons to watch and understand the recordings. At the same time, a wide range of applications can benefit from human motion understanding in visual data. Surveillance systems, human-machine interfaces or the assessment of physical capabilities of humans are just some exemplary application domains that rely on the automatic retrieval of human motion information. In this work, we investigate different approaches and techniques for automated human motion analysis in video data. We build upon the huge progress in computer vision research over the past decade. Specifically, we propose to use the long-standing task of 2D human pose estimation (HPE) as an ideal computer vision building block for human motion representation and understanding.
The first main topic of this thesis are solutions for motion analysis that directly build upon 2D human pose estimates from videos. Video-based performance analysis in professional sports serves as our main testing ground. In particular, we consider the automated analysis of swimming motion in specialized swimming channels. We present crucial changes to existing 2D HPE models in order to include domain-dependent prior information about expected human motion and to ensure a temporally coherent motion representation. We also tackle the specific task of motion event detection, where we locate characteristic moments during swimming and athletics motion on video frame level. We present two different solutions that are designed for visually challenging environments and crowded scenes. We additionally consider an application for motion analysis in the context of rehabilitation medicine, where we detect speech impairments purely from facial motion.
Our second topic is the more general task of reconstructing human posture and motion in 3D space from 2D poses alone. Providing access to 3D motion information is an even stronger foundation for subsequent domain-independent motion analysis. We consider two challenges in the 2D-to-3D pose estimation approach. First, we provide a theoretical analysis of existing 3D HPE models that are particularly relevant for domains with limited annotated 3D pose data. We identify commonly used camera models as a severe structural limitation to the maximal precision of 3D pose estimates. The second challenge is the computational demand of current 3D HPE models on video data. We leverage the state-of-the-art in Transformer architectures and develop a flexible model that can be freely adjusted in its inherent trade-off between 3D pose precision and overall processing efficiency. This enables effective 3D HPE in videos on consumer hardware.