Learning approaches for interactive segmentation: from unimodal to multimodal interactive segmentation
- One of the most important tasks computer vision has automated is the exact localization of objects in images. When we consider the notion of an exact location of an object in an image, we would like to know precisely which pixels are occupied by the object and which are not. This pixel-precise localization is referred to as image segmentation in the context of computer vision. In recent years, deep neural networks have become the central part of most systems for image segmentation. Although the methods and architectures for the various image segmentation tasks have improved over the last years, training neural networks for such systems still relies on the availability of large amounts of annotated data. Specifically for segmentation tasks, annotated data is hard to acquire. While the images themselves are usually easy to procure at a large scale, manually creating the masks for surfaces in the image requires a considerable amount of effort. To ease the process of manually creatingOne of the most important tasks computer vision has automated is the exact localization of objects in images. When we consider the notion of an exact location of an object in an image, we would like to know precisely which pixels are occupied by the object and which are not. This pixel-precise localization is referred to as image segmentation in the context of computer vision. In recent years, deep neural networks have become the central part of most systems for image segmentation. Although the methods and architectures for the various image segmentation tasks have improved over the last years, training neural networks for such systems still relies on the availability of large amounts of annotated data. Specifically for segmentation tasks, annotated data is hard to acquire. While the images themselves are usually easy to procure at a large scale, manually creating the masks for surfaces in the image requires a considerable amount of effort. To ease the process of manually creating segmentation masks, interactive segmentation systems have been created. These systems allow the user to place clicks on the image and then try to automatically infer a high quality mask on the basis of these clicks.
In this thesis, we investigate the topic of interactive segmentation systems that are based on neural networks. In the first part of this thesis, we focus on architectures which predict segmentation masks on the basis of RGB images. Therein, we first propose a method which allows the network to continue learning while it is currently in use. To do so, we only use information that is generated as a byproduct of using the model. Afterwards, we propose a novel network architecture for interactive segmentation. We design the architecture in such a way that it allows for quick responses after each click.
In the second part of this thesis, we leverage other modalities than RGB images to improve the segmentation performance of our networks. We use the geometric information from depth maps as an additional input modality alongside the RGB images, which results in better segmentation masks. Since corresponding depth maps are not generally available for arbitrary images, we generate pseudo depth maps using networks that have been pretrained for the task of monocular depth estimation. Even when replacing RGB images with high-quality depth maps as an input modality, we observe performance increases for some scenarios. We also develop a novel architecture that is capable of integrating information from an arbitrary amount of modalities. The multi-modal fusion strategy is designed to allow for the usage of an inaccessible gray-box feature extractor for RGB images. On top of this, we propose an extended version of the evaluation mechanism for interactive segmentation that accounts for challenges that occur when we want to segment multiple surfaces in the same image.…


| Author: | Robin SchönORCiDGND |
|---|---|
| URN: | urn:nbn:de:bvb:384-opus4-1267244 |
| Frontdoor URL | https://opus.bibliothek.uni-augsburg.de/opus4/126724 |
| Advisor: | Rainer Lienhart |
| Type: | Doctoral Thesis |
| Language: | English |
| Date of Publication (online): | 2026/01/19 |
| Year of first Publication: | 2026 |
| Publishing Institution: | Universität Augsburg |
| Granting Institution: | Universität Augsburg, Fakultät für Angewandte Informatik |
| Date of final exam: | 2025/12/01 |
| Release Date: | 2026/01/19 |
| Tag: | Computer Vision; Image Segmentation; Interactive Segmentation; Machine Learning; Multimodal Segmentation |
| GND-Keyword: | Mustererkennung; Bildverarbeitung; Maschinelles Lernen |
| Page Number: | xii, 170 |
| Institutes: | Fakultät für Angewandte Informatik |
| Fakultät für Angewandte Informatik / Institut für Informatik | |
| Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Maschinelles Lernen und Maschinelles Sehen | |
| Dewey Decimal Classification: | 0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik |
| Licence (German): | Deutsches Urheberrecht |



