Imitation learning by state-only distribution matching

  • Imitation Learning from observation describes policy learning in a similar way to human learning. An agent’s policy is trained by observing an expert performing a task. Although many state-only imitation learning approaches are based on adversarial imitation learning, one main drawback is that adversarial training is often unstable and lacks a reliable convergence estimator. If the true environment reward is unknown and cannot be used to select the best-performing model, this can result in bad real-world policy performance. We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric. Our training objective minimizes the Kulback-Leibler divergence (KLD) between the policy and expert state transition trajectories which can be optimized in a non-adversarial fashion. Such methods demonstrate improved robustness when learned density models guide the optimization. We further improve the sample efficiency by rewriting theImitation Learning from observation describes policy learning in a similar way to human learning. An agent’s policy is trained by observing an expert performing a task. Although many state-only imitation learning approaches are based on adversarial imitation learning, one main drawback is that adversarial training is often unstable and lacks a reliable convergence estimator. If the true environment reward is unknown and cannot be used to select the best-performing model, this can result in bad real-world policy performance. We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric. Our training objective minimizes the Kulback-Leibler divergence (KLD) between the policy and expert state transition trajectories which can be optimized in a non-adversarial fashion. Such methods demonstrate improved robustness when learned density models guide the optimization. We further improve the sample efficiency by rewriting the KLD minimization as the Soft Actor Critic objective based on a modified reward using additional density models that estimate the environment’s forward and backward dynamics. Finally, we evaluate the effectiveness of our approach on well-known continuous control environments and show state-of-the-art performance while having a reliable performance estimator compared to several recent learning-from-observation methods.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Damian Boborzi, Christoph-Nikolas Straehle, Jens S. Buchner, Lars MikelsonsGND
URN:urn:nbn:de:bvb:384-opus4-1097996
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/109799
ISSN:0924-669XOPAC
ISSN:1573-7497OPAC
Parent Title (English):Applied Intelligence
Publisher:Springer
Place of publication:Berlin
Type:Article
Language:English
Year of first Publication:2023
Publishing Institution:Universität Augsburg
Release Date:2023/12/06
Tag:Artificial Intelligence
Volume:53
First Page:30865
Last Page:30886
DOI:https://doi.org/10.1007/s10489-023-05062-w
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Ingenieurinformatik mit Schwerpunkt Mechatronik
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):CC-BY 4.0: Creative Commons: Namensnennung (mit Print on Demand)