To cite this version: Julien Audiffren, Michal Valko, Alessandro Lazaric, Mohammad Ghavamzadeh. MESSI: Maximum Entropy Semi-Supervised Inverse Reinforcement Learning. NIPS Workshop on Novel Trends and Applications in Reinforcement Learning, 2014, Montreal, Canada.

HAL Id: hal-01177446 https://hal.inria.fr/hal-01177446 Submitted on 16 Jul 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

MESSI: Maximum Entropy Semi-Supervised Inverse Reinforcement Learning

Julien Audiffren CMLA, ENS Cachan [email protected]

Michal Valko INRIA Lille – Nord Europe [email protected]

Alessandro Lazaric INRIA Lille – Nord Europe [email protected]

Mohammad Ghavamzadeh INRIA/Adobe Research [email protected]

Introduction. The most common approach to solve a sequential decision-making problem is to formulate it as a Markov decision process (MDP). This process requires the definition of a reward function, but in many applications, such as driving or playing tennis, it is easier and more natural to learn how to perform such tasks by observing an expert’s demonstration, rather than by definition of a reward function. The task of learning from an expert is called apprenticeship learning. A powerful and relatively novel approach to apprenticeship learning (AL) is to formulate it as an inverse reinforcement learning (IRL) problem [6]. The basic idea is to assume that the expert is trying to optimize an MDP and to derive an algorithm for learning the task demonstrated by the expert [6, 1]. This approach has been shown to be effective in learning non-trivial tasks such as inverted helicopter flight control [5], ball-in-a-cup [2], and driving on a highway [1, 4]. In the IRL approach to AL, we assume that several trajectories generated by an expert are available and the unknown reward function optimized by the expert can be specified as a linear combination of a number of state features. In many applications, in addition to the expert’s trajectories, we may have access to a large number of trajectories that are not necessarily performed by an “expert”. For example, in learning to drive, we may ask an expert driver to demonstrate a few trajectories and use them in an AL algorithm to mimic her behavior. At the same time, we may record trajectories from many other drivers for which we cannot assess their quality and that may or may not demonstrate an expert-level behavior. We will refer to them as unsupervised trajectories and to the task of learning with them as semi-supervised apprenticeship learning following Valko et al. [8] who combine the IRL approach of Abbeel and Ng [1] with semi-supervised SVMs. However, unlike in classification, we do not regard the unsupervised trajectories as being a mixture of expert and non-expert classes. This is because the unsupervised trajectories might have been generated by the expert herself, by another expert(s), by near-expert agents, by agents maximizing different reward functions, or simply they can be some noisy data. The objective of IRL is to find the reward function that expert trajectories maximize, and thus, semi-supervised apprenticeship learning cannot be considered as a special case of semi-supervised classification. Maximum Entropy Semi-Supervised Inverse Reinforcement Learning. We propose the algorithm MESSI (MaxEnt Semi-Supervised IRL, see algorithm 1) to address the challenge above by combining the MaxEnt-IRL approach of Ziebart et al. [9] with SSL. MESSI integrates the unsupervised trajectories in a principled way such that it performs better than MaxEnt-IRL. For this purpose, we assume that the learner is provided with a set of expert trajectories Σ∗ = {ζi∗ }li=1 and a set of e = {ζj }u . We also assume that a function s is provided to measure the unsupervised trajectories Σ j=1 0 similarity s(ζ, ζ ) between any pair of trajectories (ζ, ζ 0 ). We define the pairwise penalty R as 1 X R(θ|Σ) = s (ζ, ζ 0 ) (θ T (fζ − fζ 0 ))2 , (1) |Σ| 0 ζ,ζ ∈Σ

∗

e and fζ and fζ 0 are the feature counts for trajectories ζ, ζ 0 ∈ Σ, and finally where Σ = Σ ∪ Σ, T 2 (θ (fζ − fζ 0 )) = (¯ rθ (ζ) − r¯θ (ζ 0 ))2 is the difference in rewards accumulated by the two trajectories 1

Algorithm 1 MESSI - MaxEnt SSIRL e = {ζj }uj=1 , similarity Input: Set of l expert trajectories Σ∗ = {ζi∗ }li=1 , set of u unsupervised trajectories Σ function s, number of iterations T , constraint θmax , regularizer λ0 Initialization: P Compute {fζi∗ }li=1 , {fζj }uj=1 and f ∗ = 1/l li=1 fζi∗ and generate a random reward vector θ 0 for t = 1 to T do 1. Compute policy πt−1 from θ t−1 (Solving the MDP) 2. Compute feature counts ft−1 of πt−1 (forward pass of MaxEnt) 3. Update the reward vector by doing a gradient descent step on (2) θmax 4. If kθ t k∞ > θmax , project back by θ t ← θ t kθ t k∞ end for

w.r.t. the reward vector θ. The purpose of the pairwise penalty is to penalize reward vectors θ that assign very different rewards to similar trajectories (as measured by s(ζ, ζ 0 )). We then follow the framework of Erkan and Altun [3] to integrate the regularization R into the learning objective. This leads to the following optimization problem : θ ∗ = argmax (L(θ|Σ∗ ) − λR(θ|Σ)) ,

(2)

θ

where L is the the log-likelihood of θ w.r.t. the expert’s trajectories and λ is a parameter trading off e and between L and the coherence with the similarity between the provided trajectories both in Σ Σ∗ . Although hand-crafted similarity functions usually perform better, our experiments show that even the simple RBF s (ζ, ζ 0 ) = exp(−kfζ − fζ 0 k2 /2σ), (where σ is the bandwidth) is an effective similarity for the feature counts. −1 −3500

−2 −3

−3600

−5

Reward

Reward

−4

−6

2

3

20

40 60 Number of iterations

80

−4000 0

100

3

10

20

30 40 Number of iterations

50

60

−1540 −1560

−2

−1580

−6

Reward

−4 Reward

1

2

0

MaxEnt MESSIMAX MESSI with distribution Pµ1

−8

−1620

−1660 −1680

MESSI with distribution Pµ3 10−2 10−1 parameter lambda

−1600

−1640

MESSI with distribution Pµ2

−10 −12 −3 10

MaxEnt MESSIMAX MESSI with distribution Pµ MESSI with distribution Pµ MESSI with distribution Pµ

−3900

1

−9 −10 0

−3800

MaxEnt MESSIMAX MESSI with distribution Pµ MESSI with distribution Pµ MESSI with distribution Pµ

−7 −8

−3700

−1700 −3 10

100

MaxEnt MESSIMAX MESSI with distribution Pµ1 MESSI with distribution Pµ2 MESSI with distribution Pµ3 10−2 10−1 parameter lambda

100

Figure 1: Results as a function of number of iterations (up) and the parameter lambda (down) of the MaxEnt, MESSIMAX (MESSI with all unsupervised trajectories drawn from expert) and MESSI (with different distributions of unsupervised trajectories) algorithms on the Highway driving (left) and the gridworld (right) dataset.

Experimental Results. Our experiments shows that MESSI takes advantage of unsupervised trajectories and can perform better and more efficiently than MaxEnt-IRL in the highway driving problem of Syed et al. [7] and the grid-world domain in Abbeel and Ng [1] (see fig 1).

2

References [1] P. Abbeel and A. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning, 2004. [2] A. Boularias, J. Kober, and J. Peters. Relative Entropy Inverse Reinforcement Learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, volume 15, pages 182–189, 2011. [3] A. Erkan and Y. Altun. Semi-Supervised Learning via Generalized Maximum Entropy. In Proceedings of JMLR Workshop, pages 209–216. New York University, 2009. [4] S. Levine, Z. Popovic, and V. Koltun. Nonlinear Inverse Reinforcement Learning with Gaussian Processes. In Advances in Neural Information Processing Systems 24, pages 1–9, 2011. [5] A. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang. Inverted Autonomous Helicopter Flight via Reinforcement Learning. In International Symposium on Experimental Robotics, 2004. [6] A. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pages 663–670, 2000. [7] U. Syed, R. Schapire, and M. Bowling. Apprenticeship Learning Using Linear Programming. In Proceedings of the 25th International Conference on Machine Learning, pages 1032–1039, 2008. [8] M. Valko, M. Ghavamzadeh, and A. Lazaric. Semi-Supervised Apprenticeship Learning. In Proceedings of the 10th European Workshop on Reinforcement Learning, volume 24, pages 131–241, 2012. [9] B. Ziebart, A. Maas, A. Bagnell, and A. Dey. Maximum Entropy Inverse Reinforcement Learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, 2008.

3