Papperoni

2017

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

citations

Cite Score

AI summary

This paper introduces Proximal Policy Optimization (PPO), a new family of policy gradient methods that uses a novel objective function for multiple minibatch updates, empirically outperforming other online policy gradient methods on simulated robotic locomotion and Atari game playing benchmarks.

Main Contributions

Introduces Proximal Policy Optimization (PPO), a new family of policy gradient methods.
Proposes a novel objective function with clipped probability ratios that enables multiple epochs of minibatch updates per data sample.
PPO is simpler to implement and more general than TRPO, while offering similar benefits.
Empirically demonstrates PPO's superior sample complexity compared to other online policy gradient methods on benchmark tasks.
Achieves strong performance on simulated robotic locomotion and Atari game playing.

Abstract

We propose a new family of policy gradient methods for reinforcement learning, which al-ternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gra-dient methods perform one gradient update per data sample, we propose a novel objectivefunction that enables multiple epochs of minibatch updates. The new methods, which we callproximal policy optimization (PPO), have some of the benefits of trust region policy optimiza-tion (TRPO), but they are much simpler to implement, more general, and have better samplecomplexity (empirically). Our experiments test PPO on a collection of benchmark tasks, includ-ing simulated robotic locomotion and Atari game playing, and we show that PPO outperformsother online policy gradient methods, and overall strikes a favorable balance between samplecomplexity, simplicity, and wall-time.

Citation Graph

Loading graph...

References [14]

Sort:

Filter:

[1]Adam: A Method for Stochastic Optimization

D. P. Kingma, Jimmy Lei Ba - 2014

49 papers in library cite

Google Scholar

Amazing paper! Very well explained and huge impact. I am amazed that they made something so simple even when it requires a lot of background mathematical knowledge

[2]Human-Level Control Through Deep Reinforcement Learning

V. Mnih - 2015

9 papers in library cite

Google Scholar

Maybe the first instance of RL with NNs! I am impressed that it seems quite straightforward to implement (maybe tricky to get working).

[3]Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

R. Williams - 1992

11 papers in library cite

Google Scholar

It's alright for formalizing the concept, but it's a bit boring and doesn't add a lot from the middle on. Focuses too much in reviewing existing techniques and in stochastic units.

[4]Trust Region Policy Optimization

John Schulman, Sergey Levine, P. Abbeel, Michael I. Jordan, P. Moritz - 2015

4 papers in library cite

Google Scholar

[5]High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, P. Moritz, Sergey Levine, M. Jordan, P. Abbeel - 2015

5 papers in library cite

Google Scholar

[6]The Arcade Learning Environment: An Evaluation Platform for General Agents

M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling - 2013

5 papers in library cite

Google Scholar

[7]Asynchronous Methods for Deep Reinforcement Learning

V. Mnih, A. P. Badia, M. Mirza, Alex Graves, T. Lillicrap, T. Harley, D. Silver, Koray Kavukcuoglu - 2016

3 papers in library cite

Google Scholar

[8]Mujoco: A Physics Engine for Model-Based Control

E. Todorov, T. Erez, Y. Tassa - 2012

3 papers in library cite

Google Scholar

[9]OpenAI Gym

Greg Brockman, V. Cheung, L. Pettersson, J. Schneider, John Schulman, Jie Tang, Wojciech Zaremba - 2016

3 papers in library cite

Google Scholar

[10]Benchmarking Deep Reinforcement Learning for Continuous Control

Y. Duan, X. Chen, R. Houthooft, R. Rein, John Schulman, P. Abbeel - 2016

2 papers in library cite

Google Scholar

[11]Approximately Optimal Approximate Reinforcement Learning

S. Kakade, John Langford - 2002

1 paper in library cites

Google Scholar

[12]Emergence of Locomotion Behaviours in Rich Environments

N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Zhengtao Wang, A. Eslami, M. Riedmiller - 2017

1 paper in library cites

Google Scholar

[13]Learning Tetris Using the Noisy Cross-Entropy Method

I. Szita, A. Lorincz - 2006

1 paper in library cites

Google Scholar

[14]Sample Efficient Actor-Critic With Experience Replay

Zhengtao Wang, V. Bapst, N. Heess, V. Mnih, Rémi Munos, Koray Kavukcuoglu, N. D. Freitas - 2016

1 paper in library cites

Google Scholar

Cited by

papers in your library

Cites

papers in your library

Read

on May 21, 2026

Very simple methodology and very well explained. I also liked that they did a good job on motivating the method.