Why Reinforcement Learning?

In reinforcement learning (RL), an agent tries to maximize a reward while interacting with an environment. The agent observes the state of the environment, takes an action and observes the reward received (if any) and the new state. Then the agent takes the next action, and the cycle continues until some final outcome.

At each step, the agent tries to learn from experience what actions yield the best long-term, cumulative reward. When an action appears to be good, the agent can choose to exploit that action, but the agent should sometimes explore new actions, which might prove to be even better.

RL has been used to reach expert levels in Atari and other games, even beating the world’s best Go players. It is used to train autonomous vehicles and robots, automate industrial processes, and improve recommendations and ad-serving systems. New uses for RL emerge frequently.

Why Ray RLlib?

Reinforcement learning can be very demanding of computation resources and require very diverse compute patterns. That “smart” agent might have a large neural network behind it. The environment simulator is a complex graph of parallel tasks and objects in memory. Many “episodes” of sequential agent-environment interaction will be done to train the agent, sometimes running in parallel.

Ray RLlib is a flexible, high-performance system for building reinforcement learning applications that meets these requirements. It implements most state-of-the-art training algorithms available. It is flexible for creating new algorithms. It integrates with third-party systems like TensorFlow and PyTorch for neural networks and OpenAI Gym for a wide variety of environment simulators.

At this time no other reinforcement learning system offers the breadth of options and state-of-the-art performance that RLlib offers. RLlib achieves these goals using Ray, a new system that makes it easy and fast to build distributed systems that scale across all the cores in your machine and all the machines in your cluster.

Using Ray RLlib

Users of RLlib have a concise syntax to declare what they want. Here is a “teaser” example. I want to train the bipedal walker from OpenAI Gym to learn how to walk.

First, we train the agent:

import ray
import ray.rllib.agents.ppo as ppo
config = ppo.DEFAULT_CONFIG.copy()  # PPO's default configuration.
config['num_workers'] = 8           # 8 parallel workers
# 50 SGD (stochastic gradient descent) iterations/training minibatch.
config['num_sgd_iter'] = 50 
config['sgd_minibatch_size'] = 250  # 250 records/minibatch
# Neural network with two hidden layers, 512 weights each.
config['model']['fcnet_hiddens'] = [512, 512]
agent = ppo.PPOTrainer(config, env="BipedalWalker-v3")
for n in range(100):
    result = agent.train()
    print(f'episode_reward_mean: {result["episode_reward_mean"]}')

The episode_reward_mean is the most useful measure of overall “goodness” of the agent at reliably achieving good results for any episode, on average.

With our saved agent checkpoint, we can use the following shell commands to “rollout” episodes of bipedal walker using the trained agent to walk:

c=’{"env":"BipedalWalker-v3","model":{"fcnet_hiddens":[512, 512]}}’
rllib rollout checkpoints/checkpoint_100/checkpoint-100 \
    --config "$c" --run PPO --steps 2000


You will see a window pop up showing the agent trying to make the “robot” walk, as I’ll demonstrate in my talk, “Hands-on Reinforcement Learning with Ray RLlib“!

Reinforcement learning is an exciting field of machine learning with many successes and great potential. Ray RLlib uses Ray to make RL accessible to any data scientist or developer.

About the author/ODSC Europe speaker:

Dean Wampler (@deanwampler) is an expert in data engineering, focusing on ML/AI. He is Head of Developer Relations at Anyscale.com, which develops Ray for distributed Python. Dean has written books and reports for O’Reilly, contributed to several open-source projects, and he speaks frequently at various conferences.

Connect on LinkedIn here.