As newer fields emerge within data science and the research is still hard to grasp, sometimes it’s best to talk to the experts and pioneers of the field. Recently, we spoke with Pieter Abbeel, PhD, Director of the Berkeley Robot Learning Lab and Co-Director of the Berkeley Artificial Intelligence (BAIR) Lab ahead of his upcoming ODSC West 2022 training session on deep reinforcement learning. You can watch the full Lightning Interview here, and read the transcript for the first few questions below.

Q: First and foremost, tell us about your research.

My research I would say generally is in artificial intelligence, and it’s something I’ve been excited about for a very long time. I remember at the end of my undergraduate studies, and this is back in the late 90s a long time ago, asking myself “What is the most exciting thing I want to do for the rest of my life what do you want to specialize in?” I remember thinking there’s just nothing more exciting than trying to understand how human intelligence works.

But then it also occurred to me that it’s very hard to understand human intelligence, it’s very hard to measure anything in the brain, and it was not clear how much progress was really possible. Then I thought, “Well you know, what if we try to build intelligence?” You know, approach it from an engineering perspective as that gives a lot of insight and also allows us to build systems that can be really helpful in many ways. Ever since then I’ve been pursuing AI research and I’m still equally excited –  if anything even more excited these days – because it’s proving useful now. When I started it was just an academic interest but now there are so many applications which is really exciting.

Q: Can you give us a quick overview of deep reinforcement learning?

Within AI, there are of course many areas of research, and my own research has been largely in deep reinforcement learning. The easiest way to explain reinforcement learning is by first explaining supervised learning. In supervised learning, the way you build an intelligent system is by giving examples. For example, you might give an image and say “What’s in the image? A cat.” Give another image and say, ”What’s an image? Maybe a dog” and keep giving examples. Then you would train a typically very large neural network to learn to internalize the pattern based on what you see and what should you say is in the image. This is directly learning from examples, and it can work really well if you can collect enough examples for the pattern you want your AI system to internalize.

But now reinforcement learning is a little different and has always been a bit more exciting to me, because in reinforcement learning, you’re actually not required to give examples. The idea in reinforcement learning is that you just specify the goal your AI is supposed to achieve. For example, maybe in a video game – maximize the score in the game. Or maybe for a robot, let’s say run as fast as possible, or maybe do a flip, or maybe they can place items in a warehouse as we do at Covariant. And so in reinforcement learning, you just specify what you want and then the agent is supposed to figure out from its own experimentation what’s the best way to achieve this.

This has two really interesting benefits. One benefit is that it’s simpler, from a human perspective, as you don’t have to exhaustively give examples – you just provide a specification and then the system learns on its own. Of course, it’s a more complicated system that it’s required to do, but it’s simple as a human to specify what’s required. Another really interesting benefit is that it can actually exceed human capabilities because you could specify something – some score or some metric – that it will then optimize and it might find a way to achieve a higher performance than you as a human could demonstrate. It allows it to effectively discover new things in the direction that you wanted to try to discover new things.

Q: Can you give us some background information for your upcoming session at ODSC West?

We’re going to cover a wide range of algorithms actually used in reinforcement learning. It’s interesting – in reinforcement learning, there’s no single approach that’s like “This is what you use.” Compare that to supervised learning, where you set up a null net architecture and you run backpropagation gradient descents to cast grain descent, and at the end of the day your model has been trained. In reinforcement learning, there’s a lot more complexity for the agent and it’s also not fully converged, so we’re going to cover a range of approaches.

When we think about reinforcement learning, there’s kind of two big categories of approaches. One category is model-based reinforcement learning. The idea here is that the agent doesn’t just learn from its trial and error in the world, but also builds an internal model of the world from that trial and error, and then simulates how things go in that model. So there’s additional learning happening in a learned simulator which can often make it a lot more efficient in terms of the amount of real-world interaction that’s required, because a lot of the learning of the agent happens in the simulator it’s building on its own from its experiences.

We’ll also look at model-free approaches. In model-free approaches, the data that the agent collects through trial and error is used to directly improve a policy or a Q function which are essentially the underlying constructs that the agent uses to make its decisions the next time it gets to do an experiment in the world. 

We’ll look at both of these and the foundations of what’s under them. Even within those, there are different methods because it’s not a converged field, so I’ll show you the nuances of why people might prefer different methods in different situations. It can depend on how expensive it is to collect data in the environment, it can depend how expensive it is to assimilate, it can depend on how large the neural network that’s being trained is, and so forth.

We’ll look at those nuances, and then from there, once you’ve seen the foundations, we’ll also look at a couple of the very important aspects of getting reinforcement learning to work for real-world applications. When you think about reinforcement learning in the purest sense, you think of this agent that just starts from scratch and figures it out. But for many real-world problems, and even academic research problems like similar robotic problems or games and so forth, it’s actually sometimes very hard for the agent to learn from scratch. So often we need other things to make the agent more efficient.

The way to think of that is at a high level, and it’s actually where a lot of the research in my own lab at Berkeley is right now, can we make agents play just like children would play? When you think about play, what’s the fundamental concept of play? It’s that you actually don’t have to prescribe what the task is. It’s the idea is that the agent should just, on its own, do a wide range of interesting things, and in the process, hopefully acquire interesting skills. Maybe a humanoid robot, when asked to play, would learn to crawl, would learn to get up, would learn to run, would maybe learn to open doors, pick up objects, and so forth, just because that’s a way to keep itself entertained. Getting that to emerge in AI systems, rather than just humans.

More on Pieter Abbeel’s ODSC West 2022 Session on Deep Reinforcement Learning

Deep Reinforcement Learning equips AI agents with the ability to learn from their own trial and error. Success stories include learning to play Atari games, Go, Dota2, and robots learning to run, jump, and manipulate. This tutorial will cover the foundations of Deep Reinforcement Learning, including MDPs, DQN, Policy Gradients, TRPO, PPO, DDPG, SAC, TD3, model-based RL, as well as current research frontiers.

Session Outline:

  • Module 1: Introduction to Markov Decision Processes (MDPs) and Exact Solution Methods (which only apply to small problems)
  • Module 2: Deep Q Networks and Application to Atari
  • Module 3: Policy Gradients, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradients (DDPG), Twin Delayed Deep Deterministic Policy Gradients (TD3), Soft Actor Critic (SAC), and Application to Robot Learning
  • Module 4: Model-based Reinforcement Learning
  • Module 5: Current Research Frontier