Editor’s note: Alishba Imran is a speaker for ODSC West 2021. Be sure to check out her talk, “Machine Learning and Robotics in Healthcare Devices and Rehabilitation,” there!

Contact-rich manipulation tasks in unstructured environments are especially complex to train and perform for machines and robots today.

➡️ A general area of research to tackle this problem is to train a policy in a simulated environment using deep reinforcement learning algorithms to learn these tasks and make the policy transferable to handle real-world tasks.

Learning Contact-Rich Manipulation Tasks

With this project, I designed a framework to train a policy in a simulated environment to enable a robotic arm to conduct contact-rich manipulation tasks. Although this framework can be broadly applied to many manipulation tasks, I demonstrate it on the peg insertion task (which is often a very complex task).

Specifically, I train a policy in PyBullet (on the a Kuka LBR iiwa robot arm) using PPO and Twin-Delayed DDPG for peg-in-hole tasks. I used self-supervision to learn a compact and multimodal representation of the sensory inputs gathered from the scene, which can then be used to improve the sample efficiency of our policy learning. This implementation can also be used to understand force-torque (F/T) control based on the F/T readings that are captured each step.

https://odsc.com/california/#register

Environment Set Up

I used the simulated environment which was developed here using PyBullet where a Kuka LBR iiwa robot arm is installed on a table along with the target box:

The policy is defined with a neural network that contains a state encoder and a three-layer Multi-layer Perception (MLP) network. The encoder will take the multi-modal input and predict the state vector. The MLP will then take in the state and generate a 3D displacement of the end-effector.

The goal position of the end-effector is

which is the center of the bottom inside the box. A negative reward is returned based on the distance between the position of the end-effector and the target.

At each step, the environment will generate a RGB color image, a depth image, and showcase the F/T reading captured at the joint connected with the end-effector. The F/T reading represent the sensed force or torque by the joint along each axis in its Cartesian coordinate:

The color image (left) and the depth image (right)

F/T readings when the end-effector is touching something while moving

Policy Optimization (PPO) Implementation

The on-policy learning algorithm proximal policy optimization (PPO) is implemented, which is a policy gradient method and it is an improved version of trust region policy optimization (TRPO).

Multi-modal Sensing

A multi-modal fusion encoder is used to generate a compact state representation and to analyze the multi-modal observation.

The main steps here are:

  • Encodor: the visual and haptic data will be encoded individually with different encoders. Both these data types are transformed into a variational parameter vector.
  • Fusor: the variational parameters are produced during the state representation learning. The module will produce the state representation during the inference.
  • Decodor: the learning objectives are designed so that all the ground truth for the prediction can be determined automatically through self-supervised learning.

The fused representation of both the visual and haptic sensor data is learned from many data samples by a multi-branch auto-encoder. After training on this data, the modality encoders can focus on only the key features and generate a compact representation of the MDP state based on this. This state representation is then taken by the actor and critic model for making a decision on the action and predicting the state value that corresponds to the current policy.

This is the structure of the multi-modal:

Demo

Here is a quick demo!

Off-Policy Learning Algorithm: Twin-Delayed DDPG

Alongside PPO, I also implemented the off-policy learning algorithm, Twin-Delayed DDPG. DDPG in general can achieve really great results but the learned Q-function can overestimate Q-values, leading the policy to break.

Twin Delayed DDPG (TD3) is an algorithm that attempts to address this issue by learning two Q-functions instead of one, updating the policy (and target networks) less frequently than the Q-function, and adding noise to the target action to make it harder for the policy to exploit Q-function errors.

If interested, you can learn more about Twin Delayed DDPG here.

A lot of the approach used for this project was inspired by and built on top of this paper! Feel free to check it out to learn more.

If you have questions, feedback, or thoughts about anything for machine learning and robotics, feel free to get in touch:

Email: alishbai734@gmail.com

Linkedin: Alishba Imran

Twitter: @alishbaimran_

About the author/ODSC West 2021 speaker on Machine Learning and Robotics:

Alishba Imran is a machine learning developer who works at the intersection of AI, rehabilitation tech, humanoids, and medical devices to create smarter machines. She’s developed a novel generative neural network and 3D printed prosthetic material that reduced patient costs from $10,000 to $700, led neuro-symbolic AI research for Sophia the Robot at Hanson Robotics, and helped build a portable, low-cost, soft robotic glove with Q-learning using the SRT by the Harvard Biodesign lab.