There are many methodologies in algorithmic trading — from automated trade entry and close points based on technical and fundamental indicators to intelligent forecasts and decision making using complex maths and, of course, artificial intelligence. Reinforcement learning here stands out as a Holy Grail — no need to do intermediate forecasts or rule creation — you just have to define a target and the algorithm will learn the exact rules by itself!

How do reinforcement learning agents learn to trade like this? | Illustration by the author

There are a lot of amazing advanced tutorials that teach about modern learning algorithms (A3C, TRPO, PPO, TD3, and other scary acronyms) and deep learning architectures from CNNs and RNNs to cool Transformers. However,** financial data science and machine learning are very different from the classic AI exercises in computer vision, language understanding, or recommender systems**.

The main issue is a very **low signal-to-noise ratio** and** complex, non-linear relationships** between all available market factors. Moreover, the factors are not constant, they change over time and might affect the market moves differently in different moments of time. Given such a complicated environment, we cannot afford just training models on the datasets we have and pushing it to the production — the price for the mistake is too high — that’s why we need to drill down to the fundamentals and details and prove every piece of our modeling theory relentlessly.

This article is illustrated with code published on my Github — don’t forget to check it out and try the examples by yourself! Also, I will be giving an expanded talk about this topic at the ODSC Europe conference in a couple of weeks on a workshop Hands-on RL in Finance: playing Atari vs Playing Markets. You can consider this article as a prep-up for the workshop as well 😉

## Simplified trading exercise

In the spirit of the above-said preamble, the learning of reinforcement learning (not very easy technology!) should start with **simplified scenarios**, where we can understand the basics. Before we start with real financial data, we have to be sure that our algorithm actually can exploit all the tips and dips on the financial market. To evaluate this a **simple cosine function** can help us — if we can trade such “market”, we can start from here and make the environment gradually harder and harder for our agent.

Illustration of a simplified “market” with its ups and downs| Illustration by the author

This methodology follows a famous article “Self-learning quant”, however, I structured the code with OOP slightly better (IMO), changed the math to correspond to classic book formulas, and tested it on more examples.

## A brief introduction to reinforcement learning

But wait… what’s an agent anyways? As you might already know, the main two parts of the reinforcement learning framework are:

**— Environment** — this is a “playing field” or a market in our case, which can tell us what’s happening right now and what will be our reward in the future if we do some action right now

**— Agent** — a “player”, that interacts with the environment and learns how to maximize long-term rewards via doing different actions in different situations

Graphically, it can be represented on the following diagram:

Illustration from https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html

This framework is known for years and there are several strategies how to train an agent to act as profitable as possible at any moment. Intuitively, if we can summarize all possible states (or types of market situations in our case), we could assign different rewards for different actions performed in this state. For example, on the **bullish market**, **doing nothing** will give us a **reward of 0**, **going long +100** and **shorting** such a market might give us the **negative reward of -100**. A table with such correspondence of environment and related reward is given some actions historically is called **Q function**.

If states are complex and hard to represent in a tabular way, they can be **approximated with a neural network** (this is what we will do). After having such a table or a function, we can choose the most profitable action in every state (according to our Q function) and enjoy the profits! But how to build such a Q function? Here is where the Q-learning algorithm will be helpful for us, wherein the core will be a celebrated Bellman equation:

Formula from https://towardsdatascience.com/deep-q-learning-tutorial-mindqn-2a4c855abffc

As we can see, we will iteratively, on every step t updating the value corresponding to a state S and given action A with 2 weighted parts:

— The **current value** of the Q function for this state and action

— The **reward of such a decision** + long-term reward from the future steps

**Alpha** here measures the trade-off between the current value and new reward (i.e. learning speed), **gamma** gives a weight for the long-term rewards. Also, during the iterations (of training) over our environment we will act randomly sometimes with some **probability epsilon** — to let our agent explore new actions and potentially even bigger rewards! Updating the Q function in the case of a neural network approximation will mean fitting our neural network Q with a new value for a given action.

Also, I recommend checking out the Jupyter Notebook in the Github repository — it contains classes for the simple Environment and the Agent which will help to understand better the mechanics. For diving deeper into reinforcement learning theory and mathematics there are many great resources, I particularly like Richard Sutton’s “Introduction to reinforcement learning”.

## Implementation notes

Implementation of the Environment and Agent classes are relatively straightforward, but I’d like to outline the training loop once again here:

— Iterate over N epochs, where each epoch is iteration overall environment

— For each sample in the environment:

1. Get the **current state** at time t

2. Get the **value function** for all actions at this state (our neural network will output 3 values for us)

3. Do an **action** on this state (or argmax the outputs or act randomly to explore)

4. Get the **reward** for this action from the environment (see the class)

5. Get the **next state** after the current one (for the future long-term rewards)

6.** Save a tuple** of the current state, next state, value function, and reward for experience replay

7.** Do the experience replay** — fit our Q neural network on some samples from the experience replay buffer to make the Q function more adequate with respect to which rewards we will get for which actions at this step

Wait, wait, but what is experience replay you will ask? I recommend reading more here, but in a nutshell, it’s better to train on uncorrelated mini-batches of data than on very correlated step-by-step observation — it helps generalization and convergence.

## Validating our trading strategy

Let’s see if within such a framework we can train our agent to ride a cosine wave with some profits! Let’s define some parameters that we will use to train our agent:

— Training will be done for **5 epochs **

**— Epsilon**= 0.95,

**Gamma**= 0.9,

**Alpha**= 0.9

— The

**length of the environment**is 250 points, every

**state length**is 5 points

— Every state is normalized using

**differencing the time series**

— The

**reward**is updated every 1 point (i.e.

**next observation**at t+1)

— We have

**three actions (long, short, flat)**with the reward of our “market return” multiplied by +100, 0, and -100 accordingly

— For experience replay, we will use

**16 samples from our buffer**

First, let’s check our **cosine function** — and it looks great! Seems like we are being long (green dots) exactly in the “bullish” parts of the cosine actions and do the opposite on the other side!

RL performance visualization on the cosine function| Illustration by the author

Let’s make the exercise a bit harder now and **add some Gaussian noise **to the time series without retraining the model. And it still works adequately! Now there are some confusion points, but on average the model still knows where the long-term trends of our noisy cosine function are.

RL performance visualization on the “noisy” cosine function| Illustration by the author

Let’s make life for our agent even more complicated — **let’s sum up 4 cosine functions with different frequency periods** and try to trade those combined waves. The result is still great — our representation of the market is clearly representing the trends and even if our model was trained on another kind of data, it still knows what to do with another kind of wave.

RL performance visualization on the combination of cosine functions| Illustration by the author

If you want to try those experiments by yourself — please free to refer to the code in my Github!

## Why it’s closer to Atari than to the real market?

This kind of framework (of course with bells and whistles of recent AI advances) is applied to the famous examples of reinforcement learning flawlessly playing Go and Atari games. Why it works so well? Even the environment itself is complex (computer graphics, opponent bots, many possible actions, and situations) it rather stable and predictable. The games have rules and logic and they are not meant to mean different things at different times. And with the power of deep learning as learning different visual and textual representations we see amazing results of RL playing different games Unfortunately, **we cannot tell the same about playing financial games**:

— The variables that affect the rewards have a **very low signal-to-noise ratio** and actually can change over time (in video games killing an enemy is always a good thing)

**— Overfitting** is a much bigger problem — because the future markets can be completely different from the past ones, how can we estimate our risks of losing money in the future when the data about it is not available?

**— Backtesting** is a real issue as well — we cannot just play the video game and again again, because past market situations are rather limited and the future is unknown and cannot be simulated due to the uncertainty of the factors that will affect it

**— Interpretation** of the learned policy. What we actually have discovered? How do we explain the motivation behind the agent’s policy? Can it be explained by well-known economic theories? Is it really something novel or just a spurious correlation?

I recommend reading more about this in the celebrated book of Marcos Lopez de Prado to dive deeper into such details because these issues are very different from the ones we encounter in classical ML exercises. In our example, we can be sure that we just have learned the simple trend-following strategy — if the price goes up — follow the upwards trend and invest, if it goes in the opposite direction — short it. Works with a cosine function; maybe can even work in the market. Sometimes 🙂

## What should we do next?

After having understood, implemented, and validated the basics based on the artificial dataset, it’s time to move on and expand this framework to be able to work with real financial data. We need to take into account the following things:

**— Data preparation** — states and rewards that will be as stable as possible in the constantly changing markets

**— Models and their validation** — advanced techniques for cross-validation and feature importance will be very handy here

**— Backtesting and probability of overfitting **— the final frontier that will actually allow us to say that our agent didn’t just overfit to some noise in the data, but actually has learned a profitable policy

Partly I have discussed those issues in my previous blog posts [one, two, three], but none of them was related exactly to reinforcement learning. I will be talking about this at ODSC Europe conference in a couple of weeks on a workshop Hands-on RL in Finance: playing Atari vs Playing Markets. I will be going through the above-mentioned parts of the advanced framework and we will see how RL actually can be applied to trade financial markets. I will be happy to meet you there and discuss it!

**P.S.**

If you found this content useful and perspective, you can support me on Bitclout. I am open to discussions and collaborations in the technology field, you can connect with me on Facebook or LinkedIn, where I regularly post some AI-related articles or news opinions that are too short for Medium.

### About the author/ODSC Europe 2021 speaker on Reinforcement Learning:

Alex Honchar is a tech entrepreneur and educator. Currently, he is co-founder and ML director at Neurons Lab – a consulting firm specializing in healthcare, finance, and IoT. Also, he writes a popular blog on Medium about machine learning applications and leadership. Previously he worked as an independent consultant with SMBs and startups on rapid go-to-market ML solutions and taught machine learning courses at the University of Verona and Ukrainian Catholic University.

*Originally posted here. Reposted with permission.*