AI Crash Course book extract: Exploring the principles of reinforcement learning

By Hadelin De Ponteves | January 10, 2020 https://bluelife.ai/

Categories: Reinforcement Learning,

Hadelin de Ponteves is the co-founder and director of technology at BlueLife AI, which leverages the power of cutting-edge Artificial Intelligence to empower businesses to make massive profits by optimizing processes, maximizing efficiency, and increasing profitability. Hadelin is also an online entrepreneur who has created 50+ top-rated educational e-courses on topics such as machine learning, deep learning, artificial intelligence, and blockchain, which have reached over 700,000 subscribers in 204 countries.

Editor’s note: This is an edited extract from AI Crash Course, by Hadelin de Ponteves, published by Packt. Find out more and buy a copy of the book by visiting here.

When people refer to AI today, some of them think of Machine Learning, while others think of Reinforcement Learning. I fall into the second category. I always saw Machine Learning as statistical models that have the ability to learn some correlations, from which they make predictions without being explicitly programmed. While this is, in some way, a form of AI, Machine Learning does not include the process of taking actions and interacting with an environment like we humans do. Indeed, as intelligent human beings, what we constantly keep doing is the following:

We observe some input, whether it’s what we see with our eyes, what we hear with our ears, or what we remember in our memory
These inputs are then processed in our brain
Eventually, we make decisions and take actions.

This process of interacting with an environment is what we are trying to reproduce in terms of Artificial Intelligence. And to that extent, the branch of AI that works on this is Reinforcement Learning. This is the closest match to the way we think; the most advanced form of Artificial Intelligence, if we see AI as the science that tries to mimic (or surpass) human intelligence

Reinforcement Learning also has the most impressive results in business applications of AI. For example, Alibaba leveraged Reinforcement Learning to increase its ROI in online advertising by 240% without increasing their advertising budget (see https://arxiv.org/pdf/1802.09756.pdf, page 9, Table 1 last row (DCMAB)).

The five principles of reinforcement learning

Let’s begin building the first pillars of your intuition into how Reinforcement Learning works. These are the fundamental principles of Reinforcement Learning, which will get you started with the right, solid basics in AI.

Here are the five principles:

Principle #1: The input and output system
Principle #2: The reward
Principle #3: The AI environment
Principle #4: The Markov decision process
Principle #5: Training and inference

Principle #1 – The input and output system

The first step is to understand that today, all AI models are based on the common principle of inputs and outputs. Every single form of Artificial Intelligence, including Machine Learning models, ChatBots, recommender systems, robots, and of course Reinforcement Learning models, will take something as input, and will return another thing as output.

In Reinforcement Learning, these inputs and outputs have a specific name: the input is called the state, or input state. The output is the action performed by the AI. And in the middle, we have nothing other than a function that takes a state as input and returns an action as output. That function is called a policy. Remember the name, “policy,” because you will often see it in AI literature.

As an example, consider a self-driving car. Try to imagine what the input and output would be in that case.

The input would be what the embedded computer vision system sees, and the output would be the next move of the car: accelerate, slow down, turn left, turn right, or brake. Note that the output at any time (t) could very well be several actions performed at the same time. For instance, the self-driving car can accelerate while at the same time turning left. In the same way, the input at each time (t) can be composed of several elements: mainly the image observed by the computer vision system, but also some parameters of the car such as the current speed, the amount of gas remaining in the tank, and so on.

That’s the very first important principle in Artificial Intelligence: it is an intelligent system (a policy) that takes some elements as input, does its magic in the middle, and returns some actions to perform as output. Remember that the inputs are also called the states. The next important principle is the reward.

Principle #2 – The reward

Every AI has its performance measured by a reward system. There’s nothing confusing about this; the reward is simply a metric that will tell the AI how well it does over time.

The simplest example is a binary reward: 0 or 1. Imagine an AI that has to guess an outcome. If the guess is right, the reward will be 1, and if the guess is wrong, the reward will be 0. This could very well be the reward system defined for an AI; it really can be as simple as that!

A reward doesn’t have to be binary, however. It can be continuous. Consider the famous game of Breakout:

Imagine an AI playing this game. Try to work out what the reward would be in that case. It could simply be the score; more precisely, the score would be the accumulated reward over time in one game, and the rewards could be defined as the derivative of that score.

This is one of the many ways we could define a reward system for that game. Different AIs will have different reward structures; we will build five rewards systems for five different real-world applications in this book.

With that in mind, remember this as well: the ultimate goal of the AI will always be to maximize the accumulated reward over time.

Those are the first two basic, but fundamental, principles of Artificial Intelligence as it exists today; the input and output system, and the reward. The next thing to consider is the AI environment.

Principle #3 – The AI environment

The third principle is what we call an “AI environment.” It is a very simple framework where you define three things at each time (t):

The input (the state)
The output (the action)
The reward (the performance metric)

For each and every single AI based on Reinforcement Learning that is built today, we always define an environment composed of the preceding elements. It is, however, important to understand that there are more than these three elements in a given AI environment.

For example, if you are building an AI to beat a car racing game, the environment will also contain the map and the gameplay of that game. Or, in the example of a self-driving car, the environment will also contain all the roads along which the AI is driving and the objects that surround those roads. But what you will always find in common when building any AI, are the three elements of state, action, and reward. The next principle, the Markov decision process, covers how they work in practice.

Principle #4 – The Markov decision process

The Markov decision process, or MDP, is simply a process that models how the AI interacts with the environment over time. The process starts at t = 0, and then, at each next iteration, meaning at t = 1, t = 2, … t = n units of time (where the unit can be anything, for example, 1 second), the AI follows the same format of transition:

The AI observes the current state, sᵣ
The AI performs the action, aᵣ
The AI receives the reward, rᚁ = R(Sᚁ, aᚁ)
The AI enters the following state, Sᚁ +1

The goal of the AI is always the same in Reinforcement Learning: it is to maximize the accumulated rewards over time, that is, the sum of all the rᚁ = R (Sᚁ, aᚁ) received at each transition.

The following graphic will help you visualize and remember an MDP better, the basis of Reinforcement Learning models:

Now four essential pillars are already shaping your intuition of AI. Adding a last important one completes the foundation of your understanding of AI. The last principle is training and inference; in training, the AI learns, and in inference, it predicts.

Editor’s note: Find out about the last principle of Reinforcement Learning and much more by ordering a copy of AI Crash Course, available here.About the author: Hadelin de Ponteves is the co-founder and director of technology at BlueLife AI, which leverages the power of cutting-edge Artificial Intelligence to empower businesses to make massive profits by optimizing processes, maximizing efficiency, and increasing profitability. Hadelin is also an online entrepreneur who has created 50+ top-rated educational e-courses on topics such as machine learning, deep learning, artificial intelligence, and blockchain, which have reached over 700,000 subscribers in 204 countries.

? Attend the co-located

Tags: Featured

Police use of Clearview AI’s facial recognition increased 26% after Capitol raid

OpenAI’s latest neural network creates images from written descriptions

Google is telling its scientists to give AI a ‘positive’ spin

Chinese AI chipmaker Horizon endeavours to raise $700M to rival NVIDIA

Facebook is developing a news-summarising AI called TL;DR