State-of-the-Art Reinforcement Learning Algorithms

—This research paper brings together many different aspects of the current research on several fields associated to Reinforcement Learning which has been growing rapidly, providing a wide variety of learning algorithms like Markov Decision Processes (MDPs), Temporal Difference (TD) Learning, Advantage Actor-Critic (A2C), Asynchronous Advantage Actor-Critic (A3C), Deep Q Networks (DQNs), Deep Deterministic Policy Gradient (DDPG) and Evolution Strategies (ES) for different applications. In this paper, the computations and procedures involved in Reinforcement Learning algorithms are briefly discussed. Reinforcement Learning can be used is almost every field for its automation and advancement. Nowadays, Meta-Learning, Automated Machine Learning and Self-Learning Systems have become very popular. Meta-learning which is an application of evolution strategies is an exciting area of research that tackles the problem of learning to learn faster with being generalizable to many tasks. Automated machine learning is the process of automating end-to-end the process of applying machine learning to real-world problems.


INTRODUCTION Reinforcement Learning (RL) is an area of Machine
Learning which is very dynamic in terms of theory and its application. Reinforcement Learning algorithms study the behavior of subjects in environments and learn to optimize their behavior [1]. RL algorithms can be classified as shown in Fig.1.

Gt = Rt+1 + Rt+2 + Rt+3 + Rt+4 + -------+RT
(1) where 'T' is the final time step. Discounted Return: In this return, discount rate ' ' ε [0,1] is used to discount the future rewards and determine the present value of future rewards so that more immediate rewards are given more importance. Hence, expression of Discounted Return becomes as shown in "(2)". Gt = Rt+1 + Rt+2 + 2 Rt+3 + 3 Rt+4 + -----= (2) Policy(π): the decision-making function that maps a given state to probabilities of each possible action from that state. Value function: These are functions of states that evaluates how adequate it is for an agent to be in given state (Statevalue function) which is denoted by "Vπ" or these are functions of state-action pairs that estimate how good it is for an agent to perform a given action in a given state (Actionvalue function) which is denoted by "Qπ". Both of these functions are given in terms of Expected Return "Eπ" as shown in "(3)" and "(4)". [2][3]

A. Markov's Decision Processes
It is the framework we use to describe RL problems. In MDP, Agent and Environment interact continually and learns simultaneously as shown in Fig.2 transitions into a new state when an agent takes an action and during that very moment the agent gets a reward based on its action. Transition is represented by a tuple (s, a, r, s'), where s is the previous state, a is the action taken, r is the reward received on taking the action a and s' is the next state which the environment is transitioned into [4]. The transition probability from s state to s' state with reward r and action a is shown in "(5)". P(s', r | s, a) = Pr {St = s', = r | St+1 = s, At-1 = a} (5) While interacting with the environment, the main goal of the agent is to maximize the returns according to which optimal policy, optimal state-value function V*, optimal action-value function Q* are chosen [4]. Bellman Optimality equation for calculating Q* proves to be immensely useful. [4]

Fig. 2. Agent-Environment Interaction
Once we have Q*, we can determine the optimal policy as with Q* for any state s, an RL algorithm can find the action a that maximizes Q*(s,a). Also, in MDP, Epsilon greedy Strategy or εgreedy Strategy is used to get a balance between exploitation and exploration. Exploration is the act of exploring the environment to find out information about it whereas Exploitation is the act of exploiting the information that is already known about the environment in order to maximize the return. The agent will always start with the exploration as it does not know anything at that time. Here "ε" is the exploration rate which ranges from 0-1 where ε=1 means the agent is only exploring and ε=0 means it is only exploiting the information it has. MDPs are used in TD Learning, DQNs,A2C,A3C and DDPG. [5][6]

B. Temporal Difference Learning
Temporal difference is an agent learning from an environment through episodes without any preliminary information about the environment. TD Learning is considered as a better algorithm as compared to Monte-Carlo (MC). In TD Learning, the agent learns at every step and update values unlike in MC, where the values are updated at the termination of an episode. TD Learning is an unsupervised learning approach.
Updation of values: In TD(0), instead of using Gt , we only look at immediate Reward Rt+1 plus the discount of the estimated value of only 1 step ahead V(St+1). TD(λ) is used if we want to update values prior to the ending of the episode and use more than one step ahead for our calculation. It has two views in it: Forward and Backward view. Forward View: Looks at the next n-steps frontwards and λ is essentially operated to decay those future estimates. λ is the credit assignment variable.
Backward View: Updates values at each step. So, after each step in an episode, you make updates to all prior steps [9]. δt is the TD Error as shown in "(12)". We also use Eligibility Traces (ET) to assign credit to prior steps appropriately. Basically, ET keeps a record of the occurrence and recency of moving into a given state which can be calculated using "(13)". Credit is assigned to the states that are visited frequently and recently with respect to our final state. [9] The lambda (λ) and gamma (γ) are the terms which discount those traces.
An environment can have an infinite number of states (i.e. continuous state spaces). If we are using a neural network, then to update its weights θ, we well do, TD Error is used in A2C, A3C, and DDPG [10].

A. Deep Q Learning
In this algorithm, we use DQNs or Deep Q Networks which consists of deep neural networks. It is a value-based RL algorithm. Each state in the environment would be expressed by a set of pixels and the agent would be capable to take distinct actions from each state. Rather than using value iterations as in MDPs to determine the Q-values and find optimal Q-function, we alternatively use a function approximator to estimate optimal Q-function i.e. using Deep Neural Networks. In Q Learning, the target depends upon the prediction. [11] Q Learning is a semi-gradient off-policy algorithm. We will make use of DQNs as shown in Fig.3 to estimate the Q-values for each state-action pair in a given environment. The objective of this network is to approximate the optimal Q-function which will satisfy the Bellman equation. The loss from the network is determined by comparing the outputted Q-values to the target Q-values from the righthand side of the Bellman equation. After the loss is calculated, the network updates weights via Stochastic Gradient Descent and Backpropagation and this is how loss is minimized [12]. With Deep Q Networks, we often utilize the technique called "Experience Replay" and "Replay Memory" during its training. In it, we store all the agent's experiences et at each  The network which is fed by random sampled et for training and outputs Q-values is called the policy network. In this network, the loss which is shown in " (16)" is backpropagated and minimized [13]. A Q table is made which is updated at each time step during training. New Q-value is equal to the weighted sum of old Q-value and the learned value as shown in " (17)".
Or we can also say,  2 (19) If we use the same Q in Target and Prediction, then Target is always fluctuating along with the prediction so, both will become dependent on each other and thus inefficiency hence, we use a separate Target Network for getting Target values to avoid this. [14] B. A2C and A3C Algorithms A2C Stands for Advantage Actor-Critic and A3C Stands for Asynchronous Advantage Actor-Critic Algorithm. Both algorithms are policy-based RL algorithms. Policy-based algorithms output policies rather than the q values and each policy distribution has different exploration estimations. Policy-based methods can handle continuous action spaces easily as it represents parameters of the distribution as output which is finite [12] [15]. In training a policy-based algorithm, instead of minimizing error and finding optimal policy, the concept of gradient is used. According to Policy Gradient Theorem,

∇θ J(θ) = E [ A(s,a) ∇θ log π(a|s)] ≈ (1/N) [
A (si,ai) ∇θ log π(ai|si)], (20) where Advantage, A(s,a) = Q(s,a) -V(s), and ∇θ is the gradient and V(s) is the baseline. J(θ) is the loss function whose gradient with respect to θ is found. Advantage function captures how preferable an action can be as compared to others at a given state, while we know the value function captures how beneficial it is to be at this state. Both A2C and A3C are actor-critic algorithms. In A2C and A3C, take N = 5, collect all (state, action) pairs, calculate the N-Step Reward and Advantage, and after that go in the direction of the gradient and minimize the loss to update weights in the neural network. In A3C, we have one master network which intermittently copies its weights to the worker networks as shown in Fig.4. The worker nets are responsible for doing the rollouts. This process is Multi-threaded. Every 5 steps, each worker sends its gradients back to the master. Instead of updating its own weight, the worker sends its gradients back to the master net and master net updates its own weights. So, the master has the most up to date policy. A3C implements Parallel training where multiple workers in parallel environments independently update a global value function. These agents one by one interacts with its own copy of the environment and at the same time, the other agents are interacting with their environments. The reason this works better than having a single agent (beyond the speedup of getting more work done), is that the experience of each agent is not reliant on the experience of the others. In this way, the overall experience available for training becomes more divergent. In A2C, the steps are performed in each worker synchronously unlike A3C. In A2C, a single-worker variant of A3C is present. A2C is like A3C but without asynchronous part. The critic estimates the value function and actor updates the policy distribution in direction suggested by the critic with policy gradients. [16] In A2C, we simultaneously optimize the value function and the policy. Take N=5 steps of an episode, collect (state, action) pairs, calculate N-step reward and advantage, and go in the direction of the gradient. The regularization here can be thought of as exploration. Equation (23) and (24) determines cost function and loss function respectively. J = ( yi -w T xi ) 2 + λ |θ| 2 , (23) here λ is called the regularization parameter and is used to penalize the weights. Regularized loss = Policy loss + Penalty.  (s,a). The other network is Q network which gets optimal action from μ(s) and state 's' as its inputs. Policy network μ(s) must pass through the Q network to get the loss and output as an Action-Value from the latter. When updating weights of μ-net 'θμ', weights of Q-net 'θQ' remain fixed and the output from Q net is maximized by adjusting the weights in μ-net [19]. For optimizing μ(s), the loss function for μ-net and the Gradient of its loss function is shown in "(25)" and "(26)" respectively.

Jμ = E [ Q( s, μ(s) ) ]
∇θ μ Jμ = E [ ∇μ Q( s, μ(s) ) .∇θ μ μ(s)] (26) We use the suboptimal approach and calculate the gradient of the loss function and try to maximize the sum of future rewards. DDPG updates μ and Q nets alternatively considering two separate losses for each [20], [21]. Loss function for Q-net can also be calculated as shown in "(27)" and we will try to minimize it, 2 (27) In DDPG, we do a soft update for both policy network and Qnetwork unlike DQN i.e. we copy just a fraction of weights from the main policy network and Q-network to two separate target networks on every step. θ μ targ ← τ θ μ targ + (1 -τ) θ μ (28) θ Q targ ← τ θ Q targ + (1 -τ) θ Q (29) where 0 < τ << 1 [22] D. Evolution Strategies Algorithm Evolution Strategies (ES) is a black-box optimization method. Both Value-based and Policy-based categories use gradient descent to minimize the loss but ES takes a biologically inspired approach in particular of evolution. Evolution includes the concept of "Natural Selection". As a general nature's rule, the fit or the strong survive and the weak die. The offsprings which survive will produce offsprings for the next generation and that generation will be slightly different from their parents and these beneficial changes will compound and after many generations, the offsprings will be much stronger than their ancestors. Good changes are kept and bad changes are thrown away as those die. [23] Hill Climbing is an optimization technique used to find the local optimum solution to the computational problem. It starts with a solution that is very poor compared to the optimal solution and then iteratively improves from there. It does this by generating "neighbor" solutions which are relatively a step better than the current solution, picks the best and then repeats the process until it arrives at the most optimal solution because it can no longer find any improvements. Adding random noise leads to climbing of the hill and if the fitness of the model is less, then the noise is deleted. We try a new point and if that point is better than our current point, then we make it our current point and if not, then we consider another random point. [24] Also, Gradient descent is a specific kind of "hill climbing" algorithm. Let the learning rate = η; Normal distribution = εn; Noise Standard deviation = σ and Initial policy parameters = θ(0), where θ(n)= policy parameters for n th policy.
We want to go in directions which are better than where we currently are. The concept of parallelization is used in running multiple offsprings. No backpropagation, MDPs, Bellman eq., value function, etc like previously are used. [26][27] We see better exploration behavior in ES as compared to other policy techniques. There are fewer hyperparameters like learning rate, population size (number of offsprings to create) and noise deviation (how far can offsprings go from the parent). Fig.6 is showing that given an initial policy, we can always generate a population of similar policies around it by applying random changes to its weights. We then evaluate all these new policies and estimate the gradient i.e. we check in what direction things look more promising. Finally, we update weights and policy parameters to move exactly in that direction and start again and loop until we are satisfied with the outcome. [26] IV. DIFFERENCES BETWEEN THE RL ALGORITHMS The most valid differences which can be stated between the RL algorithms discussed above are as follows (see Table I).
Table-I: The most valid differences which can be formed in V. CONCLUSION This paper has provided an overview of reinforcement learning algorithms which were used in the past and also the algorithms which are in use these days. The comparison between the RL algorithms shows that the Evolution strategies algorithm is much more efficient and faster than other RL algorithms with the only drawback that the data used for its training acquires a lot of memory. Reinforcement Learning is not just limited to the algorithms discussed in this paper. Most recent applications in this particular technology are Neural Scene Representation and Rendering, Brain-Computer Interface, Stock Predictions, Trading, Sports betting, proving complex Mathematical Theorems, Health Care, Astronomy, Business, Manufacturing, Chatbots, Selfdriving Car, Astronomy, Playing video games at Superhuman levels and many more. Reinforcement Learning is the future as it has been predicted by researchers and scientists that humanoid robots will be built in the future with superhuman powers that will be much more intelligent and efficient than an average human. It will also be able to do innovations, learn by itself and perform several tasks that a human cannot do at all. DQN Advantage Actor-Critic DDPG ES

1.Classification
It is a value-based RL algorithm [2] A2C and A3C are policy-based RL algorithms It is a combination of both value and policy-based RL algorithms [2] It is different from value and policy-based RL algorithms [24] 2.Training Speed Slowest Fast Slower than actor-critic but faster than DQNs [ Larger memory required as it has many worker nets Larger replay memory required as it has two separate μ and Q target net [20] Very Large memory required to store data for training

5.Parallelization Required No
Yes as many worker nets are working in parallel to update the master net [16] Does not support parallelization [21] Yes as many offspring models are running in parallel [26] 6.Backpropagation Happens [11] Happens Happens [21], [22] Does not happen

Contains only basic deep neural nets
Actor is π (stochastic) and critic is V (state-value function). Actors are the workers which run in parallel with one critic i.e. master net. [17] Actor is μ (deterministic) and actor is Q (action-value function).
There is just one actor and one critic Multiple offsprings running in parallel following the concept of Natural Selection by following best gradient direction

8.Weights used
Weights on the main network copied to target nets [12] Weights of Master net are copied to worker nets Weight are updated using Soft updates to μ and Q target nets separately [20] Offspring nets have same weights initially