 Open Access
 Authors : Nimisha Sunny
 Paper ID : IJERTCONV8IS04015
 Volume & Issue : NSDARM – 2020 (Volume 8 – Issue 04)
 Published (First Online): 17032020
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Reinforcement Learning: Recent Threads
Nimisha Sunny
AbstractReinforcement learning is a method of training algorithms using reward and punishment feedback. Reinforcement learning agents will interact with their environment to extract information. It is using a trial and error mechanism to learn from its experiences. The goal of reinforcement learning is getting a model that can maximize the total aggregate reward. Its policy is similar to supervised learning. When comparing, both reinforcement learning and supervised learning uses the mapping between input and output as a policy method. This paper contains detailed comparisons and discussion of six reinforcement algorithms, their exploration and exploitation strategy, their weakness and strengths. Background on reinforcement learning models and its recent trends, advantages and future opportunities of reinforcement learning are presented in the paper. This paper is keen to discuss the stateoftheart applications and achievements of reinforcement learning in various domains.
Index TermsReinforcement learning, trends,
review, challenges, modelfree, Qlearning, DDPG, SARSA, Inverse Reinforcement Learning, Actor critic model.
I.INTRODUCTION
Reinforcement learning (RL) is a multidisciplinary machine learning technique. It is a hot research topic in the artificial intelligence domain. The last decade witnessed research and applicational level success in reinforcement learning. The concept and research in reinforcement learning started during 1980 and have its origin rooted back into statistics, game theory, control theory and animal psychology. Reinforcement learning is a goaloriented method with the ability to interact with the environment and learn from it. A reinforcement learning agent is designed to learn
from its environment using a trial and error mechanism. An RL agent in a dynamic environment is capable of extracting information from the current
state and to take appropriate action. The reinforcement learning model is designed to choose the bestfit action to maximize the reward and deliver maximum utility. More appropriate actions will be selected based on the reward function value and undesirable actions will be silenced using punishment feedback. Starting from singleagent reinforcement learning detailed research is carrying out on multiagent and swarm intelligence topics. Genetic algorithms and hybrid approaches are now trying to break the barrier of human cognitive ability to achieve superhuman perfection. The ability of reinforcement learning to extract and manipulate the raw pixels data made it an integral part of computer vision and digital game environments.
An RL agent learns the best policy using the exploration of the environment. Reinforcement learning models are used to solve complex computational problems by setting a reward mechanism and control policies. It chooses a bestfit policy to maximize the reward. The concept of reinforcement learning is inspired and adapted from the natural learning process of the animals, especially the learning and action behaviour of human beings. In reinforcement learning, data will generate based on environment exploration. The agent will only be informed about the starting state, and the behaviour will be modelled based on the reward and punishments. The reward feedback in the form of a
scalar objective function is a performance measure of each step. Reinforcement learning is closely related to optimal control theory. RL and optimal control theory are used to find an optimal control policy to optimize the objective function. The ability of RL agents in decision making under the uncertainty maid is special among other machine learning techniques. No information regarding what move to make is provided to the RL agent. The agent must decide the best activity to boost longterm rewards and execute it. The selected action will change the current environment state into the next adjacent state.
Different RL algorithms developed during the last two decades are improvised a lot and achieved good results in complex realworld applications. The
successful innovations in the topic of RL technologies and algorithms thus demand a comparative critical survey.
This paper discusses the recent trends in RL in terms of application and research level achievements. This review paper is designed to provide a general overview of RL techniques along with elucidating and comparing major reinforcement learning algorithms. This paper is divided into 6 sections. After the introduction, section 2 will give a brief history of RL, a general overview of RL algorithms and classifications and a simple explanation about reinforcement learning model architecture. The core idea of this paper is included in section 3. Here we are providing a detailed analysis and comparison of 6 reinforcement learning algorithms using various classification and performance measurement parameters. The stateoftheart achievements and applications of RL are provided in section 4. Challenges and future research opportunities in reinforcement learning are discussed in section five.

REINFORCEMENT LEARNING AN OVERVIEW

History of Reinforcement learning
The concept of reinforcement learning originated from 2 different themes. The first one is, trial and error concept developed during 1980. The second method is related to optimal control associated with dynamic programming and value function. The concept of optimal control was introduced in 1950, to minimize the dynamic behaviour of the system. This concept is further developed after the introduction of the BellMan equation. All the methods that use the bellman equation to solve the optimal control problems are generally called dynamic programming. The markovian decision process, which is known as the base model of reinforcement learning was introduced by Bellman. Paul Werbos proposed heuristic dynamic programming in 1977, which is an approximate approach to dynamic programming. He defined the concepts of dynamic programming and optimal control with computational learning.
The concept of trial and error learning is rooted back to Alexander Bains concept of groping & experiment. Edward Thorndike presented the idea of trial and error learning as a learning principle. He
defined it as the Law of effect because of the reinforcing learn model involved in action selection. The idea of trial and error learning was coined with the concept of artificial intelligence when Alan Turing proposed a design of the pleasureplain system based on the Law of effect (TURING 1996). Richard Sutton who is known as the father of computational reinforcement learning expanded the concept of trial and error learning in reinforcement learning models. He introduced temporal difference learning and policy gradient algorithm, the two most popular fundamental reinforcement learning techniques. The Qlearning algorithm was developed by Chris Watkin by combining the concepts of temporal difference and optimal control.
Several reinforcement learning algorithms and techniques were developed and validated successfully in the last decade. During this period reinforcement learning underwent various improvements and innovations. The next subtopic will discuss reinforcement learning classifications and their evolution along with a brief explanation of the
general reinforcement learning model.

Classification & Model Architecture
RL models are mainly classified into two types, modelbased reinforcement learning, and modelfree reinforcement learning. The model is the prception of the agent about its environment. An agent will map statesaction pairs to a probability distribution over states. Modelbased reinforcement learning methods will choose an optimal policy based on the learned model. In a modelbased RL method, the agent has a clear understanding of its environment where it is acting. This environment can be either fully observable or partially observable. Knowing the environment will prepare the agent for choosing the best action and hence maximize future rewards. In a modelbased algorithm, the agent will have the capacity to foresee what might happen when picking a specific activity from a scope of potential ones. While modelfree reinforcement learning agent is completely unaware of the environment in which it acts. Actions of an agent in a modelfree RL model will be limited to a specific date and it doesnt have any idea or knowledge about the next state or outcome of the action. The learning process of modelfree RL algorithms is mainly relying upon
experiences. They follow the trial and error learning mechanism for each stateaction pair to maximize the rewards to create an optimal policy. Modelfree algorithms depend upon instantaneous rewards from a specific stateaction pair to evaluate the state utility.
Modelfree reinforcement methods are further divided into policybased (onpolicy) and valuebased (offpolicy) algorithms. The policybased algorithms will learn the policy without using a value function. The agent will learn the policy function for stateaction pairs. In the onpolicy models, the policy is defined as (s, a), where is the weight of the input node, s is for state and a is action. The policybased model will try to optimize the using the method of gradient descendent on objective function or maximizing its local assumptions. Policies are of two types, firstly the deterministic policy that is used in deterministic environments like complex s board games. Here policy maps state to activity without any uncertainty. The second method is stochastic policy models which give probability distribution of actions in each state. The valuebased reinforcement learning algorithms learn and perform actions without following a policy. The action function (s, a) will be decided based on the how good is an action at a particular state. Commonly these methods will use an objective function defined by the Bellman equation. In the valuebased model, optimization is offpolicy which means, the policy used in behaviour generation of training data is independent of estimation policy. Policy gradient algorithms and actorcritic methods are a common example of onpolicy, modelfree reinforcement learning. The Qlearning, Deep Qlearning, and DDPG are the offpolicy methods. Even though DDPG is using both policy and valuebased approaches. The modelbased reinforcement learning highly emphasizes control function f(s, a). The performance efficiency of modelbased methods is defined within a specific environment or in a specific task and it is considered as a limitation of the method. The learning mechanism in modelbased reinforcement learning is categorized into two types learn the model and given the model.
This paper emphasizes different modelfree reinforcement learning methods. Major algorithms that come under policybased and valuebased approaches (reinforcement learning, deep reinforcement learning, and Inverse reinforcement learning) are briefly discussed in the below sections .
A simple RL system consists of agents and environment, where the agent will interact and explore the environmental states based on an optimal action. Including agent and environment, a reinforcement learning model consists of six key elements.

Agent: an agent is the interacting part of the RL system. An agent is a learner as well as a decisionmaker.

Environment: RL agents are designed to interact with the environment and learn from it. The nature of the problem or application defines the environment.

Actions: Defined as a set of actions, that performed by
the agent. Based on the action new states will be explored.

States: A state will provide complete information about environment instances, no information is hidden from the state.

Policy: policy defines the learning and action behaviour of the agent in each state and time. It is the decisionmaking process and mapping from perceived states to actions. Based on the environment, policies could be stochastic or deterministic.

Reward signal: A reward is a scalar value and it defines the goal of the RL models. Based on the actions performed the environment will send a reward to the agent. The reward function is defined using actions and states and the equation is given as r=R(s, a).

Model: The concept of the model is optional, all RL methods are not using a model. A model is the perception of an agent about its environment. The model is used to develop an understanding of the environment.
Fig. 1: Reinforcement learning Model.
The concept of reinforcement learning is formalized using a Markov Decision Process (MDP). It is very important to discuss MDP because reinforcement learning approaches are following an assumption that it contains an MDP. The Markov Decision Process is used in sequential decision making. In a sequential decisionmaking model, each action controls and influences the current & next states and the reward of the present action & next action. An MDP is made up of a set of state (S), set of possible actions (A), rewards (R), and state transition probability (P). In simple words, MDP is a collection of four tuples (S, A, R, P). In a reinforcement learning environment, P and R is unknown. A modelfree reinforcement model could only choose a reward based on the trial and error process. So RL models will use a value function to define the longterm reward achievement. Dynamic programming is used to solve problems in an MDP environment where the reward function is present. In an RL model, where the reward function is unknown, here we must use various modelfree and modelbased algorithms to solve the problem.
A value function defines what is good and ideal for an RL model in terms of longterm reward maximization for a stateaction pair. The Monte Carlo method is used for calculating the value function. Monte Carlo method calculates the value of a state by executing many trials runs for a state and find out the final value by taking the average of the trial measurements. Monte Carlo searching is an iterative approximation method and a popular RL algorithm. The temporal difference (TD) algorithm is the basic concept of many RL algorithms. TD was introduced by Sutton, by combining the concepts of dynamic programming with the Monte Carlo approach. TD is the most popular reinforcement learning algorithm. It computes the value function by comparing the temporally successive predictions (Sutton 1988).


REINFORCEMENT LEARNING ALGORITHMS

QLearning
The most well known approach algorithm in reinforcement learning is the Q learning algorithm. It is widely used for implementing agents or robots (Kim and Cho 2015). For predicting a future value
signal Q learning algorithm is needed. It has the structure of temporal difference learning and also a modelfree which makes it an off policy, for instance, each state agent achieves it select a given action on that state and move on to the next state while getting a reward. The purpose of Q learning is to maximize the reward total and with the aid of st = given state, at=action taken, rt=reward, st = next state, +1. Expression of Q learning greedy strategy can be given as:
Q learning exploits its environment utilizing the information from present states by action that will surely expand Q[s, a]. It further explores its environment using a greedy strategy in order to get the best Qfunction. The above equations demonstrate the total best limit reward of acton an agent took in a state according to greedy strategy. The equation has the rate of learning, the factor of discount and condition at initial. Rate of learning decides to what degree recently obtained data abrogate old data, factor discount implies that the future prizes is decided by factor discount and finally the initial factor this stage expects an underlying condition before the initial update with the help of interacting with its environment. Furthermore, exploration can be motivated if the initial has a high number lastly first reward from initial can reset first conditions. The next action of any agent in a particular area is concerned with Q learning to augment the total reward.
Furthermore, the right reinforcement learning algorithm for an assignment such as air combat is Q learning for best action selection. Q learning is recommended due to its method of studying an actionvalue function that returns a result of a state. State and action of an agent can be gotten while following a pattern using value Q of any action with an unknown model of the environment and can be contrasted with Q learning thereby showing the strength of Q learning over the unknown model. Base plan for the content reserving is in Q learning and it is created to safely store content with collaboration between edge server and that of content supplier. For the shortest path solution Q learning algorithm is the best approach with the aid of the Q table. Furthermore, Q learning calculates value base on the
Special Issue – 2020
present state and awaiting values. Q table holds the record for Q value in Q learning when an ideal Q table is known the agent begins to venture the area and can choose an ideal activity with the most noteworthy Q esteem in states. N x m is the pattern of a Q table where n=agent actions and m=total number of states. A set of Q values is Q table MBL system can be used to reduce the issues of trial and errors in Q values when a Q table is not available. A system called memorybased learning was utilized to the reduction of trial and error action within a Q value. MBL structure emulates the cerebellum of humans. Furthermore, is a table looking into a method with a table for functions with nonstraight values, blocks are utilized for storage of data which can be numerical. Also, blocks are utilized in MBL so as acceleration learning and increment of data can be spread to blocks that are close.
Q learning has its own weak points which affect its performance firstly; time to learn states when it becomes bigger is one of the limitations in Q learning algorithm, due to it learning the process a certain length of time should be available for both action and state. Furthermore locally weighted regression was proposed to tackle one of the major problems with Q learning Algorithm which has to do with it state being recognized as a distinct state. Cubes merger always give intriguing outcomes when algorithms such as Q learning is been actualized on agents with inadequate information or less observation within a domain which through their associations will move together to meet towards a goal (Mourad et al. 2014). Q learning is the best approach for the urban environment with the issue of traffic. Regular fixed time traffic sign control normally uncovers low execution when in contact with difficult traffic conditions which is caused by numerous interception with the aid of Q learning model a definite solution we have achieved.

SARSA
One of the reinforcement algorithms which has onpolicy TD control is Sarsa, its objective approach and conduct policy are both greedy. The major difference between sarsa and Q learning is that the actual action is taken in sarsa while action with the highest reward is giving more priority and taken first in Qlearning (Jiang et al. 2019). Update is done in
International Journal of Engineering Research & Technology (IJERT)
ISSN: 22780181
NSDARM – 2020 Conference Proceedings
sarsa with the aid of it fivetuple (s,a,r,s1,a1) which present state is denoted as S, A for action selected in present state after an action the reward gotten is S. a1 and s1 point out the previous state and action when sarsa algorithm gets a reward value back one step which is also referred to as backup the more it gets similar to Q learning. Q value can be updated and learn in sarsa by utilizing the
equation below:
Where present state at a time step t is st, at is the
function node selected within a state, rt is the reward, a is the rate of learning (0<a<=1), has to do with discount rate (0< <=1) (Park et al. n.d.). In sarsa interaction within an environment is necessary before a policy can be updated when an action is made. Sarsa, which is an onpolicy method, also utilizes a Q table which includes a matrix with rows and columns which indicate actions and states. Sarsa has various exploration policies like greedy and softmax exploration policy. Softmax can be calculated with the aid of Boltzmann distribution. Furthermore, softmax uses probability selection of action which can only be possible by positioning a good estimate of the value function using Boltzmann distribution given by:
For a decision on how Q values influence the action is decided by , greedy action choice is caused as the result of low temperature concerning Q. For all actions to have similar odds of being picked, it uses the result of high temperature. Changing customary strategy like greedy, count for state action appearance into the Boltzmann distribution was presented as a strategy for exploration also count based exploration bonus was also included to aid agent exploration during the learning procedure. State
Volume 8, Issue 04
Published by, www.ijert.org 5
Special Issue – 2020
to action will be recorded (s,a) count, action (a) can be chosen when an agent is in a state (s): which implies the number of state increase by action of agent in a state finally based count can be utilized in Boltzmann distribution method.
Prob(a1) indicates the likelihood of executing an action a1, the value of temperature is T, randomness is associated with T so, therefore, the higher the value of T the higher is the randomness. Furthermore, when an agent utilizes the high value of the present value function at the initial stage of learning the issues of exploration and exploitation will arise and the optimal solution may not be achieved because the probability of falling into an optimum which is local is high. For adding exploration bonus to reward the equation is bellowed:
is for bonus and at initial zero is set for all states action (s, a) paired. A reward is denoted R+R+ evaluation for performance is denoted as a total reward without bonuses and finally for countbased sarsa action is not only selected according to the number of times for state action but adding of exploration bonus with respect on a count to the learning procedure. Sarsa equation using count base sarsa is bellowed:
Value differencebased exploration is utilized in sarsa to minimize unnecessary exploration during episodic tasks due to the unequal exploration of state and this can be accomplished when the information of the initial state has been taken. The main function of VDBE is the inclusion of statedependent exploration probability as against global parameters. For every time an action is taken in a state it exploration
International Journal of Engineering Research & Technology (IJERT)
ISSN: 22780181
NSDARM – 2020 Conference Proceedings
probability will be updated. The equation for VDBE is below:
Two parameters are included: , . When has a high value it means large differences are required for exploration. Furthermore lesser changes can make future exploration possible. implies the strength of action which is single on the – value of a state. VDBE Softmax is also an exploration strategy in sarsa that works like the VDBE which implies states. States dependent exploration probability is kept p and assessed when to pick exploration or not and VDBE softmax and this exploration strategy include Softmax behaviour.
Its purpose is to help in situations where few actions produce emphatically negative rewards which when mix with Q value it could be an unnecessary measure of exploration of bad actions. Sarsa as various advantages is a method utilized for issues with negative large rewards furthermore dangerous optima path is always avoided by sarsa during exploration which is also an advantage over other algorithms like Q learning. Sarsa has its own weakness it nearoptimal policy during exploration makes it not the best reinforcement algorithm

Actor critic method
The actor critic method is onpolicy learning. They are also temporal difference methods with entire
Volume 8, Issue 04
Published by, www.ijert.org 6
Special Issue – 2020
different memory structures to particularly express the policy independent of a particular value function whereby the policy is called Actor due to its function of chosen actions why the value is the critic. The value is known as a critic because it criticizes actions by an actor; whenever a policy is followed by an actor a critic knows the critique. Actorcritic consists of errors called TD errors which are seen as critique which is output and is the result of both actor and critic.
Fig 2: Actor critic architecture.
Typically, evaluation can only take place after an action has been carried out and a critic is done to know if the outcome is good or terrible. Below is the equation for TD error.
Exploration of action is done by a method called Gibbs softmax which is given as:
P(x, a) is for value at T equal time for actions parameter pointing to the probability of choosing an action (a) when in a particular state. P(x, a) can be updated by the equation below:
The above equation has a positive stepsize parameter. To have full knowledge about the environment is difficult, before only acting by an agent in a real sense during exploration, exploitation is also considered although there are various popular exploration methods like greedy, Boltzmann and Gibbs softmax which are exploration used in actorcritic. Exploration may be split into two ways firstly for exploration in respect to randomizer which is denoted as an undirected exploration in an environment and exploration for gain in respect to maximization value strategy for gaining which is a
International Journal of Engineering Research & Technology (IJERT)
ISSN: 22780181
NSDARM – 2020 Conference Proceedings
direct exploration. Another exploration strategy is the hybrid Gibbs softmax method.
Where for positive in respect to direct exploration and it can be fixed at a bigger value and reduce water during learning n t (a) equals total times. (a) was utilized for time step t. Policy learning can be improved in the dialogue system using adversarial advantages actorcritic for task completion. (2AC) has popularly known as a policy approach which is to get a policy which always maximizes reward (R) and tries to minimize j equal to lose. Equation for is the reward and length T while the discount factor. Adversarial advantage actor aim is to encourage the selection of action by the actor which is also guarded by a discriminator in order to enhance exploration. In alternative action selection policies, action like greedy is basically in use due to its good exploration techniques. Taking a step to loan and acquire a skill, greedy action towards a goal is not 100% and it also has a minimum percentage of random exploration. For maximum solution issues with random exploration, Boltzmann distribution is utilized
Actor critic method is the combination of actor and critic, actorcritic as various capabilities such as the production of continuous actions, for the evaluation of actor policy critics utilize updating of the value function and the value function is also used to update the actor policy to achieve good performance.A lot of actorcritic is widely used in robotics, power control and finance because of its ability to find an optimal policy when using an estimate gradient of low variance which motivates the speed of learning.Actor critic policy method minimizes variance because TD method can be utilized in finding out critics and exchange variance with bias also other advantages of actorcritic is its gradient update with a perfect critics will allow actor critics to be more example effective through TD update at each progression finally it overpower the issue of high variance gradients in actor only.Actor critic as various Challenges, for instance, applying of actorcritic algorithm on a particular problem even when knowing the experience an actorcritic will not yield a good control policy lastly actor critic algorithm utilizes the
Volume 8, Issue 04
Published by, www.ijert.org 7
Special Issue – 2020
same features for actorcritic during experiments which affect its value function.

DeepQ Learning.
DeepQ learning is a deep reinforcement technique. Deep Reinforcement Learning is derived from the integration of its Learning techniques (DL) with Reinforcement Learning methods(RL). It uses the same principles of DL and RL to generate algorithms that are effective and can be utilized in several sectors such as Video games, Robotics, Finance, and even Healthcare. Implementation of reinforcement learning algorithms with deep learning architecture and deep neural networks will create a powerful model. A deep reinforcement learning framework for learning is to solve sequential decisionmaking problems. It provides trial and error in the world that provides occasional rewards. DRL is based on training networks like deep neural networks to approximate the optimal policy and/or the optimal value functions V* , Q*, and A*. Policy search methods are mainly focused on two methods gradientfree methods and gradientbased methods. The flow of interest in DRL, avoid the commonly used backpropagation algorithm, which is the gradientfree algorithms.
In deepQ learning, the agent learns after several times of interaction with its environment. The environment moves to a new state each time an agent chooses an action from a group of certain actions. Punishment and reward are attached to each action as feedback and maximizing the reward will be the next aim of the agent. The formula for Q learning is shown below
Q(st , at) show the Q value of the agent in a given state st, and time t equal to action at, rewarded with reward rt. The learning rate is and discount factor equal to . utilizing the rate of learning the fastness with which the previous information can be overridden by newly acquired information can be determined. A high value of the learning rate represents a larger change in Qvalue. The discount factor function is to determine the usefulness of future rewards. If the discount factor
International Journal of Engineering Research & Technology (IJERT)
ISSN: 22780181
NSDARM – 2020 Conference Proceedings
value is closer to 1, then it represents that the future state is more important.
This algorithm is an offpolicy control method. It uses updated rules to learn active value function. This method uses the max operator to refine policygreedily with regards to the action values. This learning is used in spaces that are small where the storage of policy may be in a tabular form. Therefore, a twodimensional array is used for representing the action space and state space. It uses a dynamic programming strategy to assign values in the array. In this reinforcement learning algorithm, there is a dilemma in choosing between the exploration and exploitation strategy.
On one side, the agent wants to choose maximum possible actions to figure out the optimal strategy, this can be called exploration. On the other side, the agent wanted to select an action with maximum Qvalue to increase the reward, this can be called exploittion. As the optimal strategy can only be determined through exploring, exploration strategy is of high importance to learning. But, too much exploration can reduce performance. To overcome this situation, an action selection strategy is used for the learning process where exploration and exploitation are balanced. greedy strategy can resist the system from a local optimal state. It chooses actions randomly with
probability [0,1]. Here, the random selection of
actions made by the agent in the present state ensures that all state space is explored. By reducing the over time, the agent slowly progresses towards exploitation.
E.Deep Deterministic PolicyGradient
Deep Deterministic Policy Gradient is one of the several Deep Reinforcement Learning algorithms (Tuyen and Chung 2017). This algorithm can deal with high dimensional continuous action space, and it uses Deep Neural Networks (DNN) to represent the policy. The stability and strong learning ability of the algorithm is the reason behind it to be widely used (Pang and Gao 2019). The traditional approaches of Reinforcement Learning (RL) like actorcritic and policy gradient are used by this algorithm. This algorithm, as two main neural networks which are actornetwork and critic network. Approximating the policy function is the job of the actornetwork whereas approximate the value function is the job of a critic network. The input of a critical network concerns the
Volume 8, Issue 04
Published by, www.ijert.org 8
Special Issue – 2020
action with state output from the actornetwork. From output, evaluation of the action performed will be made with the help of Q value and the actor updates accordingly in a critical network. Some other methods that are involved in the DDPG method are target network, experience replay, and deterministic policy gradient theorem. The DDPG algorithm framework is given below.
Fig3:Framework Of DDPG Algorithm.
The main advantage of DDPG is that probabilistic representation of the control policy is not needed and hence it is perfect for deterministic policy in several problems. Even though the DDPG method has advantages, it has certain limitations like the actor has to heavily depend on the critic for its learning process, thus the training of DDPG method is sensitive to the efficiency of critical learning.
An offpolicy way is used by DDPG in training a deterministic policy. As the policy is deterministic, the agent would not be able to find a wide variety of actions for learning signals at the beginning, if the agent tries to explore onpolicy. Noise is added to the actions during training time to make DDPG policies explore better.
To motivate an agent to examine a richer set of state, diversitydriven exploration is an effective approach. With a modification in the loss function, this can be achieved. The modification that has to be made in this case is
International Journal of Engineering Research & Technology (IJERT)
ISSN: 22780181
NSDARM – 2020 Conference Proceedings
measure of the distance between and '. is the scaling factor for D. This equation makes an agent proactively try with new policies, which increases the options to visit novel states even if there is an absence of reward signals which is obtained from E. This property is also useful in sparse reward setting, where for most of the states in S, the reward is zero. Also, exploration is motivated by the distance measure D. This is achieved by altering an agents current policy by , instead of randomly altering its behaviour. This equation also allows an agent to perform greedy policies while exploring in the training phase. In case of greedy policies, since there is a requirement of an agent for D to update after the completion of each update, for that state the greedy action may change accordingly, which directs the agent to see unseen states. There are various choices for D, it can be KLdivergence, mean square error (MSE) or L2norm.

Inverse reinforcement learning
An agent is a major component of reinforcement learning (RL) who solves the problems when the agent gets the experience from dynamic environment interactions. Inverse reinforcement learning was introduced due to the issues associated with RL such as design difficulties and advance reward methodology. Inverse reinforcement learning structure in machine learning is developed recently to resolve inverse issues involved in the reinforcement learning algorithm. Inverse reinforcement learning method which can not only be based on the given reward function, but also it gives an observed behaviour from an expert instead of the reward function. Many researchers in various networks like Machine learning, AI, psychology and control theory were attracted by the methods of IRL in the past. IRL engaging a result of its potential to use its recorded data to setup autonomous agents which will have the ability to modelling without the intervention of the performance of the task.
where L represents the loss function of deep reinforcement learning algorithms. The current policy is defined by , ' is a policy which is sampled from most recent policies and that set is limited. D is the
Volume 8, Issue 04
Published by, www.ijert.org 9
MDP must solve and obtain reward through optimal policy.
There are mainly three types of IRL algorithms("Algorithms for Inverse Reinforcement Learning")

Firstly, It has Finitestate optimal policy where is known

Infinite state optimal policy where is also known



Lastly, Infinite state with optimal policy is unknown
Fig.4: IRL vs RL representation.
There are mainly three types of IRL algorithms("Algorithms for Inverse Reinforcement Learning")

Firstly, It has Finitestate optimal policy where is known

Infinite state optimal policy where is also known

Lastly, Infinite state with optimal policy is unknown
From the above cases of studies, only the last case is close to problems which are Infinitestate with unknown optimal policy(Arora and Doshi 2018).
There are many limitations for IRL, the key problems are related to the award function. For all observation, the behaviour has their reward functions but some of the solutions contain degenerated outputs which are those state's reward value is always zero. IRL depends on the presumption that the experts policy is ideal concerning an obscure reward function. For this situation, the principal point of the apprentice is to get familiar with a rewarding work that clarifies the observed expert behaviour. At that point, utilizing reinforcement learning, it streamlines its approach as indicated by this reward and ideally carries on just as the expert. Learning a reward has a few points of interest over learning a policy immediately. Firstly, the reward can be investigated to all the more likely to comprehend the expert's behaviour. Secondly, it permits adjusting to perturbations in the dynamics of the environment or it can be transferable to other environments. However, the key problem that is an
From the above cases of studies, only the last case is close to problems which are Infinitestate with unknown optimal policy.
There are many limitations for IRL, the key problems are related to the award function. For all observation, the behaviour has their reward functions but some of the solutions contain degenerated outputs which are those state's reward value is always zero. IRL depends on the presumption that the experts policy is ideal concerning an obscure reward function. For this situation, the principal point of the apprentice is to get familiar with a rewarding work that clarifies the observed expert behaviour. At that point, utilizing reinforcement learning, it streamlines its approach as indicated by this reward and ideally carries on just as the expert. Learninga reward has a few points of interest over learning a policy immediately. Firstly, the reward can be investigated to all the more likely to comprehend the expert's behaviour. Secondly, it permits adjusting to perturbations in the dynamics of the environment or it can be transferable to other environments. However, the key problem that is an MDP must solve and obtain reward through optimal policy.


APPLICATIONS OF REINFORCEMENT LEARNING
The development of RL is directly related to research in computational game development. Researchers used dynamic programming and RL algorithms to solve complex games (card and board games). In the 21st century, RL is widely used for information retrieval, robotic control and its application is growing beyond the conventional boundaries. Google developed go playing RL agent AlphaGo and selftaught and playing advanced AlphaZero using RL and other
hybrid learning approaches. Alpha go confirmed its superhuman proficiency by defeating the world champion Lee Sedol in 2015 in the game of go. In 2017 google developed the most sophisticated board game AI agent AlphaZero. The AlphaZero was developed using the integration of reinforcement learning with the generalpurpose search algorithm. An extension of the policy gradient algorithm named proximal policy optimization was used in the Big2 game by Charles worth. The counterfactual regret minimization (CFR) was used IIGs and poker AI agent development. CFR is an RL algorithm that only depends upon the number of linear memory data and independent of the number of states. Application of RL algorithms in Atari 2600 games has started the era of RL applications in the interactive digital game domain.
Reinforcement learning approaches are now used for computer resources allocation and scheduling for waitingtasks. Deep reinforcement learning techniques are used for large scale online multiresource cluster systems for resource allocation. Traffic signal control is another application
of RL. To solve the traffic conjunction problem, they minimize the block delay by maximizing the reward value using an approximate optimal controller. The developed model is highly scalable and robust. It is appropriate to use in large complicated intersections. Reinforcement learning has an inseparable history with robotics. RL approaches are widely used to improve the behavioural context of robots. RL methods are used in robotics to improve the behaviour learning capability of robots in a dynamic environment. An efficient incremental Qlearning strategy was proposed by Carlucho in 2017, for multiagent mobile robots. He improved the learning efficiency by defining a time memory as a learning process. The scope of applications of RL for drones and UAVs is wide open and demands further research. UAV is used to accomplish various critical tasks using autonomous coordination and flight strategy. The QLearning method is used to develop efficient autopilot and flocking methods. Creating a stabilized autopilot system is an interesting challenge. A neural central model for RL is proposed by Zeng in 2017 to create an efficient flight system. He used this model to send important values into working memory. This model displayed better biological and cognitive properties.
Reinforcement could be used to enhance the transmission control protocol (TCP) in wired and wireless heterogeneous networks. RL based congestion control TCP was developed and used instead of the existing model, which showed high transmission efficiency. Megan and Raj in 2018 proposed an RL based model to control
the loss of separationbased events in the aviation industry to minimize the loss. The proposed model detected the airspace anomalies using the TCAS range tau metric they proposed a safetyevent dataset created using selfseparation criteria. The application of RL models in biological data deserves special attention. A set of different RL approaches is effectively implemented in various biological data and disease diagnosis methods. Data from medical imaging devices, genetical data and bioimagining are being classified and analysed and using RL models. These days RL models and algorithms are widely using personalized recommender systems, natural language processing, web and network configuration, computer vision, healthcare management & disease diagnosis and many more.
The application of RL models in various domains is extensive and expensive too. The practical realworld applications of RL have many shortcomings and vulnerabilities. The challenges and weaknesses of the various RL models will be discussed in the next section of the paper.

LIMITATIONS & FUTURE RESEARCH OPPORTUNITIES

Limitations and Challenges
Reinforcement learning is used to solve complex tasks across various domains. Its learning mechanism like human cognitive abilities makes it an integral part of AI. Even though RL still has many shortcomings and limitations. The ability of RL agents in their environmental interaction lacks perfection.
Existing exploration and exploitation strategies need improvisation. Exploitation is the process of choosing the bestknown policy and action to maximize the global reward, but it doesnt mean it is the best solution in the model. The RL model will follow an exploration strategy to extract more knowledge about the environment and look for better policies to achieve optimal decisions. In a continuous highdimensional environment, RL algorithms still lack an efficient
method of exploration. The most commonly used epsilongreedy exploration policy treats all actions with the same priority. Its ideal and unguided behaviour is inefficient to identify the promising actions. Reduction in the randomness caused by overexploitation will lead the policy to trapped inside a local optimum in onpolicy algorithms. This will compromise the global objective and leads to the failure of the model. Most importantly, there are no standardized benchmark features to evaluate the performance efficiency of exploration and exploitation strategies.
Lack of realworld training data is another major concern existing in reinforcement learning models. The cost and time of training and learning process are very high, especially for health care and disease diagnosis domain. RL algorithms that need to train from the scratch required large and unbiased training data. Lack of economically feasible advanced computational systems is a bottleneck in realworld applications of reinforcement learning models. Inverse reinforcement learning (IRL) is introduced for autonomous reward function declaration within the RL model. But, the existing IRL model is cable to define the reward function only using the human supported assumptions. The realworld applications of IRL are limited and need more serious research. In realworld applications, the IRL agent has only limited access to the actual environment states. To solve this problem, the IRL model should be integrated with a partially observable Markov decision process (POMDP) .
Applications of reinforcement learning in robotics are still facing severe challenges. The curse of dimensionality and the curse of the realworld are the two general terms used to express the weakness of reinforcement learning in robotics. Curse of dimensionality arises when the size of the solution space of a problem grows exponentially with additional feature exploration from the states. The ability of an RL model in robots to function in a variety of realworld environments is a complex obstacle to tackle. The concept of generalization is a hard task to achieve by RL models. The Open Spiel AI framework released by google achieved a limited generalization ability, but the framework is limited to some specific domains. Along with robust and scalable reinforcement learning algorithms, we need validated evaluation matrices and standardized experimental models to tackle these problems.

Future Scope and Opportunities

The future research pportunities in reinforcement learning are bright and wide open. Partial perception problems in a nonMarkov environment need detailed research and study. The existing learning algorithms for partial perception problems are defined based on the partially observable MDP, which lacks efficiency. Modelbased reinforcement is holding a great future research opportunity. The success of AlphaZero proved the efficiency of modelbased methods. The reinforcement agents cannot still infer generalized knowledge from different domainspecific tasks and use them in a new environment. RL algorithms still fall behind humans in terms of novelskill development ability. Domain adaptationbased transfer learning skill ability could be used to learn new experience from a specific task and used it in other.
Application of RL is an
importance in research and application perceptive. The coordination of multiple agents in a heterogeneous environment and creating a stable model is a challenging research task. Further research in the IRL model can develop an efficient method of reward function declaration without any human assumptions. Hybrid reinforcement learning methods, convolutional neural networks for computer vision and hierarchical reinforcement learning for the curse of dimensionality are offering an interesting research opportunity for future researchers.
VI.CONCLUSION
The last decade witnessed research and application level success in reinforcement learning. The concept of reinforcement learning, which is inspired by animal behaviour is now working on inverse reinforcement learning. Starting with a singleagent RL model, now reinforcement learning is expanding to multiagent and swarm intelligence.This paper analysed various onpolicy and offpolicy algorithms like Qlearning, sarsa, ActorCritic, DDPG, Deep Qlearning, and inverse reinforcement learning based on various classification and comparison features.
The quantitative performance comparison of various reinforcement learning algorithms is limited to specific research and experimental environments. We need validated evaluation matrices and generalized experimental models to provide an accurate cross
performance comparison of different algorithms. The curse of dimensionality and the real world are two major existing problems in reinforcement learning. Over the years, improvised RL algorithms could tackle some of its objectives, even though challenges remain, especially in a realworld dynamic environment. RL algorithms still fall behind humans in terms of novelskill development ability, in the authors' view, the concept of inverse reinforcement learning could overcome this limitation in the near future.
VII.REFERENCES
Arel, I., Liu, C., Urbanik, T., Kohls, A.G. (2010) Reinforcement learningbased multiagent system for network traffic signal control, IET Intelligent Transport Systems, 4(2), 128, available: https://digitallibrary.theiet.org/content/journals/10.1049/iet
its.2009.0070 [accessed 20 Oct 2019].
Bellman, R. (1957) Dynamic Programming, Princeton University Press: Princeton, NJ, USA.
Busoniu, L., Babushka, R., De Schutter, B. (2006) Multi Agent Reinforcement Learning: A Survey, in 2006
9th International Conference on Control, Automation, Robotics and Vision, Presented at the 2006 9th International Conference on Control, Automation, Robotics and Vision, IEEE: Singapore, 16, available: http://ieeexplore.ieee.org/document/4150194/ [accessed 20
Oct 2019].
Carlucho, I., De Paula, M., Villar, S.A., Acosta, G.G. (2017)
Incremental Qlearning strategy for adaptive
PID control of mobile robots, Expert Systems with Applications, 80, 183199, available: http://www.sciencedirect.com/science/article/pii/S0957417 417301513 [accessed 20 Oct 2019].
Charlesworth, H. (2018) Application of SelfPlay Reinforcement Learning to a FourPlayer Game of Imperfect Information, arXiv:1808.10442 [cs, stat], available: http://arxiv.org/abs/1808.10442 [accessed 19 Oct 2019].
Choi, J., Kim, K.E. (2009) Inverse Reinforcement Learning
in Partially Observable Environments, in
Proceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI09, Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 10281033, [accessed 20 Oct 2019].
Dayan, P., Niv, Y. (2008) Reinforcement learning: The Good,
The Bad and The Ugly, Current opinion in
neurobiology, 18, 18596.
Hawley, M., Bharadwaj, R. (2018) Application of reinforcement learning to detect and mitigate airspace
loss of separation events, in 2018 Integrated Communications, Navigation, Surveillance Conference (ICNS), Presented at the 2018 Integrated Communications, Navigation, Surveillance Conference (ICNS), IEEE: Herndon, VA, 4G114G110, available: https://ieeexplore.ieee.org/document/8384897/ [accessed 20
Oct 2019].
Hou, J., Li, H., Hu, J., Zhao, C., Guo, Y., Li, S., Pan, Q.
(2017) A review of the applications and hotspots of reinforcement learning, in 2017 IEEE International Conference on Unmanned Systems (ICUS), Presented at the 2017 IEEE International Conference on Unmanned Systems (ICUS), 506511.
Huys, Q.J.M., Cruickshank, A., SeriÃ¨s, P. (2014) Reward Based Learning, ModelBased and ModelFree,
in Jaeger, D. and Jung, R., eds., Encyclopedia of Computational Neuroscience, Springer New York: New York, NY, 110, available: http://link.springer.com/10.1007/9781461473206_6741 [accessed 19 Oct 2019].
Jansi Rani, S.V., Milton, R.S., Yamini, L., Shivaani, K. (2019)
Reinforcement Learning Approach to
Improve Transmission Control Protocol, in 2019 International Conference on Computational Intelligence in Data Science (ICCIDS), Presented at the 2019 International Conference on Computational Intelligence in Data Science (ICCIDS), IEEE: Chennai, India, 15, available: https://ieeexplore.ieee.org/document/8862007/ [accessed 20
Oct 2019].
Kalakrishnan, M., Theodorou, E., Schaal, S. (2010) Inverse
Reinforcement Learning with PI 2.
Kober, J., Peters, J. (2014) Reinforcement Learning in Robotics: A Survey, in Kober, J. and Peters, J., eds., Learning Motor Skills: From Algorithms to Robot Experiments, Springer Tracts in Advanced Robotics, Springer International Publishing: Cham, 967, available: https://doi.org/10.1007/9783319031941_2 [accessed 18
Oct 2019].
Lanctot, M., Lockhart, E., Lespiau, J.B., Zambaldi, V., Upadhyay, S., PÃ©rolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., Hennes, D., Morrill, D., Muller, P., Ewalds, T., Faulkner, R., KramÃ¡r, J., De Vylder, B., Saeta,
B., Bradbury, J., Ding, D., Borgeaud, S., Lai, M., Schrittwieser, J., Anthony, T., Hughes, E., Danihelka, I., RyanDavis, J. (2019) OpenSpiel: A Framework for Reinforcement Learning in Games, arXiv:1908.09453 [cs],
available: http://arxiv.org/abs/1908.09453 [accessed 20 Oct 2019].
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D. (2015) Continuous control with deep reinforcement learning, arXiv:1509.02971 [cs, stat], available: http://arxiv.org/abs/1509.02971 [accessed 19 Oct 2019].
Liu, C., Xu, X., Hu, D. (2015) Multiobjective Reinforcement
Learning: A Comprehensive Overview, IEEE
Transactions on Systems, Man, and Cybernetics: Systems, 45(3), 385398.
Mahmud, M., Kaiser, M.S., Hussain, A., Vassanelli, S. (2018)
Applications of Deep Learning and
Reinforcement Learning to Biological Data, IEEE Transactions on Neural Networks and Learning Systems, 29(6), 20632079, available:
https://ieeexplore.ieee.org/document/8277160/ [accessed 20
Oct 2019].
Mao, H., Alizadeh, M., Menache, I., Kandula, S. (2016) Resource Management with Deep Reinforcement Learning, in Proceedings of the 15th ACM Workshop on Hot Topics in Networks – HotNets 16, Presented at the the 15th ACM Workshop, ACM Press: Atlanta, GA, USA, 5056, available:
http://dl.acm.org/ciation.cfm?doid=3005745.3005750 [accessed 19 Oct 2019].
MoravÃk, M., Schmid, M., Burch, N., LisÃ½, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M.,
Bowling, M. (2017) DeepStack: ExpertLevel Artificial Intelligence in NoLimit Poker, Science, 356(6337), 508513, available: http://arxiv.org/abs/1701.01724
[accessed 19 Oct 2019].Ng, A.Y., Russell, S. (2000) Algorithms for Inverse Reinforcement Learning, in In Proc. 17th International Conf. on Machine Learning, Morgan Kaufmann, 663670. Nguyen, H., La, H. (2019) Review of Deep Reinforcement Learning for Robot Manipulation, in 2019
Third IEEE International Conference on Robotic Computing (IRC), Presented at the 2019 Third IEEE International Conference on Robotic Computing (IRC), IEEE: Naples, Italy, 590595, available: https://ieeexplore.ieee.org/document/8675643/ [accessed 19
Oct 2019].
Oh, J., Guo, X., Lee, H., Lewis, R., Singh, S. (2015) Action
Conditional Video Prediction using Deep
Networks in Atari Games, arXiv:1507.08750 [cs], available: http://arxiv.org/abs/1507.08750 [accessed 20 Oct 2019].
Powell, W.B. (2012) AI, OR and Control Theory: A
Rosetta