 Open Access
 Authors : Emmanuel G. S , P. Sujatha , P. Bharath
 Paper ID : IJERTV10IS080171
 Volume & Issue : Volume 10, Issue 08 (August 2021)
 Published (First Online): 28082021
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Profitbased Units Scheduling of a GENCO in Pool Market using Deep Reinforcement Learning
Emmanuel G. S

ech Student: Department of Electrical and Electronic Engineering, JNTUACEA
Andhra PradeshIndia
Prof. P. Sujatha
Professor: Department of Electrical and Electronic Engineering, JNTUACEA
Andhra PradeshIndia
Dr. P. Bharath
Assistant Professor:
Department of Electrical and Electronic Engineering, JNTUACEA
Andhra PradeshIndia
Abstract: The primary objective of power generation unit scheduling for a GENCO operating in restructured power system, is to maximize accumulated profit over the entire period of operation. When operating in the pool market, GENCOs demand is the spot market allocated energy. Hence, prior to units scheduling, the GENCO has to forecast how the market will be as far as the market clearing price and the spot market allocation for each hour of the day is concerned. By using these two market signals, the company can optimally schedule its generation to maximize profit. However, this paper aims at exploring capability of Deep Reinforcement Learning (DRL) established by using Deep Deterministic Policy Gradient (DDPG) algorithm to optimally schedule generating units in order to boost GENCOs financial benefit in deregulated electricity market environment. Simulations were carried out for a GENCO with six generating units each with different operation cost curve and different generating capacity, the resulted output reveal that the proposed method can be applied to solve profitbased generation units scheduling problem (PBUS).
Keywords: DDPG, DRL, Market Clearing Price (MCP), Profit based unit scheduling (PBUS), Power system deregulation.
Abbreviations
( ) cost of generating amount of power at hour by
generating unit
generating unit
total number of generating units
total profit at hour
penalty at hour
predicted market clearing price
power generated by unit at hour
predicted Market allocation
revenue at time t of the generating unit
overall reward at hour
Startup cost of unit at time
, Generator maximum generated power
, Generator minimum generated power

INTRODUCTION
The deregulation process in energy sector is one of the most important transition for modern electricity industry. This
transition enhance the competition in the electricity market the power prices are likely to descend which favours the electric power consumers [1],[2],[3]. With such idea in mind, there is a need to optimally schedule the generation units in a manner that will generate more profit [4]. This is due to the fact that, this type of market is based on competition which affects the electricity energy price. In contrast from vertical integrated power system, where utilities had obligation to meet demand and reserve, in deregulated power system the main objective of GENCO is to maximize its profit [5],[6],[7]. That is, GENCO has to schedule its generation pattern that will maximize the total profit. On the other hand, the responsibility of Independent System Operator (ISO) is to satisfy the system power demand in order to balance between generation and load. The ISO neither owns nor operates any generating unit but receives bids from different GENCOs and it decides energy demand among the GENCOs based on a cheapest first method [8].
UC problem has been solved by several methods each with its advantage(s) and disadvantage(s), [6],[9] explained the priority list method, dynamic programming [10],[11], Lagrangian relaxation, Genetic Algorithm [12], Grey wolf optimization [13], Particle Swarm optimization [14], Tabu Search method, Fuzzy logic algorithm [15] and Evolutionary algorithm [16]. However, a few reactions have been routed to these strategies as they are iterative require an initialization step. That can cause the convergence property for the pursuit interaction into ideal local optimal solution. Also, they may neglect to tackle the powerful case including above limitations. Market clearing price and the load forecasts plays an important part in strategizing optimal bidding in a day ahead market[5], [6].
The reinforcement learning technique has been used to solve complex problems and high dimensional problems in control systems [19], delivery route problem [20] and robotics [21]. the aim of this study is to introduce the use of deep reinforcement learning to solve optimally scheduling of generating units scheduling in deregulated power system environment in order to maximize GENCOs profit. By analysing the operation predicted data (the market clearing prices and market allocations), a datadriven Profit based
unit scheduling (PBUS) model is established. By means of DDPG algorithm, the established model is trained to maximize the GENCOs profit, finally the model is tested to show the effectiveness and accuracy of the proposed method.

PROBLEM FORMULATION
The objective of the PBUS problem is to formulate a scheduling pattern that will maximize the expected profit for the entire operation period. Therefore, the objective function is expressed as the difference of revenue generated and cost spent [22],[23]. The optimization problem for PBUS can be formulated mathematically by the following Equations;
Objective function
max =
(1)
Where;
= ( Ã— )
=1 =1
(2)
= ( + + 2
=1 =1
+ )
(3)
Constraints;
( ) ;
=1
(4)
, , ;
(5)

THE PROPOSED METHOD

Reinforcement Learning
Reinforcement learning is a class of machine learning, that is based on trialanderror, that is concerned with sequential decision making [24]. An RL agent exists in an environment. Within the environment it can act, and it can make observations of its state and receive rewards. These two discrete steps, action and observation, are repeated indefinitely with the agents goal being to make decisions so as to maximize its longterm reward.

Deep reinforcement learning
DRL utilize deep neural net as function approximator, which are especially valuable in reinforcement learning in the case that observations and/or actions dimension are so high that one cant even think about being totally known [25], [26]. In the deep reinforcement learning, deep neural network is utilized to implement either a value function, or a policy function i.e, networks can figure out how to get values for the given states, or getting values from actions and observations sets. Instead of using the technique of Qtable which would be very expensive method one can train a neural network from the given dataset of states or actions examine how significant those are comparative with our target in reinforcement learning [27].
Like every neural network, coefficients are used to estimate the function relating inputs to outputs, and their learning comprises to tracking down the correct coefficients, or weights, by iteratively changing those weights along gradients that guarantee less error. In reinforcement
learning, convolutional orgaizations can be utilized to perceive a specialist's state when the input is visual images. Figure 1 is an architecture used to design both actor and target actor networks. The network model constituted of; feature input layer, four fully connected layers (Which are all feed forward neural networks) and four activation functions.
Figure 1. Actor and target actor structure
Figure 1. Actor and target actor structure
Figure 2 shows the architecture of the critic network of the DRL used, the same is applied for target critic network. It can be seen that the critic network receives both observations and actions from the actor and output the Q values.
Figure 2. The critic and target critic structure
Figure 2. The critic and target critic structure

Designing of the state space, action space and reward function
Reward is defined as profit plus penalties, the agent is penalized when the sum of power generated is more than the predicted market allocation ().
First reward to be considered is Profit at each time step which is calculated as;
= ( ())
=1
(6)
Whenever the agent violates the constraints defined it should be penalized as per equation (7)
=
= { ( ) > }
=1
0
(7)
Net reward at each time step is defined as sum of penalty and profit as per equation (8)
= +
(8)
Defining Agents states
States = {1, 2, , } Where
= ()
2
(9)
Defining Agents action
Actions = { 1, , }

The DDPG Algorithm
The algorithm uses a total of four neural networks. The first network is called the actor, (), where denotes the network parameters. The actor part of the DDPG agent is classified as a policy search method.
The second network is called the critic, (, ), where
denotes the network parameters. The critic part of the DDPG agent is classified as a value function method [28],[29].
DDPG uses target network idea to implement further two neural networks, one for each of the actor and critic networks. The network parameters for the actor and critic target networks are denoted as and respectively. DDPG also makes use DQNs experience replay buffer to store experience which is randomly sampled from during training [30].
The loss function for the critic network is similar to the DQN loss function except that actions are selected by the actor network [31]. Using the standard Qlearning update and the mean square error, the critic loss function is expressed as:
() = ~()[( + (, ())
(,,,)
(, ))2]
(10)
The actor network is updated using the deterministic policy gradient theorem [31]. The gradient update is given by:
() = [(, ) ()]
,=( )
(11)
Table 1. Actor and critic parameter settings
Feature Input Layer
Fully connected layer 1
Fully connected layer 2
Fully connected layer 3 and 5
Fully connected layer 4
Input size
6
6
100
100
100
Output size
6
100
100
100
6 for actor network
1 for critic network
Number of hidden layers
n/a
32
64
64
32
Weight learning rate factor
n/a
1
1
1
1
Regularization factor for weights
n/a
1
1
1
1
Bias learning rate factor
n/a
1
1
1
1
Regularization factor for biases
n/a
0
0
0
0
Weight initializer
n/a
Glorot
Glorot
Glorot
Glorot
Bias initializer
n/a
Zeros
Zeros
Zeros
Zeros
Activation function
n/a
ReLU
ReLU
ReLU
tanh
Equations (10) and (11) are used with gradient descent and the backpropagation algorithms to update actor and critic network weights during training. The algorithm flowchart is
Initialization step
Randomly initialize critic (, ) and
Initialize target network and with
actor () with weights and respectively
weights
and
Initialize replay buffer
for episode 1:
Initialize random process for action exploration
Receive initial observation state
1
For each time step 0: ( 1) do
Select action = () +
with exploration noise
Execute action and observe reward and observe new state +1
Store transition (, , , +1) in
Select action = () +
with exploration noise
Execute action and observe reward and observe new state +1
Store transition (, , , +1) in
Set = + ( , ( )
Set = + ( , ( )
Sample a random minibatch of
transitions (, , , +1) from
Sample a random minibatch of
transitions (, , , +1) from
Update critic by minimizing the loss:
+1 +1
+1 +1
1
=
( (, ))2
Update the actor policy using the sampled policy
Update the target networks:
gradient:
1
=
+ (1 )
End
(,  )=, =() (
summarized in flow chart figure 3.
= + (1 )
Figure 3. The DDPG Algorithm

DDPG parameter setting
The Algorithm needs to give action command for each generating unit that will satisfy all the constraints and meet the objective. The model is initialized by randomly selected power distribution coefficients for generator one to six as in table 2 below shows the initial operation parameters and table 3 shows the DDPG Algorithm parameter setting.
Table 2. Generating units initial operation values
Symbol
Parameter
Value
1
Generating unit 1 power distribution coefficient
0.8
2
Generating unit 2 power distribution coefficient
0.6
3
Generating unit 3 power distribution coefficient
0.7
4
Generating unit 4 power distribution coefficient
0.8
5
Generating unit 5 power distribution coefficient
0.4
6
Generating unit 6 power distribution coefficient
0.6
Table 3. DDP algorithm parameter setting
Parameter
Value
Target smooth factor
0.001
Experience buffer length
1000000
Discount factor
0.99
Minibatch size
32
Actor learning rate
0.0001
Critic learning rate
0.001


EXAMPLE ANALYSIS

Simulation Environment
The GENCO mathematical model was developed in Simulink environment, constituted of an RL Agent block, Reward calculation subsystem and observation subsystem. The Deep reinforcement learning based on Deep Deterministic algorithm was created with the architecture explained in figure 1 and figure 2 for actor and critic respectively. The implementation was done by the help of deep designer app of MATLAB r2020b version. The setting parameters for the neural network architecture are tabulated in table 1.
Data used for training and testing the model
During training, the random time series data generator was formulated, this is to ensure good generalization of final result also it serves the purpose of large dataset. The
standard IEEE 118bus system data from [32] were utilized to test the trained RL agent. A single GENCO having six generating units of the 54 thermal units in the IEEE 118bus test system and the generating units data are given in table 4.
The input data to the model were 24hour (day ahead) the time series predicted market clearing price and the predicted spot market allocation for the GENCO data as plotted in figure 4 and figure 5 respectively. The GENCO had six generating units each with different operation characteristics shown in the table 4.
Figure 5. Predicted Market Clearing Price
Figure 4. Predicted spot market allocation for the GENCO.
Table 4. Generating units data
Unit Code
[MW] [MW] Capacity [MW]
a [INR/h] (x73.12)
b [INR/MWh] (x73.12)
C [INR/MWp] (x73.12)
MUT
[hrs]MDT
[hrs]RU [MW]
RD [MW]
HSC [INR/h] (x73.12)
CSC [INR/h] (x73.12)
CShr [hrs]
g1
100
420
840
128.32
16.68
0.0212
10
10
210
210
250
500
20
g2
100
300
2400
13.56
25.78
0.0218
8
8
150
150
110
110
16
g3
50
250
500
56.00
24.66
0.0048
8
8
125
125
100
200
16
g4
50
200
200
13.56
25.78
0.0218
8
8
100
100
400
800
16
g5
25
100
300
20.30
35.64
0.0256
5
5
50
50
50
100
10
g6
25
50
100
117.62
45.88
0.0195
2
2
25
25
45
90
4
Total
4340

Results and Discussion
RL Agent training results
According to the algorithm, the model was trained for 150 episodes and each episode had 2400 steps. Each step returned a reward value which was summed to obtain an overall episode reward. Figure 5 shows a plot of episode reward against the episode number. The training was targeted to achieve at least average reward of 6300 for better results.
Testing the trained model
Upon
Figure 6. DDPG Average episode reward
successiful training of the Agent, the trained agent is applied in offline simulation to verify the agents capability. In this scenario, the market clearing price and allocated energy are acting as inputs to the agent, and the agent gives optimal schedule
To justify the results, trained agent was run in tree cases; Case 1; the model was trained to meet the expected/ predicted spot market allocation.
case 2; all the generating units were fixed to generate their maximum capacity while generating unit two was optimized to minimize the operation cost.
case 3; the model was trained to find optimal bidding without fixing any of the generating units.
Table 5, is showing the amount of power assigned for each generating unit for each hour in for case 1, case 2 and case 3.
Table 6, summarizes the results obtained under three cases; case 1 had the highest operating cost as compared to case 2 and 3, this is because large amount of energy was being generated hence some units were operating under loss. The most optimal solution was under case 3, in which the operating cost was reduced and more power was being
generated at instant that the clearing price is higher thus making more revenues. The case 3 made profit 1.5 times that in case 2 and 1.09 times that in case 2. The profit generated is increasing with increased energy price provided that the generator operation cost didnt reach its optimal value of operation. This was shown by the unit g2, as compared to other units, its generation was following the MCP nature while others were constant. i.e higher generation was achieved at higher market clearing price provided that the total generated power doesnt exceed the spot market allocation. This can be proved by considering figure 4, figure 5 and the table 5. At the time between 7 to 19 hours, the market clearing price was high, this made the generated power to increase (as shown in table 5) in similar fashion as that of figure 4.
Table 5. Total generated profit
Case No.
Operation Cost (x108)
[INR]Total Revenue (x108)
[INR]Profit
(x108)
[INR]Energy not supplied [MWh]
1
2.792
3.580
0.788
0.00
2
1.588
2.677
1.089
19010.00
3
1.783
2.968
1.185
12820.00
Table 6. Amount of generated power by each generating unit at each time period of 24 hours
Time
[Hrs]1
(x840) [MW]
2
(x2400) [MW]
3
(x500) [MW]
4
(x200) [MW]
5
(x300) [MW]
6
(x100) [MW]
Case1
Case2
Case3
Case1
Case2
Case3
Case1
Case2
Case3
Case1
Case2
Case3
Case1
Cae2
Case3
Case1
Case2
Case3
1
1
1
0.91
0.27
0.22
0.22
1
1
1
1
1
1
1
1
0.98
1
1
1
2
1
1
0.93
0.31
0.23
0.24
1
1
1
1
1
1
1
1
0.97
1
1
1
3
1
1
0.96
0.35
0.24
0.27
1
1
1
1
1
1
1
1
0.98
1
1
1
4
1
1
0.93
0.31
0.23
0.24
1
1
1
1
1
1
1
1
0.99
1
1
1
5
1
1
0.97
0.42
0.25
0.28
1
1
1
1
1
1
1
1
0.99
1
1
1
6
1
1
0.98
0.56
0.27
0.31
1
1
1
1
1
1
1
1
0.99
1
1
1
7
1
1
0.99
0.73
0.30
0.34
1
1
1
1
1
1
1
1
1
1
1
1
8
1
1
1
0.94
0.42
0.44
1
1
1
1
1
1
1
1
1
1
1
1
9
1
1
1
0.92
0.48
0.50
1
1
1
1
1
1
1
1
1
1
1
1
10
1
1
1
0.89
0.52
0.54
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
0.89
0.43
0.45
1
1
1
1
1
1
1
1
1
1
1
1
12
1
1
1
0.89
0.40
0.43
1
1
1
1
1
1
1
1
1
1
1
1
13
1
1
1
0.89
0.42
0.45
1
1
1
1
1
1
1
1
1
1
1
1
14
1
1
1
0.89
0.45
0.47
1
1
1
1
1
1
1
1
1
1
1
1
15
1
1
1
0.89
0.43
0.45
1
1
1
1
1
1
1
1
1
1
1
1
16
1
1
1
0.85
0.39
0.41
1
1
1
1
1
1
1
1
1
1
1
1
17
1
1
0.99
0.87
0.30
0.34
1
1
1
1
1
1
1
1
0.99
1
1
1
18
1
1
0.99
0.89
0.31
0.35
1
1
1
1
1
1
1
1
0.99
1
1
1
19
1
1
0.99
0.81
0.35
0.38
1
1
1
1
1
1
1
1
0.98
1
1
1
20
1
1
0.99
0.69
0.32
0.36
1
1
1
1
1
1
1
1
0.99
1
1
1
21
1
1
0.99
0.52
0.30
0.34
1
1
1
1
1
1
1
1
0.99
1
1
1
22
1
1
0.98
0.46
0.26
0.30
1
1
1
1
1
1
1
1
0.98
1
1
1
23
1
1
0.95
0.39
0.24
0.25
1
1
1
1
1
1
1
1
0.98
1
1
1
24
1
1
0.89
0.33
0.21
0.21
1
1
1
1
1
1
1
1
0.98
1
1
1


CONCLUSION

In this paper, deep reinforcement learning was used to find optimal scheduling to solve the profitbased generation unit scheduling of the GENCO operating in deregulated electricity market. The Deep Deterministic Policy Gradient algorithm is used to train the agent. The GENCO is assumed to operate under pool market without bilateral contacts of power supply between the GENCO and consumers. The important data input are; predicted spot market allocation for the GENCO and the market clearing price 24 hours (day ahead) of time. The method can also be applied in very complicated scenarios where there is larger number of constraints and many generation units.
REFERENCES

K. V. N. P. Kumar et al., Expanding the Ambit of Ancillary Services in India – Implementation and Challenges, 2018 20th Natl. Power Syst. Conf. NPSC 2018, 2018, doi: 10.1109/NPSC.2018.8771784.

D. K. Mishra, T. K. Panigrahi, A. Mohanty, P. K. Ray, and M. Viswavandya, Design and Analysis of Renewable Energy based Generation Control in a Restructured Power System, Proc. 2018 IEEE Int. Conf. Power Electron. Drives Energy Syst. PEDES 2018, 2018, doi: 10.1109/PEDES.2018.8707753.

Y. R. Prajapati, V. N. Kamat, J. J. Patel, and B. Parekh, Impact of grid connected solar power on load frequency control in restructured power system, 2017 Innov. Power Adv. Comput. Technol. iPACT 2017, vol. 2017Janua, pp. 15, 2017, doi: 10.1109/IPACT.2017.8245122.

J. Li, Z. Li, and Y. Wang, Optimal bidding strategy for day ahead power market, 2015 North Am. Power Symp. NAPS 2015, pp. 16, 2015, doi: 10.1109/NAPS.2015.7335133.

K. Choudhary, R. Kumar, D. Upadhyay, and B. Singh, Optimal Power Flow Based Economic Generation Scheduling in Day ahead Power Market, Int. J. Appl. Power Eng., vol. 6, no. 3, p. 124, 2017, doi: 10.11591/ijape.v6.i3.pp124134.

K. Lakshmi and S. Vasantharathna, A profit based unit commitment problem in deregulated power markets, 2009 Int. Conf. Power Syst. ICPS 09, no. 2, pp. 2530, 2009, doi: 10.1109/ICPWS.2009.5442772.

P. Che and L. Tang, Stochastic unit commitment with CO 2 emission trading in the deregulated market environment, Asia Pacific Power Energy Eng. Conf. APPEEC, no. 70728001, pp. 1013, 2010, doi: 10.1109/APPEEC.2010.5448754.

Unit Commitment in a Power Generation System Using a modified ImprovedDynamic Programming Henry U gochukwu Ukwu Reza Sirjani, pp. 103107, 2016.

A. Bhardwaj, V. K. Kamboj, V. K. Shukla, B. Singh, and P. Khurana, Unit commitment in electrical power system – A literature review, 2012 IEEE Int. Power Eng. Optim. Conf. PEOCO 2012 – Conf. Proc., no. June, pp. 275280, 2012, doi: 10.1109/PEOCO.2012.6230874.

S. V. Tade, V. N. Ghate, S. Q. Mulla, and M. N. Kalgunde, Application of Dynamic Programming Algorithm for Thermal Unit Commitment with Wind Power, Proc. – 2018 IEEE Glob. Conf. Wirel. Comput. Networking, GCWCN 2018, no. 2, pp. 182 186, 2019, doi: 10.1109/GCWCN.2018.8668612.

C. Su, C. Cheng, and P. Wang, An MILP Model for ShortTerm Peak Shaving Operation of Cascaded Hydropower Plants Considering Unit Commitment, Proc. – 2018 IEEE Int. Conf. Environ. Electr. Eng. 2018 IEEE Ind. Commer. Power Syst. Eur. EEEIC/I CPS Eur. 2018, no. 978, pp. 16, 2018, doi: 10.1109/EEEIC.2018.8494460.

V. Arora and S. Chanana, Solution to unit commitment problem using Lagrangian relaxation and Mendels GA method, Int. Conf. Emerg. Trends Electr. Electron. Sustain. Energy Syst. ICETEESES 2016, pp. 126129, 2016, doi: 10.1109/ICETEESES.2016.7581372.

S. Siva Sakthi, R. K. Santhi, N. Murali Krishnan, S. Ganesan, and
S. Subramanian, Wind Integrated Thermal Unit Commitment Solution using Grey Wolf Optimizer, Int. J. Electr. Comput. Eng., vol. 7, no. 5, pp. 23092320, 2017, doi: 10.11591/ijece.v7i5.pp23092320.

Y. Zhang, A Novel hybrid immune particle swarm optimization algorithm for unit commitment considering the environmental cost, Proc. – 2019 Int. Conf. Smart Grid Electr. Autom. ICSGEA 2019, pp. 5458, 2019, doi: 10.1109/ICSGEA.2019.00021.

M. Jabri, H. Aloui, and H. A. Almuzaini, Fuzzy logic lagrangian relaxation selection method for the solution of unit commitment problem, 2019 8th Int. Conf. Model. Simul. Appl. Optim. ICMSAO 2019, pp. 20192022, 2019, doi: 10.1109/ICMSAO.2019.8880285.

B. Hu, Y. Gong, and C. Y. Chung, Flexible Robust Unit Commitment Considering Subhourly Wind Power Ramp Behaviors, 2019 IEEE Can. Conf. Electr. Comput. Eng. CCECE 2019, pp. 3033, 2019, doi: 10.1109/CCECE.2019.8861590.

Anamika and N. Kumar, Market Clearing Price prediction using ANN in Indian Electricity Markets, 2016 Int. Conf. Energy Effic. Technol. Sustain. ICEETS 2016, pp. 454458, 2016, doi: 10.1109/ICEETS.2016.7583797.

L. Ding and Q. Ge, Electricity Market Clearing Price Forecast Based on Adaptive Kalman Filter, ICCAIS 2018 – 7th Int. Conf. Control. Autom. Inf. Sci., no. Iccais, pp. 417421, 2018, doi: 10.1109/ICCAIS.2018.8570534.

H. Iwasaki and A. Okuyama, Development of a reference signal selforganizing control system based on deep reinforcement learning, 2021 IEEE Int. Conf. Mechatronics, ICM 2021, pp. 1 5, 2021, doi: 10.1109/ICM46511.2021.9385676.

E. Xing and B. Cai, Delivery Route Optimization Based on Deep Reinforcement Learning, Proc. – 2020 2nd Int. Conf. Mach. Learn. Big Data Bus. Intell. MLBDBI 2020, pp. 334338, 2020, doi: 10.1109/MLBDBI51377.2020.00071.

Y. Long and H. He, Robot path planning based on deep reinforcement learning, 2020 IEEE Conf. Telecommun. Opt. Comput. Sci. TOCS 2020, pp. 151154, 2020, doi: 10.1109/TOCS50858.2020.9339752.

H. Abdi, Profitbased unit commitment problem: A review of models, methods, challenges, and future directions, Renew. Sustain. Energy Rev., vol. 138, no. October, p. 110504, 2021, doi: 10.1016/j.rser.2020.110504.

A. K. Bikeri, C. M. Muriithi, and P. K. Kihato, A review of unit commitment in deregulated electricity markets, Proc. Sustain. Res. Innov. Conf., no. May, pp. 913, 2015, [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.840.1 938&rep=rep1&type=pdf.

E. F. Morales and J. H. Zaragoza, An introduction to reinforcement learning, Decis. Theory Model. Appl. Artif. Intell. Concepts Solut., pp. 6380, 2011, doi: 10.4018/978160960 1652.ch004.

D. Arora, M. Garg, and M. Gupta, Diving deep in Deep Convolutional Neural Network, Proc. – IEEE 2020 2nd Int. Conf. Adv. Comput. Commun. Control Networking, ICACCCN 2020, pp. 749751, 2020, doi: 10.1109/ICACCCN51052.2020.9362907.

S. Kami and D. Goularas, Evaluation of Deep Learning Techniques in Sentiment Analysis from Twitter Data, Proc. – 2019 Int. Conf. Deep Learn. Mach. Learn. Emerg. Appl. Deep. 2019, pp. 1217, 2019, doi: 10.1109/DeepML.2019.00011.

J. C. Jesus, J. A. Bottega, M. A. S. L. Cuadros, and D. F. T. Gamarra, Deep Deterministic Policy Gradient for Navigation of Mobile Robots in Simulated Environments, 2019.

M. Zhang, Y. Zhang, Z. Gao, and X. He, An Improved DDPG and Its Application Based on the DoubleLayer BP Neural Network, vol. 8, 2020, doi: 10.1109/ACCESS.2020.3020590.

X. Guo et al., A Novel User Selection Massive MIMO Scheduling Algorithm via Real Time DDPG, 2020.

C. Chu, K. Takahashi, and M. Hashimoto, Comparison of Deep Reinforcement Learning Algorithms in a Robot Manipulator Control Application, pp. 284287, 2020, doi: 10.1109/IS3C50286.2020.00080.

D. Silver, G. Lever, N. Heess, T. Degris, D. ierstra, and M. Riedmiller, Deterministic policy gradient algorithms, 31st Int. Conf. Mach. Learn. ICML 2014, vol. 1, pp. 605619, 2014.

B. Pmin, P. Qmin, Q. Min, and P. Coef, Table I Market price Table II Fossil Unit Data Table III Combined Cycle Unit Data Table IV Combined Cycle Unit Configuration Data, pp. 14.