Effective use of Resources for Cost Minimization on Big Data in Distributed Data Centers

Download Full-Text PDF Cite this Publication

Text Only Version

Effective use of Resources for Cost Minimization on Big Data in Distributed Data Centers

Avinash K.P.

Postgraduate Student, Department of CSE.

P.E.S College of Engineering, Mandya.

Dr. Minavathi.

Professor & Head, Department of ISE.

      1. college of Engineering, Mandya.

        Abstract-The speed at which business moves today, combined with the sheer volume of data created in digitized world, requires new approaches to deriving value from data. Big data is high volume, high velocity, and includes a high variety of sources of information. If you living in 100% of data world 90% of data has been generated over past 2 years because accessing the information has become gradually increasing in number, hence there will be a huge amount of data gets generated. Now a days all the internet providers has their data centers and those data centers has number of servers,eg,Google has 13 data centers across 8 contries and facebook has 6 data centers. According to gartner predicts 71% of worldwide data center hardware spending will come from the big data processing, which surpass $126 billon. Hence it is vital importance to study cost minimization problem for big data processing in distributed data centers. As a result, four factors, i.e., how to place chunks to the server, how to distribute these chunks to the servers, task assignment and data movement, deeply influences operational expenditure of data centers. Hence we propose a two- dimensional Markov chain model to describe a task completion time, joint optimization of that four factors. Hence we propose an matrix approach for data placing in the data centers. we model the problem as a mixed-integer non-linear programming (MINLP) and propose an efficient solution to linearize it.

        Keywords-Big Data; Chunks placing; Matrix Approach; Task Assignment; Data Movement; ;Cost Minimization ;Distributed Data Cente ;

        1. INTRODUCTION

          According to McKinsey Big Data refers to datasets whose size are beyond the ability of typical database software tools to capture, store, manage and analyse[1].

          Big data analysis is one of the key challenges of current era. Big data is an one of the emerging hot research topic because its mostly used in data center application in human society, such as government, climate, finance, and science. Currently, most research work on big data falls in data mining, machine learning, and data analysis[2].

          Data volume is also growing exponentially due to the explosion of machine-generated data (data records, web-log files, sensor data) and from growing human engagement within the social networks. Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications[3].

          Data are now the equivalent of oil or gold. Big data has arrived. It is changing our lives and changing the way we do business. Some examples includes, Google uses big data to predict the next wave of Influenza. IBM uses data to optimize traffic flow in the city of Stockholm, and to get the best

          possible air quality. Dr.Jeffrey Brenner, a physician in New Jersey, uses medical billing data to map out hot spots where you can find his citys most complex and costly healthcare cases as part of a program to lower healthcare costs. The National Center for Academic Transformation is using data mining to help understand which college students are more likely to succeed in which courses[4].

          Amortized cost

          Components

          Sub- Components

          45%

          Servers

          CPU, Memory

          25%

          Infrastructure

          Power, Cooling

          15%

          Power draw

          Electrical Utility

          15%

          Network

          Link, equipment.

          Table 1[5].Cost of data centers.

          Big Data processing mainly depends on parallel programming models like MapReduce, as well as providing a cloud computing platform of Big Data services for the public. MapReduce is a batch-oriented parallel computing model. Hence it is very important to study cost minimization problem in data centers.

          Data center resizing (DCR) has been proposed to reduce the computation cost by adjusting the number of activated servers via task placement [6]. These data centers run hundreds of or thousands of servers, so it consumes megawatts of power with massive carbon footprint, and also incur electricity bills of millions of dollars [7].

          In the existing system there is no way balancing the load in the servers, some of the servers may sit idle. Hence there is waste of resources. Second, the links in the network vary on the transmission rate and costs according to their unique features [8] for e.g. the distances and physical optical fibre facilities between data centres. Due to storage and computation capacity constraints, all the tasks cannot be placed on the same server, on which their corresponding data resides hence here the transmission cost (e.g. energy) is nearly proportional to the number of links used while satisfying all the transmission requirements. Third, the Quality-of-Service (QoS) of the data tasks is not being considered in the existing work. The big data applications exhibit Service-Level-Agreement (SLA) Between service provider and the requesters. To observe this, a certain level of QoS is guaranteed and any cloud computing tasks QoS is

          firstly determined by where they are placed and how many computation resources are allocated to them. Like in [9], the general cloud computing task mainly focuses on computation capacity constraints by ignoring their transmission rates.

          To overcome the above weaknesses, To describe the rate-constrained computation and transmission in big data processing process, we propose a two dimensional Markov chain and derive the expected task completion time in closed form. Based on the closed-form expression, we formulate the cost minimization problem in a form of mixed integer nonlinear programming (MINLP) to answer the following questions: 1.How to place these data chunks in the servers. 2.How to distribute chunks onto the servers without violating the resource constraints. 3.Assigning task to the Server. 4.Data movement. Through extensive numerical studies, we show the high efficiency of our proposed joint-optimization based algorithm.

        2. RELATED WORK

          In this section we go through the related work of various methods such as sever cost minimization, multi-level power management, data placement and distributed load balancing.

          1. Server Cost Minimization

            As shown in Table 1, the greatest data center costs go to servers. For example, assuming 50,000 servers, a relatively aggressive price of $3000 per server, a 5% cost of money, and a 3 year amortization, the amortized cost of servers comes to $52.5 million dollars per year. With prices this high, achieving high utilization, i.e. useful work[10] accomplished per dollar invested, is an important goal. Unfortunately, utilization in the data center can turn out to be remarkably low; e.g., 10%. There are some structural reasons for this:

            Uneven Application fit: A server integrates CPU, memory, network and (often) storage components. It is often the case that the application fit in the server does not fully utilize one or more of these components.

            Uncertainty in demand forecasts: Cloud service demands can spike quickly, especially for new services, far beyond what coventional (say 95th percentile-based) forecasts would predict.

            Long provisioning time scales: Purchases, whether for upgrades or new builds, tend to be large, with components bought in bulk. Infrastructure is typically meant to last very long time periods; e.g., fifteen years. Servers are meant to last as long as 3-5 years, with increments ordered quarterly or yearly.

            2.A.1. Reducing These Costs

            Decreasing the power draw of each server is clearly has the largest impact on the power cost of a data center, and it would additionally benefit infrastructure cost by decreasing the need for infrastructure equipment. Those improvements are most likely to come from hardware innovation, including use of high efficiency power supplies and voltage regulation

            modules. Barroso and Hlzle introduced the term energy proportionality to refer to the desirable property that a server running at N% load should consume N% power [10]. Creating servers that are closer to implementing energy proportionality would improve efficiency. One area of innovation that is impacted by networking is the idea of running the data center hotterliterally reducing the amount of cooling to save money on cooling equipment and the power it consumes. Initial experiments show that equipment failure rates increase with temperature, so the research challenge becomes determining what and how to harden. For example, the network may have to become more resilient and more mesh-like.

          2. Multi-level Power Management

            The coordination problem has been seeked. There are two key contributions. First, a power management solution that coordinates different individual approaches has been proposed and validated. Using simulations based on 180 server traces from nine different real-world enterprises, demonstrate the correctness, stability, and efficiency advantages of solution. Second, using unified architecture as the base, a detailed quantitative sensitivity analysis has performed and draw conclusions about the impact of different architectures, implementations, workloads, and system design choices. Perform a detailed sensitivity analysis to evaluate several interesting variations in the architecture and implementation, and in the mechanisms and policies space is the main advantage. Power delivery, electricity consumption, and heat management are becoming key challenges in data center environments. There is individual solution to solve this problem no coordination between them were the demerits.

          3. Data Placement

            Shachnai et al. [11] investigated on the placement of Video-on-Demand (VoD) file copies on the servers and the amount of load capacity assigned to each file copy to minimize the communication cost while ensuring the user experience. Agarwal et al. [12] came up with an automated data placement mechanism named Volley. The cloud services use volley by submitting logs of the datacenter request. Volley analyses these logs using an iterative optimization algorithm based on the data access patterns and client locations. Cidon et al. [13] proposed the idea of Min Copysets, which is a data replication technique that decouples data distribution and data replication to improve data durability in the distributed data centres.

          4. Distributed Load Balancing

          The exploration of whether geographical load balancing can encourage use of green renewable energy and reduce use of brown fossil fuel energy has done. It makes two contributions. First, derive two distributed algorithms for achieving optimal geographical load balancing. Second, show that if electricity is dynamically priced in proportion to the instantaneous fraction of the total energy that is brown, then geographical load balancing

          provides significant reductions in brown energy use. Geographical load balancing provides a huge opportunity for environmental benefit as the penetration of green, renewable energy sources increases. Specifically, an enormous challenge facing the electric grid is that of incorporating intermittent, unpredictable renewable sources such as wind and solar. Geographical load balancing aims to reduce energy costs, but this can come at the expense of increased total energy usage. By routing to a data center farther from the request source to use cheaper energy, the data center may need to complete the job faster, and so use more service capacity, and thus energy, than if the request was served closer to the source.

        3. MODEL

          In this section we model the system and we declare the various constants and variables and then we make the network model to form data centre topology and task model for finding task for an job.

          TABLE 2(Notations)

          Constants

          Ji

          The set of servers in data center i

          mi

          The switch in data center i

          w(u,v)

          The weight of the link

          k

          The size of chunk k

          k

          The task arrival rate for data chunk k

          P

          The number of data chunk replicas

          D

          The maximum expected response time

          Pj

          The power consumption of server j

          (u,v)

          The transmission rate of link (u,v)

          Variables

          xj

          A binary variable indicating if server j is activated or not

          yjk

          A binary variable indicating if chunk k is placed on server j or not

          (u,v)

          zjk

          A binary variable indicating link (u,v) is used for flow for chunk k on server j

          k

          The request rate for chunk k on server j

          jk

          The CPU usage of chunk k on server j

          jk

          The CPU processing rate of chunk k on server j

          f (u,v) jk

          The flow for chunk k determined to server j through link (u,v)

            1. Network Model

              The model consists of all the servers of the same data centre (DC) that are connected to the local switches, and these data centers are connected through switches as shown in Fig 2. There are a set of D datacenters and each datacenter d D consists of a set Si of servers that are connected to a

              switch wi W with its transmission cost of CL. Here the transmission cost CT for inter data centre traffic is greater than CL, i.e., CT > CL. Here all the computation resource and storage capacity, both of which are normalized to a unit. We use S to denote the set of all servers i.e., S = S1 S2 S3. S|D|. The whole system is modelled as a directed graph G = (N, E). The vertex N = W S (refer Table 2) which is set W of all switches and the S is the set of all servers, and E is the edge set. All the switches are connected to their local switch by intra-data centre links while the switches are connected by inter-data centre links, which is identified by their physical connection. The weight of each link l(u,v) represents the corresponding communication cost, can be defined as:

              C

              C

              DC D

              DC

              Fig 1.Data center topology

            2. Task Model

          The data stored in the distributed data centres are divided into a set H of data chunks. Each chunk h H has the size of h (h 1) which is normalized the servers storage capacity. The idea here is that there are exactly P numbers of copies stored in the distributed file system for managing the fault tolerance. The tasks that arrive to the datacenters during a particular time period can be visualized as a Poisson process [14]. In particular, let h be the average task arrival rate requesting chunk h. These tasks will be distributed to the servers with a fixed probability and hence te task arrival in each server can also be considered as a Poisson process. So, here the average arrival rate of the task for chunk h on server S sh (sh 1).

        4. PROBLEM FORMULATION

          4.1 Constraints of Data and Task Placement

          We define a binary variable yjk to denote whether chunk k is placed on server j as follows,

          In the distributed file system, we maintain P copies for each chunk k K, which leads to the following constraint:

          (X1=x1, , ,Xn=xn)>0

          The possible values of Xi form a countable set S called the state space of the chain. n-dimensional Markov chain works depends upon the processing , loading and task distribution.

          4.3 Matrix Approach for Data Placement

          Genetic algorithm is an adaptive search and optimization algorithm based on the mechanics of natural selection and natural genetics [16][17].A population of candidate solutions (called individuals) to an optimization problem is evolved toward better solutions [18][19]. In

          jJ yjk = P, j K

          (3)

          genetic algorithm, the degree of adaptation of each individual to the environment is represented by fitness. Individual with

          Furthermore, the data stored in each server j J cannot exceed its storage capacity, i.e.,

          . 1,

          (4)

          As for task distribution, the sum rates of task assigned to each server should be equal to the overall rate,

          high fitness has a greater chance to survive. In each generation, the fitness value of each individual in the population is evaluated, individuals with higher fitness value are stochastically selected from the current population, then crossover and mutation operator are manipulated to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm.

          1.Encoding: In the issue of genetic algorithm-based data placement, the placement of datasets in data centers is

          =

          ,

          represented by matrix B, because of the string structure of

          (5)

          matrix, the placement matrix is directly manipulated as genotype in genetic algorithm. 2.Individual and population: An individual is a point in the searching space, the collection

          Finally, we define a binary variable xj to denote whether

          server j is activated, i.e.,

          A server shall be activated if there are data chunks placed onto it or tasks assigned to it. Therefore, we have

          of placement matrices is the searching space of the data placement. A population consists of several individuals and it is a subset of the whole searching space. 3.Fitness function Fitness function is the evaluation function to guide the search in genetic algorithm [20][21]. In the issue of genetic algorithm-based data placement, the objective function is denoted as (B), and the fitness function is the reciprocal of the objective function, that is = 1/(B). 4.Genetic operators:4a.Selection: Roulette wheel selection is a genetic operator used in genetic algorithms for selecting potentially useful solutions for recombination , chromosomes with

          +

          +

          higher fitness level are more likely to be selected. The steps of Roulette wheel selection are as follows: Step a1: Obtain

          4.2 Markov Model

          + ,

          , (7)

          the fitness value f(i) of each individual in a N size population. Step a2: Suppose there is an individual and its probability of being selected is p(k ) :

          A Markov is a mathematical system that undergoes transitions from one state to another on a state space. It is a random process usually characterized as memory less: the next state depends only on the current state and not on the sequence of events that preceded it. Memory lessness is called the Markov property. Markov chains have many applications as statistical models of real-world processes .A Markov chain is a sequence of random variables X1, X2, X3, with the Markov property, namely that, given the present state, the future and past states are independent[15]. Formally, if both conditional probabilities are well defined,

          i.e. if

          1

          () = ()/ (); = 1,2, ,

          =1

          Step a3:Suppose q(0)=0,q(k)=p(1)+p(2)++p(k);k=1,..0; Step a4: Generate a random number r (0 r < 1), if q(k1) < r < q(k), then individual k is selected.

          Step 4b: Crossover: In Genetic algorithm, crossover operator is used to vary the programming of a chromosome or chromosomes from one generation to the next. Before crossover, individuals in the population should be paired randomly and choose the crossover point and then exchange some genes. Step 4c: Mutation: Mutation alters one or more genes in a chromosome from its initial state. For a binary string, if the genome bit is 1, it is changed to 0 and vice versa.

          When the mutation operator is used in a placement matrix, a random number r1 (0 r1 < n) is generated, and the placement of dataset 1 r d is to be changed, then generate two random number r2, r3 (0 2 r3 < l), If r1r2 is 1, then change it from 1 to 0, and r1r3 from 0 to 1.

            1. Constraints on QoS Satisfaction

              jk

              jk

              The loading rate of jk is constrained by the rate on

              MINLP:

              min : (13);

              s.t. : (3) (5),(7) (9),(11),(12),

              ,,,, {0,1}, ,

              any link (u,v) denoted as (u,v), if a non-zero flow f

              (u,v) goes

              Note that the above formulation is based on the setting that

              jk

              jk

              through it. This condition can be described by a binary variable z (u,v) as

              (,) (,) (,), (, ) , , .

              the number of replicas for each data chunk is a predetermined constant. If it is a part of the optimization, denoted by an integer variable p, the total cost can be further minimized by the formulation below.

              (8)

              MINLP-2:

              Finally, the constraints on jk is given as

              (,). (,) + 1 (,), (. ) , , .

              min : (13);

              (9)

              . .: = , ,

              According to the Littles law, the expected delay djk of user requests for chunk k on server j is

              P 1;

              (4),(5),(7) (9),(11),(12),

              = = 1

              , ,

              ,

              ,

              ,

              {0,1},

              , .

              (10)

              According to the QoS requirement, i.e., djk D, we have

              Electricity cost:

              (11)

              , , .

              The electricity cost another burden in distributed data centers. Because more energy will be using in data centers . All the hardwares work without electricity. Proposed a novel, data-centric algorithm used to reduce

              The binary variable of ujk can be described by constraints:

              , ,

              (12)

              where A is an arbitrary large number, because of 0 < jk <1 and ujk {0,1}.

            2. An MINLP Formulation

          The total energy cost then can be calculated by summing up the cost on each server across all the28 geo- distributed data centers and the communication cost, i.e.,

          = . +

          (,) (,). (,), (13)

          where Pj is the cost of each activated server j.

          Our goal is to minimize the total cost by choosing the best

          energy costs and with the guarantee of thermal-reliability of the servers in geo distributed data centers. And also using the n-dimensional markov chain algorithm to reduce the electricity cost.

          Server cost:

          In distributed data centers hundreds of servers used. Because of this automatically the server cost will be increases. How to reduce the server cost means using communications and data placement and task assignment approach. Number of sever will be reduced meas at a mean time the energy cost also decrease. Server cost reduced using the joint optimization of these three factors such as task assignment , data placement and data routing via a n- dimensional markov chain. To efficiently manage the datacenter resizeing. Proposed the optimal workload and balancing of latency, electricity prices and the energy consumption.

        5. SYSTEM IMPLEMENTATION

          settings of xj, yjk, z (u,v), jk, jk, and f (u,v). By summarizing

          jk jk

          all constraints discussed above, we can formulate this cost minimization as a mixed-integer nonlinear programming (MINLP) problem as:

          This system output is a simulation based studies. It will be evaluated using NS2 tool. The evaluation base on the effect of the number of servers, on the effect of task arrival rate, on the effect of data size, on the effect of expected task

          completion delay, and on the effect of number replicas. Based on these we using joint optimization of task assignment, data placement and data routing via a n-dimensional markov chain algorithm. To reduce the over all server, electricity and data placement cost in distributed data centers.

        6. CONCLUSION

Thus the study of the data placement, task assignment, data center resizing and routing to minimize the overall operational cost in large-scale distributed data centers for big data applications has done. Therefore first characterize the data processing process using a two- dimensional Markov chain and derive the expected completion time in closed-form, based on which the joint optimization is formulated as an MINLP problem. To tackle the high computational complexity of solving MINLP, linearize it into an MILP problem.

REFERENCES

[1].James Manyika, et al. Big data: The next frontier for innovation, competition, and productivity. IEEE Network July/August 2014.

[2].Ovum. What is Big Data: The End Game. available from: http://ovum.com/research/what-is-big-data-the- end-game/ [Accessed 9th July 2012].

[3].Global information technology report 2014 world economic forum. [4].The Cost of a Cloud: Research Problems in Data enter Networks, Albert

Greenberg.

[5].James Hamilton, David A. Maltz, Praveen Patel. Microsoft Research

,WA,USA.

[6].L. Rao, X. Liu, L. Xie, and W. Liu, Minimizing Electricity Cost Optimization of Distributed Internet Data Centers in a Multi- Electricity Market Environment in Proceedings of the 29th International Conference on Computer Communications (INFOCOM). IEEE, 2010, pp. 19.

[7].GAO, P. X., CURTIS, A. R.WONG, B.ANDKESHAV, S. Its not easy

being green. In Proc ACMSIGCOMM(2012).

[8].I. Marshall and C. Roadknight, Linking cacheperformance to user behavior, Comput. Netw. ISDN Syst., vol. 30, no. 223, pp..2123- 2130, 1998.

[9].L. Rao, X. Liu, L. Xie, and W.Liu, Minimizing electricity cost: Optimization of distributed internet data centers in a multielectricity- market environment, in Proc 29th Int. Conf. Comput. Commn. 2010, pp1-9.

[10].The Cost of a Cloud: Research Problems in Data Center NetworksAlbert Greenberg, James Hamilton, David A. Maltz, Parveen Patel Microsoft Research, Redmond, WA, USA.

[11].X. Fan, W.-D. Weber, and L. A. Barroso, Power Provisioning for A Warehouse-sized Computer, in Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA). ACM, 2007, pp. 1323.

[12].K. Church, J. Hamilton, and A. Greenberg. On delivering embarassingly distributed cloud services. In Hotnets VII, October 2008.

[13].Cisco. Data center ethernet. http://www.cisco.com/en/US/- netsol/ns783/networking solutions package.html.

[14].P. X. Gao, A. R. Curtis, B. Wong, and S. Keshav, Its Not Easy Being Green, in Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM). ACM, 2012, pp. 211222.

[15].Z. Liu, Y. Chen, C. Bash, A. Wierman, D. Gmach, Z. Wang, M. Marwah, and C. Hyse Renewable and Cooling Aware Workload Management for Sustainable Data Centers, in Proceedings of International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). ACM, 2012, pp. 175186.

[16].Grant, K. (1995) An Introduction to Genetic Algorithm. C/C++ Users Journal, 13, 45-58.

[17]. Zhou, M. and Sun, S.D. (1999) Genetic Algorithms and Applications.

National Defense Industry Press, Beijing.

[18].Mitchell, M. (1996) Introduction to Genetic Algorithm. MIT Press, Cambridge, MA.

[19].Pan, W., Diao, H.Z. and Jing, Y.W. (2006) A Improved Real Adaptive Genetic Algorithm. Control and Decision, 21,792-795.

[20].Polgar, O., Fried, M., Lohner, T. and Barsony, I. (2000) Comparison of Algorithms.

[21]. Tan, B.C., et al. (2008) A Kind of Improved Genetic Algorithms Based on Robot Path Planning Method. Journal of Xian University of Technology, 28, 456-459.

Leave a Reply

Your email address will not be published. Required fields are marked *