A View on Big Data Processing

Ritesh Sanjay Mahajan; Raghava R Koushik

doi:10.17577/IJERTCONV3IS14033

NCACCT - 2015 (Volume 3 - Issue 14)

A View on Big Data Processing

DOI : 10.17577/IJERTCONV3IS14033

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 101
Total Downloads : 16
Authors : Ritesh Sanjay Mahajan, Raghava R Koushik
Paper ID : IJERTCONV3IS14033
Volume & Issue : NCACCT – 2015 (Volume 3 – Issue 14)
Published (First Online): 24-04-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A View on Big Data Processing

Ritesh Sanjay Mahajan

Dept of Computer Science and Engineering Sai Vidya Institute Of Technology Rajanukunte Bengaluru 560060

ABSTRACT

Big data processing relies today on complex middleware stacks, comprised of high-level languages, programming mod- els, execution engines, and storage engines. Among these, the execution engine is often built in-house, is tightly inte- grated or even monolithic, and makes coarse assumptions about the hardware. This limits how the future big data processing systems will be able to respond to technologi- cal advances and changes in user workloads. To advance the state-of-the-art, we propose a vision for an open big data execution engine. We identify four key requirements and design a modular, task-based architecture to address them. First, to address performance requirements, we envi- sion the use of multi-layer parallelism, which enables the effi- cient use of heterogeneous platforms (from GPUs to clouds). Second, to address elasticity, we propose the use of online provisioning and allocation of cloud-based resources, which enables economically feasible processing of many big data tasks. Third, to address predictability, we define a perfor- mance envelope, which enables early decisions for selecting and sizing execution platforms. Fourth, we characterize the interaction between our execution engine architecture and the other layers, to enable the efficient operation of the en- tire big data processing stack. We further identify and an- alyze eleven challenges in realizing our vision. Ultimately, these challenges will inspire and motivate a new generation of execution engines for big data processing.

Keywords

Large scale big data processing, Execution engine, Predictabil- ity, Elasticity, High-performance

INTRODUCTION

Driven by the steadily increasing volumes of data from many fields of science and society, ranging from biology to astron- omy, and from social networks to finances, big data process- ing is one of the immediate challenges of computing. Com- plex analysis of big data is of interest for many companies

Raghava R Koushik

Dept of Computer Science and Engineering Sai Vidya Institute Of Technology Rajanukunte Bengaluru 560060

and fields, being an important source of both (increased) revenue and scientific discovery [1, 2]. Timely analysis is as important, making big data applications a new field of high- performance computing (HPC) [3]. Although many big data processing stacks already exist, they are monolithic, highly integrated stacks. This lack of flexibility makes the reaction time to changes in workloads and/or platforms unacceptably high, especially for common users. To tackle this problem, this work describes our vision for a big data processing stack that is not monolithic.

Many of the companies that drive the data analytics busi- ness are small and medium enterprises (SMEs), which fo- cus mainly on the design and development of new models and analysis tools. They typically1 process average-sized datasets100s of GBs to TBs of databut cover a broader range of algorithms and domains [5]. This dynamic mix of data and computation makes it difficult to estimate the num- ber and type of permanent resources to cover the request, making dedicated supercomputers and/or clusters unfeasi- ble, both time- and money-wise.

In contrast to dedicated infrastructure, on-demand access to compute resources is much more attractive for these dy- namic enterprises, allowing control over the type and num- ber of resources to be used .Cloud computing has emerged in the past few years as an IT paradigm in which the infras- tructure, the platform, and even the software used by IT operations are outsourced services; Infrastructure as a Ser- vice (IaaS) clouds can lease compute and storage resources to their users. Services can be used flexibly (i.e., when and only for how long they are needed), and leased for prices charged in small increments according to the actual usage. Amazon, Microsoft, Google, and SalesForce are only four examples out of more than one hundred commercial cloud providers.

Commercial IaaS clouds offer a wide diversity of (virtual-

ized) resources, including CPUs of multiple generations and number of cores, GPUs, and even more exotic architectures. Combining these resources leads to an unprecedented mix of parallelism (at multiple-layers) and heterogeneity, which need to be addressed at both the application and the exe- cution engine level .Moreover, systems that lease re-

sources from clouds hold the promise of efficiency by being elastic, that is, by varying in scale over time. Therefore, for big data processing systems to achieve good performance, addressing multiple existing layers of parallelism (i.e., from very fine to very coarse-grained) and dynamic scaling (i.e., on-demand increase of resources) are essential.

Currently, big data processing relies on users implementing their applications in a high level programming model, mak- ing implicit choices about the execution model to be used, often hindering efficient resource usage. For example, when using a MapReduce-like model, users implicitly assume a specific way of dividing the application into homogeneous tasks of predefined sizes, which are then mapped, sched- uled, and executed according to a programming model pol- icy. System elasticity or hardware heterogeneity and multi- grain parallelism cannot be taken into account easily: in- stead, a new programming model would have to be chosen, and the development process restarted.

Achieving high performance is an important goal for SMEs. Although SMEs may also value cost and other non-functional attributes of their applications, these attributes may often be seen as a trade-off in relationship with performance. For example, an SME may aim to obtain the result of data pro- cessing within a deadline, for a certain cost limit; this is essentially a problem of high performance computing when the deadlines are tight (e.g., when working for a contractor, as many SMEs do) and when the cost model is simple (e.g., the per-hour charge of many commercial IaaS clouds). An another example, an SME may strive for a measure of effi- ciency expressed as the maximum amount of processing for a given budget; again, this problem requires achieving high performance.

We argue that, for SMEs, meeting the high performance re- quirements of a large diversity of big data workloads can be efficient only when the heterogeneity and layered parallelism of modern computational systems (from GPUs to clouds), is leveraged. Instead of addressing these quick-paced changes in the resources mix we are currently using for each program- ming/execution model, we envision a novel execution engine that allows for a flexible match between the application and the computational resources, without being directly bound to a programming model. In this article, we focus on the de- sign of this generic execution model, emphasizing its design requirements and restrictions, and discussing the main chal- lenges to be addressed for its performance and compatibility with the rest of the big data processing stack. Towards our vision, we identify and expose the potential bottlenecks in the design and prototype process of such a generic engine, allowing the community to contribute to their solving.

Our main contribution is therefore threefold: (1) We pro- pose a vision for an open, non-monolithic big data process- ing architecture, suitable for SMEs. (2) We identify the key requirements for designing such an architecture. (3) We enu-

merate the first key chalenges (below fifteen of them) that can and should be solved for this vision to become (closer to) reality.

The remainder of this article is structured as follows. Sec- tion 2 introduces the state-of-the-art in big data processing stacks. Section 3 presents the design requirements and the architecture of our execution engine, while Section 4 lists the most important challenges to be solved when implementing it. Section 5 concludes the article with a summary of our findings and contributions.
BACKGROUND

In this section we introduce the general architecture of a big data processing system and discuss the state-of-the-art, for which we identify high-level problems.

Big Data Processing Stack

Big data processing applications make use of increasingly more complex ecosystems of middleware. Figure 1 depicts

the big data ecosystem, where systems-of-systems are all similarly structured into a four-layer big-data processing stack. This ecosystem is based on a variety of open-source middleware, both commercial and freeware, of increasing maturity but with widely different features and performance characteristics. We discuss each of the four layers, in turn.

The High-Level Language allows non-expert programmers to

express complex queries and general data-processing pro- grams in a data manipulation language. A variety of lan- guages exist, many derived from the SQL92 standard (e.g.,

Pig Latin ) or domain-specific (e.g., Spark for

multi-pass algorithms, especially common in machine learn- ing). These languages are supported in the big data ecosys- tem by compiling tools that convert from these languages to specific programming models.

The Programming Model is currently the central layer of the big data stack. Although a variety of programming mod- els exist, several, for example MapReduce, have started to become more commonly used. Similarly to the case of gen- eral programming languages, it is common for the choice for a particular programming model to be made based on perceived popularity, rather than features and proven per- formance; for example, the use of the MapReduce program- ming model for graph processing has been shown to lead to very poor performance in practice, when compared with other data processing stacks, yet it is still widely used. Tools can compile programs expressed in the programming model to programs that can be executed on the available resources by an execution engine.

The Execution Engine provides the automatic, reliable, effi- cient use of computational resources to execute the jobs de- rived from programs expressed in a particular programming model; although more general execution engines may appear soon, the current generation offers native support only for one programming model. Numerous execution engines ex- ist; for the MapReduce programming model, Hadoop and its newer incarnation YARN are open-source, free-to-use execu- tion engines. Depending on the characteristics and quality- of-service requirements of the application, the execution en- gine may require close interaction with a storage engine.

High-level Language FLUME BigQuery SQL Meteor JAQL Hive Pig Sawzall scope DryadlInQ

programe model

PACT Map Reduce model pregel data flow

execution engine

+*'

Flume Dremel Tera Azure Engine service data Engine

Nephele holoop Hadoop/ Griaph

Yarn

MPI/ Dryad Erlang

Tree Engine

S3 GFS Tera Azure

data Data

Store Store

HDFS Voldemort Plus Zookeeper ,CDN ,etc.

L cosmosFS F

S

storage engine'

Figure 1: A view into the ecosystem of Big Data processing.

The Storage Engine provides hardware/software storage for big data applications. It typically provides a common file- system or key-value interface for storing data. The Hadoop Distributed File System is a commonly used

storage engine in the MapReduce stack.

These fours layers are not the only ones appearing in a big data processing stack. Necessarily, a resource substrate pro- vides the actual resources (physical or virtualized). An- other layer contains very diverse middleware, to provide state management services (e.g., Zookeeper), content man- agement, etc.
Problem: The Monolithic Approach

Despite the high diversity in the big data ecosystem, the state-of-the-art in stack deployments is based on monolithic, highly integrated stacks. Current programming models, which may be the first to be fixed in a decision to install a new data processing unit, include and implicitly hide all the details regarding execution. Annotations could help, but a diverse set of useful annotations and a comprehensive analysis of the usefulness of existing annotations do not currently ex- ist. The execution engine is typically deployed on a fixed set of resources, can sometimes execute no more than a few big data processing jobs concurrently, and can rarely coex- ist with other execution engines in the same environment. The storage engine exhibits similar issues to the execution engine, although for large volumes data, instead of for rela- tively small volumes of data and large amounts of computa- tion per deployed compute unit.

The monolithic approach has democratized big data pro- cessing, but at the expense of significant waste of compute and storage capability. Understanding the performance of a big data application, from the information exposed to the programmer, is severely hindered by the aspects abstracted away, such as the mapping and scheduling of applications to resources, provisioning of needed resources (and their re- lease), and the actual execution of the application.

To address these high-level problems that are inherent in current big data stacks, we propose in the following section a generic architecture that promises to address them at the level of the execution engine. Our focus on the execution engine is motivated by the variety of resources managed at this level, including in particular heterogeneous compute el- ements.

3.A GENERIC ARCHITECTURE FOR MANY-TASKS BIG DATA PROCESSING

In this section we discuss the requirements and the design of a generic execution engine for a many-tasks big data pro- cessing.

Requirements

We envision big data processing stacks that exploit afford- able computational capacity, flexible in scale and composi- tion. Such a set of machines, either leased for short peri-

ods of time from a commercial IaaS cloud or even common computers clustered together, is likely to be equipped with multi-core processors and GPU-like accelerators of different capabilities. Therefore, heterogeneity and a large variety of parallelism and performance capability must be expected.

For such computational infrastructures, much poorer in com- munication capacity and latency than high-end dedicated supercomputers, intensive data communication between com- putation tasks will greatly reduce performance. To avoid this bottleneck, data processing applications should be par- titioned into bags of many loosely-coupled or even indepen- dent tasks. Providing these tasks to the execution engine allows it to take into account both the available resources and the performance requirements to provide an efficient,

high-performance execution of the workload.

To design a generic architecture for a performance-aware execution engine with access to different kinds of computa- tional resources, we identify four driving requirements:

Special Issue – 2015

Figure 2: A generic architecture for many -task big- data applications.

International Journal of Engineering Research & Technology (IJERT)

ISSN: 2278-0181

an inter-operated sysNteCmA.CCTT-pi0r1d5lyC, obnyferecnocnesiPdreorcinegedinbgosth the tasks and the resources in the Prediction and classification component, this architecture makes provisions for advanced prediction and classification mechanisms, and in particular for mechanisms supporting tru resource heterogeneity (that is, from GPUs to clouds). Because applications can behave very differently on CPUs and GPUs, adequate mechanisms can offer important opportunities for improvement. By clas- sifying tasks, this component also makes provisions for self- tuning mechanisms, relieving the naÂ¨ive user from one of the burdensome tasks but without necessarily sacrificing per- formance. Fourthly, the provisioning mechanism can ex- tend the functionality of popular Execution Engines, such as Hadoops, with the ability to use an elastic set of resources; for example, it could enable the resource substrate to in- crease with resources leased from commercial IaaS clouds only when needed and for as long needed. These transient resources offer new opportunities for efficient execution mechanisms.

Our architecture addresses the requirements as follows:
- High Performance requirements (addressed in Section 4.2) are fulfilled through the individual behavior and through the interplay of several components, primarily Map- ping and Scheduling, Provisioning, and Execution.
  1. High Performance : the robust exploitation of the in- herent multi-layered parallelism of todays heteroge- neous computational infrastructures must be leveraged this requirement also enables higher-level
    
    trade-offs that may appear in the operation of an SME, such as performance-cost or performance-accuracy.
  2. Elasticity : the elastic (up- and down-scaling) yet ef- ficient provisioning and allocation of resources for a variety of workloads must be enabled.
  3. Predictability : the performance envelope of a given computational capacity, subject to different data sets, algorithms, and processing stacks, should be estimated.
  4. Compatibility : the integration of the execution engine with the high level programming model and the storage system must be provided.
Architecture

To address the four major requirements formulated earlier, we have designed a generic big-data processing architecture depicted in Figure 2. Responding to requirements, our archi- tecture assumes that data-processing applications can fulfill the many-task paradigm: the architecture is designed to ex- ecute many tasks with high throughput.

We will further focus on the Execution Engine, built with four distinctive features. Firstly, this architecture decou- ples the execution engine from the input constraints pro- vided by the programming model (that is, tasks) and by the resource substrate (that is, actual compute resources).

Secondly, by clearly exposing to the user, from the Exe- cution Engine, a set of components and their interconnec- tions, this architecture enables the design, tuning, and im- plementation of each component, either in isolation or as
- Elasticity requirements (Section 4.2) are fulfilled through the individual behavior and through the interplay of Provisioning and Execution.
- Predictability requirements (Section 4.3) are fulfilled through the individual behavior of Prediction and Clas- sification, which takes decisions based on the available resources and tasks to classify the tasks in terms of resources.
- Compatibility requirements (Section 4.4) are addressed by the interfaces offered by the Execution Engine: the back-end of the programming model and the resource specification.

OPEN CHALLENGES

In this section we discuss an open list of challenges that come with the architecture proposed in Section 3. We see each of this challenges as surmountable in the next decade.
The heterogeneous and dynamic nature of the computa- tional infrastructure we are considering poses significant pro- grammability challenges: tasks have to be able to run effi- ciently on different types of archtiectures, from distributed nodes to many-core CPUs and GPUs. Ideally, they should also be ready to address new, exotic architectures that might

Figure 3: Selected policy over the lifetime of a sy – tem with changing workload modes.

be incorporated in the near future (e.g., fused processors or domain-specific processors).

Developing all these different versions of software is a huge programmability challenge. As new workloads are being de- signed and developed for new types of big data analysis, this challenge needs to be addressed in a systematic way. There- fore, defining and using portable programming models is the key to solve this challenge efficiently. Portable versions of all tasks are necessary to address both the elasticity and hetero- geneity of the computational infrastructure. Implementing them in languages like OpenCL or OpenACC will provide a first level of functional portability, but more research is required to achieve acceptable levels of performance porta- bility.
The idea that social links between the users and between the providers of resources in a (data) processing system may influence the operation of the system dates from the early 1960s; for example, shared infrastructure generated off-line, socially guaranteed agreements between users, regarding eq- uitable usage hours. However, recent studies have identified new forms of social influence on resource usage and sharing. The use of data by large groups of scientists has been shown to exhibit filecules, that is, sharing patterns for socially close participants. Similarly, the use of computational re-sources in grids can be partially explained through social patterns A mid-2000s trend of sharing resources based on socially guaranteed contracts led to many resources being gifted, volunteered, or shared. An important challenge is building mechanisms that exploit these social elements to improve resource usage and sharing. The challenge due to data sharing are non-trivial: data replication and transfers, incentives to manage foreign data, privacy-preserving pro- cessing of sensitive data; early studies have not yet addressed many of them.
For an application to be described as a collection of tasks, the entire workload (i.e., algorithm and data) needs to be decomposed. While the migration of the computation de-

pending on the available resources is the responsibility of the execution engine, the similar migration of the data requires interacting with the storage system.

We believe there is much research to be put into optimiz- ing this interaction for two reasons: (1) in elastic and dy- namic systems, such migrations happen very often, and (2) the expected heterogeneity of the platforms will force mi- gration between different storage spaces, depending on the source and destination. They key challenges to address here are understanding the types of data migration and charac- terizing the performance penalty each type infers. Based on this classification, data migration rules can be enforced on the interface between the execution and storage engines, thus limiting the negative performance impact due to too often/too expensive migrations.
CONCLUSION

Big data processing has been increasing in popularity for the past five years. Smart and efficient data analytics be- come important revenue sources for many small and medium enterprises (SMEs), which have the expertise for interest- ing analysis, but no resources to fund expensive infrastruc- ture. We belive that despite relying on commodity hardware and/or cloud services, high performance remains achievable for these common scenarios, but it is currently hindered by the monolithic nature of existing big data processing solu- tions.

Thus, we proposed our vision of an open architecture for big data processing, focusing on its execution engine, the component with the highest performance impact. Our con- tributions are threefold: (1) describing a vision for an open,

fying its main design requirements, and (3) enumerating the most important eleven challenges that need to be addressed to transform this vision into a feasible solution.

Our own work addresses some of these challenges in more detail. For example, for the performance challenges, we are actively involved in designing and evaluating portable par- allel programming models for multi- and many-core archi- tectures as well as in addressing the impact of het- erogeneity on performance We are also working on the problem of predictability – especially on its modeling

and benchmarking components – in our work on large scale graph processing on multiple platforms Finally, our work on scheduling in large scale distributed environments and clouds is addressing the elasticity challenges

All our research efforts in this directions are, of course, only the tip of the iceberg, proving that solutions do exist for im- proving the state-of-the-art, but much more needs to be done to fully materialize our vision of an open big data processing architecture, with a flexible, performance-aware execution engine that will allow regular users (SMEs) to efficiently use the resources available to them (off-the-shelf or from a cloud, homogeneous or heterogeneous) to run their analytics.

Acknowledgment

The authors are grateful for the comments of the reviewers. This publication is supported under the guidance of Sreelatha mam .

dept of computer science

Sai Vidya institute of Technology (sreelatha.pk@saividya.ac.in)
REFERENCES

V. R. Borkar and M. J. Carey, Big data technologies circa 2012, in COMAD, 2012, pp. 12-14.
R. Ramakrishnan, Big data in 10 years, in IPDPS, 2013, p. 887.
V. R. Borkar, M. J. Carey, and C. Li, Big data platforms: whats next? ACM Crossroads, vol. 19, no. 1, pp. 44-49, 2012.
Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup,

C. Martella, and T. L. Willke, How well do graph-processing platforms perform? an empirical performance evaluation and analysis: Extended report, Delft University of Technology, Tech. Rep. PDS-2013-004, 2013. [Online]. Available:

http://www.pds.ewi.tudelft.nl/fileadmin/pds/reports/

2013/PDS-2013-004.pdf
M. Stonebraker and J. Robertson, Big data is buzzword du jour; cs academics have the best job, Commun. ACM, vol. 56, no. 9, pp. 10-11, 2013.

A View on Big Data Processing

Leave a Reply