A Study on Fault Tolerance Solution

Dr. J Meena Kumari; Shaima'A; Ghamdan

doi:10.17577/IJERTCONV4IS21030

ICRET - 2016 (Volume 4 - Issue 21)

A Study on Fault Tolerance Solution

DOI : 10.17577/IJERTCONV4IS21030

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 859
Total Downloads : 15
Authors : Dr. J Meena Kumari, Shaima’A, Ghamdan
Paper ID : IJERTCONV4IS21030
Volume & Issue : ICRET – 2016 (Volume 4 – Issue 21)
Published (First Online): 24-04-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Study on Fault Tolerance Solution

Dr. J Meena Kumari

Professor & Head Computer Science Department,

Oxford College of Science, Bangalore, India.

Shaima'a, Ghamdan

Research Scholar, Computer Science Department, Jain niversity,Bangalore,India.

Abstract: Cloud Computing has gained popularity in the recent years. Node failure when applications run in servers are a major concern to be addressed. This is otherwise called as Fault Tolerance. Reliability and availability of nodes during execution of critical service applications is a major concern which may otherwise affect the quality of service provided by the cloud service providers.

In order to reduce the impact of failure when an application runs on the cloud, there should be mechanism to anticipate the failures so that failures can be proactively addressed. This paper addresses the various fault tolerance techniques used to deal with various software faults in a virtualized cloud environment. Fault-tolerance is an important issue in job scheduling on cloud data centers. One way of providing fault- tolerance is to schedule multiple copies of a task on different virtual machines. Replication is of the most common methods in fault tolerance but still it faces lots of problems. This paper discusses different techniques of fault tolerance and highlights the problems met proactive and replication as one of these techniques.

Keywords- Fault tolerance, real time system, virtual machine, proactive and replication.

INTRODUCTON:

Computer network technologies have improved in the last 20 years. After the arrival of Internet the networking of computers has led to several novel advancements in computing technologies like Distributed Computing and Cloud Computing. The term distributed systems and cloud computing systems refer to different things, however the underlying concept between them is same.

Companies outsource their IT services to third party providers due to the high cost of maintaining the internal infrastructure. This trend led to the emergence of the so- called cloud computing approach. These providers are called as Cloud Service Providers. One of the greatest advantages of cloud computing is to allow customers to pay only for the amount of resources used by them. The cloud provider is responsible for the administration of the cloud resources that includes hardware and virtual machines (VM) and services. It is the responsibility of the provider to manage the accommodation of the capacity of the cloud. Cloud customers make use of the resources provided by the cloud to deploy and execute their applications.

When real time applications of customers run in cloud infrastructure the chance of node failure is quite high. As

most of the applications which run on cloud are safety critical systems. In general, real-time system is one that should process information and create a response within a scheduled time else may risk severe consequences including failure [12]. So the reliability depends not only on the logical result, but also the time of delivery [26]. Failure to respond when a system fails is equal to a wrong response [17]. The two characteristics which decide the reliability of cloud systems are timeliness and fault tolerance.

Fault tolerance is a concept that widely used in computer to indicate that the system continues working in the existence of failures. It has gained more concern with the emergence of cloud computing. With the large-scale and dynamic environment in clouds, errors become more popular and the need for fault tolerance becomes a must. The consequences of missing a deadline while scheduling jobs can be catastrophic in resource scheduling in cloud data centers in which the consequences are relatively tolerable.

In cloud computing there are two types of fault tolerance reactive and proactive. Reactive fault tolerance means to remove the fault after it occurs. Basically reactive fault tolerance reduces the effects of error on application execution. Proactive fault tolerance refers to avoiding faults, errors and failures by predicting them in advance.

In order to make the execution succeed, different replicas of task are run on different resources until the whole replicated task is accomplished. Replication is one of the most techniques used in clouds and considered as a reactive technique to reduce the impact of failure in the system. Proactive FT keeps an application alive by avoiding experiencing failures through preventative measures. It anticipates the failure using Pre-fault indicators, such as a significant increase in heat, can be used to avoid an imminent failure through anticipation and reconfiguration.
LITERATURE REVIEW

Cloud computing differs from standard distributed system since it is a real time system. Malik and Fabrice in [21] define real-time system as any information processing system which has to respond to externally generated input stimuli, within a finite and specified period of time [12]. So the correctness depends not only on the logical result, but also the time it was delivered [26]. Failure to respond is as bad as the wrong response [15][17]. These systems have two main characteristics by which they are separated by

other general-purpose systems. These characteristics are timeliness and fault tolerance [36]. By timeliness, we mean that each task in real time system has a time limit in which it has to finish its execution. And by fault tolerance means that it should continue to operate under fault presence [6]. Cloud computing has gained popularity because of its efficient way of referring to the use of shared computing resources. When several instances of an application running on several virtual machines, a problem that arises is implementation of an autonomic fault tolerance technique to handle a server failure to reassure system reliability and availability.

Proactive FT keeps an application alive by avoiding experiencing failures through preventative measures. In contrast, reactive FT keeps an application running through recovery from experienced failures. More precisely, proactive techniques anticipate, while reactive mechanisms respond. Although reactive solutions perform preventative measures, they do not avoid failures; instead, applications experience failures and recover [34].

In a dynamic cloud environment it is important that tasks complete within their deadline even in the presence of a virtual machine failure. There are three layers identifiable in a cloud platform: resources, VMs and applications. Each of them is concerned with failures. Therefore, we identify three types of failure in a cloud platform: hardware failure, VM failure and application failure.
There are two strategies for replication: active and passive. In active replication, all the replicas are simultaneously invoked and each replica processes the same request at the same time. This implies that all the replicas have the same system state at any given point of time and it can continue to deliver its service even in case of a single replica failure. In passive replication, only one processing unit (the primary replica) processes the requests while the backup replicas only save the system state during normal execution periods. Backup replicas take over the execution process only when the primary replica fails.
1. Primary-backup replication: In the primary-backup strategy [24], one of the replicas, called the primary receives the invocations from the client process, and sends
  
  the response back. Given an object x, its primary replica is noted prim(x). The other replicas are called the backups. The backups interact with the primary, and do not interact directly with the client process.
2. Active replication: In the active replication technique, also called state-machine approach" [7], all replicas play the same role: there is here no centralized control, as in the primary backup techniques.
3. Passive Replication: In passive replication there is only one server (called primary) that processes client requests. After processing a request, the primary server updates the state on the other (backup) servers and sends back the response to the client. If the primary server fails, one of the backup servers takes its place. Passive replication may be used even for nondeterministic processes. The disadvantage of passive replication compared to active is that in case of failure the response is delayed.
Virtual machine replication (VM replication) is a type of VM protection and disaster recovery for all virtual machines in the environment. That takes a copy of the VM as it is right now and copies it to another VM. Replication of VM can be stored on the same host, different local host, or even a remote host.
1. ANALYSIS
  
  Most of the real time applications require the fault tolerance capability to be provided. A lot of work has been done in the area of fault tolerance for standard real time systems. But there is lot of research room available in fault tolerance of real time application running on cloud infrastructure.
  
  The cloud service provider usually host fewer physical servers and increase number of virtual machines (VMs). IT professionals must balance VMs with available resources. Virtualization saves costs by reducing the need for physical hardware systems. Virtual machines more efficiently use hardware, which lowers the quantities of hardware and associated maintenance costs, and reduces power and cooling demand. They also ease management because virtual hardware does not fail. Administrators can take advantage of virtual environments to simplify backups, disaster recovery, new deployments and basic system administration tasks.
  
  Fault tolerance is concerned with all the techniques necessary to enable a system to tolerate software faults remaining in the system after its development. Fault tolerance is one of the major challenges faced by cloud services for HPC applications. The fault tolerance approach used in high performance and distributed systems is not applicable to cloud because of the resource sharing strategy of cloud.
  
  The practice followed currently for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, this method is becoming less efficient.
  
  The other easiest and cheapest way of dealing with fault tolerance is of dealing with failures is to add some
  
  invulnerable stations that will oversee the system, assigning work to normal stations and dealing with their failures. The main disadvantage of this solution is, however, creation of a bottleneck. Scalability of such system is bounded by performance of these special stations. There are number of fault tolerance models which provide different fault tolerance mechanism to enhance the system performance and reliability. There are some drawbacks with the existing fault tolerance models.
  
  Replication is widely used in providing fault tolerance but still suffers from lots of problems such as cost of used resources. The more number of replicas we use, the higher goes the cost of rented resources. More number of replicas may imply number of problems in managing and handling these replicas especially when some of the replica fails.
2. CONCLUSION:
  
  One of The basic mechanisms to achieve the fault tolerance is replication and proactively predicting the faults. Due to replication, cost for renting in cloud resources increase. But it is really required to avoid catastrophic loss. With the number of new replicas increasing, the system maintenance cost will significantly increase, and too many replicas may not increase the availability, but cause unnecessary spending instead. Although proactive techniques suffers lots of drawbacks it is used in most critical system to avoid system crash. Lots of enhancements needed in these areas.
3. REGFERENCES

A. Bala and I. Chana, "Fault Tolerance- Challenges, Techniques and Implementation in Cloud Computing", IJCSI International Journal of Computer Science Issues, vol. 9, no. 1, (2012), pp. 1694- 0814.
A. Heddaya, A. Helal, Reliability, Availability, Dependability and Performability: A User-Centered View, Boston, MA, USA, Tech. Rep., 1997
Amin, Zeeshan, Harshpreet Singh, and Nisha Sethi. "Review on fault tolerance techniques in cloud computing." International Journal of Computer Applications 116.18 (2015).
Aviienis, Algirdas, Jean-Claude Laprie, and Brian Randell. "Dependability and its threats: a taxonomy." Building the Information Society. Springer US, 2004. 91-120.
B. Selic, Fault tolerance techniques for distributed systems: http://www.ibm.com/developerworks/

nrational/library/114.html
Coenen, J., and Jozef Hooman. "A formal approach to fault- tolerance in distributed real-time systems." Proceedings of the 4th workshop on ACM SIGOPS European workshop. ACM, 1990.
F.B. Schneider," Replication Management using the State-Machine Approach," In Sape Mulender , editor, Distributed Systems, pages 169-197, ACM Press, 1993.
Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. Oper Syst Rev 37(5):2943.
Gray, Jim, et al. "The dangers of replication and a solution." ACM SIGMOD Record. Vol. 25. No. 2. ACM, 1996.
Hosseini, Seyyed Mansur, and Mostafa Ghobaei Arani. "Fault- Tolerance Techniques in Cloud Storage: A Survey." International Journal of Database Theory and Application 8.4 (2015): 183-190.
J. Stankovic, Misconceptions About Real-Time Computing,IEEE Computer, Vol. 21, No. 10, October 1988, pp. 10-19.
Jhawar, Ravi, Vincenzo Piuri, and Marco Santambrogio. "A comprehensive conceptual system-level

approach to fault tolerance in cloud computing." Systems Conference (SysCon), 2012 IEEE International. IEEE, 2012..
Jhawar, Ravi, and Vincenzo Piuri. "Fault tolerance and resilience in cloud computing environments." Computer and Information Security Handbook, (2013): 125-141.
K. H. Kim, Towards Integration of Major Design Techniques for Real-Time Fault-Tolerant Computer System,Society for Design & Process Science, USA, 2002.
Kaur, Jasbir, and Supriya Kinger. "Analysis of Different Techniques Used For Fault Tolerance." IJCSIT: Int J Comput Technol 4.2 (2013): 737-741.
Kim, K. H. "Toward Integration of Major Design Techniques for Real-Time Fault-Tolerant Computer Systems." Journal of Integrated Design & Process Science 6.1 (2002): 83-101.
Kim, K. H., and Howard O. Welch. "Distributed execution of recovery blocks: An approach for uniform treatment of hardware and software faults in real-time applications." Computers, IEEE Transactions on 38.5 (1989): 626-636.
Kong, Xiangzhen, et al. "Performance, fault-tolerance and scalability analysis of virtual infrastructure management system." Parallel and Distributed Processing with Applications 2009 IEEE International Symposium on. IEEE, 2009.
Luckow A, Schnor B. Service replication in grids: ensuring consistency in a dynamic, failure-prone environment. In: Proceedings of IEEE International Symposium on Parallel and Distributed Processing, Miami, 2008. 17.
Malik, Sheheryar, and Fabrice Huet. "Adaptive fault tolerance in real time cloud computing." Services (SERVICES), 2011 IEEE World Congress on. IEEE, 2011.
Malik, Sheheryar, and Jaffar Rehman. "Time Stamped Fault Tolerance in Distributed Real Time Systems." (2005).
Merideth M G, Iyengar A, Mikalsen T, et al. Thema: Byzantine fault-tolerant middleware forWeb service applications. In: Proceedings of 24th IEEE Symposium on Reliable Distributed Systems, Orlando, 2005. 131142.
N. Budhiraja, K. Marzullo, F.B. Schneider, and S. Toueg. The Primary-Backup Approach. In SapeMullender, editor, Distributed Systems, pages 199{216. ACM Press, 1993.
NguyÃªn, ToÃ n, Jean-Antoine DÃ©sidÃ©ri, and Laurentiu Trifan. "Applications resilience on clouds." High Performance Computing and Simulation (HPCS), 2012 International Conference on. IEEE, 2012.
O. S. Unsal, I. Koren, M. Krishna, Towards Energy-Aware Software Based Fault Tolerance in Real Time SystemsISLPED 02, Monterey, California, USA, August 12- 14,2002.
Patra, Prasenjit Kumar, Harshpreet Singh, and Gurpreet Singh. "Fault tolerance techniques and comparative implementation in cloud computing." International Journal of Computer Applications 64.14

(2013).
Salatge N, Fabre J-C. Fault tolerance connectors for unreliable Web services. In: Proceedings of 37th International Conference on Dependable Systems and Networks, Edinburgh, 2007. 5160.
Santos G T, Lung L C, Montez C. FTWeb: a fault tolerant infrastructure for Web services. In: Proceedings of 9th IEEE International Conference on Enterprise Computing, Enschede, 2005. 95105.
Sheu G-W, Chang Y-S, Liang D, et al. A fault-tolerant object service on CORBA. In: Proceedings of 17th International Conference on Distributed Computing Systems, Baltimore, 1997. 393.
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proc 2010 IEEE 26th symposium on mass storage systems and technologies

(MSST 2010), May 2010. IEEE Press,

New York, pp 110
Sun, Dawei, et al. "Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments." The Journal of Supercomputing 66.1 (2013):193-228.
Tchana, Alain, Laurent Broto, and Daniel Hagimont. "Fault Tolerant Approaches in Cloud Computing Infrastructures." Proceedings of the 8th International Conference on Autonomic and Autonomous Systems (ICAS12). 2012. (Models Comparison)
Thomas Naughton, Engelmann, Christian, Geoffroy R. Vallee and Stephen L. Scott. "Proactive fault tolerance using preemptive migration." In Parallel, Distributed and Network-based Processing, 2009 17th Euromicro International Conference on, pp. 252-

257. IEEE, 2009.
Tu M, Li P, Yen IL, Thuraisingham BM, Khan L (2010) Secure data objects replication in data grid. IEEE Trans Dependable Secure Comput 7(1):5064.
W. T. Tsai, Q. Shao, X. Sun, J. Elston, Real Time Service-Oriented Cloud Computing, School of Computing,Informatics and Decision System Engineering Arizona State University USA, http://www.public.asu.edu/~qshao1/doc/RTSOA.pdf.
Wang JY, Jea KF (2009) A near-optimal database allocation for reducing the average waiting time in the grid computing environment. Inf Sci 179(21):37723790.
Wang SS, Yan KQ, Wang SC (2011) Achieving efficient agreement within a dual-failure Cloud computing environment. Expert Syst Appl 38(1):906915.
Yuan D, Yang Y, Liu X, Chen J (2010) A data placement strategy in scientific cloud workflows. Future Gener Comput Syst 26(8):12001214.
Zheng Z, Zhou TC, Lyu MR, King I (2010) FTCloud: a component ranking framework for Faulttolerant cloud applications. In: Proc 2010 IEEE 21st international symposium on software Reliability engineering (ISSRE 2010), Nov. 2010. IEEE Press, New York, pp 398407.
Zhu Q, Agrawal G, (2010), "Supporting fault-tolerance for time- critical events in distributed environments."Sci Program 18(1):5176.

A Study on Fault Tolerance Solution

Leave a Reply