An Efficient Deduplication Storage of Virtual Machine with Bloom Filter

DOI : 10.17577/IJERTCONV3IS22015

Download Full-Text PDF Cite this Publication

Text Only Version

An Efficient Deduplication Storage of Virtual Machine with Bloom Filter

R. Amala, Mrs. B. Jayanthi,

    1. Computer Science and Engineering, Assistant Professor,CSE,

      Mookambigai College of Engineering, Mookambigai College of Engineering,

      Pudukkotai, Tamil Nadu. Pudukkotai, TamilNadu

      AbstractIn cloud computing a virtual machine has been serving as a significant component. The virtual machine is comprised of a set of specification and configuration files and is backed by the physical resources of a host. Virtualization is run multiple operating system and applications on a single component. Existing system have reduced virtual machine files and images storage consumption. The storage area network cluster has made efforts to reduce storage consumption by use of deduplication. However, an storage area networks are very high priced, and it is difficult to satisfy increasing demand of large scale virtual machine. In this paper, we propose an exact deduplication approach that especially designed for high demand on virtual machine deployment. The design provides low storage consumption by using deduplication approach and peer to peer data transfer of instant virtual machine deployment.

      Keyword TermsCloud computing, virtual machine, file system, virtualization, bloom filters.


        Cloud Computing is the delivery of computing as a service rather than a product, are provided to computers and other devices as utility over a network. It s broad consider as possibility the next most important technology in our software industry.

        Security refers to confidentiality, integrity and availability, which pose major issues for cloud computing. The Primary service model being deployed are commonly known as Software as a service, Platform as a service, Infrastructure as a service. Development models in cloud computing is Public cloud, Private cloud, Community cloud, Hybrid cloud. A virtual machine is a software computer that, like physical computers, runs an operating system and applications. The virtual machine is comprised of a set of specification and configuration files and is backed by the physical resources of a host.

        Private cloud is cloud infrastructure operated solely for a single organization, whether managed internally or by a third- party, and hosted either internally or externally. Undertaking a private cloud project requires a significant level and degree of engagement to virtualized the business environment, and requires the organization to reevaluate decisions about existing resources.

        When done right, it can improve business, but every step in the project raises security issues that must be addressed to prevent serious vulnerabilities. Self-run data centers are

        generally capital intensive. They have a significant physical footprint, requiring allocations of space, hardware, and environmental controls. These assets have to be refreshed periodically, resulting in additional capital expenditures.

        The virtual machine presents a great opportunity for parallel, cluster, grid, cloud and distributed computing. Virtualization technology benefits the computers and IT industries by enabling user to share expensive hardware resource by multiplexing virtual machine on the same set of hardware hosts. Run multiple operating system and applications on a single computer. This virtualization layer is known as hypervisor or virtual machine monitor (VMM).

        This leads to plain practice of storing and sharing virtual machine images and files such as storage area network and network attached storage, direct attached storage is only used for lasting for short time only storage. Cost of network storage systems are several times more than direct attached storage.

        The storage consumption flow out obtain by a large number of virtual machine images and files could be identified by deduplication techniques. Existing systems have endeavour to address this flow out on a storage area network by deduplication techniques.

        This is operated in assign considerable powers from the centre to its branches fashion, the deduplication is completed at virtual machines there is no equal data blocks are stored in storage area network. Nevertheless, storage area networks are high priced, and have not reduced storage consumption by using deduplication technology, and doesnt solve the bottleneck problem of metadata server.

        In this paper, we have proposed, efficient deduplication storage with bloom filter using distributed file system. It is specific design for done at the same time address the above problems. The client side machine separate a whole data blocks into small data blocks, refers them by their fingerprints, and the deduplication uses to avoid storing same content data. The groups of fingerprints are saved to a meta data server and the deduplicated data blocks are saved to data server.

        The client side downloads its group of fingerprints from hot backup meta data server, fetches original data blocks from data servers. The clients are peer to peer fashion, it is reduces requests directly send out to data servers.

        Fast virtual machine deployment features such as instant cloning and on-demand data blocks fetching.


    1. Basic Structure of Virtual Machine

      Two type of basic structural of virtual machine images. First type is raw image format is only a byte by byte coping of physical disks content into a regular file or group of regular files. Advantages of raw image format are that they have good input output performance because their byte by byte mapping is straight forward.

      Nevertheless, they contain all information in the disk because this type is commonly very large in size. Another type is spare image format. It is not like byte by byte copy, the spare image format make a difficult mapping between blocks in physical disks and blocks of data in virtual machine. It is worst input and output performance compared with raw image format, this is disadvantage of spare image format.

    2. Deduplication Techniques

      Data deduplication eliminates redundant data. It only eliminates extra copies of data none of the original data is lost. It goals at improving storage utilization. The deduplication process is unique data blocks are usually identified by an examined fingerprint from their original content.

      Whenever fingerprint is calculated, it will surly compared with a stored fingerprints data blocks to check for a match or not. If similar fingerprints are found, the data blocks will declared as a redundant data blocks. The redundant data blocks are removed from stored data blocks, in the place of storing the same content in multiple times.

      Deduplication has two methods. There are fixed size chunking and variable size chunking. The fixed size chunking method is to break the original file into blocks of same size and except the lost block one. The simplicity is the advantage

      The data servers are controlled by the meta server. The meta server is assigned the fingerprint space to each data server. The meta server checks the data servers health periodically.

      The client is a significant component because it is capable for giving reduplication on virtual machine files and images, peer-to-peer sharing of data blocks, and instant cloning. Whenever starting a new virtual machine, it creates a new virtual machine ID, name, size and data block count. Successfully create a new virtual machine. If virtual machine status is activate, then will proceed the process, otherwise if deactivated, it wont working.

      The client side of filesystem fetches virtual machine images and files meta data information and data blocks from the meta server , data server and peer to peer clients and its provides images or files content. When shutdown of virtual machine, the client side browse the files and images and upload data into meta server, and new data blocks pushes into data server.

      Peer to peer data bock sharing scheme is one advantage of an exact deduplication approach for virtual machine with bloom filter.

      If meta server smashed the mirroring meta server will take over all control of the whole system. An exact copy of data blocks is stored in data server, if any crashing some data server will not damage the whole system.

      The architecture of deduplication approach of virtual machine with bloom filter.

      Shadow of

      of fixed size chunking, because of data blocks are stored in same size and mapping from file system offset, file ID to data blocks. It cloud be done with very simple calculation.

      Variable size chunking is more and more difficult to

      meta server Meta server

      Heart Beat



      Data serve


      Data server


      calculate the fingerprint of data blocks, compared with variable size chunking method, the fixed size chunking method will have better read and write performance. The fixed size chunking is good measurement of deduplication.


    1. System Architecture

      Bloom Filter

      Client file system


      Bloom Filter Hash Function


      Client file system


      Bloom Filter

      Three component more than one data servers, multiple clients and a single meta server with shadow of meta server.

      The calculation of deduplication process is first virtual machine files or images are broken into fixed size data block. Each and every data block having unique fingerprint. The sequence of fingerprint is called as data blocks.

      The meta server contains all file system information. This has file system offset, index extend file ID, namespace, data blocks fingerprint. The fingerprints are mapping to data server. Each and every data blocks having reference count.

      peer to peer data sharing

    2. Fingerprint calculation

      The deduplication techniques usually compare the data blocks of fingerprints to check similar data blocks. The fingerprint is smash-resistant. It is hash value calculated from data block content. Two cryptographies are MD5 and SHA-1. This hash functions are continuously used for this calculation.

      The expensive of fingerprint calculation is extremely high. It is a difficult to hinder for real time deduplication. To avoid such high cost calculation of fingerprint a virtual

      machine files or images are being modified, delays the calculation of fingerprint for recently modified date blocks runs deduplication techniques lazily only when it is necessary.

    3. Caching techniques

      The client side of file system contains two types of cache. There are shared cache and private cache. The shared cache maintains very recently accessed data blocks. The data blocks are read-only and currently opened in shared among all virtual machine images.

      When client requested for a necessary data block, first of all it going to look it up in shared cache. Shared Cache provides distributed replicated cache to minimize the load factor. It consists the usage of two or more servers in a farm. It's replicated all data within the cluster. The big plus is simple, you have all your cache nodes on all different servers. In case one of your servers get restarted, it will receive all nodes automatically from its parent. If requested data blocks are not in shared cache, it will be waiting for data block are loaded into the shared cache. Now shared cache is filled up requested data blocks are replaced using the least recently used method. This mechanism improves the reading performance .

      Private to specify that the response is cacheable only on the client and not by shared (proxy server) caches. Private cache have modified data blocks and delay calculation of fingerprint on them, after data blocks are modified in shared cache, it will be ejected from it and added to the private cache. Until data blocks are send out from the private cache, the modified data block will referred to by the private fingerprint.

      No crashes will occur the private fingerprint differs from normal fingerprint. This mechanism improves the writing performance. It avoids repeated not valid calculation of fingerprint.

    4. Data block storage

      As individual files are stored directly into a local file system. According to their fingerprints split data blocks are form groups. Each group is created three files.

      An extent file, an index file and a bitmap file. An extent files maintaining all the information about the data blocks. An index file mapping a fingerprint of data blocks offset and reference count. And a bitmap file to point out extent file is valid.


    1. Heartbeat message

      The meta server is take control of managing of all data servers. It checks the regular heart beat information or messages from each data server. Keep an up to date message of their health status.

      In a round-robin fashion the meta server checks heartbeat messages with data servers. There are many data serves so slow to detect failed data servers. To speedup failure detection, a data servers connection problem with another data server, it will send an error message to the meta server. Immediately meta server will send the daemon thread to error according data server and it checks the heartbeat whether, it

      is alive or not. This is effective approach of detected failed data serves.

    2. Peer-to-Peer Fashion

      Peer to peer data bock sharing scheme is one advantage of an exact deduplication approach for virtual machine with bloom filter. All client nods are peer-to-peer fashion, unlike the client/server model, in which the client makes a service request and the server fulfills the request, the peer-to-peer model allows each node in a peer-to-peer network to function as both a client and a server.

      Two computers are considered peers if they are communication with each other and playing similar roles. It allowing shared access to files and peripherals without the need for a central server.

      A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard. That is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a bloom filter has a hundred percent recall rate.

      The basic bloom filter supports two operations. There are test and add. The test is used to check whether a given element is in the set or not. If it returns, the first one is false then the element is definitely not in the set and the second one is true then the element is probably in the set. The false positive rate is a function of the bloom filters size and the number and independence of the hash function used.

      The add simply adds an element to the set. Removal is impossible without introducing false negatives, but extensions to the bloom filter are possible that allow removal for example counting filters.

      The notations are S is a set of n elements. A Bloom filter is an array of m bits representing a set S = { x1, x2, , xn} of n elements. Array set to 0 initially. k independent hash functions p, , hk with range{1, 2, , m}. Assume that each hash function maps each item in the universe to a random number uniformly over the range {1, 2, , m}. For each element x in S, the bit hi(x) in the array is set to 1, for 1

      i k, A bit in the array may be set to 1 multiple times for different elements

      A false positive on a query of element x occurs when all of the

      hash functions phk applied to x return a filter position that

      has a 1. Asume hash functions to be independent. Thus the probability of a false positive is

      = (1 (1 1 ) ) (1 )

      Bloom filters have random order, if bloom filter contain necessary fingerprint, it tries to bring the data blocks from a peer client. If data blocks are not in peer, the client will go and search the data blocks from the data servers. It is low cost because its simplicity.

    3. Copy-on-read Technique

      The copy-on-read techniques to carry data blocks from data server and peer clients to local cache on demand. Read- Copy Update is one such mutual exclusion method where readers threads trying to access, but not modify the data can access the shared data without acquiring any conventional lock.

    4. On-Demand Fetching

      When booting a virtual machine, even if the data blocks in the virtual machine files or images have not been all fetched into local cache, this is crucial speedup for virtual machine boot up process. Here, only accessed a part of data blocks are fetched bandwidth consumption of network is low rate.

      This way is more efficient then fetching all date blocks into local cache and then booting the virtual machine. It is low read performance.

    5. Instant cloning for virtual machine

Many virtual machine files and images are very large, and their size is several giga bytes or tera bytes. Copying those files are images byte-by-byte would be taken long time. Instant cloning is efficient solution to this problem.

Instant cloning technique provides a very simple, effective, reliable, and versatile tool for molecular cloning, chimera construction.

Copying the meta data file or images and updating data block storage, achieve cloning of a virtual machine. Modification on the cloned virtual machine files and image will not affect the original files or images, it is like copy-on- write product virtual machine could be cloned in few milliseconds in the view of user.


We have presented an exact deduplication file system with good input output and read write performance wide set of features. It provides good input and output performance while doing deduplication operation, and this technique avoid storing redundant data blocks. Caching techniques repeatedly accessed data blocks in memory, and only run deduplication work using has function, when it is required.

Support instant cloning virtual machine files and images by copy on read technique and to supply all information on demand fetching through network to which make something possible very fast virtual machine spread out. Peer to peer technique is highly scalable. Peer to peer data block sharing scheme is one advantage of exact deduplication approach for with bloom filter.

Virtual machine deduplication is highly effective. Caching will avoid running cost deduplication algorithms repeatedly, thus increase input and output performance.

In feature to develop a Multi Dimensional Vector in Bloom Filtering algorithm, this technique is improve the data storage and deduplication in large file system efficiently. Mainly these techniques to adapt for any bit operating system with scheduling on large scale file system on MapReduce Knowledge.


  1. Data Deduplication, Sept. 2013. [Online]. Available:

  2. A.V. Aho, P.J. Denning, and J.D. Ullman, Principles of Optimal Page Replacement, J. ACM, vol. 18, no. 1, pp. 80-93, Jan. 1971.

  3. A.T. Clements, I. Ahmad, M. Vilayannur, and J. Li, Decentralized Deduplication in San Cluster File Systems, in Proc. Conf. USENIX Annu. Techn. Conf., 2009, p. 8, USENIX Association.

  4. R. Coker, Bonnie++, 2001. [Online]. Available:

  5. D. Eastlake, 3rd, Us Secure Hash Algorithm 1 (sha1), Sept. 2001.[Online]. Available:

  6. G.DeCandia, D.Hastorun,M. Jampani, G.Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P.Vosshall, andW.Vogels, Dynamo:Amazons Highly Available Key-Value Store, in Proc. 21st ACM SIGOPS SOSP, New York, NY, USA, 2007, vol. 41, pp. 205-220.

Leave a Reply