- Open Access
- Authors : Tamanda Mgwalima , Arthur Ndlovu
- Paper ID : IJERTV11IS120020
- Volume & Issue : Volume 11, Issue 12 (December 2022)
- Published (First Online): 12-12-2022
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
A Review of the Past, Present, and Future of Secure Data Deduplication
School of Information Science and Technology Department Harare Institute of Technology, Zimbabwe
School of Information Science and Technology Department Harare Institute of Technology, Zimbabwe
AbstractIn todays digital world, many enterprises, organizations and individuals have chosen to outsource their data to cloud storage providers to reduce the burden of maintaining enormous amounts of data. Storage optimization techniques have become an essential requirement in cloud storage and many cloud storage providers perform de-duplication which avoids storing duplicate data copies from multiple users. Cloud subscribers do not rely on service providers for the security of their data. To ensure data confidentiality, data is first encrypted by cloud subscribers before being outsourced to the cloud and one problem with that is, we cannot apply deduplication on encrypted data. Encryption of the same data using different keys (by different subscribers) will result in different ciphertexts that will not allow the Cloud Service Provider (CSP) to carry out deduplication. Performing deduplication over encrypted data securely in the cloud is a challenging task. Various secure deduplication methods to overcome this challenge have been researched and in this paper we review these state-of-the-art methods. Finally, we outline future research directions facing deduplication-based storage systems.
KeywordsCloud Storage, Deduplication, Proof of Ownership, Convergent Encryption;
According to NIST, cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.
The cloud allows users to store enormous amounts of data, which can be retrieved as and when required. Google Drive , SugarSync , OpenDrive  and Amazon S3  are a few examples of cloud storage offerings. In general, Cloud Service Providers (CSPs) store a single copy of identical data received from multiple sources to optimize space. However, CSPs cannot distinguish identical data when the clients upload the data in an encrypted form using different keys. Performing encryption is essential to ensure the confidentiality of data, at the same time, performing deduplication is essential for achieving optimized storage. Hence, deduplication and encryption need to work in hand to hand to ensure data confidentiality and optimized storage. Various techniques and approaches used for deduplication over encrypted data are studied in this paper.
LITERATURE REVIEW OF SECURE DATA DEDUPLICATION METHODS
Douceur et al  first provided a solution for deduplication over encrypted data aiming to achieve both data confidentiality and deduplication. In their scheme, convergent encryption was
introduced to derive the encryption key from the hash of the plaintext. Two users with two identical plaintexts will then obtain two identical ciphertexts since the encryption key is the same; hence the cloud storage provider will be able to perform deduplication on such ciphertexts and keep one instance in storage. However, since convergent encryption is a deterministic encryption scheme, it suffers against brute-force dictionary attacks if the file comes from a predictable plaintext space. Since the proposal of convergent encryption, many other schemes for encrypted-data deduplication have been proposed in the literature. Convergent encryption is also susceptible to many attacks like confirmation of a file attack and learn the remaining information attack. In confirmation of a file attack, anyone who owns the same file will have the potential to prove that another user also possesses the same file.
Bellare, Keelveedhi and Ristenpart  formalized Message- Locked Encryption (MLE), a cryptographic primitive where the key used for encryption and decryption is derived from the message itself. However, MLE scheme is vulnerable to brute- force attacks. Bellare et al. firstly proposed DupLESS  to resist the above-mentioned brute-force attacks. The scheme introduced a key server and implemented rate-limiting strategy to resist brute-force attacks.
V Chouhan, S Peddoju and R Buyya  propose a secure and reliable cloud storage framework to deduplicate the encrypted data and key (dualDup framework) that optimizes the storage by eliminating the duplicate encrypted data from multiple users by extending DupLESS concept, and securely distributes the data and key fragments to achieve the privacy and reliability using Erasure Coding scheme.
Halevi, Harnik, Pinkas, and Shulman-Peleg  addressed the security problem in client-side deduplication whereby an entire file is represented by a hash value, and an adversary who gets this hash value can claim ownership of the file by  proposing the concept of Proofs of ownership (PoW) based on Merkle tree. The cloud server computes the Merkle tree of the uploaded file and stores the root hash value of the Merkle tree, then the clients prove file ownership by computing the corresponding sibling paths based on the cloud server request. If the clients compute requested paths correctly, the cloud server considers that they are in the possession of the file. However, this scheme has a large computational overhead. Following , Pietro and Sorniotti  propose another efficient PoW scheme by choosing the projection of a file onto some randomly selected bit-positions as the file proof. They addressed a major security risk in existing proof of ownership schemes where an adversary in possession of a fraction of the original file is able to claim possession of the entire file.
Authors in  propose a PoW scheme based on bloom filter which is flexible and scalable. This scheme is more
efficient at the client side than the approach in  and more efficient at the server side than the scheme in .
Li et.al  first addressed the problem of achieving efficient and reliable key management when implementing convergent encryption. They proposed a baseline approach in which a user encrypts a file with a convergent key and uses a unique independent master key for encrypting the convergent keys. The encrypted data is then outsourced to the cloud server for storage. This approach generates a huge number of keys with the increasing number of users and hence an enormous amount of key storage space. Another challenge of this baseline approach is that if the master key is compromised, the stored data cannot be recovered, hence the master key is a single point of failure. To overcome this challenge, they also proposed Dekey, a new construction that implements Ramp secret sharing scheme in which users do not need to manage any keys on their own but instead securely distribute the convergent key shares across multiple servers. Singh et al  also proposed an approach to eliminate this single point of failure by distributing the convergent keys into multiple random looking shares using the Chinese Remainder Theorem based secret sharing and sending the shares across multiple key management servers. The key can also be recovered when a threshold number of shares is obtained from the key management server by execution of POW protocols. In their scheme, encrypted data is distributed at multiple servers into random looking shares based on the Permutation ordered binary number system.
uzio et al  proposed ClouDedup, which in addition to the basic storage provider, a metadata manager and an additional server are defined. The server adds an additional encryption layer to prevent well-known attacks against convergent encryption and thus protect the confidentiality of the data; on the other hand, the metadata manager is responsible of the key management task since block-level deduplication requires the memorization of a huge number of keys.
PoW schemes worked well when the file is in plaintext. However, the privacy of the clients data may be vulnerable to honest-but-curious attacks. To deal with this issue, the clients tend to encrypt files before outsourcing them to the cloud, which makes the existing PoW schemes inapplicable any more. Chao Yang  first proposed a secure zero-knowledge based client side deduplication scheme over encrypted files. The scheme achieves a high detection probability of the clients misbehavior. They introduced a proxy re-encryption based key distribution scheme which ensures that the server knows nothing about the encryption key even though it acts as a proxy to help distributing the file encryption key. It also enables the clients who have gained the ownership of a file to share the file with the encryption key generated without establishing secure channels among them. Clients private key cannot be recovered by the server or clients collusion attacks during the key distribution phase.
Authors in  propose a secure and efficient client-side encrypted deduplication scheme (CSED) built upon message locked encryption (MLE), where a dedicated key server is employed to assist clients in generating MLE keys without leaking any information about the data to be encrypted to the key server. They integrated a Bloom filter-based proof of ownership (PoW) mechanism into CSED to resist illegal content distribution attacks.
Most secure deduplication schemes presume that all files need equal security, however, Stanek and Kencl  proposes a scheme which provides security to the data on basis of their popularity. Data owned by many cloud subscribers is known as popular data and data owned by a few subscribers is called unpopular data. Semantic security is provided for unpopular files and convergent encryption is used for popular files. Authors in  proposed PerfectDedup, to counter the weaknesses in convergent encryption by taking into account the popularity of the data segments. PerfectDedup takes advantage of the properties of Perfect Hashing in order to assure block-level deduplication and data confidentiality simultaneously.
C Guo, X Jiang, K R Choo and Y Jie  proposed R-Dedup, a randomized, secure, client-side deduplication that does not rely on an external third-party or require assistance from other peer users. By sharing a random value used to generate an encryption key for users who hold the same copy of a file, R- Dedup can resist brute-force attacks from both malicious cloud servers and subscribers. Data Verification in R-Dedup ensures data integrity and provides user authentication for the cloud server. Less computation overhead occurs at the client side since the complex calculations are handled by the cloud server. proposed DedupDUM, a deduplication scheme with dynamic user management which updates dynamic group users in a secure way and restricts unauthorised cloud users from sensitive data owned by valid users. Their scheme supports user revocation and new cloud user joining by exploiting re- encryption techniques and does not require a fully trusted third party.
Geeta C M et al  proposed Secure deduplication and virtual auditing of data in the cloud (SDVADC). SDVADC supports secure deduplication of information and effective virtual auditing of the files during the download process. The approach lowers the burden of data owner to audit files by himself or a third party auditor. They proposed an algorithm for file uploading and virtual auditing that allows a user to encrypt a file using randomized convergent encryption and outsource the ciphertext to the distributed server in the cloud service provider premises. The metadata information of the file is sent to the Virtual Auditing Entity (VAE) that consists of the metadata information of all the files uploaded to the distributed server. The CSP accepts the file, and checks for duplication. If the file is original, the CSP saves the file in storage and if the file is a duplicate, the CSP runs PoW convention with the user. When the user proves that he is an authorized person, then the CSP provides a link for the file existing in storage. During file download, the cloud user transmits a file request query to the CSP. The CSP sends the requested file to the VAE. The virtual auditing framework performs auditing of this file and sends the file attached with the auditing report to the user that shows whether the file has been modified or not.
R Miguel, K Mi and M Aung  proposed HEDup, a scheme that performs deduplication on encrypted data, with the aid of a key server deployed at the cloud service provider premises. Subscribers obtains data-encryption key from the key server through some homomorphic searching operations. propose a cloud data deduplication scheme based on certificateless proxy reencryption. Certificateless cryptography is applied to solve the problem of key escrow and to avoid situations where a key generation centre impersonates a cloud
subscriber to decrypt ciphertext. In this scheme, a convergent key is used to encrypt the plaintext and the resulting ciphertext is encrypted again using certificateless proxy reencryption and stored in the cloud server together with the reencryption key. The reencryption key is used to share the encrypted data with other cloud subscribers authenticated by the POW scheme based on certificateless signature.
Authors in  propose an identity based proxy re-encryption scheme for cloud data deduplication. This scheme integrates cloud deduplication with access control unlike other solutions which cannot support flexible data access control and require data users to remain online. It efficiently distributes ciphertexts and revoke data owners who delete their data from the cloud
without participation of the data owners unlike in certificateless proxy reencryption.
Client-side deduplication has been applied in most of the existing deduplication than server side deduplication. Table 1 gives the advantages and limitations of a few selected deduplication schemes and the methodologies behind the schemes.
DupLESS : Server- Aided Encryption for Deduplicated Storage
To resist brute force attack on Message Locked Encryption
The scheme combines has the ability to obtain message-derived keys with the help of a key server shared amongst a group of clients. The clients interact with the key server by a protocol for oblivious PRFs, ensuring that the key server can cryptographically mix in secret material to the per-message keys while learning nothing about files stored by clients.
It provides more security than convergent encryption
When there is significantly less deduplication across the corpus, DupLESS may introduce greater overhead.
ClouDedup : Secure
To prevent well known
Secret keys can be generated in a hardware-
It prevents curious
If the third party server is
attacks against convergent
dependent way by the device itself
cloud storage providers
comprised, it may lead a
Encrypted Data for
Server encryption is
applied on top of convergent encryption
original content of
to the users.
performed by user
stored data by
patterns or accessing
It gains in terms of
storage space are not
affected by the
overhead of metadata
management, which is
Data can be differentiated
Semantic security is provided for all the
The users no longer
Lower deduplication ratio.
based on popularity to
unpopular data whereas for popular data the
need to manually
High computation cost
determine the level of
security is slightly weaker convergent
classify sensitive files.
encryption is provided for popular data
The transition between
unpopular and popular
state is automatic and
does not require active
CSED : Client-Side
To construct a secure and
Client-side encrypted deduplication scheme
Reduced storage and
No integrity auditing.
efficient scheme that
based on proofs of ownership (PoW) that
resists brute-force attacks
resists brute-force attacks. A key server is
scheme based on
and illegal content
employed to assist the clients in generating
Secure against brute
proofs of ownership
distribution attacks, where
the MLE keys and adopt the rate-limiting
for cloud storage
the adversary can
strategy to prevent brute-force attacks. Bloom
distribute data to other
filter and hierarchical strategy on cloud
users via the cloud server.
storage to improve the efficiency.
HEDup : Secure
Allow deduplication on encrypted data with
Data uploads and
Key server discussed in
encrypted data, with the
the aid of a key server deployed at cloud
this approach may become
aid of a key server
service. The subscriber encrypts data with
HEDup have minor
a bottleneck when number
deployed at the CSP
data-encryption key obtained from key server
storage and latency
of clients increase in case
premises. Client obtains
via various key-management schemes, one of
of large scale deployment,
the encryption key from
which uses homomorphic encryption. The
Data owners still
and a decentralized
the key server through
key server deployed at cloud provider
deployment of key server
premises, it will not only deduplicate data
control of their data
is supposed as a solution.
searching operations. Data
from particular domain but also for the CSP's
owners maintain exclusive
entire client base including public and
keys, i.e. CSP has no
control of their data and
different enterprise users
access to any of it –
cloud providers has no
access to any of it.
A randomized, secure,
ElGamal Encryption technique
Sharing a random
overhead on the client side
scheme that does not
value used to generate encryption key for
integrity of data
involve any third-party
without involving a
entity or require assistance
from other users. In
A secure zero-knowledge
Scheme enables a client to prove its file
based client side
based client side
ownership via the original file without
probability of clients
deduplication scheme over
leaking any information to the server. Key
encrypted files of
encrypted files is
distribution scheme is based on proxy re-
secure cloud storage
encryption which realizes delegation of
in smart cities
To efficiently and reliably
Scheme implements Ramp secret sharing
It incurs small
No integrity auditing
manage a huge number of
scheme in which users do not need to manage
convergent keys. Users do
any keys on their own but instead securely
overhead com- pared to
not need to manage any
distribute the convergent key shares across
keys on their own but
instead securely distribute
in the regular
the convergent key shares
across multiple servers.
To design a scheme that
The scheme saves only one copy of data in
can flexibly support
the cloud for the initial uploader and
computation overhead of
subsequent data owners who will have passed
even when the data owner
the proof of ownership will require
distribution even when
conversion of this data into ciphertext that
the data owner is
can be decrypted with their own private keys.
To allow only authorized
users to access critical
Servers burden and
data. To provide control
less storage overhead
CP-ABE in Cloud
over access permissions in
an encrypted deduplication
SecDep: A User-
To resist brute force attack
User-Aware Convergent Encryption (UACE)
No integrity auditing and
and convergent key space
and Multi-Level Key
Table 1: Comparative Analysis of deduplication techniques
Various secure deduplication techniques for providing in cloud storage have been discussed. Client-side has comparatively more benefits than server side Deduplication and hence it has been applied in most of the existing deduplication tools. This paper discussed the various secure data deduplication techniques, and outlined a comparative analysis of some of the different existing client-side deduplication schemes are done. Future enhancements might be designing a scheme capable of verifying integrity of data without downloading it from server, with reduced computation complexity and that supports insertion, deletion and updation operations, private verifiability, public verifiability and batch auditing and searchable encryption.
REFERENCES C. D. National Institute of Standards and Technology (NIST), The NIST Definition of Cloud Computing, govinfo.gov, Jan. 2011, Accessed: Oct. 25, 2022. [Online]. Available: https%3A%2F%2Fwww.govinfo.gov%2Fapp%2Fdetails%2FGOV PUB-C13-74cdc274b1109a7e1ead7185dfec2ada  My Drive – Google Drive. https://drive.google.com/drive/my- drive (accessed Nov. 28, 2022).  Cloud File Sharing, File Sync & Online Backup From Any Device | SugarSync | SugarSync. https://www1.sugarsync.com/ (accessed Nov. 28, 2022).  G. Xu, M. Lai, J. Li, L. Sun, and X. Shi, Open Access A generic integrity verification algorithm of version files for cloud deduplication data storage, 2018.  Cloud Object Storage Amazon S3 Amazon Web Services. https://aws.amazon.com/s3/ (accessed Nov. 28, 2022).  J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, and M. Theimer, Reclaiming space from duplicate files in a serverless distributed file system, Proc. – Int. Conf. Distrib. Comput. Syst., pp. 617624, 2002, doi: 10.1109/ICDCS.2002.1022312.  M. Bellare, Message-Locked Encryption and Secure Deduplication, pp. 129, 2013.  M. Bellare, S. Keelveedhi, S. Diego, T. Ristenpart, W. Madison, and T. Ristenpart, DupLESS: Server-Aided Encryption for Deduplicated Storage, 2013.  V. Chouhan, S. K. Peddoju, and R. Buyya, Journal of Information Security and Applications dualDup: A secure and reliable cloud storage framework to deduplicate the encrypted data and key, vol. 69, no. July, 2022.  S. Halevi, D. Harnik, B. Pinkas, and A. Shulman-peleg, Proofs of
Ownership in Remote Storage Systems , pp. 113, 2013. R. Di Pietro and A. Sorniotti, Boosting efficiency and security in
proof of ownership for deduplication, 2012. J. Blasco, A. Orfila, R. D. I. Pietro, and A. Sorniotti, A Tunable Proof of Ownership Scheme for Deduplication Using Bloom Filters, pp. 118, 2014.  J. Li, X. Chen, M. Li, J. Li, P. P. C. Lee, and W. Lou, Secure Deduplication with Efficient and Reliable Convergent Key Management, vol. 25, no. 6, pp. 16151625, 2014.  P. Singh, N. Agarwal, and B. Raman, Secure data deduplication using secret sharing schemes over cloud, Futur. Gener. Comput. Syst., vol. 88, pp. 156167, 2018, doi: 10.1016/j.future.2018.04.097.  P. Puzio and S. Loureiro, ClouDedup: Secure Deduplication with Encrypted Data for Cloud Storage.  C. Yang, M. Zhang, Q. Jiang, J. Zhang, and D. Li, Zero knowledge based client side deduplication for encrypted files of secure cloud storage in smart cities , Pervasive Mob. Comput., vol. 41, pp. 243258, 2017, doi: 10.1016/j.pmcj.2017.03.014.  S. Li, C. Xu, and Y. Zhang, Journal of Information Security and Applications CSED: Client-Side encrypted deduplication scheme based on proofs of ownership for cloud storage, vol. 46, pp. 250 258, 2019, doi: 10.1016/j.jisa.2019.03.015.  D. Thesis and I. Technology, Secure Data Deduplication in Cloud Storage Services Doctoral Thesis, no. March, 2018.  P. Puzio, R. Molva, M. Onen, and S. Loureiro, PerfectDedup: Secure Data Deduplication, vol. 2020, no. 644412, pp. 118.  C. Guo, X. Jiang, K. R. Choo, and Y. Jie, Journal of Network and Computer Applications R-Dedup: Secure client-side deduplication for encrypted data without involving a third-party entity, J. Netw. Comput. Appl., vol. 162, no. April, p. 102664, 2020, doi: 10.1016/j.jnca.2020.102664.  H. Yuan, X. Chen, T. Jiang, X. Zhang, and Y. Xiang, DedupDUM: Secure and Scalable Data Deduplication With Dynamic User Management, Inf. Sci. (Ny)., vol. 456, 2018, doi: 10.1016/j.ins.2018.05.024.  C. M. Geeta, S. R. R. G, and K. R. Venugopal, ScienceDirect SDVADC: SDVADC: Secure Secure Deduplication Deduplication and and Virtual Virtual Auditing Auditing of of
Data Data in in Cloud, Procedia Comput. Sci., vol. 171, no. 2019, pp. 22252234, 2020, doi: 10.1016/j.procs.2020.04.240. R. Miguel, K. Mi, and M. Aung, HEDup: Secure Deduplication with Homomorphic Encryption, pp. 215223, 2020.  X. Zheng, Y. Zhou, Y. Ye, and F. Li, A cloud data deduplication scheme based on certificateless proxy re-encryption, J. Syst. Arch., vol. 102, 2020.  G. Kan, C. Jin, H. Zhu, Y. Xu, and N. Liu, An identity-based proxy re-encryption for data deduplication in cloud, J. Syst. Archit., vol. 121, no. October, p. 102332, 2021, doi: 10.1016/j.sysarc.2021.102332.  T. Youn, N. Jho, K. H. Rhee, and S. U. Shin, Authorized Client- Side Deduplication Using CP-ABE in Cloud Storage, vol. 2019, 2019.  Y. Zhou et al., SecDep: A User-Aware Efficient Fine-Grained Secure Deduplication Scheme with Multi-Level Key Management, 2015. doi: 10.1109/MSST.2015.720897.