Enhancement Of Data Compression Using Incremental Encoding And Infix Substitution

Neeti Sangwan; Aanchal Nayyar; Meenakshi Kapoor

doi:10.17577/IJERTV2IS50639

Volume 02, Issue 05 (May 2013)

Enhancement Of Data Compression Using Incremental Encoding And Infix Substitution

DOI : 10.17577/IJERTV2IS50639

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 174
Total Downloads : 414
Authors : Neeti Sangwan, Aanchal Nayyar, Meenakshi Kapoor
Paper ID : IJERTV2IS50639
Volume & Issue : Volume 02, Issue 05 (May 2013)
Published (First Online): 23-05-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Enhancement Of Data Compression Using Incremental Encoding And Infix Substitution

Neeti Sangwan Aanchal Nayyar Meenakshi Kapoor

Dept. of Computer Dept. of Computer Dept. of Computer Science, GGSIPU Science, GGSIPU Science, GGSIPU University, Delhi, India University, Delhi, India University, Delhi, India

Abstract

In todays world scenario data has been multiplied to very large volumes and also the number of data storage applications has increased manifolds. At the same time, the communication networks are having in massive transfer of data through communication channels. In this paper a methodology has been proposed to reduce storage and thus reducing communication costs and also helps increase the capacity of channels and storage medium. The purpose indeed can be met by optimizing Incremental encoding algorithm which is a data compression technique to increase its performance even further. In this redundant/similar data between data entries are replaced by shorter codes. Also the previous research of optimizing incremental encoding algorithm with RLE algorithm has been compared with our proposed technique which will work more efficiently than the previous research done.

Keywords Compression, Lossless compression, Lossy compression, Run length encoding, incremental encoding, Longest common substring, Pattern Matching

Introduction

When we have enormous amounts of data we would like to apply techniques to minimize the usage of space, such techniques are techniques of Data compression. It involves the development of a compact representation of information. Most representations of information contain large amounts of redundancy. Redundancy can be defined as the repetition of linguistic information inherent in the structure of a language, as singularity in the sentence it works. Therefore, one aspect of data compression is redundancy removal.

Data compression is an art of reducing the size of original data and data compression stores the same amount of data in few bits. This Paper specifies the study of efficient encoding and develops an algorithm to encode the data into as few bits as possible. The primary objective of data compression is to minimize the amount of data to be transmitted. The theme of this paper is to

present and analyze a data compression technique.

At this stage we make use of the fact that in a database where a particular string of data that is common in entries of that database can be encoded as shorter codes. In the coding step we use shorter code words to represent letters that occur more frequently, thus lowering the average number of bits required to represent each letter. The design of a compression algorithm involves understanding the types of redundancy present in the data and then developing strategies for exploiting these redundancies to obtain a compact representation of the data. Redundancy or similarity can be present within a single input or between 2 inputs or both.
Compression Techniques

The Compression techniques can be basically classified into two types depending upon whether data is lost or not during the process of compression. These are as follows:
1. Lossless Compression
  
  Lossless data compression technique use algorithms which usually exploit statistical redundancy to represent data more concisely without losing information. Lossless data compression is possible because most real- world data has statistical redundancy. Lossless compression techniques are used for transmitting textual data because we cannot afford losing any part of the text, because such loss can significantly change the meaning of the text, for example: Please
  
  do not attempt the first question of the paper, now let us say on compression the text becomes Please do attempt the first question of the paper, a loss of the not word changes the meaning of the original sentence completely, hence we use Lossless compression techniques for compressing textual data.
2. Lossy Compression
  
  Lossy data compression is the converse of lossless data compression. In these schemes, some loss of information is acceptable. Dropping nonessential detail from the data source can save storage space. Lossy data compression schemes are informed by research on how people perceive bits of information. There is a corresponding trade-off between information lost and the size reduction. A number of popular compression formats exploit these perceptual differences,
  
  including those used in music files, images, and video.

Review of Incremental Encoding

Incremental encoding algorithm is a data compression technique employed where we replace common prefixes with shorter codes to save memory and avoid duplication. That is why it is also known as front compression, back compression, or front coding,This algorithm is particularly well- suited for compressing sorted data, e.g., a list of words from a dictionary. Following is the example of such words:

Table 1: Incremental Encoding

Input	Common prefix	Compressed output
psxa	No preceding word	0 psxa
psxorhvta	psx	0 orhyta
psxorod	psxor	5 od
cab	No common prefix	0 cab
cabbed	cab	3 bed
cabbing	cabb	4 ing
cabit	cab	3 it
cabk	cab	3 k
cabob	cab	3 ob
50 bytes		35 bytes

The encoding used to store the common prefix length itself varies from application to application.

Related Work

Incremental encoding is a data compression technique where we save memory for storage of data by replacing longer words with shorter codes especially holds true for a dictionary encoder where the similarity between consecutive entries is exploited to save memory.

In the previous research paper incremental encoding algorithm has been enhanced by combining it with RLE algorithm. The methodology involved is 2-phased:
1. Phase 1
  
  In the first phase of the algorithm, RLE algorithm is applied where repetitive patterns are replaced by a 2-byte code {no of times the particular letter appears repetitively, the repeated letter itself}, which means FFFFF can be encoded as 5F.Let us understand the concept by taking a simple example of 2 inputs, the first being FFFFGGGCCCDDE and the second being FFFFGGGCD. We can encode the first output as 4F3G3CDDE and the second output as 4F3GCD.
2. Phase 2
  
  In the 2nd phase of encoding data is encoded using incremental encoding wherein we
  
  replace common prefixes by the common prefix length that is of 1 byte. Let us understand the concept by taking a same example of 2 inputs in the 1st step, in the 1st phase of encoding we observe that the incremental output for the 2nd entry can be encoded as 8D where 8 is the common prefix length and the common prefix being FFFFGGGC. Finally both incremental encoding and RLE algorithm have been combined to increase the efficiency even further in the published research paper

Proposed Technique

In our proposed technique an approach is used which comprises of 2 phases:-

Phase 1

The 1st phase involves the technique of encoding the data using incremental encoding algorithm where the common prefix between successive data entries is replaced by common prefix length in exactly the same way illustrated before in the 2nd phase of related work content.
Phase 2

The 2nd phase in the methodology proposed increases the efficiency by using concept of infix matching in any 2 successive data entries after the incremental encoding has been done.
Relative Analysis of Proposed Technique and Related Work

RLE algorithm works well where large amounts of repetitive character is available to us, whereas our methodology works more efficiently on record based datasets where repetitive pattern occurs. RLE algorithm does not cater to the needs of encoding real world data where similar patterns between successive data entries can be observed and exploited for saving more memory unlike our proposed methodology. The proposed Technique can be used to compress data of different organizations/firms/colleges.

Let us try to understand the comparison by illustrating the application of the proposed methodology on a dataset which is close to the kind of data we generally observe in the real scenario where in a firm we maintain the record of the computers in the firm where the nomenclature indicates the use of the computer and also its location:

CADC/AI/TPO/LG-11 CADC/ERP/TPO/LG-12

Taking up the above example, we can observe that applying the proposed methodology will lead to by far better results as compared to the RLE algorithm because in this case similar pattern is observed which can be treated as infix and can be encoded according to our proposed technique.

Implementation

The proposed technique is a 2-phased operation on data to be compressed. In the first phase we are using Incremental Encoding engine and in the second phase we

are using Infix Substitution engine, which replaces the longest common substring between two consecutive words by its code which is of the form (Starting position length), after the incremental coding has been done. The procedure for encoding matched infix between successive data entries is as follows:-

After the 1st phase, to encode a matched infix, we use an algorithm by which we get the length of the common substring/infix which is the longest from the 2 successive entries, that are being compared, can be retrieved. The length is only taken into account when it is greater than 2 bytes because only in that situation we will save at least 1 byte of memory for the minimum case and much more in case in the length of the matched substring is considerably greater than 2 bytes. In the 2nd phase of

encoding a pattern matching algorithm on the 2 data strings received as inputs has been applied by which the starting position of the matched substring/infix in the 1st word can be retrieved.

After the 2nd phase, using the above techniques the values for the starting position of the matched infix in the first word as well as its length have been already retrieved. Using these 2 values the final encoded form can be represented by replacing a matched infix by a combination of 2 bytes code that is [starting position, length of infix matched]. The form of encoding can be understood using the following example of let us say any 2 consecutive data entries in a database:

AAAEL127091310

AAAEL172709032

Using the proposed technique the final encoded output for the 2nd entry is represented as [6]7[7,4]032 where 6 represents the length of common prefix in the 2 entries,7 is written as it is since it is not similar to the 1st word, then 2709 which appears as infix in both the strings has been encoded by replacing 2709 as [7,4] where 8 indicates the starting position of the matched infix with respect to the 1st word and 4 corresponds to the length of the matched infix, also the technique takes into account that a particular matched infix will only be encoded if its length is greater than or equal to 3.

Figure 2 : Proposed Technique

Phase 1: Algorithm for Incremental Prefix Encoding

Step 1: Initialize Queues (Q1, Q2, Q3 )

and String s1

T.count <- count1 <- count2 <- 0 alphabet1 <- alphabet2 <- NIL

Step 2: Q1 <- NULL;

Q2 <- Q3 <- First Word

Step 3: if ( Q1 == EMPTY )

then Dequeue and print all element of Q2 and save in s1

Go to Step 5.

else Dequeue one element of Q1 and Q2

for i = 1, 2 if ( Qi(E) ==

Alphabet ) then if ( counti == 0 )

then counti <- 1 alphabeti <- Qi(E)

else counti <- Qi(E) Repeat Step 3.

Step 4: if ( alphabet1 == alphabet2 ) then T.count <- T.count +

count2

if (count1 == count2) then Go to Step 3. else print T.count

Dequeue all elements of Q2 and save in s1.

else Go to Step 5.

Step 5: Q1 <- Q3; Q2 <- Q3 <- Next

Word T.count <- count1 <- count2 <-

0

alphabet1 <- alphabet2 <- NULL

Step 6: if (Q2 == EMPTY )

then EXIT.

else Go to Step 3.

Phase 2: Algorithm for matched Infix Encoding

Step 1: Initialize String (s2,s3,s4,s5,s6) s2 < – First Word

s3 < – Next Word [1n]

s4 <- s5 <- s6 <- NULL

Step 2: function Infix(String1[1..m],

String2[1..n]) L <- array(1..m, 1..n) z <- 0

inf <- {}

for i <- 1..m for j <- 1..n

if s1[i] = s2[j] if i = 1 or j = 1 L[i,j] <- 1

else

L[i,j] <- L[i-1,j-1] + 1

if L[i,j] > z

z <- L[i,j]

inf <- {s1[i-z+1..i]}

if L[i,j] = z

inf <- inf {s1[i-z+1..i]}

else L[i,j]=0;

return inf

Step 3: function PatMatch(String1,String2)

j<- 1;

while j <- n-m+1 do begin i<- 1;

while (i<=m) and (String1[i]=String2[j]) do begin

i<- i+1;

j<- j+1

end;

if i<=m then j<- j-i+2 else return j-i+1; end.

Step 4: s4 <- Infix (s1,s2)

p <- s4.length

if(p > 2)

q <- PatMatch(s2,s4) r <- PatMatch(s3,s4)

else

Go to step 7.

Input Wor d	RLE Output	Incre ment al Outp ut	RLE + Incre menta l Outpu t	Propos ed Techni que Output
AAB	AABCE	AAB	AABC	AABC
CEA	4ACYB	CEA	E4AC	EAAA
AAA	BC	AAA	YBBC	ACYB
CYB		CYB		BC
BC		BC
AAB	AABCB	4BA	4B3A	4B7,7E
CBA	3ACYB	AAC	CYBB	A
AAC	BEA	YBB	EA
YBB		EA
EA
AAB	AABB3	3BA	3B3A	3.5,7D
BAA	ACYBD	AAC	CYBD	CBE
ACY	CBE	YBD	CBE
BDC		CBE
BE
AAB	AABB3	p>3CA	3C3A	3C5,8C
CAA	ACYBD	AAC	CYBD	D
ACY	CCD	YBD	CCD
BDC		CCD
CD
Takes 56 bytes	Takes 51 bytes	Takes 49 bytes	Takes 44 bytes	Takes 33 bytes
Com	9%	12.5	21.5%	41%
press		%
ion
%

Input Wor d	RLE Output	Incre ment al Outp ut	RLE + Incre menta l Outpu t	Propos ed Techni que Output
AAB	AABCE	AAB	AABC	AABC
CEA	4ACYB	CEA	E4AC	EAAA
AAA	BC	AAA	YBBC	ACYB
CYB		CYB		BC
BC		BC
AAB	AABCB	4BA	4B3A	4B7,7E
CBA	3ACYB	AAC	CYBB	A
AAC	BEA	YBB	EA
YBB		EA
EA
AAB	AABB3	3BA	3B3A	3.5,7D
BAA	ACYBD	AAC	CYBD	CBE
ACY	CBE	YBD	CBE
BDC		CBE
BE
AAB	AABB3	3CA	3C3A	3C5,8C
CAA	ACYBD	AAC	CYBD	D
ACY	CCD	YBD	CCD
BDC		CCD
CD
Takes 56 bytes	Takes 51 bytes	Takes 49 bytes	Takes 44 bytes	Takes 33 bytes
Com	9%	12.5	21.5%	41%
press		%
ion
%

Step 5: s5 < – s3(T.countr-1)

s6 <- s3(p+r-1.n)

Step 6: prints5 . q , p. s6

Step7: s2 <- NULL

s2 <- Next Word

s3 <- Next to Next Word s4 <- s5 <- s6 <- NULL

Step 8: if (s2== NULL)

EXIT

else

Go to step 2.

Comparison Analysis

Let us take an example to understand the proposed technique and its advantages in terms of compression ratio with RLE encoding, incremental encoding and combination of these two in related work. . is used to differentiate data from encode output. In this example we are taking 4 words as input and compressing the input words using RLE encoding, Incremental encoding, related work and our proposed technique. Below is the comparison table of the algorithm:

Table 2: Comparison Analysis Table We can see that uncompressed data takes 56

bytes of memory and RLE, Incremental and

combination of these two take 51 bytes, 49 bytes and 44 bytes of memory respectively while with the help of our proposed technique we are using only 33 bytes of memory to save data and thus the compression % is raised to 41 %. Hence it is

the best among all the four types of technique to compress real world data.
Key Features

The proposed technique has the following key features:
- It is a lossless compression technique.
- Increases the efficiency of Incremental encoding algorithm considerably.
- Increase the transmission rate because of high compression.
- Increase the disk storage capacity.
- Increases the capacity of input/output channels and communication channels.
- It is a 2-phased enhanced encoding procedure.

Conclusion

The proposed technique gives an excellent result of data compression among Lossless compressions. As already illustrated in the table it is evident that the proposed methodology saves 41% of memory which is the highest among all the 4 results that are from incremental encoding, RLE algorithm, Incremental encoding, combination of these two and our proposed methodology respectively.

Thus, we can conclude that our proposed theory saves 20 % more than that in the related work, in the example we have illustrated. Also by replacing infixes we have broadened the horizon of its

application beyond a dictionary to an organization based database which is primarily record-based that exhibits similar relationship among data in the database.
References

Data Compression Khalid Sayood university of Nebraska, Lincoln
Enhancement of Data Compression Using Incremental Encoding by Ajit Singh and Yogita Bhatnagar.
http://en.wikibooks.org/wiki/Data_ Compression
http://en.wikipedia.org/wiki/Data_ Compression
http://www.ics.uci.edu/~dan/pubs/ Data Compression.html
http://books.google.co.in/books? id=ChSOjgiY84YC&printsec=frontcover&s ource=gbs_ge_summary_r&cad=0#v=onepa ge&q&f=false
http://datacompression.info/ [8]http://www.data-compression.com/index. shtml

http://www.cs.cmu.edu/~guyb/ Realworld/compression.pdf
http://www.webopedia.com/TERM/D/ data_compression.html
Introduction To Data Compression by Khalid Sayood (http://books.google.co.in/ books?id=ChSOjgiY84YC&printsec=frontc over&source=gbs_ge_summary_r&cad=0#v

=onepage&q&f=false) [12]http://www.amazon.com/Data- Compression -Book MarkNelson/dp/ 1558514341#reader_1558514341

http://www.cs.cmu.edu/~guyb/real world/compress.html
http://en.wikipedia.org/wiki/Run-length

_encoding
http://en.wikipedia.org/wiki/Run_ Length Limited
http://www.fileformat.info/mirror/egff/ ch09_03.htm
http://www.datacompression.info/ Algorithms/RLE/index.htm
http://www.ijser.org/researchaper% 5CEnhancement-of-Data-Compression- Using-Incremental-Encoding.pdf

Enhancement Of Data Compression Using Incremental Encoding And Infix Substitution

Keywords Compression, Lossless compression, Lossy compression, Run length encoding, incremental encoding, Longest common substring, Pattern Matching

Phase 1: Algorithm for Incremental Prefix Encoding

Step 1: Initialize Queues (Q1, Q2, Q3 )

Step 2: Q1 <- NULL;

Step 3: if ( Q1 == EMPTY )

then Dequeue and print all element of Q2 and save in s1

Go to Step 5.

else Dequeue one element of Q1 and Q2

else counti <- Qi(E) Repeat Step 3.

Step 4: if ( alphabet1 == alphabet2 ) then T.count <- T.count +

if (count1 == count2) then Go to Step 3. else print T.count

Dequeue all elements of Q2 and save in s1.

else Go to Step 5.

Step 5: Q1 <- Q3; Q2 <- Q3 <- Next

Step 6: if (Q2 == EMPTY )

else Go to Step 3.

Phase 2: Algorithm for matched Infix Encoding

Step 1: Initialize String (s2,s3,s4,s5,s6) s2 < – First Word

Step 2: function Infix(String1[1..m],

if s1[i] = s2[j] if i = 1 or j = 1 L[i,j] <- 1

else

if L[i,j] > z

if L[i,j] = z

else L[i,j]=0;

return inf

Step 3: function PatMatch(String1,String2)

while j <- n-m+1 do begin i<- 1;

while (i<=m) and (String1[i]=String2[j]) do begin

end;

if i<=m then j<- j-i+2 else return j-i+1; end.

Step 4: s4 <- Infix (s1,s2)

if(p > 2)

else

Go to step 7.

Step 5: s5 < – s3(T.countr-1)

Step 6: prints5 . q , p. s6

Step7: s2 <- NULL

Step 8: if (s2== NULL)

else

Go to step 2.

Leave a Reply