 Open Access
 Total Downloads : 1093
 Authors : P.Yellamma, Dr.Narasimham Challa
 Paper ID : IJERTV1IS8175
 Volume & Issue : Volume 01, Issue 08 (October 2012)
 Published (First Online): 29102012
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Performance Analysis Of Different Data Compression Techniques On Text File
International Journal of Engineering Research & Technology (IJERT)
ISSN: 22780181
Vol. 1 Issue 8, October – 2012
P.Yellamma Dr.Narasimham Challa
Amrita sai institute of science and Amrita sai institute of science and Technology, India Technology India.
Abstract
Data Compression is the science and art of representing information in a compact form. Compression is the process of coding that will effectively reduce the total number of bits needed to represent certain information.Data compression has been one of the critical enabling technologies for the ongoing digital multimedia revolution .data compression also called as source coding or bit rate reduction. There are different compression algorithms which are available in different formats. Data compressions are generally lossless and lossy data compression. In this paper, we study different methods of lossless data compression algorithms and calculating the entropy on English text files: ShanonFano coding, Huffman Encoding, Run Length Encoding (RLE), LempelZivWelch (LZW). Keywords: lossless data compression, lossy data compression, Entropy, ShannonFano coding, Huffman encoding, RLE, LZW.

Introduction
Data compression refers to reducing the amount of space needed to store data or reducing the amount of time needed to transmit data. The size of data is reduced by removing the excessive information. Data compression can be lossless, only if it is possible to exactly reconstruct the original data from the compressed version [4]. To compress something means that you have a piece of data and you decrease its size.
There are different techniques who to do that and they all have their own advantages and disadvantages. Examples of such source data are medical images, text and images Preserved for legal reason, some computer executable files, etc.
The general principle of data compression algorithms on text files is to transform string of characters into a new string which contains the same information but with new length as small as possible. The efficient data compression algorithm is chosen according to some scales like: compression size, compression ratio, processing time or speed, and entropy. [1]

Lossless compression vs lossy compression:

Lossless compression: Reduces bits by identifying and eliminating statistical redundancy
.no information is lost in lossless compression. In these schemes before the compression after the compression data must be same. [13]

Lossy compression: Reduces bits by identifying marginally important information and
removing it. In these schemes some loss of information is acceptable depending upon the application [5].
Compression ratio=B1/B0*100%.
B0=no. of bits before compression. B1= no. of bits after compression



Shanonfano coding
In ShannonFano coding, the procedure is done by a more frequently occurring string which is encoded by a shorter encoding vector and a less frequently occurring string is encoded by a longer encoding vector. ShannonFano coding[1].
Relied on the occurrence of each character or symbol with their frequencies in a list and is also called as a variable length coding.
The ShannonFano Algorithm
This is a basic information theoretic algorithm [3]. A simple example will be used to illustrate the algorithm.
Example: Alice_has_sent_a_message_to_bob.

Algorithm
Encoding for the Shannonfano algorithm: It follows a topdown approach [9].
Step1: Create table providing frequencies / counts
Step2: sort symbols according to their frequencies/ Probabilities in descending order.(table1)
Step3: Recursively divided into two parts, each with
approx (binary) same number of counts.
Step4: Add a binary 0 to the code words of the upper part and a binary 1 to the lower part.
Step5: Search for the next part containing more than two symbols, repeat the step3 and step4.
Step6: Coding of the origination data according to code words in the table2.
Step7:Create the coding tree(figure1).
Step8: Transmit Codes instead of Tokens
Table1.Symbol table frequencies in descending order.
Table2. Shannonfano code words table.
Figure1.Shannonfano coding tree.
Table3.Result of performing Shannonfano.
Total number of used bits=116.
Before compression (B0) =32*8=256bits. After compression (B1) =116bits.
Compression ratio=B1/ B0 *100%
=116/256*100%=45.3% from the original size. It means that it saves 54.7% in space[1].
Generally, ShannonFano coding does not guarantee that an optimal code is generated. Shannon Fano algorithm is more efficient when the probabilities are closer to inverses of powers of 2.


Huffman Encoding
The Huffman coding algorithm [4] is named after its inventor, David Huffman, who developed this algorithm as a student in a class on information theory at MIT in1950. It is a more successful method used for text compression. It is a compression algorithm used for lossless data compression.
A more efficient approach is to use a variablelength representation, where each character can have a different number of bits to represent it. More specifically we first analyze the frequency of each character in the text, and then we create a binary tree (called Huffman tree) giving a shorter bit representation to the most used characters, so that they can be reached faster[3].

Algorithm
Step1: Compute the probability of each character. Step2: Sort the set of data in ASCENDING order.
Step3: Create a new node where the left child is the lowest in the sorted list and the right is the second lowest in the sorted list.
Step4: Chopoff those elements in the sorted list as they are now part of one node and add the probabilities. The result is the probabilities for the new node.
Step5: Perform insertion sort on the list with the new node.
Step6: Repeat steps 3, 4, 5 until, only have one node left.
Step7: Calculate Entropy
Example: Alice_has_sent_a_message_to_bob.
Table1.Symbol table frequencies in Ascending order.
It is based on building a full binary tree for the different symbols that are in the original file after calculating the probability for each symbol and put them in ascending order. After that, we derive the code words for each symbol from the binary tree, giving short code words for symbols with large probabilities and longer code words for symbols with small probabilities.
By applying Huffman algorithm on the given example. We get the probability table [2].
Table5. Ascending probabilities for symbols.
Figure 2: Huffman encoding tree.
Huffman Method of Code Generation and average code length per character computation is calculating in this example [7].
Table 6: Huffman code words symbols.
Average code word length=entropy=~3.6 bits/symbols. Huffman weighted path in this example is:
Weighted path= (no. of counts*no. of code words).[5]
Compression ratio=B1/B0*100%=116/256*100%
=45.3% from the original size. it means that it saves 54.7% in space.


Runlength encoding (RLE)
Run Length Encoding (RLE) is a simple and popular data compression algorithm. It is based on the idea to replace a long sequence of the same symbol by a shorter sequence and is a good introduction into the data compression field for necomers.RLE requires only a small amount of hardware and software resources. Therefore RLE was introduced very early and a large range of derivates have been developed up to now.
Runlength encoding is a data compression algorithm that is supported by most bitmap file formats, such as TIFF, BMP, and PCX.

General Principle of RLE [16]
Instead of the original data socalled runs will be stored. In the general form a run is a sequence of a certain length containing only one symbol. The length of the sequence is called run count and the symbol run value

Algorithm.
step1: Pick the character from source string. step2: Append the picked character to the destination string.
step3: Count the number of subsequent occurrences of the picked character and append the count to destination string.
step4: Pick the next character and repeat steps 2, 3 and 4 if end of string is NOT reached.
Example: Alice_has_sent_a_message_to_bob.
Table7: RLE data compression hypothetical scan line.
If we apply the runlength encoding (RLE) data compression algorithm to the above hypothetical scan line, we get the following:
6_4a4e4s2t2o2b1l1i1c1pn1m1g1. Compression ratio=B1/ B0 *100%
=30/32*100%=93.7%from the original size. It means that it saves 6.3% in space only. Does not work well for English text however, is almost always a part of a larger compression system.


LempelZivWelch (LZW)
LZW is a universal lossless data compression algorithm [10] created by Abraham Lempel, Jacob Ziv,and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement, and has the potential for very high throughput in hardware implementations [6].
LZW is referred to as a dictionarybased encoding algorithm. The algorithm builds a data dictionary (also called a translation table or string table) of data occurring in an uncompressed data stream. Patterns of data (substrings) are identified in the data stream and are matched to entries in the dictionary. If the substring is not present in the dictionary, a code phrase is created based on the data content of the substring, and it is stored in the dictionary. The phrase is then written to the compressed output stream [1].
Code table compression is the basis of the popular LWZ compression method. Encoding occurs by identifying sequences of bytes in the original file that exist in the code table. The 12 bit code representing the sequence is placed in the compressed file instead of the sequence. The first 256 entries in the table correspond to the single byte values, 0 to 255, while the remaining entries
correspond to sequences of bytes. The LZW algorithm is an efficient way of generating the code table based on the particular data being compressed.

LZW encoding algorithm:
Encoding input consists of the following steps:
Step 1. Initialize dictionary to contain one entry for each byte. Initialize the encoded string with the first byte of the input stream. Step 2. Read the next byte from the input stream. Step 3. If the byte is an EOF goto step 6. Step 4. If concatenating the byte to the encoded string produces a string that is in the dictionary:

Concatenate the the byte to the encoded string.

Go to step 2.
Step 5. If concatenating the byte to the encoded string produces a string that is not in the dictionary:

Add the new sting to the dictionary.

Write the code for the encoded string to the output stream.

Set the encoded string equal to the new byte.

Go to step 2.
Step 6. Write out code for encoded string and exit. Example: alice_has_sent_a_message_to_bo.
The LZW algorithm stores in a dictionary. The first 255 entries are used to contain the values for individual.
e_ but this has already been included in the dictionary in entry 260.this means we now set up the threecharacter string e_t.
By using LZW compression algorithm our examples is not suitable. Generally LZW is suitable for the images. In this example Compression ratio=31/32*100=96.8% from the original size. it means that it saves 3.2% in space only or storage of new file. [4]



Analysis
In this section, tests are made on the four compression techniques on same text file. The results are tabulated and analyzed in order to reach to the best technique, advantage and disadvantage for each one, and when each one is best to use. First compare the Shannonfano and Huffman coding, the compression ratio is almost same. By using these compression algorithms it saves the
54.7%space.Second compare the Run length encoding and Lempelzivwelch algorithms of the compression ratio is low as compare with the Huffman and Shannonfano algorithms. According to my observation Huffman encoding algorithm is best result for the text files.

Conclusion
Data compression is most consideration thing of the recent world. We have to compress a huge amount of data so as to carry from one place to other or in a storage format. These proposed compression technique are improved the efficiency of compression on text. Huffman encoding Algorithm is suitable for the given text.

References