🌏
Global Research Authority
Serving Researchers Since 2012
IJERT-MRP IJERT-MRP

The Compression Method of Complete Path Representation (CPR) and Postings for Efficient Retrieval XML Documents

DOI : 10.17577/IJERTV14IS060202

Download Full-Text PDF Cite this Publication

Text Only Version

The Compression Method of Complete Path Representation (CPR) and Postings for Efficient Retrieval XML Documents

Hsu-Kuang Chang

Department of Information Engineering I-Shou University Kaohsiung, Taiwan

AbstractThe CPR dictionary and the inverted index presented as the central data structures in XML information retrieval [1]. It can be seen that the index structure could be very large since the CPR for each tree can be generated from many path combinations. For a larger and deeper tree, the index structure could grow exponential. Therefore, the proposed method of representing XML document with complete paths practical and scalable for use in web environment in which XML structure could be complicated. In this paper, we employ a number of compression techniques for CPR dictionary and inverted index that are important for efficient XML IR systems. There are two subtle benefits of compression. The first is increased use of caching. Search systems use some parts of the dictionary and the index much more than others. For example, if we cache the XML postings list of a frequently used query paths p, then the computations necessary for responding to the paths query p can be entirely done in memory. With compression, we can fit a lot more information into main memory. Instead of having to expand a disk seek when processing a query with p, we instead access its XML postings in memory and decompress it. The second more subtle advantage of compression is faster transfer of data from disk to memory. For instance, we can reduce input/output (I/O) time by loading much smaller compressed XML postings, even when you add on the cost of decompression. So, in most cases, the retrieval system runs faster on compressed XML postings than on uncompressed postings lists.

KeywordsCPR, XML, Posting, dictionary

  1. INTRODUCTION

    The amount of data in our world has been exploding exponentially as a result of the rapid development of the Internet and messaging-related technology, so a variety of different data formats has resulted. Unfortunately, the different data formats cause significant difficulties in the exchange of information destined for data storage and file formats used commercially or for business purposes. Therefore W3C [1] defines XML as a standard information description language widely used in the exchange of information and data storage. The increased amounts of data transfer have also increased the demand for efficacious XML data query search processing. Therefore, the use of the tree model for XML files has been the core query search process to satisfy user requirements. In order to describing the XML query language, first proposed by the Xpath [2] and XQuery [3], the process is based on path expressions. In order to process XML queries, all portions matching the query tree pattern in the XML document must be found. These patterns include the most important query axes:

    parent child (PC, / for short) and ancestor descendant (AD, // for short). Some methods, related to [4] and [5], match all query results in the contents of the XML document, but this is a time-consuming task, especially considering the increasingly large amount of data. Therefore, research is required to improve the efficiency of path expressions [6-16]. Path approaches use sub-paths as their feature, and represent each XML datum as a binary vector. An element in the binary vector denotes whether the datum involves a corresponding feature, where such features can be defined as tree nodes [7], two-node sub-paths (i.e. node-pair (NP)) [8-9], or whole paths (WP) [10-11]. To improve search efficiency, several path representation modifications have been proposed. Yang et al.

    [12] used the content instead of the leaf node for node representation. Liu et al. [13] presented a hybrid definition that combines NP and WP for XML data description.

    The rest of the paper is organized as follows: first, it examines XML CPR representation. Second, it presents compression method for CPR paths and XML postings. Next, it shows the experiment results of handling compressing posting using variant approaches. Conclusions are drawn in the final section.

  2. XML DATA REPRESENTATION USING COMPLETE PATH ELEMENTS

    An example of level definition is shown in Figure 1, where four CPE sets: CPL-1, CPL-2, CPL-3, and CPL-4, can be defined. The elements of the four CPE sets are shown in Table 1. Traditional WP and NP representations, with their lack of linking information, cannot serve such queries as (/B/*/*/*/*) and (/*/I/A/L) where * means dont care about the node name. For efficient query service, CPR describes XML data with the CPEs of all SSTs. The CPE set of a tree is defined as all the branches, (i.e., sub-paths) starting from each level to the leaves. For convenience, let CPL-i denote a set of CPEs starting from the ith level, where the root level is defined as one and is increased toward the leaves.

    Essentially, this algorithm is an exhaustive search that is guaranteed to find all of the branches of a SST. The CPR for the description of SSTs can be defined as:

    L

    CPS {CPLi CPLi is a set involving the i-th level CPEs of

    i1

    all XML data},

    where denotes union operation and L is the maximum level number. Considering a database comprised of the three XML data shown in Figure 1, the CPR can be found in Table 1. In Figure 1, there are two I nodes for both DOC 1 and 3. The two nodes with different children are distinct and cannot be merged. The two sub-paths /B/I/T in DOC 2 and /B/I/T/* in DOC 1 have the same path length equal to 3, but have distinct distances from leaf node. The same distinctions also exist between the two sub-path elements /M/I/* and /M/I/*/* in the CPL-1.

  3. INDEXING COMPLETE PATH ELEMENTS

    A CPE with the tree characteristic is a high dimensional feature. Traditional B-tree indexing based on node relationships is suitable for WP, NP and twig queries, but is inefficient for CPR, which regards each CPE as a feature element. In this section, a new index with feature similarity structure (FSS) is presented for CPE management. The FSS provides a fast template-based hierarchical indexing.

    The CPEs of Table 1 can be represented with a tree structure, as shown in Figure 2, where Pi denotes a CPE subset with path length equal to i. The CPEs in Figure 2 are inherent with the hierarchical information involving path length (Pi) and level ( CPLl ) that are available for inferring semantic relations, e.g., ancestor-descendant (AD), sibling (SB) and cousin (CN) relationships. B-tree index with a key design can achieve balanced binary tree structure for efficiently indexing the elements of NP and WP, but cannot provide hierarchical information. To facilitate the inference of semantic information, the inverted index structure with additional fields is applied for CPE indexing.

    The path elements with AD relationships can be easily obtained from the CPEs with the path length field filled in Pi for i 3, i.e., path length 3. For the example of Figure 2, there are two kinds of AD relationship shown in Table 2, where A1 involves the path elements with one-generation AD, and A2 involves the path elements with two-generation AD. Note that these path elements are different from CPE, and are

    labeled with . SB and CN are relations among nodes,

    where these nodes have different descendants but have the same father and grandfather node, respectively. For SB, the father nodes can be found in levels CPL-l for .

    Furthermore, the search of CN nodes is to verify whether thir father nodes are inherent with a SB relationship. The hierarchical labeling templates of SB and CN relations are shown in Table 3. The tree structure index, including semantic information, is illustrated in Figure 3, where SB and CN indexing requires fewer levels than the indexing of AD.

    CPR Block storage with k

    We can further compress the CPR dictionary by grouping paths in the string with blocks of size k and keeping a path pointer only for the first path of each block. We store the length of the path in the string as an additional byte at the beginning of the path. We thus eliminate k-1 path pointers, but need additional k bytes for storing the length of each path. For k = 4, we save (k-1)*3 = 9 bytes for path pointer, but need an additional k = 4 bytes for path length. So the total space requirements for the CPEs dictionary of CPR are reduced by 5 bytes per four-path block, or total of for

    CEPs and for ADs respectively, bringing us down to 1117 bytes as shown in the Table 4.

    Block & CPR coding

    We presented new method, the blocking & CPR coding with CPEs dictionary, totally needs block

    L

    pointers, i 1 2 3 4 10, which can be denoted as

    i 1

    L L i 1

    i, j .

    i 1 j 1

    The CPEs dictionary of the size in can be denoted as

  4. COMPESSION METHODS OF CPR PATHS AND XML POSTINGS

    This section presents a series of CPR dictionary structures that

    , which 4 bytes for frequency and posting pointer, 8 bytes for average path length, and 3 bytes for block pointer. Similarly for ADs dictionary, with blocking & CPR coding, it totally

    L2

    needs block pointers i 1 2 3 , which can be denoted

    i 1

    L2 L i 1

    as i, j . The CPR for this scheme of need size is down

    i 1 j 1

    achieve increasingly higher compression ratios. The techniques of compression in the CPR dictionary are presented in fixed-width entries, CPR dictionary as string, block storage with k, block & CPR coding methods, and CPR Huffman Coding (CPR HC).

      1. Compession Methods Of Cpr Paths

        CPR dictionary as fixed-width entries

        In Figure 2 and Figure 3 for CPR dictionary, we need M * (20+4+4) = 63*28=1764 bytes (B), where M is paths of CPEs, ADs, SBs, and CNs for storing the CPR dictionary in this scheme as shown in the Table 4. Using fixed-width entries for the CPR dictionary is clearly wasteful.

        to 1062 B as shown in the Table 4.

        CPR Huffman Coding

        We presented modified Huffman Coding method for CPEs and ADs paths in Figure 2 and Figure 3. The Huffman Code algorithm for CPR is presented in the Table 5. First, the frequency of the path elements can be derived as the Table 6, for example, according to the shown frequency, the path element F can be compressed Huffman code as 11001 (5 bits), and I as 101 (3 bits). Secondly, the complete path elements (CPEs) in different length of each level can be derived in the Table 7. For example, the Huffman Coding (HC) of /B/I/T/D (path of length-4 in level-1: CPL-1/P4) can be represented as the 1111 101 1110 11000. Thirdly, the ancestor-descents (ADs) path in different generation of each level can be derived in the Table 8. For example, the Huffman Coding (HC) of

        /B/~ /~ /D (AD path of 2- generations in level-1: ADL-1/A2) can be represented as the 1111 0 0 11000. Fourthly, the compression ratio (CR) of complete paths representation (CPR) can be calculated as CR = Bytes of (CPEs+ADs) * 8 bits / Bits of Huffman Coding (CPEs+ADs), CR = (62+62+50) * 8 bits / (184+147+119) bits = 3.09 as shown in the Table 9. That is, for example, the path /B/~/~/~, occupies

        32 bits (4-byte * 8 bis) while the Huffman coding is 7 bits only (1111 0 0 0).

      2. Compession Methods Of Xml Document Postings

    bits 20-30 compression

    Assuming we have 800,000 XML documents, 200 paths per document, six characters per path, and 100,000,000 XML postings where we define a posting as a docID in a postings list. XML document identifiers are log2 800,000 ~= 20 bits long. Thus, the size of the collection is about 800,000*200*6 bytes=960 MB and the size of the uncompressed XML postings file is 100,000,000*32/8 = 400 MB (32-bit word) and 100,000,000*20/8 =250 MB (20-bit word) respectively as shown in the Table 10.

    Table 10. Document postings compression for XML

    To encode small numbers in less space than large numbers, we look at two types of methods: bytewise compression and bitwise compression. As the names suggest, these methods attempt to encode gaps with the minimum number of bytes and bits, respectively.

    Variable byte (VB) encoding

    Bytewise variable byte (VB) encoding uses an integral number of bytes to encode a gap. The last 7 bits of a byte are payload and encode part of the gap. The first bit of the byte is a continuation bit. It is set to 1 for the last byte of the encoded gap and to 0 otherwise. To decode a variable byte code, we read a sequence of bytes with continuation bit 0

    terminated by a byte with continuation bit 1. We then extract and concatenate the 7-bit parts. Note the VB encoding 60% reduced compared with the 20-bits uncompressed can be denoted in the Table 12.

    To devise a more efficient representation of the XML postings, the proposed methods that uses fewer than 20 bits per XML document, we observe that the XML postings for frequent paths are close together. Imagine going through the documents of a collection one by one and looking for a frequent path like

    /B/I/T/D. We will find a document containing /B/I/T/D, then we skip a few documents that do not contain it, then there is again a document with the path and so on (see Table 11). The key idea is that the gaps between XML postings are short, requiring a lot less space than 20 bits to store. In fact, gaps for the most frequent paths such as /B/~/~/~ are mostly equal to 1. But the gaps for a rare path that occurs only once or twice in a collection (e.g., /B/~/A/~ in Table 11) have the same order of magnitude as the docIDs and need 20 bits. For an economical representation of this distribution of gaps, we need a variable encoding method that uses fewer bits for short gaps.

    Table 11. Encoding gaps instead of document IDs.

    Patched Frame-Of-Reference Delta (PForDelta)

    The PForDelta use the compressing structures including the header, code, and exception which contain 32-bits for header, 8-bits for code (172 xml postings), and 4-byte for exception respectively. The CPEs for PForDelta is derived as 50*8-bits

    =400 bits/8=50 Bytes, ADs 19*8=152/8=19 Bytes. The total size bytes for PForDelta be 32B+50B+19B+8B+8B=88Bytes. This strategy for PForDelta is space saved and good performance of efficient retrieval. The PForDelta for XML posting is shown as the Table 12.

    Table 12. Postings compression for XML datasets of Figure2 -3.

  5. CONCLUSION

For efficiently serving versatile queries, a new XML data representation referred to as CPR Dictionary and XML postings have been presented in this paper. The proposed method of representing XML document with complete paths is practical and scalable for the usage in web environment. We employ a number of compression techniques for CPR dictionary and XML postings that are important for efficient XML IR systems and give a satisfied experiment result.

REFERENCES

  1. World Wide Web Consortium. The document object model. http://www.w3.org/DOM/

  2. Berglund A, Boag S, and Chamberlin D, XML Path Language (XPath) 2.0, W3C Recommendation, http://www.w3.org/TR/xpatp0/ (January 2007).

  3. Robie J and Hat R. XML Processing and Data Integration with XQuery. IEEE Internet Computing 2007; 11: 62-67.

  4. Dalamagas T, et al. Clustering XML Documents using Structural Summaries. EDBT ork-shop on Clustering Information over the Web (ClustWeb04) 2004.

  5. Nierman A and Jagadish HV. Evaluating Structural Similarity in XML Documents. Fifth International Workshop on the Web and Databases 2002.

  6. Flesca S, et al. Fast Detection of XML Structural Similarity. IEEE Transactions on Knowledge and Data Engineering 2004; 17: 160-175.

  7. Lian W, et al. An Efficient and Scalable Algorithm for Clustering XML Documents by Structure. IEEE Transactions on Knowledge and Data Engineering 2004; 16: 82-96.

  8. Kozielski M. Improving the Results and Performance of Clustering Bit-encoded XML Documents. Sixth IEEE International Conference on Data Mining-Workshop 2006.

  9. Yuan JS, Li XY and Ma LN. An Improved XML Document Clustering Using Path Feature. Fifth International Conference on Fuzzy Systems and Knowledge Discovery 2008; 2: 400-404.

  10. Leung HP, Chung FL, Chan SCF and Luk R. XML Document Clustering Using Common Xpath. Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration 2005; 91 -96.

  11. Termier A, Rousset MC and Sebag M. treefinder: a first step towards XML data mining. Proceedings of IEEE International Conference on Data Mining 2002; 450-457.

  12. Yang J, Cheung WK and Chen X. Learning the Kernel Matrix for XML Document Clustering. IEEE International Conference on e-Technology, e-Commerce and e

    Service 2005; 353-358.

  13. Liu J, Wang JTL, Hsu W and Herbert KG. XML Clustering by Principal Component Analysis. Proceedings. Of the 16th IEEE International Conference on Tools with Artificial Intelligence. 2004; 658-662.

  14. Lee JW, Lee K and Kim W. Preparation For Semantic-Based XML Mining. IEEE International Conference on Data Mining 2001, pp. 345-352.

  15. Qureshi MH and Samadzadeh MH. Determining the Complexity of XML Documents. Proceedings of the International Conference on Information Technology: Coding and Computing 2005; 2: 416-421.

  16. Li XY. Using Clustering Technology to Improve XML Semantic Search. Proceedings of the Seventh International Conference on Machine Learning and Cybernetics 2008; 5: 2635-2639.