XML Twig Pattern Matching Algorithms and Query Processing

DOI : 10.17577/IJERTV1IS3239

Download Full-Text PDF Cite this Publication

Text Only Version

XML Twig Pattern Matching Algorithms and Query Processing

D.BUJJI BABU 1 Dr. R.SIVA RAMA PRASAD 2 M.SANTHOSH 3

Associate Professor,Dept. o f CSE, Research Director, Dept. o f CSE, Dept. of CSE,

Prak asam Engg. College, Acharya Nagarjuna University, Prak asam Engg. College, Kandukur-523105, Guntur-522510, Kandukur-523105,

A.P., India A.P., India A.P., India

ABSTRACT

XM L has e merged as the wire-language of the internet.XM L can be used to structure the data and also provide meaning fo r data. An effective document structure helps convert data into useful in formation that can be processed quickly and e fficiently. The XM L data is e xchanged and generated in B2B1 applications. According to this point there is need for effic ient processing of queries on XML data. The research stream in XM L database is processing of XML tree pattern query (XTPQ) with effic ient answer (ca lled pattern matching).The XM L document can be converted into tree model by using DOM (Parser).The XM L query languages like XPath (Extensible path language), XQuery (Extensible Query language) represent queries on XML data as tree patterns (twigs).The major operation of XM L query processing is to find all the occurrences of twig 1 patterns efficiently on XM L database. In the past few years, many a lgorith ms have been proposed to match such tree patterns. This paper presents an overview of the state of the art in XTPQ processing. This overview shall start by providing some background in holistic approaches to process XTPQ and then introduce diffe rent algorith ms for twig pattern matching.

Keywords: – XML, Pattern Matching Algorith ms, XM L Tree Pattern, Query processing, XML Parsers.

  1. INTRODUCTION

    There is an increasing need of XM L data for data transporting application in businesses. The evaluation of XM L tree pattern queries (XTPQ) p roduces output as all matched patterns is called twig patterns (this can be called as pattern matching). The e mergence of this

    1 B2B means business to business

    point is need for effic ient pattern matching a lgorith ms on large volume of XM L data for evaluating tree patterns (twigs). The DOM parser represents the XML document as XML tree. The XM L trees again two types ordered (ancestor and left-to-right ordering among siblings relationships significant) and unordered (only ancestor relationship significant) XM L trees. Some a lgorith ms produces output as unordered XML trees and some produces output as ordered labeled XM L trees(twigs).The previous approaches considered XM L t ree as ordered labeled XM L tree(t wig). For e xa mple , when searching for a twig of the element student with the sub elements first name and last name (possibly with specific va lues), ordered matching would not consider the case where the order of the first name and the last name is reversed. However, this could e xactly be the student we are searching for. The way to solve this problem is to consider the query twig as an unordered tree where only the ancestor- descendant relationships are important the preceding- following, preceding-sibling and following-sibling relationships (a xes) are unimportant.

    With the rapidly increasing popularity of XM L for data representation, there is a lot of interest in query processing over data that conforms to a tree-structured data model. Since the data objects in a variety of languages (e.g. XPath [1], XQuery [2]) are typically trees, tree pattern matching (twig) is the central issue.

    For e xa mp le, the following query:

    Query=/ book

    Lorem ipsum dolor sit amet...

    //author [name ='Jane']

    can be represented as a twig (sma ll tree ) pattern. It matches author elements that has sub element na me as content the string value Jane, and are descendants of book ele ments that have a child title ele ment whose content is the string value XM L. In the above query "/" represents the relat ionship between parent and child

    "//" represents the relationship between ancestor and descendant.

    In practice, XM L data may be very large, co mple x and have nested elements. Thus, efficiently finding a ll twig patterns in an XML database is a ma jor operation of XM L query processing. In the past few years, many algorith ms ([3], [4]) have been proposed to match such twig patterns. These approaches

    • First develop a labeling scheme to capture the structural info rmation of XM L documents, and then

    • Perform tree pattern matching based on labels alone without traversing the original XM L documents.

      For solving the first sub- problem of designing a proper labeling scheme, the previous methods use a tree-traversal order or te xtual positions of start and end tags (e.g. reg ion encoding [5]) or path e xpressions(e.g. Dewey ID [6])or prime numbers (e.g. [7]). By applying these labeling schemes, one can determine the relationship (e.g. ancestor-descendant, parent-child) between two ele ments in XM L documents fro m their labels alone.

  2. XML Twig Patte rn Matching

    The XML databases like Lo re [8] and Timber [9] represents the XML query as small tree ca lled t wig. XM L data and its related issues of their storage as well as query processing using relational database systems have recently been considered in [6, 7]. The recent papers (e.g. [10, 11]) are proposed to effic iently process an XML twig pattern (XTPQ). In paper [10], a new holistic a lgorith m, called ordered, is proposed to process order-based XML tree query. In paper [11], an algorith m ca lled TwigStac kListNot is proposed to handle queries with negation function. Chen et al [12] proposed different data streaming schemes to boost the holism o f XM L tree pattern processing. They showed that larger optima l c lass can be achieved by refined data streaming schemes. In addition, Twig2Stack [13] is proposed for ans wering genera lized XM L tree pattern queries. Note the diffe rence between generalized XM L tree pattern and extended XM L tree pattern here. Genera lized XM L tree pattern is defined to include optional a xis which mode ls the e xpression in LET and RETURN clauses of XQuery statements. But e xtended XML tree pattern is defined to include some complicated conditions like negative function, wildcard and order restriction.

    We have other approaches to match an XML tree pattern are ViST[14] and PRIX[15] ,wh ich converts an XM L tree pattern match into sequence match. These two algorithms ma inly focus on ordered queries and it is non-trivia l to e xtend those methods to handle unordered queries. The paper [16] gives diffe rent XM L tree query processing algorithms (inc luding holistic match and sequence match) and concluded that the

    holistic tree pattern method is robust in nature also guarantees performance. Fro m the theoretical research about the optima lity of XM L tree pattern matching, Choi et a l. [20] developed theore ms to prove that it is impossible to devise a holistic algorith m to guarantee the optima lity fo r queries with any co mbination of Parent-Ch ild and Ancestor-Descendant relationships. Shale m et al. [21] researched the space comple xity of processing XM L t wig queries. Their paper showed that the upper bound of full-edge queries with Parent-Ch ild and Ancestor- Descendant edges are O(D), where D is the document size. In other words, their results also theoretically prove that there exists no algorithm to optima lly p rocess an arbitrary query Q/, //, *.

    The structural relat ionships are verified with the help of labeling scheme of XM L e le ments. The most commonly used labeling schemes are containment and prefi. The contain ment labeling scheme was introduced by Zhang et al. [17] for contain ment queries. The a xes like Pa rent-Ch ild and Ancestor- Descendant relationships have the same co mp le xity according to regional labeling. The e xa mple to represent prefix labeling scheme on XM L data is Dewey ID. It can be used to preserve the path informat ion during query processing. Recent work of Lu at e l. [14] utilizes the extended Dewey encoding

    [18] which encodes path informat ion including not only the ele ment IDs but also the ele ment names.

  3. Holistic Algorithms for XML Query Processing

    Here we propose two types of algorith ms to process an XM L twig query. They are

    Two-Phase holistic twig evaluation a lgorith ms One-Phase holistic twig evaluation algorith ms

    1. TwigStack Algorithm:

      Based on the containment labeling scheme [17], Bruno et a l. [5] proposed a novel holistic XM L t wig pattern matching method TwigStack which avoids storing intermed iate results unless they contribute to the nal results. The method, unlike the decomposition based method, avoids computing la rge redundant intermediate results. But the ma in limitation of TwigStack is that it may produce a la rge set of useless intermediate results when queries contain any parent-child re lationships. TwigStack has been proved to be I/O optima l in terms of output sizes for qu eries with only A-D edges, their algorithms still cannot control the size of intermed iate results for queries with parent-child (P-C) edges. TwigStack operates in t wo steps:

      • A list of intermediate path solutions is output as intermed iate results; and

      • The intermediate path solutions in the rst step are me rge-joined to produce the nal solutions.

        Algorithm for TwigStack (q):

        // Phase 1

        1: while notEnd (q) 2: qact = getNe xt(q)

        3: if (isNotRoot(qact)) the n

        4: cleanStack(pare nt(qact), ne xtL(qact)) 5: end if

        6: if (isRoot(qact) or isNotEmpty(Sparent(qact))) then 7: cleanStack(qact, ne xt(qact))

        8:moveStre amToStack(Tqact,Sqact

        ,poi ntertotop(Sparent (qact))) 9: if (isLeaf(qact)) then

        10: showSolutionsWithBlocking(S qact , 1) 11: pop (Sqact)

        12: end if

        13: else

        14: advance (Tqact)

        15: end if

        16: end while

        // Phase 2

        17: mergeAllPathSol utions()

        Algorith m TwigStack operates in two phases. In the first phase (lines 1-16), some (but not all) solutions to individual query root-to-leaf paths are computed. In the second phase (line 1 7), these solutions are merge-joined to co mpute the answers to the query twig pattern.

    2. TwigStackList Algorithm:

      Unlike the previous Algorithm TwigStack [5], our approach takes into account the level informat ion of ele ments and consequently output much less intermediate paths for query twig patterns with parent – child edges. We have analytically shown that when all edges below branching nodes (nodes that has more than one child) in the query pattern are ancestor-descendant relationships, the I/O cost of TwigStackList is only equal to the sum of sizes of the input and the nal output. In other words, TwigStackList [19] identies a larger query class to guarantee the I/O optima lity than TwigStack, wh ich only guarantee the optima lity for queries with entirely A-D relat ionships. Experimental results showed that our method achieves the similar performance with TwigStack for queries with only ancestor-descendant relationships, but is much more ecient than TwigStack for queries with parent-child relationships, especially for deep data sets with complicated recursive structure.

      Algorithm for TwigStack List: 1: while notEnd() do

      2: qact= getNext(r oot)

      3: if (isNotRoot(qact)) the n

      4: cleanParentStack(qact,getStar t(qact)) 5: end if

      6: if (isRoot(qact ) or isNotEm pty(Sparent(qact))) then

      7: clearSelfStack(qact,getEnd(qact)) 8:move ToStack (qact,Sqact,pointertotop(S parent(qact))) 9: if (isLeaf(qact)) then

      10: showSolutionsWithBlocking(Sqact,1) 11: pop (S qact )

      12: end if

      13: else

      14: procee d (qact) 15: end if

      16: end while

      17: mergeAllPathSol utions()

      First of a ll, line 2 ca lls getNe xt a lgorithm to identify the node qact to be processed. Line 4 and 7 re move partial answers fro m the stacks of parent(qact ) and qact that cannot be extended to total answer. If q act is not a leaf node, we push ele ment Cq into Sq (line 8); otherwise (line 10), a ll path solution involving Cq can be output. Note that path solutions should be output in root-leaf order so that they can be easily merged together to form na l twig matches (line 17).

    3. OrderedTJ Algorithm:

      Its an e xtension of TwigStack List. He re

      1. We introduce a new algorithm, called Or dere dTJ[10], for holistic ordered twig pattern processing. In Or dere dTJ, an ele ment contributes to nal results only if the order of its children accords with the order of corresponding query nodes.

      2. If we ca ll edges between branching nodes and their

      children as branching edges and denote the branching edge connecting to the nth child as the nth branching edge, we analytically de monstrate that when the ordered query contains only A-D relat ionship fro m the second branching edge, Ordere dTJ [10] is I/O optimal among a ll sequential a lgorith ms that read the entire input. In other words, the optimality of Or dere dTJ allo ws the e xistence of P-C edges in non-branching edges(a node that has only one child) and the rst branching edge(a node that has more than one child).

      Algorithm for Or dere dTJ: 1: while notEnd() do

      2: qact= getNext(r oot)

      3:if (isRoot(qact) or isNotEmpty(Sparent(qact))) the n 4: cleanStack(qact,getEnd(qact))

      5: end if

      6: moveStreamToStack(qact,S qact);

      7: if (isLe af(qact)) the n

      8: showPathSolutions(S q,getEle ment(qact)) 9: else

      10: procee d (qact)

      11: end if

      12: end while

      13: mergeAllPathSol utions ();

      Algorith m Or dere dTJ, wh ich co mputes answers to an ordered query twig, operates in two phases. In the rst phase (line 1-12), the individual query root-leaf paths are output. In the second phase (line 13), these solutions are me rged-jo ined to compute the answers to the whole query.

    4. TJFast Algorithm (a Fast Twig Join algorithm):

      The above three algorith ms are based on containment labeling scheme[17]. A ne w a lgorith m, name ly TJ Fast, which e xplo its the nice property of the e xtended Dewey labeling scheme [18] and e c iently evaluates XML twig queries. The containment labeling scheme is dicult to answer queries with wildcards in branching nodes(a node that has more than one child). For e xa mp le, consider an XPath: //a/*/ [b]/c. Where * denotes a wildcard symbol wh ich can match any single ele ment. The contain ment labels of a, b and c do not provide enough information to determine whether they match the query or not.

      Algorithm for TJ Fast:

      1: for eac h f leafNodes(r oot) 2: locate Matc he dLabel(f)

      3: end for

      4: while (notEnd(r oot)) do

      5: fact= getNext(topBr anc hingNode ) 6: outputSolutions(fact)

      7: advance (Tfact)

      8: locate Matc he dLabel(fact) 9: end while

      10: mergeAllPathSol utions()

      Algorith m TJ Fast, which computes answers to a query twig pattern Q, is presented in Algorith m 7. TJ Fast operates in two phases. In the rst phase (line 1-9), some solutions to individual root-lea f path patterns are computed. In the second phase (line10), these solutions are merge-joined to co mpute the answers to the whole query.

    5. TreeMatch Algorithm:

    This algorith m is proposed to achieve larger optima l query classes. It uses a concise encoding technique to match the results and also reduces the useless intermediate results.

    Algorithm Tree Mtch for class Q/, //, *. 1: locate Matc hLabel(Q);

    2: while (notEnd(r oot)) do

    3:fact= getNe xt(topBr anchi ngNode); 4: if (fact is a retur n node )

    5: addToOutputList(NAB(fact), cur (Tfact ));

    6: advance (Tfact ); // read the next ele ment in Tfact

    7: updateSet(fact); // update set enc oding

    8: loc ate MatchLabel (Q); // loc ate ne xt ele ment wi th matc hing path

    9: emptyAllSets (root);

    Now we go through Algorithm. Line 1 locates the first ele ments whose paths match the individual root-leaf path pattern. In each iterat ion, a lea f node fact is selected by getNe xt function (line 3). The purpose of lines 4 and 5 is to insert the potential matching ele ments to outputlist. Line 6 advances the list T fact and line 7 updates the set encoding. Line 8 locates the next matching ele ment to the individual path. Finally, when all data have been processed, we need to empty all sets in Procedure emptyAllSets (line 9) to guarantee the co mpleteness of output solutions.

    The below Fig. (a) shows the query and document illustrate the TreeMatch algorithm for class Q/,//,*

    Fig (a) Illustration to Algorithm TreeMatch for class Q/,//,*

    TABLE 1

    Set encoding for the example in above Fig (a)

    The above TABLE1 de monstrates the current access ele ments, the sets encoding and the corresponding output elements. There are two branching nodes in the query. First, B1, D1, and E1 are scanned. C1 and C2 are added into the set SC, but their bit Vectors is 10 and 01, which indicate that C1 and C2 have only one child, respectively. In this scenario, recall that TJ Fast may output path solutions A1/A2/C1/D1 and A1/A2/C1/C2/ E1, which might be useless to produce final results. Thus, our algorith m Tree Match dimin ishes the unnecessary I/O cost. Next, E2 is scanned and the bit Vector (C1) beco mes 11 as C1 has two children now. Similarly, the bit Vector (A1) is 11 too. In this mo ment, we guarantee that A1 matches the whole pattern tree, as all b its in bit Vector (A1) is 1. Finally, when B2 is scanned, A2 is added to set SA. Therefore, Tree match outputs two final results B1 and B2.

    Through this exa mp le, we illustrate two differences

    between TJFast and Tree Match.

    1. TJFast outputs one useless intermediate path A1/A2/C1/C2/ E1, but Tree Match uses the bitVector encoding to solve this proble m.

    2. TJFast outputs the path solution for all nodes in query, but TreeMatc h only outputs nodes for return nodes (i.e., node B in the query) to reduce I/O cost.

  4. EXPERIMENTAL STUD Y

    The e xperimental study verifies the effectiveness, in terms of accuracy and optimality, of various holistic twig pattern matching algorithms are shown in Fig (b) and Fig (c ).

    Fig (b) Execution time of Q/,//,* on random data

    Fig(c) Execution time of Q/,//,*,< on random data

  5. CONCLUSION

In this paper, we proposed the problem o f XM L twig pattern matching and surveyed some recent wo rks and algorith ms. Five algorith ms TwigStack [5], Twig StackList [19], OrderedTJ [10], TJFast [18] and TreeMatch are introduced. TreeMatch has an overall good performance in terms of running time and the ability to process generalized tree patterns. From the above algorithms we can observe one point that is first four twig pattern matching a lgorith ms (TwigStack, Twig StackList, OrderedTJ, and TJFast) works on two – phase query evaluation and TreeMatch works on one- phase query evaluation. Fro m this point we can say that TreeMatch twig pattern matching a lgorith m can answer complicated queries and has good performance.

References

  1. A. Berglund, S. Boag, and D. Chamberlin. XM L path language (XPath) 2.0. W3C Recommendation 23 January 2007 http://www.w3.org/TR/xpatp0/.

  2. S. Boag, D. Chamberlin, and M . F. Fernandez. Xquery 1.0: An XM L query language. W3C Working Draft 22 August 2003.

  3. H. Jiang, H. Lu, and W. Wang. Efficient processing of XM L twig queries with OR-predicates. In Proc. of SIGMOD Conference, pages 274-285, 2004.

  4. H. Jiang et al. Holistic twig joins on indexed XM L documents. In Proc. of VLDB, pages 273-284, 2003.

  5. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal XM L pattern matching. In Proc. of SIGMOD Conference, pages 310-321, 2002.

  6. I. Tatarinov, S. Viglas, K. S. Beyer, Shanmugasundaram,

    E. J. Shekita, and C. Zhang. Storing and querying ordered XM L using a relational database system. In Proc. of SIGMOD, pages 204-215, 2002.

  7. X. Wu, M . Lee, and W. Hsu. A prime number labeling scheme for dynamic ordered XM L trees. In Proc. of ICDE,pages 66-78, 2004.

  8. R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proc. of VLDB, pages 436-445, 1997.

  9. H. V. Jagadish and S. AL-Khalifa. Timber: A native XM L database. Technical report, University of M ichigan, 2002.

  10. J. Lu, T. W. Lin g, T. Yu, C. Li, and W. Ni. Effcient processing of ordered XM L twig pattern matching. In DEXA,pages 300-309, 2005.

  11. T. Yu, T. W. Ling, and J. Lu. Twigstacklistnot: A holistic twig join algorithm for twig query with not -predicates on xml data. In DASFAA, pages 249-263, 2006.

  12. T. Chen, J. Lu, and T. W. Ling. On boosting holism in

    xml twig pattern matching using structural indexing Techniques. In SIGMOD, pages 455-466, 2005.

  13. S. Chen, H.-G. Li, J. Tatemura, W.-P. Hsiung, D.Agrawal, and K. S. Candan.Twig2stack: Bottom-up processing of generalized-tree-pattern queries over xml document. In Proc. of VLDB Conference, pages 19-30, 2006.

  14. H. Wang, S. Park, W. Fan, and P. S. Yu. ViST: A dynamic index method for querying XM L data by tree structures. In SIGMOD, pages 110-121, 2003.

  15. P. Rao and B. M oon. PRIX: Indexing and querying

    XM L using prufer sequences. In ICDE, pages 288-300, 2004.

  16. M . M oro, Z. Vagena, and V. J. Tsotras. Tree-pattern queries on a lightweight XM L processor. In VLDB, pages

    205-216, 2005.

  17. C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. M . Lohman. On supporting containment queries in relational database management systems. In Proc. of SIGMOD Conference, pages 425-436, 2001.

  18. J. Lu, T. W. Ling, C. Chan, and T. Chen. From region encoding to extended dewey: On efcient processing of xml twig pattern matching.In VLDB, pages 193204, 2005.

  19. J. Lu, T. Chen, and T. W. Ling. Efcient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM , pages 533542, 2004.

[20]B. Choi, M . M ahoui, and D. Wood. On the optimality of

the holistic twig join algorithms. In Proceeding of DEXA, pages 2837, 2003.

[21]M . Shalem and Z. Bar-Yossef. The space complexity of processing xml twig queries over indexed documents. In ICDE, 2008.

AUTHORS PROFILE

  1. D. Bujji Babu currently working as an Associate professor in the department of Computer Science and Engineering, at Prakasam Engineering College, Kandukur, A.P. India. He is having 4 years of research and 10 years of teaching experience. He is a research scholar in the department of CSE at Acharya Nagarjuna University , India.

    E-mail: bujji_bict@yahoo.com

  2. Dr. Siva Rama Prasad currently working as a head of the International Business Administration department at Acharya Nagarjuna University , India. He has published several research papers in various peer reviewed international journals. Authored seven books and also he is working as a research director in the dep artment of CSE.

    E-mail: raminenisivaram@yahoo.co.in

  3. Kum.M.S anthosh M .Tech (CSE) from Prakasam Engineering College, Kandukur, Prakasam (Dt.), Affiliated by JNTUK, Kakinada, A.P., India.

E-mail: santhosh.maddisetty@gmail.com

Leave a Reply