 Open Access
 Total Downloads : 236
 Authors : P. Z. Piah, P. O. Asagba, V. Ejiofor, K. T. Igulu
 Paper ID : IJERTV5IS010586
 Volume & Issue : Volume 05, Issue 01 (January 2016)
 Published (First Online): 27012016
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
An Efficient Point Data Indexing Structure for Multidimensional Range Queries
P. Z. Piah
Department of Computer Science Kenule Benson SaroWiwa Polytechnic BoriRivers State, Nigeria
V. Ejiofor Department of Computer Science
Nnamdi Azikiwe University Awka, Nigeria
P. O. Asagba Department of Computer Science
University of Port Harcourt Port Harcourt, Nigeria
K.T. Igulu
Department of Computer Science Kenule Benson SaroWiwa Polytechnic BoriRivers State, Nigeria
Abstract Since the main memory is expensive and volatile persistence data cannot be kept in main memory. Most databases utilize the secondary/tertiary storage. The main overhead of using a secondary storage is access time. This is usually high in multidimensional databases like OLTP that are characterized by recurrent updates and queries. The use of secondary/tertiary storage inherently suggests indexing. Indexing helps to retrieve/store the required data/ data segments faster than iterating through the table. Indexing enhances speed of querying. In multidimensional databases (like OLAP databases) the essential tool for accessing data is the range query (or window query). In extant databases, BTrees and its variants are the convention for indexing. But the major setback of Btrees is that they are single attribute index structures. Which implies that the record of such database are ordered by a particular attribute usually but not necessarily the primary key. This limits range query restriction on one particular attribute of the table. The use of multiple Btrees indexing for various attributes of a table is the convention for achieving range query for other attributes. The use of multiindexing is additive and poses so many drawbacks (additional space required and speed is hampered). With the exponential burst of data, there is need for a better data structure with efficient query algorithm that has high storage capacity and also considerably fast (algorithm) for multidimensional range queries. This paper discusses a multidimensional indexing structure that is fast and also consumes virtually equivalent space as though is a single attribute structure. Experiment show that the structure has multiplicative complexities and is immune to the curse of dimensionality.
Keywords Range query, Indexing, multidimensional, database, Btree, UBTree.

INTRODUCTION
Complex enterprise applications for instance SAP HANA [1], data mining (DM), data warehousing (DW) and nonstandard DBMS applications such as geographical information systems (GIS) and statistical databases have spawned a strong demand for efficiency in the processing of extremely complex queries on large databases. These complex queries of course set new requirements on the query processing and indexing method algorithms for DBMSs. The main purpose of indexing a table of a database is to expedite query
execution. This is achieved by utilizing the constraints imposed by a query in order to condense the number of disk accesses. In multidimensional databases (like OLAP databases) the essential tool for accessing data is the range query (or window query). In extant databases, Bitmap, B Trees and its variant are the convention for indexing. But the major setback of Btrees is that they are single attribute index structure. Which implies that the record of such database are ordered by a particular attribute usually but not necessarily the primary key. This limits range query restriction on one particular attribute of the table. The use of multiple Btrees indexing for various attributes of a table is the convention for achieving range query for other attributes.
This paper discusses an access method (indexing structure) for efficient manipulation of range queries called the UB Tree. It organizes and clusters any table of a database by a certain computation that depends on some or all the attributes of the table. By this, data points that are close by this computation are stored closely in the storage. In other words, spatial proximity is maintained based on all the attributes or some of the attributes of the table.

QUERIES
Relational queries are primarily expressed by operators of the relational algebra and they operate either with a single table or span multiple tables [2]. Single table queries restrict, re arrange or aggregate the tuples of one relation [3]. A query is a predicate (x) over the tuples of a relation R. The result set (RS) of a query is the subset of tuples of R sufficiently satisfy the query predicate in (1). The result set size is the cardinality of the result set as given in (2).
RS(R,) = {xR (x)} (1)
RS(R,) (2)
A restriction query is a predicate (x) on the tuples x of a relation R. A restriction query can be to a pointexact match query or on some dimensionpartial match or partial range query. A range query is a special case of query with restrictions on all the dimensions. Reference [4] categorizes queries into single table queries and multiple table queries. Multiple table queries specifically join single table queries of
different tables. Single table queries are further categorized into restriction queries and queries of rearrangement which can be sorting, projection , grouping and aggregation. Partial range query is a type that gives restriction on some dimensions of the query and some unrestricted. This can be further categorized as exactmatch query or range query.
A. Range Queries
Range queries give restriction on all dimensions (attributes) of the query. From the geometric perspective, let P be a set of n points in Ãk (i.e. k dimensional Ãk= {d1, d2, d3, , dk }) and let D be a family of subsets of Ãk ( i.e.D Ãk). Let ri represent a range of the ith dimension (ri di). We call ri an interval represented by its lower and upper bound (ri =(l,h)). Elements of D are called ranges (i.e. D ={r1 , r2, , rk}). Let be points subsumed by the range set D. Given the above, the requirement is to build appropriate data structure that supports range reporting, range count (number of points in the result set or the cardinality of the result set), emptiness query (determining if the result set size is not zero) given in (3), (4) and (5) respectively.
P (3)
P  (4)
P= or P=0 (5)
An exact point query shall be regarded as a special case of range query whose intervals are equal values on the various dimensions. i.e. l=h for all the k dimensions. No two points on the plane have the same address (no two points have the same xand ycoordinates for 2d space). Range queries are a fundamental problem for all database systems. A range query is specified by an interval for each dimension. If no restriction for a dimension it formally means that the interval is (, +) i.e. unbounded. The query is the Cartesian product of the intervals for all dimensions, called the query box (QB) or Query Volume of Q with the lower and upper bounds ql and qh. The answer to Q is the set of data points objects in Q. Subsequently, this set will be called Result Set or simply RS of Q.

THE UBTREE
The UBtree is due to Markl [2] in their work on Mistral [3]. The UBtree as described in [2] is a structure to index multi dimensional data with linear complexities i.e. using a structure that has linear complexities. The UBTree exploits the capabilities of Btree and Zcurve [7]. Each multi dimensional data tuple is transformed into an integer (Z address), which is inserted into the Btree. Each node is a pair of integer ([:]) denoting the lower bound () and the upper bound () of a region on the plne respectively. It suffices to note at this point that the entire plane is regarded as a Z region (SuperZregion). For consistency, all regions will be regarded as simply Zregion. A leaf of the UBtree which is mapped to a Zregion of the curve holds data (points) or link to the data. Usually, a region mapped to a disk block (or page). Range query can be handled by retrieval points in regions that are perfectly subsumed by the query box or that intersect the query box. Figure 1 shows (a) typical 2D 8by8 space with 6 Zregions (b) UBtreenodes corresponding to the zregions in the space. The inner nodes of UBtree
recursively divides the space, such that a hierarchy of nested Zregions is formed.
Figure 1: Zcurve with Zregions and UBTree

Data Types and Address Calculation
The values of the attributes specified in query operations by a user are over specific domains (data types). Some of these domains have natural lexicographical ordering. Domains with explicit lexicographical ordering are directly converted to binary. The interleaving algorithm is then invoked on the binary equivalent of the values. Some domains require transformation to achieve uniqueness and ordering of values over such domain. To achieve this, transformation algorithms are utilized. Then the transformed equivalent is used for interleaving operation. In such domain, bitlexicographic order on the binary representation of tuples does not correspond to any natural ordering. Transformation is bijective. For Cartesian coordinates, the inverse function to the transformation function is applied after bit extraction. Figures 2a and 2b depict this concept. Character string can be transformed to binary by using the ASCII lexicographical ordering of the character. This will produce the binary equivalent of the character string and interleaving invoked appropriately. In programming language, ENUM type defaults to Integer. For the integer default, the ENUM values are mapped to their integer equivalents are interleaved appropriate. ENUM of character string will of course undergo two levels of transformation i.e. mapping and transformation itself.
Figure 2: (L): Address Calculation. (R): Cartesian Calculation

Address Calculation and Interleaving Operation
The address computation of the UBTree determines its efficiency. The UBTree utilizes the bitinterleaving paradigm. If each attribute xi of a ddimensional tuple (x1, x2,
…, xd ) consists of 2r values, it can be regarded as a sequence of bits xi,r … xi,1. Bitinterleaving produces an rdimensional tuple from the ddimensional tuple by rearranging the bits of the tuple in the following way[6]:
Interleaved,r (x1,r … x1,1, x2,r…x2,1 xd,r …xd,1)
=x1,r x2,r … xd,r, x1,r1…xd,r1, x1,1… xd,1) (6) The value produced by this computation is henceforth known as Zvalue. Thus interleave3,4 (1110,1010,0111) = (110,101,111,001). If the result of interleavedr is considered as binary number in the place of an rdimensional tuple, incrementing this number by 1 yields the ZAddress (Zaddr) of a tuple. Thus:
Zaddr (x1, x2, …, xd)= interleavedr (x1, x2 , …, xd )+ 1 (7)
Therefore Zaddr(14,10,7) = Zaddr(1110, 1010, 0111) = interleave3,4 (1110,1010,0111) + 1 = (110,101,111,001) + 1 =
(110,101,111,010) = 6.5.7.2. The Zaddress can be computed by a function given in [2].
s1
Zaddr(P)=
j 0
n
i1
Pi, j
2 jni1
(8)
Figure 5: Bitinterleaving in 3d tuple [5,4,6]

Rapid Hyper Quadrant Jump Addressing Approach
P , where the binary representation of each coordinate pi is denoted as pi= pi,s1, pi,s2, , pi,o. This function does the same thing as bitinterleaving. The inverse function to interleaved,r can be computed basically in the same way. This function is called (interleaved,r)1. Therefore interleaved,r (o) = interleaver,d (Zaddr(o) 1). Only a slight modification of the interleave operation is necessary to support a universe where the domain of each dimension does not consist of the same number of bits r. The domains of this work are of equal cardinalities (2r). The algorithm of bitinterleaving has the CPUcomplexity of O(d*r), where r stands the length of each attribute in bits and d is the dimensions. The same holds for (interleave)1. Switching a tuple between Cartesian representation and address representation can therefore be performed very efficiently. Figures 2 and 3 show the bitinterleaving operations in 2d and 3d respectively. The bitinterleaving does not deteriorate by high dimensionality (immune to the curse of dimensionality).
Figure 4: Interleaving process of 2d tuple [5,6]
This new approach was introduced in [4]. Investigation show that the approach is just a new method but not different from the original idea. In geometry, the term quadrant depicts an exactly defined quarter of 2dimensional universe. Equally, 1dimensional space can be divided into two halves, and for 3dimensional space eight octants of space. Common to most geometric constructs is a constraint to partition the universe. Moreover, the partitioning is done using halving the space in all existing dimensions. Such information can be generalized by definition of the hyperquadrant of ndimensional space.
Let =Dn be a vector space: The hyperquadrant (hquad) HQ is a subspace in ; i.e. HQ; such that. HQ=HD1Ã—HD2Ã—HDn ; where each domain HDi is the lower or the upper half of the domain D, i.e. HDi = low(D) or HDi = up(D). i.e. the lower and upper subspace in each dimension. Each ndimensional vector space is formed by its 2n disjoint hquad. i.e. there are distinct 2n disjoint hquads. If the hquad are designated with the 2n distinct possible values, a unique code will be computed for each hquad. For instance in n=2, i.e. 2d, there will be four quadrants designated 00, 01, 10 and 11. For 3d, there be 000, 001, 010, 011, 100, 101, 110,
Figure 6: 2d ZOrder curve of 2Ã—2, 4Ã—4, 8Ã—8, 16Ã—16 respectively
Figure 6: 2d ZOrder curve of 2Ã—2, 4Ã—4, 8Ã—8, 16Ã—16 respectively
111. The bitlength equals the dimension. Recall the geometric representation of the ZOrder curve which expands its resolution by halfsplitting on all the n dimensions. Figure 6 shows the 2d Zorder curve and figures 7 gives the 3d hquad representation.
Figure 7: 3d case with Visualization of the various Hquads.
The ZOrder curve is a finite image of its self by recursively splitting along the various dimensions of the plane. Since the ZOrder is an image of itself, we can jump from one supper quadrant to its specializations given the bit string address of such quadrant. For Zcurve in 2d, the first two bits of the Z address represent the super quadrant where the point can be found. The second two bit string specifically identifies the second level of the splitting and so on. The same applies to d dimensional space. This approach combines the interleaving and the searching together and has a great performance impact for all the operations on the UBTree[4].

Insertion Algorithm
A point P to be inserted into the universe is specified by its Cartesian coordinates (x1 x2 , …, xd) with address Zvalue = Zaddr(x1 x2 , …, xd). P belongs to the unique region [:] fulfilling Zvalue . It should be noted that Zvalue must be computed only to a precision sufficient enough to determine the proper region. The addresses are linearly ordered by . P is inserted into the leafpage corresponding to such region, which is found by an exactmatch query. Since pages can store only a maximum number M of pointers or objects, pages may overflow and are split like in Btrees. [:] is split by a new area with address satisfying < < . The region [:] is divided by into [:] and [+1:]. The objects or object pointers in page ([:]) are distributed onto page([:]) and page([+1:]) accordingly. is created by increasing area() as follows: Add to area subcues from [:] in increasing order until the number of the objects in [:] is between Â½M and Â½M+ . If the subcube that follows in this process contains much objects, it undergoes a recursive subdivision until the condition can be met. The parameter is used to get shorter split addresses, which are favorable for the UBTree performance especially of the range query algorithm[2].

Split Point Algorithm
The constraint of SFCs, (Zorder precisely) is that regions are disjoint (regions dont overlap). The nonoverlapping of regions has direct performance impact on range query. Range query is succinctly finding those region that intersect the query box (QB). If splitting is not optimal, the general performance of the range query degrades. Another effect of region or page splitting is to avoid post filtering as much as possible. If the split point for a split is not optimally computed, there could be overlapping of pages and this directly impacts the access time during an operation that affects such region. Reference[2] uses a tree to keep track of split points. Reference [4] showed that the underlining one dimensional tree used in [2] is redundant and introduced a new optimal split point algorithm. The algorithm proposed in

ensures that;

The left region should be half full.

A perfect partition that creates at least the maximum number of points larger page/region for the left page. After the split, the left region has a restriction of the number that can be inserted. This avoids early split.



Point Query Algorithm
Point Queries are also called "exactmatch queries". They are specified by the Cartesian coordinates (x1, x2, …, xd ) of the point P. In OLAP these coordinates are usually called dimensions or dimension attributes. To find P, the address z value := Zaddr(x1, x2, …, xd) of P with sufficient precision to find the unique region [:] with the property a*b and fetch page([:]). This is realized by searching the UBtree with the address Zvalue as the search key. page([:]) must contain point P with the additional information or the identifier Id(P) that is used to reference P. P can be found in O(logkN) time, where N is the number of objects in the universe and k is the order of the tree. Since UBtrees are balanced and searched exactly like the Btree used as the underlying data structure for the UBtree. Thus the point query performance of a UBTree is similar to that of a BTree index. The additional address calculation overhead is negligible.

Range Query Alogorithm
To answer a range query, only those regions, which perfectly intersect the query box, must be fetched from the database and thus from the disk. Initially the range query algorithm computes and retrieves the first region that is overlapped by the querybox. Then the next intersecting region is computed and retrieved. This is repeated until a minimal cover for the query box has been constructed, i.e., the region that contains the ending point of the query box has been retrieved. At first, Zaddress of the query box lower and upper bounds are computed. Using this value, a leaf page of the UBtree is retrieved and searched for relevant data tuples. Next, the following queryintersected leaf is retrieved and so on. The algorithm will finish as soon as the of the actual Zregion gets greater than Zaddress of the query box upper bound, i.e.
when > Zvalueh . With the exception of cases where the query box degenerates to a hyperplane of the universe, the number of regions, which must be fetched from the database, is for sufficiently large databases proportional to the volume of Q and therefore proportional to the size of the answer to the range query.


CONCLUDING REMARKS
Extant commercial DBMS do either not support multidimensional indexes at all or only use them as an add ons. A widespread approach to handling multidimensional range queries consists of the successive application of such single key structures, one for each dimension. Concatenated or compound single structures are used to handle multi dimensional range queries. Unfortunately, this approach can be very inefficient. Since each index is accessed independently of the others. Its apparent that high selectivity in one dimension to narrow down the search in the remaining dimensions is not possible. Generally, there is no easy and apparent approach to expand single key structures to handle multidimensional data. Other shortcomings of existing multidimensional indexing structures for databases is that they dont guarantee for spatial proximity and they suffer of the curse of dimensionality. Their performance also degrade because of memory consumption for creating indexes for the various attributes in the case of compound indexing and post
filtering (matching of tuples to build the result set). For concatenated indexing, queries favor the first attributes. More also every query using such indexing must include the first group of attributes. Experience also show that existing indexing structures for complex range queries possess additive complexity whereas the approach discussed here possesses multiplicative complexity which impacts performance. The enforcement of tuple clustering which also translates to page clustering based on majority dimensions (attributes) of the tuple reduces disk accesses for complex range queries.
REFERENCES

SAPGroup, 2015. [Online]. Available: www.sap.com.

V. Markl, "Processing relational queries using a multidimensional acces technique," Dissertations in Database and Information Systems Infix,, vol. 59, 1999.

"MISTRAL Project," 1999. [Online]. Available:
http://mistral.informatik.tumuenchen.de.

P.Z. Piah, An Efficient Range Searching Algorithm in Complex Geometric Space, PhD thesis, University of Port Harcourt, 2015.

T. Skopala, M. Kratkyb, J. Pokornya and V. Snaselb, "A new range query algorithm for Universal Btrees," Information Systems, vol. 31, p. 489511, 2006.

R. Bayer, "The universal Btree for multidimensional indexing: general concepts," in International Conference on Worldwide Computing and Its Applications, Springer, Berlin, 1997.

J. Lewder, "The Application of Spacethe Storage and Retrieval of Multidimensional Data," London, 1999.