 Open Access
 Total Downloads : 13
 Authors : Ritika Singhal , N. Deepika
 Paper ID : IJERTCONV3IS21010
 Volume & Issue : NCAISE – 2015 (Volume 3 – Issue 21)
 Published (First Online): 24042018
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
TriLayer Resolution: A Coalesced Approach for Indexing Topk Queries
Ritika Singhal Computer Science and Engineering
New Horizon College of Engineering Bangalore, INDIA
N. Deepika
Sr. Asst. Professor Computer Science and Engineering New Horizon College of Engineering
Bangalore, INDIA
Abstract – In the world of changing scenarios and uncertainty, users sometimes specify values for certain attributes in a query, without asking for the exact match in return. The result is termed as a ranked query or a topk query. In this paper, we go for advancement in indexing such ranked queries on the basis of Partitions and Layered techniques. We employ horizontal partitioning first, thereby reducing size of datasets. This partitioning technique is well versed in Oracle and even in Hive. Then the dualresolution layer that consists of coarselevel and finelevel layers, is applied. So we propose a combinatorial technique to make the processing of topk queries faster and more efficient.
Keywords topk query, partitioning, skyline, convex skyline

INTRODUCTION
Approximation and Generalization are the keywords for the Ranked queries. The user specifies a ranking function or a scoring function. This functions value is calculated on every data row for a specified set of attributes and the output is that each data row has a rank associated with it. (Something similar to applying a hash function) Then the topk rank holders are returned as the desired output.
Example 1.1 Consider a user interested in finding a Hotel room in a city where combined cost of the room and the cost of travelling to the famous tourist attraction is minimum. The user is only interested in five such Hotels.
So here, the scoring function is:Min(room price, travel price). This can be depicted graphically as shown in Fig. 1. In the figure, w specifies the scoring function given by the user. The dots represent some hotels. The scoring function is calculated as an average of both the coordinates.
The naÃ¯ve way to answer this is to perform a join of rows with cheapest room rates from Hotel database and rows of least distance to tourist point from Tourism Database. But this is indeed costly. In general terms, to process a topk query, a naive method would calculate the score of each tuple according to the score function, andthen, finds the top k tuples by sorting the tuples based on their scores. This method, however, is not appropriate for a querywith a relatively small value of k over large databases because it incurs a significant overhead by reading even those tuples that cannot possibly be the results.
Fig. 1. Sample DataSet
There have been a number of methods proposed to efficiently answer topk queries by accessing only a subset of the database instead of unnecessarily accessing the entire one.The techniques formulated till now for such efficient processing of topk queries can be categorized into 3 approaches:

Layerbased approach:
This approach constructs a global index by dividing the records into consecutive layers. These layers are divided on the basis of all attributes. There exists oneto one mapping between a tuple and a record in certain layer. This mapping is the key of traversal. Any topk query can be answered by fetching records from at most klayers. ONION[2] and AppRI use this approach.
ONIONbuilds the index by making layers with the vertices of the convex hulls over the set of tuples represented as point objects in the multidimensional space. That is, it makes the first layer with the convex hull vertices over the entire set of tuples, and then, makes the second layer with the convex hull vertices over the set of remaining tuples, and so on. As a result, an outer layer geometrically encloses inner layers.
AppRI constructs the database in much finer layers than ONION does by precomputingthe domination relation for all the pairs of tuples in the database.

Listbased approach:
This approach uses sorted lists for each attribute. The result for the topk query is fetched by aggregating the
values from the lists searched according to the attributes given in the query. The search goes through the list in RoundRobin fashion. The lists are independent of each other. Also the efficiency of this approach decreases as the number of attributes mentioned in the queryincreases. The Threshold Algorithm (TA) uses this approach.TA and its variants are useful for a distributed database environment where exploiting the attribute correlation is difficult due to vertical partitioning. However, the performance of these methods are penalized due to lack of exploiting the attribute correlation

Viewbased approach:
This approach uses lots of space. The answers to a class of queries on every subset of attributes is pre computed and stored. Whenever the query demands the topk answers, the precomputed results are scanned and the query is answered. If there are no exact precomputed matches, then the nearest matching answers are returned. As a result this approach has no guarantee for efficiency. PREFER[6] and LPTA uses this approach.
Out of all the above approaches, the layered approach is the most efficient one. In this paper, we first focus on reducing the size of concerned dataset by partitioning. Then we go for an optimized layer based approach which guarantees that the first k layers include topk answers regardless of the scoring function. This approach focuses on the relationship between tuples in adjacent layers at a finer granularity.
So the TriLayer approach has following three layers:
(1) Partitioning the dataset on the basis of some partitioning columns or keys. (2) Coarselevel layers build on skylines which exploit dominance relationships. (3) Finelevel layers build on convex skylines which exploit relaxed dominance relationships.


PRELIMINARIES
We define some notations and definitions before we formally define the problem.
Let R be a relation containing tuples ti with cardinality n. The attributes Ai are defined over domains dom(Ai) and degree of R is d. A topk query also consists of a linear combination scoring function
=
combination function F. A convex skyline is a set of convex skyline tuples, denoted by CSKY(R).
Definition 5(Convex Hull)[8]: The convex hull of a set X of points is the smallest convex set that contains X. An object is convex or belongs to a convex set if for every pair of points within the object; every point on the straight line segment that joins the pair of points is also within the object.
TABLE I. Description of Notations
Symbols
Semantics
R
The target relation
n
Cardinality of R
d
The number of attributes in R or the degree of R
Ai
The ith attribute of R
ti
A tuple in R
w[i]
The weight vector over the ith attribute
Li or Li
The ith layer

COALESCING LAYERS
In this section, we study how partitioning can be done and then we move on to the twolevel layers[1], which can work on each partition. The result is an efficient retrieval of topk queries.

Partitioning
Partitioning allows tables and indexes to be subdivided into smaller pieces, enabling these database objects to be manges and accessed at a finer level of granularity. Each partition has its own name and some storage characteristics. Each row in a partitioned table is unambiguously assigned to a single partition. The partitioning key is comprised of one or more columns that determine the partition where each row will be stored.
Partitioning can be done via three techniques:

Range: in this the records are classified into partitions based on some ranges of values of the partitioning key. This is most common form of partitioning.

Hash: It maps data to partitions based on the hashing algorithm applied to the partitioning key. The hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size.

List: it enables the user to explicitly control the
Where wi is the userspecific
=1
weight function.
mapping of records into partitions by specifying a list of discrete values for the partitioning key in
Also, =1 = 1. Thus the scoring function F preserves
the monotonicity.
Definition 1(Topk Query): Given a scoring function F and the retrieval size k, a topk query returns an ordered set of tuples such that 1 ( ).
Defintion 2(Dominant tuple): A tuple t dominates another
tuple t if t is better than t in all the attributes.
Definition 3(Skyline)[1]: A tuple t is a skyline if any other tuple t does not dominate t on all attributes Ai. A skyline is a set of skyline tuples, denoted as SKY(R).
Definition 4(Convex Skyline)[1]: A tuple t is a convex skyline if it has the minimum score for any linear
the description of each partition.
Partitioning is well supported with Oracle. If we are dealing with Big Data, then Hive also lets the user to create partitions.
The main advantage of partitioning in topk query retrieval is that the dataset to be worked upon is reduced in size. Hence the access to the required records is faster. Further, each partition can be added, modified or dropped according to the partitioning key as per the trend of user queries.
Considering the same example, partitioning can be shown graphically as in Fig. 2.
Fig. 2. Partitioned DataSet
In Fig. 2. the dataset is divided in four different partitions. These partitions contain different tuples distributed based on some partitioning key. The selection of the partitioning key and the partitioning technique is dependent on the database administraor.
This partitioning forms the first layer in topk query traversal. The basic notion is based on the heuristics that the user queries will point to some common partitioning keys. Partitioning the dataset into layers, thus, involves some manual intervention also.


Twolevel Layer
The dualresolution layer is then built on every partition in concern. The dualresolution layer has the following two phases:

Coarselevel layer construction: in this the skyline layers are built sequentially.The first layer L1 is the Skyline SKY(R).Considering our example, one particular partition can have 3 layers as shown.
L1 = {k} L2={i,j}
L3={f,b}
Fig. 3. Skyline Layers in one partition Fig. 4.

Finelevel layer construction: We split the coarse level layer into consecutive convex layers at a finer granularity. Let Lij be the jth finelevel layer Li1 in Li coarselevel layer. The first finelevel layer Li1 in Li is the convex skyline CSKY(Li).

When these layers are built up, each tuple in R is associated with a certain layer. Then, given a scoring function F and retrieval size k, the topk answers are identified by traversing the layers through relationships between tuples. In this process, some tuples are selectively accessed as the candidate tuples of the topk answers. The relationships between the tuples in these layers hold an important place. To define these relationships, first, the dominance relationship for coarselevel layers can be used in the dominant graph, called DG[3].
Dominance between coarse layers:For any scoring function F, if t Li dominates tLi+1, the score F(t) of t is always smaller than the score F(t) of t. This relationship is also known as dominance.
Second, a relaxed dominance relationship between adjacent finelevel layers can be used for further selective access. Specifically, the coarselevel layer is divided into multiple finelevel layers. In this case, because no dominance relationship holds for adjacent finelevel layers, we adopt the relaxed dominance relationship for the fine level layers.
We introduce an dominance set (EDS). We consider a hyperplane H spanning a set of d tuples in ddimensional space. Given a hyperplane H and a tuple t, it is said that the tuple set on hyperplane H is an EDS of tuple t if H is closer to the origin than tuple t. It is said that every tuple t
EDS dominates t.
dominance between fine layers:For any scoring function F, if t Lijdominates tLi(j+1), the score F(t) of t is smaller than the score F(t) of t.
Basically, the dominance sets for Lijhave to encompass all tuples inLi(j+1), i.e., every tuple inLi(j+1)is connected by the dominance relationship for any tuple in Lij.
Considering our example, the dualresolution is constructed as shown in Fig. 4.
Fig. 5. Dual Resolution Layer Construction over one partition
The entire data in the concerned partition is divided into three coarselevel layers(shown by solid boxes. The relationships between these layers are shown by the solid arrows ie. dominance relationships. Each coarse level layer is then divided into multiple finelevel layers (light smaller boxes). Then dominance relationships between finelevel layers are shown by dotted arrows.
Topk query processing over the layerbased index can be viewed as a graph traversal problem. The tuples are then accessed on the basis of a filtering condition. Each tuple is now accessed on the basis of a status. This status is decided as follows.
Definition 6 (dominancefree). A tuple tLi+1is – dominancefree, if (1) t_ has no connected edge from t
Lior (2) every tuple t Li connected to tis in the top(k 1)answers.
Definition 7 (dominancefree). A tuple tLi(j+1) is – dominancefree, if (1) thas no connected edge from Lij or(2) any tuple t Lij connected to tis in the top(k 1)answers.
Filtering condition:A tuple t has to be only accessed if t is both dominancefree and – dominancefree.
In this way, the retrieval of topk queries is optimized and made more efficient with a threelayer resolution.


CONCLUSION
This paper has studied the problem of designing a layer based index for supporting scalable topk query computation.In particular, we focused on the optimization goal of making selective access to each layer in a specific partition. To pursue this optimization goal, we first reduced our dataset by choosing only one or more concerned partitions. Then we designed a twolevel layer, in which each coarselevel layer was further divided into multiple finelevel sub layers at a finer granularity.
ACKNOWLEDGMENT
This paper is written for the purpose of paper submission in IJERT. The authors thank the reviewers for their feedback and to all the people involved in this study.
REFERENCES

J.Lee, H.Cho,S.Lee and Swon Hwang, Towards ScalableIndexing for Topk Queries in IEEE Transactions on Knowledge and data engineering, Vol 26, No. 12, Dec 2014.

Y.C. Chang et al., The onion technique: Indexing for linearoptimization queries, in Proc. ACM SIGMOD, Dallas, TX, USA,pp. 391402, 2000.

L. Zou and L. Chen, Dominant graph: An efficient indexingstructure to answer topk queries, in Proc. IEEE 24th ICDE, 2008,pp. 536545.

J. Heo, J. Cho, and K. Whang, The hybridlayer index: Asynergic approach to answering topk queries in arbitrary subspaces,n Proc. IEEE 26th ICDE, Long Beach, CA, USA, 2010,pp. 445448.

L. Zou and L. Chen, Paretobased dominant graph: An efficientindexing structure to answer topk queries, IEEE Trans. Knowl.Data Eng., vol. 23, no. 5, pp. 727741, May 2011.

V. Hristidis, N. Koudas, and Y. Papakonstantinou, PREFER: Asystem for the efficient execution of multiparametric rankedqueries, in Proc. 2001 ACM SIGMOD, pp. 259270.

J.S. Heo, K.Y. Whang, M.S. Kim, Y.R. Kim, and I.Y. Song, Thepartitionedlayer index: Answering monotone topk queries usingthe convex skyline and partitioningmerging technique, Inf. Sci.,vol. 179, no. 19, pp. 32863308, 2009.

http://en.wikipedia.org/wiki/Convex_hull
Ritika Singhal is an M.Tech. student with New Horizon College of Engineering, Bangalore, India. She received her B.Tech. degree in Information Technology at Kurukshetra University, Kurukshetra, India. Her current research interests include database systems and query languages.
N. Deepika is a Senior Assistant Professor with Department of Computer Science and Engineering at New Horizon College of Engineering, Bangalore, India. She received the M.Tech. degree in computer science from JNTU, Hyderabad, India. Her current research areas include data mining, web mining and computer networks.