Identifying Peculiarities To Exploit Perceptibility Of Widgets

M. Vijaya Lakshmi; D. Uma Devi

doi:10.17577/IJERTV1IS7490

Volume 01, Issue 07 (September 2012)

Identifying Peculiarities To Exploit Perceptibility Of Widgets

DOI : 10.17577/IJERTV1IS7490

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 54
Total Downloads : 276
Authors : M. Vijaya Lakshmi, D. Uma Devi
Paper ID : IJERTV1IS7490
Volume & Issue : Volume 01, Issue 07 (September 2012)
Published (First Online): 26-09-2012
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Identifying Peculiarities To Exploit Perceptibility Of Widgets

Identifying Peculiarities to Exploit Perceptibility of Widgets

M. Vijaya Lakshmi1, D. Uma Devi2

1M.Tech II year Student, Department of CSE, Sri Mittapalli College of Engineering,

2Associate Professor, Department of CSE, Sri Mittapalli College of Engineering,

In recent years, there has been important concentration in the expansion of ranking functions and capable top-k rescue algorithms to help users in ad-hoc search and rescue in databases (e.g., buyers searching for products in a catalog). We introduce a complementary problem: how to guide a seller in picking the best peculiarities of a new tuple (e.g., a new product) to highlight so that it stands out in the crowd of existing competitive products and is widely visible to the pool of potential buyers. We develop several formulations of this problem. Although the problems are NP-complete, we give several exact and approximation algorithms that work well in practice. One type of exact algorithms is based on Integer Programming (IP) formulations of the problems. Another class of exact methods is based on maximal regular itemset mining algorithms. The approximation algorithms are based on greedy heuristics. A detailed performance study illustrates the benefits of our methods on real and synthetic data.

Key Words: Data mining, information and facts, engineering tools and procedures, advertising, withdrawal methods and algorithms, rescue designs.

In recent years, there has been significant interest in developing effective procedures for ad-hoc search and rescue in unstructured as well as structured data repositories, such as text collections and relational databases. In particular, a large number of emerging applications require exploratory querying on such databases; examples include users wishing to search databases and catalogs of products such as mobiles, cars, cameras, restaurants, or articles such as news and job ads. Users browsing these databases typically execute search queries via public front-end interfaces to these databases. Typical queries may specify sets of keywords in case of text databases, or the desired values of certain peculiarities in case of structured relational databases. The query-answering system answers such queries by either returning all data objects that satisfy the query conditions, or may rank and return the top-k data objects, or return the results that are on the querys skyline. If ranking is employed, the ranking may either be simplistic e.g., objects are

ranked by an attribute such as Price; or more sophisticated e.g., objects may be ranked by the degree of relevance to the query. While unranked rescue (also known as Boolean Rescue) is more common in traditional SQL-based database systems, ranked rescue (also known as Top-k Rescue) is more common in text databases, e.g. tf-idf ranking [20]. Recently there has been widespread interest in developing suitable top-k rescue procedures even for structured databases [1, 7, and 30]. Skyline rescue semantics is also investigated where a data point is retrieved by a query if it is not dominated by any other data point in all dimensions [4, 19, 22, 25, 29, and 31].

Screening Peculiarities for greatest Perceptibility: We distinguish between two types of users of these databases: users who search such databases trying to locate objects of interest, and users who insert new objects into these databases in the hope that they will be easily discovered by the first type of users. For example, in a database representing an e-marketplace the former type of users are potential buyers of products; while the latter type of users are sellers of products. Almost all of the prior research efforts on effective search and rescue procedures such as new top-k algorithms, new relevance measures, and so on have been designed with the first kind of user in mind (i.e., the buyer). In contrast, less research has been addressing procedures to help a seller/manufacturer insert a new product for sale in such databases that markets it in the best possible manner i.e., such that it stands out in a crowd of competitive products and is widely visible to the pool of potential buyers.

It is this latter problem that is the main focus of this paper. To understand it a little better, consider the following scenario: assume that we wish to insert a classified ad in an online newspaper to advertise about cars, mobiles and cameras for sale. Our products may have numerous peculiarities (it will have power window, four door, petrol, diesel, etc). However, due to the ad costs involved, it is not possible for us to describe all peculiarities in the ad. So we have to select, say the ten best peculiarities. Which ones should we select? Thus, one may view our effort as an attempt to build a recommendation system for sellers, unlike the more traditional recommendation systems for buyers. It may also be viewed as inverting a ranking function, i.e., identifying the argument of a ranking function that will lead to high ranking scores.

This general problem also arises in domains beyond ecommerce applications. For example, in the design of a new product, a manufacturer may be interested in picking the ten best features from a large wish-list of possible features e.g., new car can have the feature of high mileage. To define our problem more formally, we need to develop a few abstractions. Let D be the database of products already being advertised in the marketplace (i.e., the competition). In addition, let Q be the set of search queries that have been executed against this database in the recent past thus Q is the workload or query register. For a new product that needs to be inserted into this database, we assume that the seller has a complete ideal description of the product. Our problem can now be defined as follows:

Given a database D, a query register Q, a new tuple t, and an integer m, determine the best (i.e., top- m) peculiarities of t to retain such that if the shortened version of t is inserted into the database, the number of queries of Q that retrieve t is exploited.

We consider several variants, including Boolean, categorical, text and numeric data, and conjunctive and disjunctive query semantics. We also consider variants in which the budget, i.e., m, is not specified; in this case our objective is to determine the value of m such that the number of satisfied queries divided by m is exploited. Thus we seek to exploit the per dollar benefit. A special case of this no- budget variant is when ranking is performed using functions that are non-monotone on the number of specified peculiarities (keywords), such as the BM25 [18] scoring function used in Information Rescue.

We analyze the computational complexity of these problems, and show that most variants are NP- complete. We develop two types of methods yielding optimal solutions: (a) procedures based on Integer Programming (IP) and Integer Linear Programming (ILP) methods, which work well for moderate-sized problem instances, and (b) more scalable solutions based on novel adaptations of maximal regular set algorithms that also allow us to leverage several preprocessing opportunities.

Main Contributions:

The main contributions of this paper may be summarized as follows:
1. We introduce the problem of identifying peculiarities of a tuple for exploit perceptibility as a new data exploration problem. We consider several interesting variants of the problem as well as diverse application scenarios.
2. We analyze the computational complexity of the different variants ofthe problem and show that most of them are NP-complete.
3. We develop optimal Integer Programming (IP) and Integer Linear Programming (ILP) based algorithms to solve certain variants of the problem. These algorithms are effective for moderate-sized problem instances.
4. For certain problem variants, we also develop more scalable optimal solutions based on novel adaptations of maximal regular itemset algorithms. Furthermore, in contrast to ILP-based solutions, we can leverage preprocessing opportunities in these approaches.
5. We also develop fast greedy approximation algorithms that work well for all problem variants.
6. We perform detailed performance evaluations on both real as well as synthetic data to demonstrate the effectiveness of our developed algorithms.
First we provide some useful definitions.

Boolean Database: Let D = {t1tN} be a collection of Boolean tuples over the attribute set A = {a1aM}, where each tuple t is a bit-vector where a0 implies the absence of a feature and a1 implies the presence of a feature. A tuple t may also be considered as a subset of A, where an attribute belongs to t if its value in the bit- vector is 1.

Tuple Domination: Let t1 and t2 be two tuples such that for all peculiarities for which tuple t1 has value 1, tuple t2 also has value 1. In this case we say that t2 dominates t1.

Tuple Compression: Let t be a tuple and let t be a subset of t with m peculiarities. Thus t represents a compressed representation of t. equally, in the bit- vector representation of t, we retain only m 1s and convert the rest to 0s.

Query Register: Let Q = {q1qS} be collection of queries where each query q defines a subset of peculiarities.

In this section we formally define the main problem for Boolean data. In Section 3.1 we formally define the problem and in Section 4 we provide algorithms for this problem variant.

Conjunctive Boolean Rescue: We view each query as a conjunctive query. A tuple t satisfies a query q if q is a subset of t. For example, a query such as {a1, a3} is equivalent to return all tuples such that a1= 1 and a3 = 1. Alternatively, if we view q as a special type of tuple, then t dominates q. The set of returned tuples R (q) is the set of all tuples that satisfy q. In this problem variant as well as in most of the variants defined later, our task is to compress a new tuple t by retaining the best set of m peculiarities (i.e., top-m peculiarities) such that some criterion is optimized.

3.1 Problem Definition

Conjunctive Boolean – Query Register (CB-QR): Given a query register Q with Conjunctive Boolean Rescue semantics, a new tuple t, and an integer m, computes a compressed tuple t having m peculiarities such that the number of queries that retrieve t is exploited.

Intuitively, for buyers interested in browsing products of interest, we wish to ensure that the compressed version of the new product is visible to as many buyers as possible. This problem also has a per- attribute version where m is not specified; in this case we may wish to determine t such that the number of satisfied queries divided by |t| is exploited. Intuitively, if the number of peculiarities retained is a measure of the cost of advertising the new product, this problem seeks to exploit the number of potential buyers per advertising dollar.

CAM ID	Flash	Pixel	Battery	TV Kit	Memory
t1	0	1	0	1	0
t2	0	1	1	0	0
t3	1	0	0	1	1
t4	1	1	0	1	0
t5	1	1	0	0	0
t6	0	1	0	1	0
t7	0	0	1	1	0

DATABASE D

Query ID	Flash	Pixel	Battery	TV Kit	Memory
q1	1	0	1	1	1
q2	1	1	0	0	0
q3	0	1	1	0	0
q4	0	0	1	0	1
q5	1	1	0	1	0

Query Register Q

New

CAM

Flash

Pixel

Battery

TV

Kit

Memory

t

1

0

1

New tuple t to be inserted

Fig 1: Illustrating the Fundamentals

In this section we discuss our main algorithmic results for the main problem variant discussed in Section 3.
CB-QR has a per attribute variant, where m is not provided as an input, and we have to determine the best m such that number of satisfied queries divided by m is exploited. This variant can be simply solved by trying out values of m between 1 and M and making M calls to any of the algorithms discussed above, and picking the solution that exploits our objective. Since we adopt this general strategy for all per-attribute problem variants, we do not discuss such variants any further in this paper.

dimension of the Boolean database is enormous. Consequently, none of the optimal algorithms, either IP-based or regular itemset-based, are feasible for text data. Fortunately, the greedy heuristics we have developed scale very well with reasonable results, as described in the experiments section. The second issue is that some of the scoring function that are used in text data e.g., the BM25 scoring function that takes into account the document length (size of compressed tuple t) are non-monotonic on the number of keywords added. In particular, adding a query keyword to t may decrease its BM25 score if this keyword has very low inverse document frequency (idf). Consequently the per-attribute versions of our various problem variants are of interest.

Fig 5: Algorithm MaxFreqItemSets-CB-QR
5.1 Text Data Problem Definition

A text database consists of a collection of documents, where each document is modeled as a bag of words as is common in Information Rescue. Queries are sets of keywords, with top-k rescue via query- specific scoring functions, such as the tf-idf-based BM25 scoring function [18]. This problem arises in several applications, e.g., when we wish to post a classified ad in an online newspaper and need to specify important keywords that will enable the ad to be visible to the maximum number of potential buyers. A subtle point is that, due to the non-monotonicity of many IR ranking functions, it is possible that a top-m1 tuple compression is worse (fewer queries retrieve document t) than a top-m2 compression, where m1 > m2. The attribute selection problem for text data is also NP-complete as it can be converted into Boolean problem considering each keyword as a Boolean attribute. The problem is NP-hard for the case of monotone ranking function, as in CB-QR. Hence, it is also NP-hard for the more complex non-monotonic ranking functions.

5.2. Algorithms for Text Data

As discussed above, text data can be treated as Boolean data, and all the algorithms developed for Boolean data can be used for text data. There are two issues that we wish to highlight, however. One is that if we view each distinct keyword in the text corpus (or query register) as a distinct Boolean attribute, the
In this section we describe the experimental setting and the results. Our main performance indicators are (a) the time cost of the proposed optimal and greedy algorithms, and (b) the approximation quality of the greedy algorithms, for the CB-QR and text data problem variants presented in Sections 3 and 5 respectively.

System Configuration: We used Microsoft SQL Server 2000 RDBMS on a P4 3.2-GHZ PC with 1 GB of RAM

and 100 GB HDD for our experiments. Algorithms are implemented in C#, and connected to RDBMS through ADO.

Datasets: We used two datasets, a cars dataset for the Boolean data experiments (Section 6.1), and a publications titles dataset for the text data experiments (Section 6.2). In particular, we use an online used-cars dataset consisting of 15,191 cars for sale in the Dallas area extracted from autos.yahoo.com. In the synthetic workload, each query specifies 1 to 5 peculiarities chosen randomly distributed as follows: 1 attribute 20%, 2 peculiarities 30%, 3 peculiarities 30%, 4

peculiarities 10%, 5 peculiarities 10%. We assume that most of the users specify two or three peculiarities.
In this paper, we are describing how a query will be processed. According to that, some of the screens are given here for easy resemblance:

Fig 12: How the buyer is going to observe the peculiarities.

Fig 12 represents when a user logins into the online with the given registered userid and password. Then the user will get the screen with different widgets having the peculiarities.

Fig 13: Details of peculiarities for a CAR

Fig 13 represents the selection of a widget from the list. In the above list, it represents the peculiarities of the CAR widget. So, it gives the specifications of each and every widget.

Fig 14: Seller adding the details

Fig 14 represents the screen of the seller for adding the details of a widget. This is very important to seller for updating the peculiarities according to the widgets added.

Fig 15: Priorities according to the usage of peculiarities
A large corpus of work has tackled the problem of ranking the results of a query. In the documents world, the most popular procedures are tf-idf based [20] ranking functions, like BM25 [18], as well as link-

structure-based procedures like PageRank [2] if such links are present (e.g., the Web). In the database world, automatic ranking procedures for the results of structured queries have been proposed [1, 7, and 30]. Also there has been recent work [3] on ordering the displayed peculiarities of query results.

Both of these tuple and attribute ranking procedures are inapplicable to our problem. The former inputs a database and a query, and outputs a list of database tuples according to a ranking function, and the latter inputs the list of database results and selects a set of peculiarities that explain these results. In contrast, our problem inputs a database, a query register, and a new tuple, and computes a set of peculiarities that will rank the tuple high for as many queries in the query register as possible.

Although the problem of choosing peculiarities is seemingly related to the area of feature selection [5], our work differs from the work on feature selection because our goal is very specific to enable a tuple to be highly visible to the database users and not to reduce the cost of building a mining model such as classification or clustering.

Kleinberg et al. [12] present a set of microeconomic problems suitable for data mining procedures; however no specific solutions are presented. Their problem closer to our work is identifying the best parameters for a advertising strategy in order to exploit the attracted customers, given that the competitor independently also prepares a similar strategy. Our problem is different since we know the competition. Another area where boosting an item's rank has received attention is Web search, where the most popular procedures involve manipulating the link structure of the Web to achieve higher perceptibility [8].

Integer and linear programming optimization problems are extremely well studied problems in operations research, management science and many other areas of applicability (see recent book on this subject [19]). Integer programming is well-known to be NP-hard [6]; however carefully designed branch and bound algorithms can capability solve problems of moderate size.

Computing regular itemsets is a popular area of research in data mining and some of the best known algorithms include Apriori [1] and FP-Tree [9]. Several papers have also investigated the problem of computing maximal regular itemsets [3, 5, 12, 14, and 17]. Almost all the popular approaches are designed for sparse datasets and do not work well for our unique problem of dense datasets. Apriori [1] employs a bottom-up, breadth first search that enumerates every single regular itemset. In many applications (especially in dense data) with long regular patterns enumerating all possible subsets of an M length pattern (M can easily be 50 or 60 or longer) is computationally unfeasible. Also, we are not interested in mining all regular itemsets, but only maximal regular itemsets in our algorithm. A

known approach for mining maximal regular itemsets is the complete random walk [7], which is a bottom-up approach. To see this, consider a table with 50 peculiarities, and assume we need to determine a compressed tuple t with 10 peculiarities. Now, we need to know the itemset of ~Q (complemented query register which is a dense dataset) of size 40 with maximum frequency. Due to the dense nature of ~Q, the bottom-up approach will not be able to compute regular itemsets beyond a size of 5-10. Likewise, other approaches for mining maximal regular itemsets such as the Genetic Algorithm (GA) based approach [11] is also mainly intended for sparse dataset and does not work well for dense dataset. In contrast, our proposed method works well for dense dataset. The recent works
[14] and [13] are related to our work. The former tries to find out the dominant relationship between products and potential buyers where by analyzing such relationships, companies can position their products more effectively while remaining profitable. The latter introduces skyline query types taking into account not only min/max peculiarities (e.g., price, weight) but also spatial peculiarities and the relationships between these different attribute types. Their work aims at helping manufacturers choose the right specs for a new product, whereas our work to choose the peculiarities subset of an existing product for advertising purposes.

In previous work [16], we tackled the main variant of the problem with Boolean conjunctive query semantics where a tuple satisfies a query if all the peculiarities present in query are also present in the tuple (Section 3). We extend the idea in the current paper. We consider both the database (existing products) and query register with various query semantics (conjunctive, top-k, skyline, negations, etc.).

Several procedures have been proposed for capable skyline query processing [4, 19, 25, and 31]. There has been recent work on categorical skylines [21] and skyline computation over low cardinality domains [15] that also considers skyline or Boolean data as well. One main difference of our work with the existing works is that our goal is not to propose a method for processing or maintaining the skylines, instead we use skylines as a query semantic where a new tuple can be visible for maximum number of queries. Another related work is mining top-k regular itemsets without minimum support verge [10] which finds top-k closed regular itemsets. Also it is not proven that the top-k approach works well for dense dataset. The top-k approach without minimum support verge [10] finds top-k regular closed patterns of length no less than min_l, where min_l is the minimal length of each pattern.

In this work we introduced the problem of picking the best peculiarities of a new tuple, such that this tuple will be ranked highly, given a dataset, a query register,

or both, i.e., the tuple stands out in the crowd. We presented variants of the problem for Boolean, categorical, text and numeric data, and showed that even though the problem is NP-complete in most cases; optimal algorithms are feasible for small inputs. Furthermore, we present greedy algorithms, which are experimentally shown to produce good approximation ratios. After all, a query register is only an approximate surrogate of real user preferences, and moreover in some applications neither the database, nor the query register may be available for analysis; thus we have to make assumptions about the nature of the competition as well as about the user preferences. Finally, in all these problems our focus is on deciding what subset of peculiarities to retain of a product. We do not attempt to suggest what values to set for specific peculiarities, which is a problem tackled in advertising research, e.g., [17]. However, while we acknowledge that the scope of our problem definition is indeed limited in several ways, we do feel that our work takes an important first step towards developing principled approaches for attribute selection in a data exploration environment.

The authors gratefully acknowledge the contributions of the reviewers. Their comments have considerably improved the quality of this article. We, authors express gratitude to all the anonymous reviewers for their affirmative annotations among our paper.

R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In U.M. Fayyad, G. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, (eds.), Advances in Information Discovery and Data Mining, pp. 307-

328. AAAI/MIT Press, 1996.
S. Brin and L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW Conference, 1998.
Gautam Das, Vagelis Hristidis, Nishant Kapoor, S. Sudarshan. Ordering the Peculiarities of Query Results. SIGMOD, 2006.
Good, I., The population frequencies of species and the estimation of population parameters, Biometrika, v. 40, 1953, pp. 237-264.
Isabelle Guyon and Andre Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3(mar): 2003.
Michael R. Garey and David S. Johnson (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman. ISBN 0-7167-1045-5.
D. Gunopulos, R. Khardon, H. Mannila, S. Saluja,

H. Toivonen, R. S. Sharm: Discovering all most specific sentences. ACM TODS. 28(2): 2003
M. Gori and I. Witten. The bubble of web perceptibility. Commun. ACM 48, 3 (Mar. 2005), 115-117.
Jiawei Han, Jian Pei, Yiwen Yin: Mining Regular Patterns without Candidate Generation. SIGMOD 2000: 1-12.
Jiawei Han, Jianyong Wang, Ying Lu, and Petre Tzvetkov: Mining top-k regular closed patterns without minimum support, ICDM 2002.
Jen-peng Huang, Che-Tsung Yang, Chih-Hsiung Fu: A Genetic Algorithm Based Searching of Maximal Regular Itemsets. ICAI 2004.
J. Kleinberg, C. Papadimitriou and P. Raghavan.A Microeconomic View of Data Mining. Data Min. Knowl. Discov. 2, 4 (Dec. 1998).
Cuiping Li, Beng Chin Ooi, Anthony K. H. Tung, Shan Wang: DADA: a Data Cube for Dominant Relationship Analysis. SIGMOD 2006.
Cuiping Li, Anthony K. H. Tung, Wen Jin, Martin Ester: On Dominating Your Neighborhood Profitably. VLDB 2007: 818-829
Michael D. Morse, Jignesh M. Patel, H. V. Jagadish: Capable Skyline Computation over Low- Cardinality Domains. VLDB 2007.
Muhammed Miah, Gautam Das, Vagelis Hristidis, Heikki Mannila: Standing Out in a Crowd: Selecting Attributes for Maximum Visibility.

ICDE 2008: 356-365
Thomas T. Nagle, John Hogan. The Strategy and Tactics of Pricing: A Guide to Growing More Profitably (4th Edition), Prentice Hall, 2005.

[18]S E Robertson and S Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted rescue. SIGIR 1994.

[19] Alexander Schrijver: Theory of Linear and Integer Programming. John Wiley and Sons. 1998.

[20]G. Salton. Automatic Text Processing: The Transformation, Analysis, and Rescue of Information by Computer. Addison Wesley, 1989.

[21]Nikos Sarkas, Gautam Das, Nick Koudas, Anthony

K. H. Tung: Categorical skylines for streaming data. SIGMOD Conference 2008: 239-250

Identifying Peculiarities To Exploit Perceptibility Of Widgets

Leave a Reply