Preparation of Data Sets for Data Mining Analysis using Horizontal  Aggregation in SQL and Optimization of Horizontal Aggregation

Ms Vidya Bodhe; Prof Jyoti Mankar

doi:10.17577/IJERTV1IS10542

Volume 01, Issue 10 (December 2012)

Preparation of Data Sets for Data Mining Analysis using Horizontal Aggregation in SQL and Optimization of Horizontal Aggregation

DOI : 10.17577/IJERTV1IS10542

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 97
Total Downloads : 334
Authors : Ms Vidya Bodhe, Prof Jyoti Mankar
Paper ID : IJERTV1IS10542
Volume & Issue : Volume 01, Issue 10 (December 2012)
Published (First Online): 29-12-2012
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Preparation of Data Sets for Data Mining Analysis using Horizontal Aggregation in SQL and Optimization of Horizontal Aggregation

Ms Vidya Bodhe, Prof Jyoti Mankar

P.G.Scholar, Department of Computer Engineering KKWIEER,Nasik

Asst. Professor, Department of Computer Engineering KKWIEER,Nasik

ABSTRACT

Data aggregation is a process in which information is gathered and expressed in a summary form, and which is used for purposes such as statistical analysis. A common aggregation purpose is to get more information about particular groups based on specific variables such as age, name, phone number, address, profession, or income. Most data mining algorithms takes as input data set with a horizontal layout. Significant effort is required to prepare summary data set in a relational database with normalized tables. For preparing data sets suitable for data mining analysis, we have to write complex SQL queries, operation of joining tables and column aggregation. Horizontal aggregation can be performing by using operator, it can easily be implemented inside a query processor, much like a select, project and join. PIVOT operator on tabular data that exchange rows, enable data transformations useful in data modeling, data analysis, and data presentation

Two main ingredients in SQL code are joins and aggregations Standard aggregation returns one column per aggregated group and produce table with a vertical layout and Standard aggregations are hard to interpret when grouping attributes have high cardinalities. All these are limitations of standard aggregation. Because of these limitations, standard aggregation is not much suitable for preparation of data set for data mining analysis. Horizontal aggregation is a simple method which generates SQL code to return aggregated columns in a horizontal tabular layout and returns set of numbers instead of one number per row. This project is useful for building a suitable dataset for data mining analysis using horizontal aggregations in SQL. Three fundamental methods are used to evaluate horizontal aggregations: CASE, SPJ, and PIVOT. This project will evaluate horizontal aggregation using K-means algorithm for query optimization.

Keywords: CASE, data set, Horizontal Aggregation,, Optimization, SPJ, Pivot

INTRODUCTION

Preparation of data set for analyzing data in data mining project from relational database using standard aggregation function is time consuming task. Most data miming algorithm takes input a data set which is in horizontal layout i.e. in summarized form. Horizontal aggregations represent an extended form of traditional SQL aggregations, which return a set of values in a horizontal layout, instead of a single value per row. Horizontal aggregations are a new class of aggregations that have similar behavior to SQL standard aggregations, but which produce tables with horizontal layout. Horizontal aggregations have been evaluated using CASE, SPJ, and PIVOT method.

Motivation:-

The basic objective regarding this project is to prepare data set for data mining analysis using horizontal aggregation method and evaluate this horizontal aggregation method using CASE, SPJ, PIVOT and K-means algorithm for query optimization.

Existing systems:-

Datasets are prepared for data mining analysis using standard aggregation functions. Most data mining algorithm requires input a datasets with a horizontal layout with several records and one variable or dimensions per column. But data set prepared using standard aggregation produce dataset in vertical tabular layout. And converting vertical data set into summarized form requires writing long SQL statements or customizing SQL code if it is generated by some tool. Significant effort is required for computing aggregations using available functions and clauses in SQL to convert data set into cross tabular form suitable for data mining analysis.

Concept or seed idea:-

In this section, we suggest some alternatives which will make it possible to convert data sets into horizontal layout easily using horizontal aggregation function. The method focuses on minimizing effort and time that is spent in preparing and cleaning a data set for data mining algorithms in data mining project. A big part of this effort involves deriving metrics and coding categorical attributes from the data set in question and storing them in a tabular (observation, record) form for analysis so that they can be used by a data mining algorithm.

DEFINITIONS

Tables that will be used to explain SQL queries are defined here.

F(K,D1,..,Dp,A) F is a fact table with primary key K, D1,..,Dp are dimensions ,A is measure column that is passed to standard SQL aggregations.

F is a temporary table as shown in table 1

Table 1: Input Table, F

K

D1

D2

A

1

3

X

9

2

2

Y

6

3

1

Y

10

4

1

Y

0

5

2

X

1

6

1

X

NULL

7

3

X

8

8

2

X

7

F is used as input table. Here for explaining some examples of analyzing store database transationLine table is used. transactionLine table is a transaction table of store database. Table transactionLine has three dimensions namely product hierarchy, location and time .these dimensions are used for grouping rows. Table transactionLine also contains three measures- itemQty, salesAmt and costAmt are used to pass as arguments in aggregation function.

Table transaction Line is assumed to have star schema as shown in figure 2

Figure 1: transaction Line table

LITERATURE REVIEW

This chapter will introduce us with the previous system and its analysis. It also includes the comparison of existing system with proposed system. This will give us the detail idea about the need of proposed system.

Datasets are prepared for data mining analysis using standard aggregation functions. The most widely-known aggregation is the sum of a column over groups of rows. Some other aggregations return the average, maximum, minimum or row count over groups of rows. Using these aggregation functions datasets are prepared from input table F as shown in table 1.

Following query on table 1 gives result as shown in table 3 in vertical tabular form.

SELECT D1, D2, sum (A) FROM F

GROUP BY D1, D2 ORDER BY D1, D2;

A standard SQL aggregation (e.g. sum ()) with the GROUP BY clause, which returns results in a vertical layout as shown in table 2.

Table 2: Vertical Table, FV

D1	D2	A
1	X	NULL
1	Y	10
2	x	8
2	y	6
3	x	17

Analysis of Exiting System:

Analysis of system is given using examples with a store (retail) database that requires data mining analysis.

Store database contains following tables,

Product hierarchy (product ID, name, category)

Location (location id, city, state, country) Time (date, month, quarter, year) transactionline (product hierarchy, location,

time) and three measures itemQty, costAmt and salesAmt as shown in figure 1.

If we compute query summarizes sales for each store by each day of the week gives results as shown in table 4.

SELECT storeId, dayofweekNo, sum (salesAmt)

FROM TransactionLine

GROUP BY storeId, dayweekNo ORDER BY storeId, dayweekNo;

Table 3: Vertical Table

If there are 200 stores, 30 store departments and stores are open 7 days a week, the first query returns 1400 rows which may be time-consuming to compare with each other each day of the week to get trends. This query can be answered with standard SQL, but additional code needs to be written or generated to return results in horizontal tabular form as shown in table 3.

Horizontal aggregation just requires a small syntax expression to aggregate functions called in a SELECT statement and produce datasets in horizontal layout which is the desired layout for most data mining algorithms. They represent a template to generate SQL code from a data mining tool automates writing SQL queries, optimizing them and testing them for correctness. This method reduces manual work in the data preparation phase in a data mining project. The data set can be created entirely inside the DBMS. As horizontal aggregations produce tables with fewer rows and more columns, query optimization techniques used for standard aggregations are inappropriate for horizontal aggregations.

What is Horizontal aggregation?

Horizontal aggregation is a new class of aggregate functions that aggregate numeric expressions and transpose results to produce a data set with a horizontal layout. Functions that belongs to this class is called are called horizontal aggregations. Horizontal aggregations represent an extended form of traditional SQL aggregations, which return a set of values in a horizontal layout, instead of returning a single value per row. Horizontal aggregations are a new class of aggregations that have similar behavior to SQL standard aggregations, but which produce tables with a horizontal layout.

A new class of aggregations that have similar behavior to SQL standard aggregations, but which produce tables with a horizontal layout as shown in table 2. Horizontal aggregations just require a small syntax extension to aggregate functions called in a SELECT statement. Alternatively, horizontal aggregations can be used to generate SQL code from a data mining tool to build data sets for data mining analysis.

storeId	Dayofweekno	salesAmt
10	Mon	2345
10	Thu	5777
10	Wed	3445
10	Thu	6868
10	Fri	90
10	Sat	8089
10	Sun	4543
32	Mon	3343
.	.	.
.	.	.
.	.	.
32	Sun	6767
56	mon	32314

Table 4: Horizontal Table, FH

D1

D2X D2Y

1

2

3

NULL 10

8 6

17 NULL

If we want to summarize sales information with one store per row for one year of sales means we need the

sales amount broken down by day of the week, the number of transactions by store per month, the number of items sold by department and total sales. The query is

SELECT

storeId,

sum (salesAmt BY dayofweekName),

count (distinct transactionid BY salesMonth), sum (1 BY deptName),

sum (salesAmt) FROM transactionLine

, DimDayOfWeek, DimDepartment, DimMonth WHERE salesYear=2009

AND transactionLine.dayOfWeekNo

=DimDayOfWeek.dayOfWeekNo AND transactionLine.deptId

=DimDepartment.deptId

AND transactionLine.MonthId

=DimTime.MonthId GROUP BY storeId;

This query gives output as shown in table below. Table 5: Horizontal table for data mining algorithms

stor eId	SalesAmt Mon TueSun		countTransactios Jan Feb Dec	countItems dairy Meat .		Total salesAmt
10	120	111 200	2011 1807 4200	34	57	25025
32	7	65 98	802 912 1632	32	65	14022
56	77	92 99	555 1000 950	55	26	20500
.
.

SYSTEM ARCHITECTURE

Figure 2 System Architecture diagram

SELECT DISTINCT R1… Rk FROM F returns a table with d distinct rows and each row is used to define one column to store an aggregation for one specific combination of dimension values.

In a horizontal aggregation there are four input parameters to generate SQL code:

The input table F,
The list of GROUP BY columns L1. Lj,
The column to aggregate (A),
The list of transposing columns R1. . . Rk.

Syntax for horizontal aggregation is as follows. SELECT L1, L2Lj, H (A BY R1. Rk) FROM F

GROUP BY L1, L2 Lj;

SPJ method

In this SPJ method first we create one table with a vertical aggregation for each result column, and then join all those tables to produce FH. The d projected tables with d Select-Project-Join-Aggregation queries are aggregated from input table F. Each table FI corresponds to one sub grouping combination and has

{L1Lj} as primary key and an aggregation id done on A as the only non-key column. It is necessary to introduce an additional table F0 that will be outer joined with projected tables to get a complete result set.
Case Method

In this method thecase programming construct which is available in SQL is used. The case statement returns a value selected from a set of values based on Boolean expressions.

Horizontal aggregation queries can be evaluated by directly aggregating from F and transposing rows at the same time to produce FH. First, we get the unique combinations of R1. Rk, those define the matching Boolean expression for result columns.
Pivot Method

The PIVOT method uses the built-in PIVOT operator, which transforms rows to columns (e.g. transposing). PIVOT operator is a built-in operator in a commercial DBMS. The PIVOT method internally needs to determine how many columns are needed to store the transposed table and it can be combined with the GROUP BY clause.

3.	Venn diagram
	R1 S1	R is set of distinct rows
	R2 S2
	R3 S3	C is set of tables in H.A. using CASE
	R1 C1 R2 C2 R3 C3	P is set of tables in H.A. using PIVOT
	R1 P1	S is set of tables in H.A. using SPJ
	R2 P2
	R3 R3

MATHEMATICAL MODEL

Sr.No.	Description	UML Design Observations
1.	Problem Description
	s ={R,T,C,P,S\|Ã˜s} Where R=sets of records {r1,r2,r3\|Ã˜r} T=sets of tables {t1,t2,t3\|Ã˜t} C=sets of tables in horizontal tabular form using CASE method {c1,c2,c3\|Ã˜c} P=sets of tables in horizontal tabular form using PIVOT method {p1,p2,p3\|Ã˜p} S=sets of tables in horizontal tabular form using SPJ method {s1,s2,s3\|Ã˜s}	S holds list of actors
2	Activity1
	fRC(R) \| C fRP(R) \| P fRS(R) \| S fTC(R) \| C fTP(R) \| P fTS(R) \| S	fR is the rules to generate H.A. from the input record set. FT is the rules to generate H.A. from the input tables

Table 6: Table showing mathematical model

EXPERIMENTAL SETUP

For designing and implementation purpose C# language is require.

Also, require SQL Server V9 running on DBMS server running at 3.2GHz

Large synthetic datasets generated by the TPC-H generator is required.

Operating environment requires for implementation is as shown in table below

Table 7: Table showing experimental setup

Sr. No.	Name of content	Capacity
1	Operating System	Window XP
2	Processor	Dual core
3	Hard Disk	1TB
4	RAM	4 GB
5	Front End	C#
6	Back End	SQL Server V9 running on DBMS server

Test id	Test description	Expected input	Actual output
1.	Distinct Rows	Specific row from input table from which we want to summarize data for data mining analysis	Data set for data mining analysis
2.	SPJ method	Specific row from input table from which we want to summarize data for data mining analysis using SPJ method	Table in horizontal tabular form for data mining analysis
3.	CASE method	Specific row from input table from which we want to summarize data for data mining analysis using CASE method	Table in horizontal tabular form for data mining analysis
4	PIVOT method	Specific row from input table from which we want to summarize data for data mining analysis using PIVOT method	Table in horizontal tabular form for data mining analysis

controlling probabilistic distribution of column values. SPJ will be slowest method as compared to PIVOT and CASE.PIVOT and CASE will have similar time performance. Time will grow as n(size of data set) grows for all methods but much more for SPJ. When d (dimensionality) increases for SPJ, small time growth will occur. Main performance factor for PIVOT and CASE method is n (data set size).But both n and d will impact the SPJ method. Use of two probabilistic distributions, uniform and skewed will be used for query optimization. Using skewed distribution at high d, PIVOT shows a slightly higher impact than the CASE method.SPJ will be slowest and will show bigger impact at high d.

VIII. CONCLUSION

Horizontal aggregations help in preparing data sets for data mining and OLAP cube exploration. Horizontal aggregations are useful to create data sets with a horizontal layout, as commonly required by data mining algorithms A horizontal aggregation returns a set of numbers instead of a single number for each group, resembling a multi-dimensional vector. Horizontal aggregations can be used as a database method to automatically generate efficient SQL queries with three sets of parameters: grouping columns, sub grouping columns and aggregated column. In future, this work can be extended to develop a more formal model of evaluation methods to achieve better results. Also then we can be developing more complete I/O cost models.

TEST CASES

Table 8: table showing test cases
EXPECTED RESULT

Since TPC-H only creates uniformly distributed values, we will create a similar data generator for

IX. REFERENCES

Carlos Ordonez, Zhibo Chen Horizontal aggregations in SQL to prepare data sets for data mining analysis,

IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(4):678-691 , 2011, IEEE

Computer Society.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, 1st edition, 2001
[3]C. Galindo-Legaria and A. Rosenthal. Outer join simplification and reordering for query optimization. ACM TODS, 22(1):4373, 1997.
[4]C.Cunningham, G.Graefe, and C.A. Galindo- Legaria. PIVOT and UNPIVOT: Optimization and execution strategies in an RDBMS. In Proc. VLDB Conference, pages 9981009, 2004
1. C.Galindo-Legaria and A. Rosenthal. Outer join simplification and reordering for query optimization. ACM TODS, 22(1):4373, 1997.
2. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation Operator generalizing group-by, cross-tab and subtotal. In ICDE Conference, pages 152159, 1996.
3. C. Ordonez. Data set preprocessing and transformation in a database system. Intelligent Data Analysis (IDA), 15(4), 2011
4. G. Graefe, U. Fayyad, and S. Chaudhuri. On the efficient gathering of sufficient statistics for classification from large SQL databases. In Proc. ACM KDD Conference, pages 204208, 1998.
5. C. Ordonez. Statistical model computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22, 2010.

Preparation of Data Sets for Data Mining Analysis using Horizontal Aggregation in SQL and Optimization of Horizontal Aggregation

Leave a Reply