An Efficient Extension Of Conditional Functional Dependencies Using Nested Relational Database

Download Full-Text PDF Cite this Publication

Text Only Version

An Efficient Extension Of Conditional Functional Dependencies Using Nested Relational Database

Suchitra Reyya1, Sateesh Gudla2, M.Prasanna Shivani3

1Assistant Professor, Department of CSE,LIET,,Vizianagaram.

2Associate Professor, Department of CSE, LIET,Vizianagaram.

3Department of IT, LIET, Vizianagaram.

Abstract

This paper propose an efficient data cleaning by using extended Conditional Functional Dependencies (eCFDs), which is an extension of Conditional Functional Dependencies(CFDs). eCFDs intend to solve the multi-valued inconsistencies to trounce drawbacks of CFDs which use pattern tableau to hold individual tuples in a table for cleaning relational data by supporting only single valued attributes. SQL techniques are used to create patterns of semantically related values for detecting single tuple CFD violations. By introducing a query and an algorithm we provide better competence for eliminating data redundancy in multi-valued attributes using nested relational database. We experimentally analyze the efficiency and performance of these eCFD based techniques in improving data quality and displays the tentative results graphically.

Keywords: CFDs, eCFDs, Nested relational database.

  1. Introduction

    Data warehouse consists of huge amount of data integrated from different database sets where, the chances of data being redundant and inconsistent are highly possible. If these warehouses are not frequently cleaned, the inconsistent data can lead to mis conceptions there by making the cleaning process more complex [7]. Data redundancy implies that different tuples may represent the same data which causes anomalies and data corruption. Thus data cleaning plays a prominent role in preprocessing. One way of achieving this is by using functional dependencies. In FDs data redundancy is reduced by enforcing integrity constraints at schema level. But the extension of FDs

    called CFDs were developed which aim at capturing violations for individual tuples by using the concept of pattern tables [1].When these pattern tables are implied on the entire database the tuples which violate the pattern tables are detected and hence can be easily corrected. The downside with CFDs is that they hold for only single value attributes but not for multivalued attributes [2].

    This paper deals with eCFDs for improving data quality which are extended from CFDs to trounce the problem of multi-valued attributes by using nested relational database [5][6]. eCFDs are capable of capturing inconsistencies using nesting.

    This paper is organized into different sections where section 1 gives a brief introduction to data cleaning concept, section 2 thrash out related work, section 3 pioneer how to purge redundancy, section 4 explains methodology to reduce redundancy by introducing algorithm, section 5 displays experimental results and section 6 holds the conclusion.

  2. Related Work

    Relations that contain redundant information may potentially endure from update anomalies.Functional dependencies are used to detect the redundancies by enforcing integrity constraints at schema level [1]. The constraints hold for all the tuples in the table. The main idea behind FD is that the value of a particular attribute uniquely determine the value of some other attribute [8] A relation A B if for every valid instance of A, that value of A uniquely determines the value of B.

    Definition (Conditional Functional Dependencies): Refinement of FDs termed as CFDs capture inconsistencies that traditional FDs cannot detect. They are not expressible as traditional FDs because CFDs do not hold not entire relation, instead it holds on tuples matching the conditions specified in the pattern table [1].

    A CFD on R is a pair (R : A B, Tp), where (i) A, B are sets of attributes from relation (R), (ii) R : A B is a customary FD, rooted in ; and (iii) Tp is a tableau with all attributes in A and B , where for each X in A or B and each tuple t Tp, t[X] is either a constant a , or an unnamed variable – .

    Given an instance I of a relation schema R and a set of CFDs on R, it is to find all the tuples that violate some CFD in . The query below is an SQL technique for detecting violations of a CFD [2].

    Select distinct t.A

    from R t, Tp tp

    where t[A1] = tp[A1] AND . . . AND t[An] = tp[An]

    group by t.A having count(distinct B)> 1

    This query groups tuples with the same value on attribute A and it counts the number of distinct values in B. The query returns inconsistent tuples in the database if there is more than one value for B.

  3. Eliminating Redundancy using Nested Relational Database

    This paper proposes eCFDs which hold for multi value data. This is achieved using nested database [6]. The overall design of the system is shown below.

    DB

    Verification of eCFDs for multi value attributes using T

    Checking for fixed patterns using Tp

    Checking for fixed patterns using Tp

    Update violated tuples with correct data

    Figure 1: A model for achieving eCFds

    The model architecture describes the step by step flow of process carried to implement eCFDs. In the first step the data in the database is compared with table Tm to check for multi value inconsistencies. In the second step if the inconsistencies are found then it compares each tuple with matching fields from the database and table Tm with pattern table Tp. In the third step finally the incorrect tuple is updated with matching value from the multi tableau and saved in database. Defnition (eCFD): Let R be some relational schema given as (R:XY, Tp,Tm) where (1) X,Y,Tp,Tm attr( R ); (2) XY is an embedded FD; (3) Tp is a pattern tableau consisting of finite number of pattern tuples; (4) Tm is a multi tableau holding multiple values of Y for each X attribute.

    Example 1.1: Let us consider a schema cust(AC ,PN

    ,NM, STR, CT, ZIP). It specifies a customer relationship in terms of the customers phone (area code (AC), phone number (PN)), name (NM), and address (street (STR), city (CT), zip code (ZIP)). An instance D0 of cust is shown below.

    Table 1: Customer Database (CD)

    AC

    PN

    NM

    STR

    CT

    ZIP

    718

    1111111

    Mik

    TreeAve

    Albany

    12238

    518

    2222222

    Joe

    Elm Str

    Colonie

    12205

    100

    2222222

    Jim

    Oak Str

    Troy

    12181

    212

    3333333

    Ben

    5th Ave

    NYC

    10001

    646

    4444444

    Ian

    High St

    NYC

    10011

    0891

    2552705

    Vani

    Eenadu

    VSP

    530013

    08922

    2342345

    Rani

    Boypalm

    VSP

    530012

    345

    6784563

    Pete

    Troy

    Ohio

    23124

    0890

    1233211

    Raja

    Eenadu

    VSP

    530045

    In the above table we can observe that the city NYC and VSP has ifferent area codes (AC).

    1: CT {NYC} with AC {212, 646, 347, 917}.

    2: CT {VIZAG} with AC {0891, 08922}.

    Hence in 1 if the city is NYC then the area codes must be one of the following values. Similarly in 2 if the city is VIZAG the area code must be one of the above given values. Here CT is associated with disjunction of options rather than single value. But in the above example tuple t3 contains CT as NYC but AC is 100 which is not one of the area codes associated with NYC and tuple t9 has CT as VSP but AC is 0890. This problem is not overcome in CFDs as it does not hold multi-valued attributes. This multi-valued attribute problem can be solved with by less complexity by taking two tables Tm & Tp.

    TABLE Tm: Tm is a multi tableau holding all the tuples consisting of multi- value of Y for a single attribute X. The Tm table for the given customer (cust) instance is shown below.

    Table 2: multi tableau-Tm

    CT

    AC

    NYC

    {212,646,347,917}

    VSP

    {0891,08922}

    Albany

    {718}

    Now to correct the area code in tuple t4 appropriate area code must be selected from multiple values of AC in table Tm. So we use the pattern table Tp which consists of fields, street(STR) and area code(AC) to select the area code that match the fields in the tuple. The pattern tableau for the cust relation is shown

    from cust C, table1 Tm where (Tm.CT < > '@'

    AND Tm.AC[i.in] < > '@') AND (C.CT = Tm.CT

    AND C.AC < > Tm.AC[i..in]);

    Once the violations are displayed the area code must be corrected with the appropriate one from the multiple area codes.

    The below query is used to update the area code by considering customer relation table (C ), multi tableau (Tm ) and pattern tableau ( Tp) where the area code of Tm matches with that of Tp and the street in customer relation matches with that of Tp.

    update

    cust set

    C.AC = (select Tp.AC

    from table2 Tp, table1 Tm, cust C

    below.

    STR

    AC

    5th Ave

    212

    8th Ave

    212

    High.St

    646

    Eenadu

    0891

    Boypalem

    08922

    STR

    AC

    5th Ave

    212

    8th Ave

    212

    High.St

    646

    Eenadu

    0891

    Boypalem

    08922

    Table 3: Pattern Tableau-Tp

    where Tp.AC = Tm.AC[i.in]

    AND C.STR = Tp.STR);

    The following Algorithm can be used to explain the methodology to reduce redundancy using nesting.

    Algorithm : Eliminating inconsistent tuples in multi-

    valued attribute using nested relational database

    Input: Inconsistent multi-valued database (CD)

    Output: Clean database(CCD)

    Method: Solving multi-valued inconsistencies using Nested Relational database

    1. Begin

      From cust relation we observe that in the tuple t4, street is 8th AVE and city is NYC which uniquely determines area code as 212. Thus now the violation showed in tuple t4 can be eliminated by correcting it with area code as 212 using these two tables.

  4. Methodology to Reduce Redundancy

We introduce a query for solving multi value inconsistencies in a simplified way with less complexity.

This query shows the violated tuples by considering customer relation table(C) and multi tableau(Tm) where the citys area code (AC) does not match with the disjunction of options present in the multi tableau Tm.

select ALL

  1. Consider a table CD with different fields.

  2. Take a new table multi tableau(Tm) having multiple values for attribute X ie. X [i..in].

  3. For each tuple in table CD having attr X, compare it with X [i…in] of table Tm.

  4. If comparision successful then goto step

    15

  5. Else

  6. Display tuples not satisfying step 4.

  7. End for

  8. Consider a pattern tableau Tp having prefixed constants and unnamed variables.

  9. For updating the inconsistent values shown in step 7, for each tuple in table CD compare Tm.X [i.in] with Tp.X and CD.Y with Tp.Y respectively.

  10. If comparision successful then

  11. Update attr X with tp.X

  12. End if

14. End for

  1. Original database (CD) is updated as Clean Database (CCD)

  2. End

Now after applying the above algorithm on TABLE 1, the consistencies in the table will be replaced by the accurate data, thereby making the database clean and error free. The below is the clean database (TABLE 4) obtained as a result of the algorithm. The AC in tuple t4 is replaced with 212 and similarly tuple t9 with erroneous AC is replaced with appropriate area code (AC) as 0891.

Table 4: Cleaned Customer Database (CCD)

AC

PN

NM

STR

CT

ZIP

718

1111111

Mik

TreeAve

Albany

12238

518

2222222

Joe

Elm Str

Colonie

12205

212

2222222

Jim

Oak Str

Troy

12181

212

3333333

Ben

5th Ave

NYC

10001

646

4444444

Ian

High St

NYC

10011

0891

2552705

Vani

Eenadu

VSP

530013

08922

2342345

Rani

Boypalm

VSP

530012

345

6784563

Pete

Troy

Ohio

23124

0891

1233211

Raja

Eenadu

VSP

530045

  1. Experimental Analysis

    In this section, we present our findings about the performance of our models for reducing redundancies over multi-value databases.

    Setup: For the experiments, we used postgre SQL on Windows XP with 1.86 GHz Power PC dual CPU with 3GB of RAM.

    Efficiency of eCFDs: The efficiency of the eCFDs is graphically displayed by considering the multi-value database. When CFDs are posted on this database all tuples having multiple values for each attribute are changed to only one value and obtains duplicate data. But in eCFDs each attribute can hold multiple data thereby preserving the data.

    Figure 2: Graph displaying the efficiency of eCFDs

    Size of the Relation: In this experiment we have analyzed how the redundancy is reduced by considering the space occupied by the pattern table (TP) and multi tableau (Tm) in CFDs and eCFDs respectively. It is observed Tp can hold only one value for each fixed pattern where as Tm can efficiently hold multiple values for each attribute in the pattern thus reducing tuple size

    Table 6: Comparison of size of tuples in Tp (CFDs) & Tm( eCFDs)

    No. of multiple values

    Tp

    Tm

    Pattern 1

    1

    2

    Pattern 2

    3

    Pattern 3

    1

    4

    No.of multiple values

    No.of multiple values

    5

    4

    Table 5: Comparison of efficiency in CFDs & eCFDs

    No. of reduced tuples CFDs eCFDs

    Tuple variation1 7 10

    Tuple variation2 14 20

    Tuple variation3 27 30

    Tuple variation4 40 50

    Table 5: Comparison of efficiency in CFDs & eCFDs

    No. of reduced tuples CFDs eCFDs

    Tuple variation1 7 10

    Tuple variation2 14 20

    Tuple variation3 27 30

    Tuple variation4 40 50

    50

    Number of tuples

    Number of tuples

    40

    30

    20

    10 CFD

    eCFD

    0

    3

    2 T

    1 T

    0

    Pattern1 Pattern2 Pattern3

    No.of Patterns

    Number of reduced tuples

    Figure 3: Graph displaying size of relation

  2. CONCLUSION

    We presented that nesting based eCFDs proved to be better than CFDs in eliminating redundancy in nested relational database. eCFDs acquire no extra complexity in inconsistency detection. We have developed SQL based technique and algorithm for eCFD violation detection. Our experimental results for efficiency and size of the relation proved eCFDs to be more effective compared to CFDs.

  3. REFERENCES

  1. Philip Bohannon, Wenfei Fan, Floris Geerts and Xibei Jia3 Anastasios Kementsietsidis3: Conditional Functional Dependencies for Data Cleaning

  2. Wenfei Fan, Floris Geerts , Laks V.S. Lakshmanan and Ming Xiong : Discovering Conditional Functional Dependencies, IEEE conference on data engineering

  3. Sangeeta Viswanadham, V. Valli Kumari: Eliminating Data Redundancy using Nesting based wMVD, IEEE Conference Publicat ions, Elect ronics Computer Technology (ICECT), 2012 4th Internat ional Conference, April (2012).

  4. Sangeeta Viswanadham, V. Valli Kumari: Eliminating Data Redundancy, Cube (2012)

  5. Suchitra Reyya, Sangeeta Viswanadham: Eliminating Data Redundancy through Weak Multivalued Dependencies using Nesting

  6. Suchitra Reyya, M. Sundara Babu, Ratnam Dodda: An Efficient Storage of Spatial Database using Nested Relational Database , International Journal of Engineering Research & Technology (IJERT) Vol. 1 Issue 7, September 2012

  7. Erhard Rahm, Hong Hai Do: Data Cleaning Problems and Current Approaches, IEEE Techn. Bulletetin on Data Engineering, Dec. (2000)

  8. Korth, H., Roth, M.: Query languages for Nested Relational Databases, Nested Relations and Complex Objects in Databases, Springer, (1991).

Leave a Reply

Your email address will not be published. Required fields are marked *