Near Duplicate Detection In Relational Database

Bhagyashri. A. Kelkar; Prof. K. B. Manwade; Prof. G. A. Patil

doi:10.17577/IJERTV2IS3230

Volume 02, Issue 03 (March 2013)

Near Duplicate Detection In Relational Database

DOI : 10.17577/IJERTV2IS3230

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 181
Total Downloads : 676
Authors : Bhagyashri. A. Kelkar, Prof. K. B. Manwade, Prof. G. A. Patil
Paper ID : IJERTV2IS3230
Volume & Issue : Volume 02, Issue 03 (March 2013)
Published (First Online): 28-03-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Near Duplicate Detection In Relational Database

Bhagyashri. A. Kelkar *, Prof. K. B. Manwade **, Prof. G. A. Patil ***

*ME (CSE) Research Scholar, D. Y. Patil college of Engg. & Tech., Kolhapur, Maharashtra, India.

** Head of Dept, Dept. of CSE, Ashokrao Mane College of Engg, Wathar, Dist Kolhapur, Maharashtra, India.

*** Head of Dept, Dept. of CSE, D. Y. Patil College Of Engg. and Tech., Kolhapur, Maharashtra, India.

AbstractNear Duplicate detection is an important precess for many database based applications. Accurately identifying duplicate entities between multiple data sources is a big challenge to organizations and researchers. To detect the approximately duplicate records that refer to the same real-world entity is important to make the database more concrete and achieve higher data quality. In this process, ideally each record must be compared with every other record in dataset for finding duplicates. It is possible to reduce search space for record comparisons by using mutual exclusion property of tuples. In this research paper we analyze two types of blocking algorithms, namely, the adaptive sorted neighborhood method (ASNM), and iterative blocking and their combination with Jaro Winkler distance for string matching. Experimental evaluation on real dataset shows that, adaptive sorted neighborhood method along with Jaro Winkler distance algorithm outperforms in terms of precision and recall and requires very less number of comparisons than iterative blocking method. The experiments also highlight that, strings matching threshold gives optimal results if value is in range of 85% to 90%.

Keywords: Record linkage, near duplicate detection, Iterative Blocking, Adaptive sorted neighborhood method(ASNM), Jaro-Winkler Distance.

INTRODUCTION

In a databases based application, often there is need to consolidate data from different sources. The data integrated from the multiple sources often involves a 1% to 5% duplicate records. We call two records as nearly duplicates if they identify the same real world entity. Many industries and systems depend on the accuracy of databases for taking their competitive initiatives, strategic and operational decisions. Quality of the information stored in the databases, can have significant cost implications to a system that relies on information to function and conduct business. Low data quality results in incorrect reporting, inability to create a complete view of the customers from various segments and also results in poor customer service. Poor data quality costs billions of dollars to businesses in postage, printing, and staff overhead. Hence, data quality

improvement is an essential step before establishing data warehouse.

Presence of nearly duplicate records is result of expressing an entity using different values due to inconsistent expression habit, abbreviation, type errors & different formats. Thus, data quality is often compromised by many factors, including data entry errors (e.g., Microsft instead of Microsoft), missing integrity constraints (e.g., allowing entries such as EmployeeAge = 234), and multiple conventions for recording information (i.e. lexical heterogeneity) e.g., address recorded as 44 W. 4th St. versus 44 West Fourth Street.
REDUCE SEARCH SPACE WITH BLOCKING

The naive method for finding nearly duplicate records is to compare all records in the database, pair wise. Obviously, this method is practically infeasible due to time complexity of O(n2). To lower the time complexity, various techniques are proposed. Blocking[2] refers to the procedure of subdividing database records into a set of mutually exclusive subsets (blocks) under the assumption that no matches occur across different blocks. In this paper, we analyze performance of iterative blocking[4] and adaptive sorted neighborhood method[3].
1. Iterative Blocking:
  
  Most blocking techniques process blocks separately and do not exploit the results of other blocks. In iterative blocking, results of blocks are reflected to subsequently processed blocks. Blocks are now iteratively processed until no block contains any more matching records. When two records match and merge in one block, their composite may match with records in other blocks. The same pair of records may occur in multiple blocks, so once the pair is compared in one block, we can avoid comparing it in other blocks.
2. Adaptive Sorted Neighborhood Method:
In this blocking method, blocking key values adjacent to each other, but that are significantly different from each other are found using an appropriate string similarity measure. These boundary pairs of blocking keys are then used to form blocks, i.e. they mark the positions in the sorted array where one window ends and a new one starts. This approach can therefore be seen as

a combination of traditional blocking and the sorted neighborhood approach. Record pairs that were removed in the indexing step are classified as non-matches without being compared explicitly.
FIELD SIMILARITY COMPUTATION

Approach to reduce the computational efforts is to minimize the number of costly string comparisons that need to be made between records. Approximate string comparisons are carried out by using methods like edit distance, Jaccard similarity, Cosine similarity, matching tree etc. Jaro-Winkler algorithm[3] is found effective for fields like name and address details, while comparison functions specific for date, age, and numerical values are used for fields that contain such data. In the proposed system, performance of Jaro-Winkler metric will be evaluated on experimental dataset.
RECORD SIMILARITY COMPUTATION
1. Attribute selection:
  
  A record is usually composed of many attributes, whose similarities decide record similarity. It is vital to select the attributes of the records that participate in the record match, which represent the whole record. Attributes are selected by domain experts depending on people's comprehension of the meanings of data. If the cardinality of an attribute is approximately equal to the cardinality of dataset, then that attribute could not be a duplicate identifier. For example, title is not a duplicate identifier for name database.
2. Attribute weight allocation:
A record is usually composed of many attributes, whose contributions are different in deciding whether two pieces of records are approximate. Attribute weight is allotted according to the importance of its contribution in the process of judging approximately duplicate record. The bigger contribution the attribute makes, the bigger weight need to be allotted. In this paper, effect of attribute weights on precision and recall and F- measure is analyzed.

Field matching step results into formation of a vector that contains the numerical similarity values (Fi) calculated for each pair for selected attributes. Record similarity (RS) between two records R1 & R2 is

Where, Wi is weight allocated to ith field and n Fields contribute to form the vector. If RS(R1,R2) is greater than user specified threshold, record R1 is marked as similar to R2. This stage classifies the compared candidate record pairs into matches & non-matches. IN this paper, effect of threshold value on precision and recall, F-measure and number of record comparisons is analyzed.

EXPERIMENTAL RESULTS

Evaluation metrics: The effectiveness of the blocking methods and string similarity algorithms can be measured by precision, recall and F-measure metrics./p>

True Positive (TP): Corresponds to the number of matched detected when it is really match.

True Negative(TN): Corresponds to the number of non-matches detected when it is really non-match.

False Positive (FP): Corresponds to the number of matches detected when it is really non-match.

False Negative(FN): Corresponds to the numbers of non-matches detected when it is really match.

Precision: Precision is the fraction of true matches over the all number of candidate pairs which are classified as matches.

Recall: Recall is the fraction of matches correctly classified over the all number of matches.

F-measure: F-measure is regarded as the mean of precision and recall values.

Pairwise comparison count: This metric measures the number of pairwise comparisons performed by the algorithm. The lower the count, the more efficient the algorithm is.

Datasets

Restaurant dataset: The experiments are done on real world Restaurant dataset. Restaurant is a standard dataset which is used in several record linkage studies. It was created by merging the information of some restaurants from two websites: Zagat (331 non-duplicate restaurants) and Fooders (533 non-duplicate restaurants). There are 864 records in this dataset and 112 of them are duplicates. Name, Address, City, Phone and Type of restaurants are attributes of this dataset. Every record has five fields of the following form

Record (RIDDLE, Restaurant data set):@data

"arnie mortons of chicago", "435 s. la cienega blv.", "los angeles", "310/246-1501", "american", 0

"arnie mortons of chicago", "435 s. la cienega blvd.", "los angeles", "310-246-1501", "steakhouses", 0

Implementation of iterative blocking on restaurant dataset using Jaro Winkler string similarity:

Blocks were formed on city in first pass and then on telephone number in second pass. Similarity threshold values were varied from 70% to 91%. The contributing fields i.e. name, address, city, phone and cuisine type were assigned weights as name(30%), address(30%), city(20%), cuisine_type(20%) in the calculating record similarity index . The results of Precision, Recall, F-measure & number of record comparisons are as shown in figure 1.

The results show that, at 85% of similarity threshold, F-measure, precision, recall are at optimal level. At 90% threshold, precision maximize but recall and F-measure drops as more number of records are rejected from comparisons. With increase in threshold value, number of record comparisons also increase.

Figure 1. PRECISION, F-MEASURE,RECALL OBTAINED WITH ITERATIVE BLOCKING, JARO-WINKLER SIMILARITY AND UNEQUAL WEIGHT COMBINATION ON RESTAURANT DATASET

Precision F-measure Recall

0.98 0.98 0.99 0.97

Threshold %	Precisi on	Reca ll	F- measure	#Compariso ns
70	0.79	0.87	0.83	4258
90	0.97	0.76	0.85	4520
91	0.98	0.74	0.84	4544

Above results show that, at 85% of similarity threshold, F- measure, precision, recall reach optimum values. At 91% threshold, precision maximizes but recall and F-measure drop as more number of records are rejected from comparisons. Adaptive approach results in tremendous drop in number of comparisons.

Figure 2. COMPARATIVE CHART OF ITERATIVE BLOCKING(IB) AND ADAPTIVE SORTED NEIGHBORHOOD(ASNM) IN COMBINATION WITH JARO-WINKLER(JW) AND SOUNDEX(SND) SIMILARITY

0.98

0.92

0.860.77

Output

0.80

0.74

0.68

0.62

0.56

0.50

0.96

0.84 0.87 0.86 0.85

0.80 0.77

0.75

0.76 0.74

0.62 0.60

1.00

0.90

Output

0.80

0.70

0.60

0.50

Precision F-measure Recall

70ptc 80ptc 84ptc 85ptc 86ptc 90ptc 91ptc

Threshold

IB+JW IB+SND ASNM+JW ASNM+SND

Threshold 85%

Implementation of adaptive sorted neighborhood with Jaro- Winkler string similarity:

In the first pass, sort key was created by concatenating first two characters of each of following fields : name, address, city, phone and cuisine type in order. In the second pass sort key was created by concatenating first two characters of fields city, phone, cuisine type, name and address in sequence. The data was then sorted on sort key 1 in first pass and on sort key 2 in second pass.

Results obtained by combination of iterative blocking and adaptive sorted neighborhood method with Jaro Winkler and Soundex at threshold value of 0.85 & equal weight assigned to contributing fields are shown in Fig. 1. It shows that Jaro- Winkler string similarity works well than Soundex for maximizing precision, recall and F-measure.

A fixed window size of 10 records is defined initially and changed adaptively when a match is found. Comparisons of records take place within the window only. First record in result set is selected as base record and compared with remaining records one after the other within the window. Jaro-Winkler string similarity is used to find mathcing index of the records. If a matching record is found, i.e. match index > threshold value, remaining records in the window were bypassed and next record of current base record is marked as new base record. If the new record is already matched with some other record, it need not be processed further and so bypassed and next base record is selected for processing. This process is continued till all records exhaust. The results of Precision, Recall, F-measure and number of comparisons required are as shown in following table.

Threshold %	Precisi on	Reca ll	F- measure	#Compariso ns
70	0.79	0.87	0.83	4258
80	0.85	0.88	0.86	4307
84	0.93	0.88	0.90	4382
85	0.96	0.88	0.92	4406
86	0.97	0.86	0.91	4430

Threshold %	Precisi on	Reca ll	F- measure	#Compariso ns
70	0.79	0.87	0.83	4258
80	0.85	0.88	0.86	4307
84	0.93	0.88	0.90	4382
85	0.96	0.88	0.92	4406
86	0.97	0.86	0.91	4430

TABLE I. RESULTS OF ADAPTIVE SORTED NEIGHBORHOOD METHOD WITH JARO- WINKLER SIMILARITY AND EQUAL FIELD WEIGHTS

Figure 3. RECORD COMPARISONS REQUIRED FOR ITERATIVE BLOCKING(IB) AND ADAPTIVE SORTED NEIGHBORHOOD(ASNM) IN COMBINATION WITH JARO-WINKLER(JW) AND SOUNDEX(SND) SIMILARITY

80000

60000

40000

20000

0

Comparisons

80000

60000

40000

20000

0

Comparisons

IB+JW

IB+SND

ASNM+JW

ASNM+SND

Fig. 3 shows that, adaptive sorted neighborhood method reducesnumber of record comparisons by factor of 93%.

Real customer dataset: Real dataset having 10644 customer records was loaded into mysql dataset. The data contains following fields :

Title, full name, address line 1, address line 2, address line 3, City, state code, pin code, date of birth, sex code, interest category, mother tongue

Dataset have 2167 duplicates blocks and total of 7056 duplicate records.

Implementation of iterative blocking using Jaro-Winkler string similarity:

The data was preprocessed for removing known wrong values in city and title fields using update queries. No specific field can be selected to form blocks. So blocking key was formed by concatenating first four characters of full name and address line

1. First record in the block was marked as base record of the block. All other records were compared one by one with base record to compute record similarity.Similarity threshold values were varied from 70% to 99%. All the contributing fields were assigned equal weights while calculating record similarity index. The results of Precision, Recall, F-measure & number of comparisons required are as shown following tables.

TABLE II: RESULTS OF ITERATIVE BLOCKING WITH JARO-WINKLER SIMILARITY AND EQUAL FIELD WEIGHTS

Threshold %	Precisi on	Reca ll	F- measure	#Compariso ns
70	0.7	0.47	0.56	3021245
80	0.5	0.5	0.5	3588053
84	0.78	0.54	0.64	4155076
85	0.80	0.55	0.65	4437091
86	0.81	0.56	0.66	4723404
90	0.87	0.64	0.74	6050503
91	0.89	0.65	0.75	6332849
93	0.9	0.68	0.78	6832347
95	0.91	0.69	0.79	7262777
97	0.92	0.69	0.79	7700536
99	0.93	0.67	0.78	8337269

TABLE III:RESULTS OF ITERATIVE BLOCKING USING JARO-WINKLER SIMILARITY AND UNEQUAL FIELD WEIGHTS

Threshold %	Precisi on	Reca ll	F- measure	#Compariso ns
70	0.73	0.51	0.60	2381521
80	0.79	0.57	0.66	3418051
84	0.83	0.62	0.71	4235334
85	0.85	0.64	0.73	4713467
86	0.87	0.66	0.75	5280880
90	0.96	0.79	0.86	7218565
91	0.97	0.80	0.88	7599389
93	0.99	0.82	0.90	8149949
95	1.0	0.82	0.90	8546981
97	1.0	0.80	0.89	8944531
99	1.0	0.77	0.87	9349063

Comparison of table II with table III indicate that, assigning unequal weights for forming similarity index results in improved precision, recall & F-measure and increase in number of record comparisons.

Implementation of adaptive sorted neighborhood with Jaro- Winkler string similarity:

Blocking key was formed by concatenating first four characters of full name and first two characters of address line 1. The data was then sorted based on sort key 1.

Results show that, at 93% of similarity threshold, F-measure, precision, recall reach optimum values. At 95% threshold, precision maximizes but recall and F-measure drop as more number of records are rejected from comparisons. Adaptive approach results in tremendous drop in number of comparisons. Adaptive sorted neighborhood method reduces number of record comparisons by factor of 82%.

TABLE IV. RESULTS OF ADAPTIVE SORTED NEIGHBORHOOD METHOD WITH JARO- WINKLER SIMILARITY AND EQUAL FIELD WEIGHTS

Threshold %	Precisi on	Reca ll	F- measure	#Compariso ns
70	0.81	0.57	0.67	53678
80	0.82	0.58	0.68	54999
84	0.85	0.62	0.71	57488
85	0.87	0.63	0.73	58650
86	0.88	0.64	0.74	59655
90	0.93	0.70	0.80	63683
91	0.93	0.71	0.81	64396
93	0.94	0.73	0.82	65653
95	0.95	0.72	0.82	66719
97	0.95	0.71	0.81	67856
99	0.95	0.69	0.80	69308

If highest weight-age is given to name field, following results are obtained.

TABLE V. RESULTS OF ADAPTIVE SORTED NEIGHBORHOOD METHOD USING JARO- WINKLER SIMILARITY AND UNEQUAL FIELD WEIGHTS

Threshold %	Precisi on	Reca ll	F- measure	#Compariso ns
70	0.78	0.58	0.66	47351
80	0.80	0.60	0.69	46000
84	0.86	0.66	0.75	53992
85	0.88	0.68	0.77	55757
86	0.91	0.72	0.80	57717
90	0.97	0.81	0.88	64288
91	0.98	0.82	0.89	65485
93	0.99	0.83	0.90	67220
95	1.0	0.81	0.90	68503
97	1.0	0.80	0.89	69637
99	1.0	0.77	0.87	70838

Dbgen Dataset:

Synthetic dataset generated from data generator software – dbgen having 10252 records and 501 duplicates blocks. The data contains following fields :

social security number, first name, middle name, last name, street number, address, apartment number, city, state, zip code

First blocking key selected was state field and second blocking key was social security number field. Results show that, on dbgen dataset, giving highest weightage to name field produces optimal results with slight increase in number of comparisons. Iterative blocking and Jaro Winkler similarity achieves highest precision of 1.00, highest F-measure of 0.94, highest recall of

0.89 and requires only 0.13% comparisons against one to one comparisons of 10.5 core comparisons when highest weight is given to name and address fields. At 62% similarity threshold recall , F-measure, precision are maximum.

TABLE VI. RESULTS OF ITERATIVE BLOCKING WITH JARO-WINKLER SIMILARITY AND UNEQUAL FIELD WEIGHTS

Threshold %	Precisi on	Reca ll	F- measure	#Compariso ns
70	1.00	0.77	0.77	248720
80	1.00	0.77	0.69	46000
84	1.00	0.77	0.75	53992
85	1.00	0.77	0.77	55757
86	1.00	0.77	0.80	57717
90	1.00	0.76	0.88	64288
91	1.00	0.82	0.89	65485
93	1.00	0.83	0.90	67220
95	1.00	0.81	0.90	68503
97	1.00	0.80	0.89	69637
99	1.00	0.77	0.87	70838

CONCLUSION

Following points highlight the conclusion drawn out of the work:

Whole record need not be compared to identify duplicates. Fields which represent the whole record must be used for the similarity vector computation than comparing all the fields. Those fields can be identified as the fields having higher cardinality. Weights assigned to contributing fields play important role in efficiency of the process. Higher weights must be assigned to the most discriminating fields in the record. Highest weight must be assigned to the field having lowest cardinality in the whole dataset. This weight combination results into higher precision, recall and F-measure than that of equal weights. Similarity threshold used to compute similarity vector also affects the precision, recall, F measure values. On real dataset like 'Bank' similarity threshold of 95% results in optimum results, whereas on restaurant dataset, similarity threshold of 85% results into optimum results. i.e. similarity threshold value yielding optimal results is dependent on database characteristics. On average similarity threshold of 90% produces optimal results across datasets. Iterative and adaptive sorted neighborhood blocking methods perform nearly equal in terms of F-measure, precision and recall. Adaptive sorted neighborhood method requires very

less number of record comparisons than that of iterative blocking. Adaptive sorted neighborhood method is useful for large datasets. The overhead is to find proper sort key. Jaro- Winkler similarity function always outperforms than Soundex similarity and Jaro-Winkler similarity is best suited for name and address data.

References

Xiaochun Yang Bin Wang Guoren Wang Ge Yu Key RSEARCH: Enhancing Keyword Search in Relational Databases Using Nearly Duplicate Records ( 2010 IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering )
Ahmed K. Elmagarmid, Senior Member, IEEE, Panagiotis

G. Ipeirotis, Member, IEEE Computer Society, and Vassilios

S. Verykios, Member, IEEE Computer Society Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S.Verykios. Duplicate Record Detection: A Survey( IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007 )
Su Yan, Dongwon Lee, Min-Yen Kan, and Lee C. Giles. Adaptive sorted neighborhood methods for efficient record linkage. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, 2007.
Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. Entity resolution with iterative blocking. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2009.
A. E. Monge and C. Elkan, An efficient domain- independent algorithm for detecting approximately S.E. Whang, D. Menestrina, G. Koutrika, M. Theobald, H. Garcia-Molina, Entity resolution with iterative blocking, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD 09), 2009, pp. 219232
Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration IEEE Intelligent Systems, 18(5):16-23, Sep/Oct 2003.

AUTHORS

Bhagyashri A. Kelkar received the B.E. in Computer Sci. & Engg. from Walchand College of Engg., Sangli in 1995. She is currently pursuing M.E. degree in Comp. Science & Engg.

Prof. K. B. Manwade received M.Tech. degree in computer science from Shivaji University, Kolhapur. He is working as an Head of Computer sceience Department in Ashokrao Mane Group of Institutes, Wathar.

Prof. G. A. Patil received ME degree in computer sci. & engg. From Walchand College Sangli. He is working as head of department in D. Y. Patil College of engg. & tech., Kolhapur

.

Near Duplicate Detection In Relational Database

Leave a Reply