A Review on Data Normalization Techniques

A bulk data is generated from various sources. The sources may provide duplicate data with some representative changes. To mine such big data and create representative data is a challenging task. The data importance increases when it is linked with similar resources and similar data is fused in one source. Lot of research work has been done to provide a single representative data of all real world entities by removing the duplicate records. This task is called record normalization. This paper aims to study various existing data normalization techniques along with its advantages and limitations. Based on the existing system study a new technique is proposed. Index Terms Record normalization, data clustering, data fusion, data linkage, data integration.


INTRODUCTION
The bulk data is generated in the world wide web. Based on the user search parameter the data is collected from various sources. The structured data contents are stored in web warehouses containing web databases and web tables. The relevant data collection is done from various warehouses like Google, Bing Shopping. Google Scholar is an important mining domain. It is known as web data integration. In web data integration, the structured data should be matched automatically coming from various web warehouses. A data containing similar records, records that point to the same entity should be grouped together as a standard record set.
The result set generated after searching a query on the search engine generates the redundant results, showing multiple entries of the same record coming from various sources. This record representation contains duplicate and unnecessary entries. Such a result set is inconvenient to the end user for analysis.
Record normalization is important in a variety of domains. For example, in the case of research publication domain Citeseer or Google Scholar are important integrator websites that collect data from various sources from automatic data collection techniques. The data is displayed to the user based on the user query. The data should be clear and in normalized form. The search result should be: 1. Best match search 2. Data should be de-duplicated If ad-hoc approaches for data matching is followed or all the matched records are displayed to the end user then it will be very frustrating for the end user to sort and extract useful information from the generated result set. Ad-hoc extraction of records may lead to records with missing value or incorrect data representation.
Record normalization is a challenging problem because various resources provide the same data in various formats. There is conflict in data which is collected from various sources due to erroneous data, incomplete data, different data representation or missing some attribute values.
Consider an example: User fire a search query as: "Data integration: the teenage years", based on the title matching various records are fetched like: In the above table, the same author name representation is in the various forms. Venue and pages contain some missing value or variation in representation of the same data. By analyzing all the records the normal record should be generated as: record. For more precision the values in a field should be normalized. In the following section literature survey is discussed followed by problem formulation. Based on the analyzed problem a new system is proposed in section IV followed by the conclusion.

LITERATURE SURVEY
Culotta et al. proposes a record normalization at the very first time. The normalization technique is also called Canonicalization. This is a process of converting the data in one standard canonical form by analyzing various parameters. In this paper the author proposes a technique for the record normalization in databases. This paper does not consider the value component level normalization and hence the normalized record database contains many instances of repetitive data and unnecessary normalized records [2].
Swoosh treats the data duplication problem as an entity relationship problem. The problem is like a black box function. This back box matches and merges the records. The ER algorithm is defined to invoke these functions. The system generates de-duplicate records but does not generate the normalized records. It increases the complexity of record matching problems [3].
Wick et al. proposes a technique for data integration using schema matching. It also focuses on coreference resolution, record canonicalization. For implementation it uses a discriminatively-trained model. Due to combined objectives, the system complexity increases. The paper only deal with field level record matching and not at the value level and hence the system do no generate the complete normalization records. [4] Tejada et al. proposes a technique for database record normalization called object normalization. The system collects the data from various web sources and saves collectively in a database. At the time of search these database objects are normalized with duplication removal. The system uses attribute ranking as well as string ranking in attribute, based on the user's confidence score. [5] Wang et al. works on shopping dataset. The dataset is normalized in terms of records. It works on data integration and data cleaning. It works on record marching and replacing the missing values with the most relevant values. It also corrects the data which is best suitable to the record by comparing the other dataset record entries. It do not work on value level and working globally on field level normalization. [6] Chaturvedi et al. works on pattern discovery in the records. This technique does not focus on data normalization and removal of duplicate records but it extracts patterns from duplicate records and finds the most important and prevalent patterns in the dataset. This approach can be applicable for data normalization. [7] Dragut et al. works on automatic labeling called as Label normalization. The label normalization is used for record normalization and assigning meaningful labels to the elements of an integrated query interface. It works on field level labeling and assigns labels to each attribute within the global interface. [8] S. Raunich et. al. proposes an ATOM system. The Atom system works on Ontology merging which is nothing but a record normalization. But in the merging phase user involvement is required. The approach should be automated with less involvement of the end user [9]. Yongquan Dong et. al. works on automatic record normalization. The normalization is performed at three levels: record level, field level and value level. The normalization accuracy increases at each level of data pruning. The duplicate records are removed. A single entry is created by analyzing the duplicate entries. The related entries are not clubbed together. A single representation of the record is created. For more informative data representation data should be normalized and linked together. The data is processed with string operation functions and no natural language processing(NLP) techniques are used. NLP techniques may create more accurate results with less processing. [1].

PROBLEM FORMULATION
Let E1 be the real world entity. Re is a set of records collected from various sources representing the same entity E1. R e= {R1,R2,..Rp}. This record is the collection of various fields. In each field various string values are present. Let FS be the set of fields FS = {f1, f2, …, fq} and ri[fi] is the value in the field fi. There is a need to define the problem as record normalization and linking problems. From the set of Re, generate a new customized record that represents the entity E1 more accurately in a very descriptive manner using natural language processing techniques.
The records from other entities like E1 should be linked together by matching the field and value level components.

PROPOSED METHODOLOGY
A. Preliminaries: 1. Frequency Ranker: The frequency ranker ranks the mostly occurred unit u in the list of distinct units.

Edit-distance based Similarity measure:
The number of edits required to transform one string to another. Edit distance based similarity between two string a and b is given as: |a| and |b| is lengths of a and b respectively.

bigram similarity measure:
This distance is based on 2-character substring present in string. The similarity measure between string a and b is given as: Bigram(a) and bigram(b) are 2-grams of a and b respectively.
6. Feature-based rankers: Feature based rankers are divided in 2 sections: a. Strategy feature: This is a binary indicator that indicates the unit is a representative unit ranked by some ranking criteria. b. Text Feature: This feature examines the property of string. It checks if the string is acronyms or abbreviations of a certain representative string or not. For example: conf is an abbreviation of conference whereas VLDB is an acronym for Very Large Databases.

Collocation:
Collocation is a sequence of consecutive terms with the inverse term document frequency (idf) value less than the given threshold. N-collocation defines the consecutive n terms.

Sub-collocation
Is the substring of n-collocation string with k consecutive terms. For example "in the conference" is the sub-collocation of " in the conference of VLDB".

Template collocation:
An n-collocation term is called a template collocation if its inverse term document frequency (idf) is greater than the given threshold.
10. Twin template collocation: The terms tc1 and tc2 are twin collocation if it satisfies the following conditions: P(tc1, tc2) > p(tc1, tc), for all tc Ɛ TC and tc1 <> tc2 (p(tc1,tc2))/(p(tc2))>threshold B. System Architecture Redundant record Set is input to the system. After processing, the system generates Non-redundant normalized record set along with the data linking. The data processing is mainly categorized in 5 sections: 1. Record selection: The record is generated with the combination of various fields. There should be all values present in each field so that a complete informatory citation can be generated as a representative of all redundant data. This is a selection criterion for record level data filtering. The selected records are further processed using field and value level.

Field Selection:
The normalized record is generated by combining the most descriptive features of all fields. From all the records each field data is normalized and then a new record is generated. For record normalization frequency ranker, length ranker, centroid rankers and feature based ranker are used.

Value Selection:
The values of each field are extracted. The abbreviation and acronyms are replaced by Mining Abbreviation-Definition Pairs algorithm. Afterwards its collocation, sub collocation and twin-collocation is identified using the Mining TemplateCollocation-SubCollocation Pairs (MTS)algorithm. A normalized record is generated at the value level. For mining abbreviations NLP technique is used that helps to find n-gram nouns and its respective aberrations.

Field based Clusters:
Based on the normalized value extracted for each field in the record, relevant records are linked as per the field value details.

V. CONCLUSION
In this paper, a study of various existing systems is proposed. The record normalization is carried out in a variety of ways like duplication removal, record level normalization, field level normalization and value level normalization. Some techniques require user involvement in record normalization working for accuracy improvement. These techniques use string processing functions to find record level, field level and value level duplications and alternate text used in the same context. None of the existing techniques uses NLP for value level normalization. The proposed system generates Normalized records by removing duplicate entries that point to the same entity. Data normalization processing is applied at tree levels: record level, Field level and value level. The precision of de-duplication increases from record level to value level. The values are normalized using NLP n-gram technique. Along with the duplication removal similar entities are grouped together using field and value level data comparison. The grouped data is linked together to generate more representative data.