- Open Access
- Total Downloads : 192
- Authors : Churukanti Ravinder Reddy, Pvs Srinivas
- Paper ID : IJERTV2IS90324
- Volume & Issue : Volume 02, Issue 09 (September 2013)
- Published (First Online): 16-09-2013
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Optimality of Training Cost and Evidence Results Identification Using Cutting Plane and Structural SVM Approach
Churukanti Ravinder Reddy 1 PVS Srinivas 2
1 PG Scholar in Computer Science and Engineering TKR College of Engineering & Technology , Medbowli, Meerpet, Saroor Nagar, Hyderabad 500 097.
2 Professor & Head, Department of Computer Science and Engineering, TKR College of Engineering & Technology, Medbowli, Meerpet, Hyderabad 500 097
Many number of e-commerce, digital libraries and government organization are contains noisy and redundant records. These databases are not quality databases. Many number of organizations offer different approaches to remove the duplicate for generate the quality databases creation. Existing system approaches are statistical, ad-hoc domain knowledge, training approaches. These all approaches are expensive in detection of duplicates. All unique records decompose with help genetic programming approach. Its the fewer evidence results. Training wise cost is high and decomposition wise its not optimal. In this paper we propose new training approach that is called cutting plane approach. In less amount of time detects the total duplicates and less expensive also. Training approach gives the result as unique records content. All unique records decompose into Structural with multi way classification approach. These results are fold and alignment into tree format content. Compare to all previous approaches structural SVM shows the betters and optimal results with good proof.
KEYWORDS: Structural SVM, cutting plane algorithm, De-duplication approach, DNA sequence alignment approach.
Present databases itself everyday add the new content, databases are converts as a large databases. In large databases for extraction purpose takes more amount of time because of duplicates.Remove the duplicates and provide the meaning and useful records with present paper we propose here.
Existing approach data entry related problems affects performance in recognization of duplicates. Pattern recognization approaches are failing in detection of duplicates. It shows the low duplicate ratio. Some other approaches are available for removing the duplicates. Those approaches are training approaches for removing the duplicates. Some issues are available like high training cost and less optimal.
In this paper we propose new training approach that is called cutting plane approach. In less amount of time detects the duplicates and less amount of training cost. All unique records arrange into DNA structure format with
meaningful and better evidence. Following approaches contains complete implementations of proposed system.
Record duplication is the present growing related research concept in database environment. Many problems are generating under extraction of results from different number of styles. Its not possible to understand all duplicate records of information in database environment. Thats why it may chance to missing of some duplicates information.
After some number of days similarity function we implement in present environment for detection of duplicates specification. Its give the inconsistency results in extraction of results. New concept is introduced that is called as a approximate query processing based on edit distance mechanism. Duplicate detection possible within the limited distance specification process.
Some people start the research work in machine learning techniques. Some existing supervised clustering algorithms works as a training phase of work. Its possible to remove the duplication maximum like 50%. This is called as a semi supervised clustering algorithm. Its work on single rule classification mechanism.
Probabilistic approaches give the results as a statistical performance records. Each and every record of duplicate weight once we calculate here. In duplicates weight apply the threshold filter the unique records optimal solution output results. In same duplicate detection environment apply the concept boundary
detection environment process. Select the record identifies the duplicate weight, below boundary which records are present, those records are unique remaining records are duplicates output results. These techniques are completely related supervised clustering techniques.
The above supervised clustering techniques are not provides results as a effective duplicates detection in implementation. Now some users are introduced unsupervised clustering techniques for duplicate records detection. Now we improve high duplicate ratio with new techniques. Its follow the number rules in detection of all dimensions of duplicates in implementation. It is the infinite process underdetection of duplicates in implementation process.
Last and final ranking techniques display the based on decision tree approach. These results also display the based on range approach. Compare to above approach its provide better results in implementation process. These techniques also it does not provide optimal solution.
Present genetic programming and generational evolutionary approaches are providing the effectiveness results in record alignment as a useful with evidence. Total evidence results are display with more training and testing iterations. More training iterations are shows the issues related to like operational cost and overhead. Many steps are present for extraction of results with meaningful. In each and every step we use the some resources. Whenever steps are high
automatically resources utilization is high in present implementation part.
In this paper we propose new approach that is works based on less number of training and testing iterations for extraction of efficient results without duplicates. Those unique records work as meaningful records with better evidence. This present approach also for extraction of better results here we spend less number of training and testing iterations. All better evidence results shows with less amount of operational cost.
Fig1: New proposed system architecture we introduced in this paper with new training approach. New training approach we apply in searching task and display the unique cost with less operational cost. In unique record directly apply the structural SVM(structure vector machine) for display the results in output environment. Structural SVM gives the useful and meaningful patterns results with proof.
Fig1: Proposed system architecture with less training cost
Architecture contains different steps for quality results. Those steps are
Using cutting plane display unique records
Apply structural SVM in unique records
Alignment of records in DNA structure
Calculate the precision and recall
Consider the input as a multiple web databases. Enter the input query as a topic, search in multiple databases. Related to query display the results. In present results some duplicates are presents. It gives the huge amount of data records content.
Using cutting plane display unique records:
In huge amount of database records for removing the duplicates introduces the cutting plane algorithms. No need to search whole database for removing the duplicates here. Removing duplicates with less amount of training cost here. This is completely hyper plane based approach for emoving the duplicates very easily with less overhead cost.
Apply structural SVM in unique records:
All unique records we decompose into structural SVM. Structural SVM performs the operation like multi way
classification approach. We categorize all features for display the effective and performance results in output.
Alignment of records in DNA structure:
After categorization of all records everything arrange into one structure format. Structure nothing but folding of records in correct position of specification implementation. Different folds data combined finally display the final structure.
Calculate the precision and recall:
Remove the false positive records identification of true records as a best useful records content identification. Now we have the result everything as true positive records content.
Precision= Number of correctly identified duplicate pairs/ Number of identified duplicate pairs
Recall=Number of correctly identified duplicate pairs/Number of true identified duplicate pairs
Testing of different training approaches and identifies the training cost environment. All approaches training cost performance show into graph.
Fig2: Different approaches of Training cost comparison
CONCLUSION AND FUTURE WORK
Previous all existing training approaches and SVM utilizes more amount of cost utilization with high operational overhead. These approaches are not shows the optimal in evidence pattern.
In this paper we show the optimal results with less amount training cost using cutting plane approach. Using Structural SVM also arranges the results as a effective pattern or structure results here.
In future some other new training approaches are introduce and reduces the training cost. Those all unique records decompose in structure manner with better evidence.
We thank full to T K R College of Engineering & Technology, which provides good facilities to completion of my research and also thankful for my guide Dr. P.V.S.Srinivas for giving great support in every aspects of my work.
Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu, Cutting-Plane Training of Structural SVMs, 2011
Optimization Techniques to Record Deduplication, 2012
Evolutionary Tuning for Distributed Database Performance, 2010
On Evaluation and Training-Set Construction for Duplicate Detection, 2012
A Language Independent Approach for Detecting Duplicated Code, 2009
Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases, 2011
A Genetic Programming Approach to Record De-duplication, 2011
Learning Linkage Rules using Genetic Programming, 2009
Adaptive Duplicate Detection Using Learnable String Similarity Measures, 2010
Collection Statistics for Fast Duplicate Document Detection, 2009
Near-Duplicate Detection by Instance-level Constrained Clustering, 2008