Learning of Restriction for Partially Supervising Huddles

Download Full-Text PDF Cite this Publication

Text Only Version

Learning of Restriction for Partially Supervising Huddles

Sundari. P1, PG Student1, Kalaiselvan. R2, Asst. Professor, Department Of Computer Science and Engineering,

1PG Student, Parisutham Institute of Technology and Science, Thanjavur, Tamilnadu, India.

2Asst. Professor, Parisutham Institute of Technology and Science, Thanjavur, Tamilnadu, India.

Abstract Supervising huddles learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labelled data with a large amount of unlabeled data. The huddles learning falls between unsupervised learning (without any labelled training data) and supervised learning (with completely labelled training data). Many machine huddles researchers have found that unlabeled data, when used in accuracy with a small amount of labelled data, can produce considerable improvement in learning accuracy. The cost associated with the labelling process thus may render a fully labelled training set infeasible, whereas attainment of unlabeled data is relatively inexpensive. In such situations, supervising huddles learning can be of great practical value. Partially supervising huddles learning is also of theoretical interest in machine learning and as a model for human learning.

Keywords Active learning, clustering, machine learning

  1. INTRODUCTION

    Semi-supervised learning attempts to make use of this combined information to surpass the classification performance that could be obtained either by discarding the unlabeled data and doing supervised learning or by discarding the labels and doing unsupervised learning. Semi-supervised learning may refer to either transductive learning or inductive learning. The goal of transductive learning is to infer the correct labels for the given unlabeled data only. The goal of inductive learning is to infer the correct mapping from to .Intuitively; we can think of the learning problem as an exam and labelled data as the few example problems that the teacher solved in class. The teacher also provides a set of unsolved problems. In the transductive setting, these unsolved problems are a take-home exam and you want to do well on them in particular. In the inductive setting, these are practice problems of the sort you will encounter on the in-class exam. It is unnecessary (and, according to Vanes, imprudent) to perform transductive learning by way of inferring a classification rule over the entire input space; however, in practice, algorithms formally designed for transduction or induction are often used interchangeably.

    Some methods for semi-supervised learning are not intrinsically geared to learning from both unlabeled and labeled data, but instead make use of unlabeled data within a supervised learning framework. For instance, the labelled and unlabeled examples may inform a choice of representation, distance metric, or kernel for the data in an unsupervised first step. Then supervised learning proceeds from only the labelled examples.

    Self-training is a wrapper method for semi-supervised learning. First a supervised learning algorithm is used to select a classifier based on the labelled data only. This classifier is then applied to the unlabeled data to generate more labelled examples as input for another supervised learning problem. Generally only the labels the classifier is most confident of are added at each step.

    Co-training is an extension of self-training in which multiple classifiers are trained on different (ideally disjoint) sets of features and generate labelled examples for one another.

  2. BACKGROUND

Data mining an interdisciplinary subfield of computer science is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set [4] and transform it into an understandable structure for further use.

Traditional data mining algorithms when applied on these big data results in poor performance with respect to the computational part [1]. So, there is a need to parallelize the traditional data mining algorithms. There has been several research works carried out to handle and process the Big data. Google has developed a software framework called Map Reduce to support large distributed data sets on clusters of computers, which is effective to analyse large amounts of data. Followed by Googles work, many implementations of Map Reduce emerged and lots of traditional methods combined with Map Reduce have been presented such as Apache Hadoop , Phoenix, Mars, Twister . Apache Hadoop is a software framework that

helps constructing the reliable, scalable distributed systems. Hadoop enables users to store and process large volumes of data and analyse it in ways not previously possible with less scalable solutions or standard SQL-based approaches. In our work, we have discussed the various advantages of incorporating parallelism in existing mining algorithms and proposed a system for mining Big data.

The term is a misnomer, because the goal is the extraction of patterns and knowledge from large amount of data, not the extraction of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence, machine learning, and business intelligence. The popular book "Data mining: Practical machine learning tools and techniques with Java"(which covers mostly machine learning material) was originally to be named just "Practical machine learning", and the term "data mining" was only added for marketing reasons. Often the more general terms "(large scale) data analysis", when referring to actual methods, artificial intelligence and machine learning are more appropriate.

The actual data mining task is the automatic or semi- automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records, unusual records and dependencies[5]This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting are part of the data mining step, but do belong to the overall KDD process as additional steps.

The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered[3]. These methods can, however, be used in creating new hypotheses to test against the larger data populations

A. Methods

Some methods for semi-supervised learning are not intrinsically geared to learning from both unlabeled and labelled data, but instead make use of unlabeled data within a supervised learning framework. For instance, the labelled and unlabeled examples may inform a choice of representation, distance metric for the data in an unsuperised first step. Then supervised learning proceeds from only the labelled examples.

Self-training is a wrapper method for semi-supervised learning. First a supervised learning algorithm is used to select a classifier based on the labelled data only. This

classifier is then applied to the unlabeled data to generate more labelled examples as input for another supervised learning problem. Generally only the labels the classifier is most confident of are added at each step.

Co-training is an extension of self-training in which multiple classifiers are trained on different (ideally disjoint) sets of features and generate labelled examples for one another.

    1. imitations

      • Faculty need to be expert in the content area.

      • May be difficult to organize active learning experiences.

      • Requires more time and energy and may be stressful for faculty.

      • Faculty may receive less favorable evaluations from students.

      • Students may be stressed because of the necessity to adapt to new ways of learning.

C.Techniques

Graph-based methods for semi-supervised learning use a graph representation of the data, with a node for each labelled and unlabeled example. The graph may be constructed using domain knowledge or similarity of examples; two common methods are to connect each data point to its K nearest neighbours or to examples within some distance .

  1. Clustering methods

    The cluster administrator specifies list of nodes to be decommissioned. Once a Data Node is marked for decommissioning, it will not be selected as the target of replica placement, but it will continue to serve read requests. The Name Node starts to schedule replication of its blocks to other Data Nodes[2]. Once the Name Node detects that all blocks on the decommissioning Data Node are replicated, the node enters the decommissioned state. Then it can be safely removed from the cluster without jeopardizing any data availability.

  2. Inter-cluster data

    When working with large datasets, copying data into and out of a HDFS cluster is daunting. HDFS provides a tool called Dist Cp for large inter/intra-cluster parallel copying. It is a Map Reduce job; each of the map tasks copies a portion of the source data into the destination file system. The Map Reduce framework automatically handles parallel task scheduling, error detection and recovery.

  3. Durability of Data

    Replication of data three times is a robust guard against loss of data due to uncorrelated node failures. So for the sample large cluster as described above, a node or two is lost each day. That same cluster will re-create the 60 000 block replicas hosted on a failed node in about two minutes: re-replication is fast because it is a parallel problem that scales with the size of the cluster the probability of several nodes.

    Fig. 1 Active learning system

  4. Interactive Image Retrieval

    Active learning methods have been considered with increased interest in the statistical learning community. Initially developed within a classification framework, a lot of extensions are now being proposed to handle multimedia applications. This paper provides algorithms within a statistical framework to extend active learning for online content-based image retrieval (CBIR). The classification framework is presented with experiments to compare several powerful classification techniques in this information retrieval context. Focusing on interactive methods, active learning strategy is then described. The limitations of this approach for CBIR are emphasized before presenting our new active selection process RETIN. First, as any active method is sensitive to the boundary estimation between classes, the RETIN strategy carries out a boundary correction to make the retrieval process more robust. Second, the criterion of generalization error to optimize the active learning selection is modified to better represent the CBIR objective of database ranking[9]. Third, a batch processing of images is proposed. Our strategy leads to a fast and efficient active learning scheme to retrieve sets of online images (query concept).

    Experiments on large databases show that the RETIN method performs well in comparison to several other active strategies.

  5. Graph Based Active Learning

    In many learning tasks, to obtain labeled instances is hard due to heavy cost while unlabeled instances can be easily collected.

    Active learners can significantly reduce labeling cost by only selecting the most informative instances for labeling. Graph- based learning methods are popular in machine learning in recent years because of clear mathematical framework and strong performance with suitable models. However, they suffer heavy computation when the whole graph is in huge size. In this paper, we propose a scalable algorithm for graph- based active learning[10]. The proposed method can be described as follows. In the beginning, a backbone graph is constructed instead of the whole graph. Then the instances in the backbone graph are chosen for labeling. Finally, the instances with the maximum expected information gain are sampled repeatedly based on the graph regularization model. The experiments show that the proposed method obtains smaller data utilization and average deficiency than other popular active learners on selected datasets from semi- supervised learning benchmarks.

    Service recommendation [6] based on the similar users or similar services would either lose its timeliness or could not be done at all. In addition, all services are considered when computing services rating similarities in traditional CF algorithms while most of them are different to the target service. The ratings of these dissimilar ones may affect the accuracy of predicted rating.

  6. Datamining And Bigdata

    Big Data concerns large-volume, complex, growing data sets with multiple, autonomous sources so existing knowledge particular area only handled. The most fundamental challenge for the Big Data applications is to explore the large volumes of data and extract useful information or knowledge for future actions may Possible [5].To compute similarity between every

    pair of users or services may take too much time, even exceed the processing capability of current RSs. Service recommendation based on the similar users or similar services would either lose its timeliness or could not be done at all because decided also business people only. Existing neural networks-based clustering algorithm in e-commerce recommendation system. The cluster analysis gathers users with similar characteristics according to the web visiting message data. it is hard to say that a users preference on web visiting is relevant to preference on purchasing. The vectors were clustered using a refined fuzzy C-means algorithm.

    Our active learning framework assumes the availability of a constraint-based clustering algorithm. For this purpose, we use the well-known MPCK Means [3] algorithm, as im- plemented in the Weka UT package. We set the maximum number of iterations of MPCK means to 200, and used default values for other parameters. Note that the choice of this algorithm is not critical and our method can be used with any constraint-based clustering algorithm.

    When evaluating the performance of a particular method on a given data set D, we apply it to select up to 150 pair wise queries, starting from no constraint at all. The queries are answered based on the ground-truth class label for the data set. MPCK means is then applied to the data with the resulting constraints (and their transitive closures). To account for the randomness in both active learning and MPCK means, we repeat this process for 50 independent runs and report the average performance using evaluation criteria described below.

  7. Evaluation Based on Clustering Performance

    This set of results are very similar to what we observe when evaluating using NMI. When using onl 20 queries, the performance of the nonrandom methods often do not demonstrate statistically significant difference. However, as we increase the number of queries, our method begins to dominate all other methods.

  8. Experimental Results

    This section presents the experiment results, which com- pare our proposed method to the baseline methods. In the remaining discussion, we will refer to our method as the normalized point-based uncertainty (NPU) method.

  9. Performance evaluation

F-measure focuses on how accurately we can predict the pair wise relationship between any pair of instances. In, we show the F-measure values achieved by different methods with query sizes of 20, 40, 60, 80, and 100. For each query size, we compare different methods against each other using paired t- test at 95 percent significance level and the best performing method(s) are then highlighted in boldface. Finally, provides a summary of the win/tie/loss counts of the proposed method versus the other methods.

IV .CONCLUSION

We empirically evaluate the proposed method on the eight benchmark data sets against a number of competing methods. The evaluation results indicate that our method achieves consistent and substantial improvements over its competitors. There are a number of interesting directions to extend our work. The iterative framework requires repeated re clustering of the data with an incrementally growing constraint set. This can be computationally demanding for large data sets. To address this problem, it would be interesting to consider an incremental semi-supervised clustering method that updates the existing clustering solution based on the neighborhood assignment for the new point. An alternative way to lower the computational cost is to reduce the number of iterations by applying a batch approach that selects a set of points to query in each iteration. A naive batch active learning approach would be to select the top k points that have the highest normalized uncertainty to query their neighborhoods.

ACKNOWLEDGEMENT

I would like to thank our guide Prof. Mr.R.KalaiSelvan Asst. Prof., Computer Science and Engineering Department, Parisutham Institute of Technology and Science, thanjavur for his help and guidance to enable us to propose this system.

REFERENCES

  1. S. Basu, A. Banerjee, and R. Mooney, Active Semi-Supervision for Pairwise Constrained Clustering, Proc. SIAM Intl Conf. Data Mining, pp. 333-344, 2004.

  2. S. Basu, I. Davidson, and K. Wagstaff, Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall, 2008.

  3. M. Bilenko, S. Basu, and R. Mooney, Integrating Constraints and Metric Learning in Semi-Supervised Clustering, Proc. Intl Conf. Machine Learning, pp. 11-18, 2004.

  4. I. Davidson, K. Wagstaff, and S. Basu, Measuring Constraint-Set Utility for Partitional Clustering Algorithms, Proc. 10th European Conf. Principle and Practice of Knowledge Discovery in Databases,pp.115-126, 2006.

  5. D. Greene and P. Cunningham, Constraint Selection by Commit- tee: An Ensemble Approach to Identifying Informative Con- straints for Semi-Supervised Clustering, Proc. 18th European Conf. Machine Learning, pp. 140-151, 2007.

  6. D. Cohn, Z. Ghahramani, and M. Jordan, Active Learning with Statistical Models, J. Artificial Intelligence Research, vol. 4, pp. 129- 145, 1996.

  7. Y. Guo and D. Schuurmans, Discriminative Batch Mode Active Learning, Proc. Advances in Neural Information Processing Systems,pp.593-600, 2008.

  8. S. Hoi, R. Jin, J. Zhu, and M. Lyu, Batch Mode Active Learning and Its Application to Medical Image Classification, Proc. 23rd Intl Conf. Machine learning, pp. 417-424, 2006.

  9. S. Hoi, R. Jin, J. Zhu, and M. Lyu, Semi-Supervised SVM Batch Mode Active Learning for Image Retrieval, Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-7, 2008.

  10. S. Huang, R. Jin, and Z. Zhou, Active Learning by Querying Informative and Representative Examples, Proc. Advances in Neural Information Processing Systems, pp. 892-900, 2010.

AUTHOR DETAILS

P.SUNDARI received degree B.tech (IT) from periyar Maniammai University in 2012.she is currently persuading M.E-Computer science in parisutham institute of technology and science, thanjavur, Tamilnadu.

R.Kalaiselvan received B.E from Anjalaiammal Mahalingam engineering college in 2006,Tiruvarur.Received Master of Science from Staffordshire university in 2011,United Kingdom.

Leave a Reply

Your email address will not be published. Required fields are marked *