A Map Reduce – based SVM Ensemble with Stochastic Gradient Descent

DOI : 10.17577/IJERTV5IS120168

Download Full-Text PDF Cite this Publication

Text Only Version

A Map Reduce – based SVM Ensemble with Stochastic Gradient Descent

Zhao Jin

Key Lab. of Machine Learning and Computational Intelligence

College of Mathematics and Information Science, Hebei University

Baoding City, Hebei Province, 071002, China

Shuxia Lu *

Key Lab. of Machine Learning and Computational Intelligence

College of Mathematics and Information Science, Hebei University

Baoding City, Hebei Province, 071002, China

Mi Zhou

Key Lab. of Machine Learning and Computational Intelligence College of Mathematics and Information Science,

Hebei University

Baoding City, Hebei Province, 071002, China

AbstractStochastic Gradient Descent algorithm (SGD) is a simple and effective algorithm for SVM. It is particularly fast for linear classification and it is also adapted to the non-linear classification with Mercer kernel. The running time scales linearly with the number of iterations and does not depend on the number of the training size. In order to improve the convergence rate and classification accuracy with large data sets. This paper proposes a MapReduce-based SVM ensemble algorithm with SGD. We utilize Hadoop Distributed File System to store big training set and MapReduce parallel computing model to training several SVMs as SVM ensemble. The results show that our methods achieve a faster convergence rate than Pegasos that is a traditional SGD algorithm.

KeywordsStochastic Gradient Descent; Support Vector Machine; MapReduce; Voting Mechanism; Ensemble

  1. INTRODUCTION

    Support Vector Machines [1] is a machine learning algorithm for classification with better generalization ability. It is based on Statistical Learning Theory. VC dimension and structural risk minimization are the theoretical foundation of SVM.

    Stochastic gradient descent is an effective approach for training SVM, where the objective is the native form rather than dual form. It proceed by iteratively choosing a labeled example randomly from training set and updating the model weights through gradient descent of the corresponding instantaneous objective function.

    T.Zhang [2] proved that a constant learning rate in SGD would numerically achieve good accuracy, enabling a

    running time of 1/ 2 for linear kernel. The algorithm

    Norma [3] suggests a learning rate proportional to1/ t . Shai Shalev-Shwartz et al [4] presented Pegasos algorithm, which is effective stochastic sub-gradient descent algorithm for solving SVM, which adopted a learning rate of 1/ t . Zhang Wang et al [5] presented a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training that have constant space and constant time complexity per update.

    Krzysztof Sopyla et al [6] presented SGD-BB algorithm that shows lower sensitivity to the choice of initial parameters.

    According to Machine Learning Theory, training examples is not the more the better for recognizing the pattern. In fact, the proceed of SGD algorithm is not using all training examples but a subset which is chosen from training set randomly through the training process. Therefore, when the training set is too large to load into PCs memory, we can choose a subset of training set to achieve the optimal solution. There are some methods to deal with the big date problem. Nasullah Khalid Alham et al [7] presented MRESVM that is a MapReduce-based distributed SVM ensemble algorithm for scalable image annotation. Ferhat et al [8] proposed a different MapReduce-based distributed SVM algorithm for binary classification. There are two differences between [7] and [8]. First, the subsets to train a single SVM are selected using the method of bootstrapping in [7]. In the algorithm of [8], the training set is simply separated into different blocks that are treating as subsets. Second, [7] is an ensemble algorithm which has only one process of MapReduce. [8] is an iteration algorithm that has many times of MapReduce process. Both [7] and [8] use Sequential Minimal Optimization (SMO) [9] to train SVM.

    In this paper, we proposed a MapReduce-based SVM ensemble using SGD algorithm that is called MR-SGD. In MR-SGD, we use the MapReduce programming model and Hadoops Distributed File System to train several SVMs parallel. Each SVM of MR-SGD uses a subset of training set with SGD algorithm. Our proposed algorithm is evaluated with Pegasos [4]. The results show that MR-SGD achieves a fast convergence rate in almost all cases.

    This paper is organized as the following: section II discussed the basic stochastic gradient descent algorithm for SVM, both for linear kernel and non-linear kernel. In section III, our proposed algorithm and its details of implementation are discussed. The section IV shows the experimental results on datasets and in the last section conclusion are discussed.

  2. STOCHASTIC GRADIENT DESCENT (SGD) FOR SVMS

    1. SGD for linear kernel

      Consider a binary classification problem with data

      After a predetermined number T of iterations, we can acquire the norm of SVM classifier WT+1. The pseudo-code of linear kernel SGD is below.

      set S x , y ,i 1,…, m, where instance x Rn is an n-

      i i i

      dimensional input vector and y 1,1 is the

      Algorithm 1 linear kernel SGD

      i

      corresponding label of that instance. Training an SVM using S is formulated as solving the following optimization problem

      min PW W 2 1 lW ;x, y,

      1. Input: S, , T

      2. Initialize:W1 0 ; 3. For t 1,…,T

        4. Choose it 1,…, S uniformly at random;

        2 m x, y S

        5. 1

        ( 1 )

        t t

        where 0

        is a regularization parameter used to control

        6.

        W (1 1)W

        model complexity and

        lW;x, y max0,1 y W , x

        (2)

        t1 t t

        i

        i

        t

        t

        t

        t

        is the hinge loss function where W , x

        denotes the standard

        1. If y Wt ,xi 1

        2. Wt1 Wt1 t yi xi

          inner product between the vectors W and x.

          The classifier is expressed as

          f x WT x

          (3)

          t t

        3. Output: WT 1 .

    2. SGD for non-linear kernel

      where W is a vector of weights associated with each input. We study the case where the bias is set to zero.

      SGD operates as follows. Initially, we set W1 to be zero vectors. On iteration t of the process, we first choose a

      Represented Theorem [10] implies that the optimal solution of Eq. (1) can be expressed as a linear combination of the training instances. So SGD for SVM can be used to solve non-linear problems with Mercer kernels rather than with direct access to the feature vectors.

      it it

      it it

      training example x , y by picking an index it 1,…, m

      uniformly at random. We then replace the objective in Eq. (1)

      This section we consider predictors, which are linear

      it it

      it it

      with an approximation based on the training example x , y . We can get

      functions of some implicit mapping x of the instances. The minimization problem is below:

      min PW W 2 lW ;x , y .

      minP(W ) W 2 1 lW ;x, y

      (9)

      2 it it

      2 m ( x, y )S

      ( 4 )

      We consider the sub-gradient of the above approximate objective, given by:

      where the loss function is:

      lW;x, y max0,1 y W ,x

      (10)

      t Wt t yi xi ,

      (5)

      We then replace the objective in Eq. (9) with an

      t t approximation based on the trainin example x , y just as

      1, if

      y W , x 1

      it it

      where t

      it t it .

      (6)

      before:

      0, otherwise

      min PW W 2 lW ; x , y .

      (11)

      Then we update the norm W using the formula below

      2 it it

      Wt1 Wt tt,

      (7)

      Rather than explicitly calculating W by using x, it saves

      where t 0

      is a learning rate. Then we can rewrite the

      training examples to implicitly represent W and using the

      update formula as follows

      kernel function

      K(x, x')

      x,x'

      when calculating

      1

      predictionWT x .

      Wt1 1 t Wt tt yit xit

      (8)

      The process is the same as the case of linear kernel. Set W1

      to zero vector initially. According Eq. (8), we can get:

      W 1 1W 1 y x 1 y x

      Let W be a List where each element records two information: training example that is not zero and its selected times so far.

  3. A MAPREDUCE-BASED SVM ENSEMBLE

    1

    1

    2 1

    2

    2

    3 2

    3 2

    W 1 1 W

    1 1 1 1 1 1

    1 y x

    2 2 2 2

    WITH STOCHASTIC GRADIENT DESCENT

    Since the run-time of SGD algorithm does not depend directly on the size of the training set, it is especially suited for learning from large datasets. However, there is some

    1 y x 1 y x

    2 1 1 1 2 2 2 2

    difficulty for traditional algorithms when the file of training set is too big to load into the memory of PC.

    1 y x

    2 1 1 1

    2 y2x2

    We proposed an algorithm which is a MapReduce-based SVM ensemble using SGD (MR-SGD). In MR-SGD, we

    3

    3

    4 3

    4 3

    W 1 1 W

    1 y x

    3 3 3 3

    utilize the Hadoops Distributed File System and the MapReduce programming model to conquer the big data

    1 y x

    3 1 1 1

    1 y x

    y x 1 y x

    2 2 2 3 3 3 3

    y x y x

    problem and to achieve a faster convergence rate than single SVM.

    1. MapReduce Model

      3 1 1 1

      2 2 2

      3 3 3

      Here we introduce Apache Hadoop [11], which has two

      ··· ···

      Apparently we can induce the presentation of Wt+1

      core components: Hadoop Distributed File System (HDFS)

      [12] and MapReduce parallel computation programming

      model [13].

      1 t

      t

      t

      Wt1 j y jxj j1

      Note that the instance

      (12)

      x j , j {1,…,t} is chosen by

      HDFS is used to store large data. It split data file into multiple chunks. Each chunk is stored in different data nodes.

      MapReduce is a parallel processing programming model

      uniformly at random at the jth iteration. Then we can get the predictor with Mercer kernel function.

      can handle the large data sets which are stored in HDFS. The basic function of the MapReduce model is iterate over the

      t

      t

      f x W T x 1

      y K

      x , x

      (13)

      input, compute key/value pairs from each part of input, group

      t t t

      j1

      t t t

      all intermediate values by key, then iterate over the resulting groups and finally reduce each group. The model process in

      It is obvious from Eq. (12) that the training examples can be ignored if the corresponding is zero because they have no contribute to the predictor.

      The pseudo-code of non-linear kernel SGD is below.

      Algorithm 2 non-linear kernel SGD

      1. Input: S, , T

      2. Initialize: W1 empty

      3. For t =1,…,T

      4. Choose it 1,…, S uniformly at random

      1 lengthWt

      parallel. Map task is an initial transformation step, in which individual input records are processed in parallel. The system shuffles and sorts the map outputs and transfers them to the reduce tasks. Reduce task is a summarization step, in which all associated records are processed together by a single entity.

    2. MR-SGD

      There are two properties about SGD algorithm:

      1. The run-time scales linearly with the number of iterations and does not depend on the number of the training size;

      2. It proceeds by iteratively choosing a labeled example

        t

        t

        t

        t

        1. If

          yi t

          j1

          j y j K

          x j , xi 1

          randomly from training set. If the number of iteration is less than the training set size, all of the instances that are chosen

          t

          t

        2. If xi in Wt

        3. Increment corresponding by 1

        4. Else

          t

          t

        5. Put xi in Wt

        6. Set corresponding to 1

        7. Output: WT 1

        to train classifier are just a partition of the training set.

        Given the properties of SGD described above, we can utilize the HDFS to store training set and use MapReduce programming model to select multiple subset of the training set and to train SVM on each subset in parallel using a cluster of computers. Finally, we aggregate all the trained SVMs using Voting Mechanism to predict a testing instance. Fg.1 shows the process of MR-SGD.

        subset 1

        subset 1

        training set

        training set

        SVM 1

        and the parameters are shown in Table 1 and Table 2 respectively.

        TABLE I. DATASETS

        subset 2

        SVM 2 predictor

        Dataset

        #Training

        #Testing

        #Features

        ····

        ····

        Usps

        7,291

        2,007

        256

        subset k

        SVM k

        Mnist

        60,000

        10,000

        780

        Letter

        15,000

        5,000

        16

        . 1 Workflow of MR-SGD

        Adult

        32,561

        16,281

        123

        subset 2

        SVM 2 predictor

        Dataset

        #Training

        #Testing

        #Features

        ····

        ····

        Usps

        7,291

        2,007

        256

        subset k

        SVM k

        Mnist

        60,000

        10,000

        780

        Letter

        15,000

        5,000

        16

        . 1 Workflow of MR-SGD

        Adult

        32,561

        16,281

        123

        Fig

        The implementation details of MR-SGD describes below

        1. First, we pick up k subsets from training set where k should be odd to support the voting Mechanism. According to Machine Learning Theory, the number of training examples is not the more the better. The size of subset that required training an effective SVM ensemble is different with different training set.

          Assume we have m training examples and want to randomly choose d examples as a subset. Map task deal with each instance in the training set. It loops k times to decide whether this instance should be chosen into the corresponding

          subset. In the ith( i 1,…, k) loop, it will generate a random

          integer from 1 to m. If the random integer is less than d, then the instance will be put into the ith subset to train an SVM in Reduce task.

        2. Then Reduce tasks use the subsets which are chosen in Map tasks to train SVMs with SGD algorithm which is mentoned in section II. There is k SVMs just as we set in the first step.

        3. Finally, we aggregate k SVMs using Voting Mechanism to decide the label of the test samples.

  4. EXPERIMENTS

    In this section we present experimental results that demonstrate the merits of our algorithm. The basic SGD algorithm is Pegasos [4]. To evaluate the classification accuracy and convergence rate of Pegasos and MR-SGD, several benchmark datasets are used to illustrate both the linear kernel and the non-linear kernel situations. MR-SGD experiments were carried out on a cluster contains 6 machines. Each of machines has four E5-2609 2.50GHz processors and 4GB RAM memory. The CentOS-6.4 was used as an operating system. The version of Apache Hadoop is 2.4.1.

    The experiments in this section were performed on four datasets that are downloaded from LIBSVM web page [15]. The Usps and Mnist datasets were used for the task of classifying digits 0, 1, 2, 3, 4 versus the rest of the classes. The original Letter datasets labels represent 26 alphabets and we set the former 13 alphabets as positive class and the rest as negative class. We use both the linear kernel and Gaussian kernel Kx, y e x y 2 in our experiments. The parameters include the regularization parameter 1 for linear kernel and 2 for Gaussian kernel and the parameter which controlling the width of the Gaussian kernel. The datasets characteristics

    TABLE II. PARAMETERS

    Usps

    Mnist

    Letter

    Adult

    1

    8E-3

    2E-4

    6E-4

    6E-3

    2

    1.35E-5

    1.67E-5

    1.55E-6

    3.07E-5

    0.1

    0.2

    40.0

    0.05

    For the experiments on Mnist dataset, the size of subset to training each SVM is set to 30,000. However, on Usps, Letter and Adult datasets, we used all examples in original training set because the original training set is not a big data set. We should better use all of the training examples to complete the training process. Results are shown as follows.

    Fig. 2 Testing accuracy on Usps dataset with linear kernel

    Fig. 3 Testing accuracy on Mnist dataset with linear kernel

    Fig. 4 Testing accuracy on Letter dataset with linear kernel

    Fig. 5 Testing accuracy on Adult dataset with linear kernel

    The classification accuracy of MR-SGD is slightly higher than that of Pegasos on four datasets.

    From Figure 6 to Figure 9, it shows that MR-SGD with non-linear kernel has a faster convergence rate than Pegasos on four datasets. The classification accuracy of MR-SGD is slightly higher than that of Pegasos on four datasets.

    From Figure 2 to Figure 9, it shows that the classification accuracy of MR-SGD with Gaussian kernel are 11 percent higher than that of MR-SGD with linear kernel on Usps and Mnist datasets. The classification accuracy of MR-SGD with Gaussian kernel is 24 percent higher than that of MR-SGD with linear kernel on Letter dataset. However, the classification accuracy of MR-SGD with Gaussian kernel is similar to that of MR-SGD with linear kernel on Adult dataset.

    Fig. 6 Testing accuracy on Usps dataset with Gaussian kernel

    Fig. 7 Testing accuracy on Mnist dataset with Gaussian kernel

    Fig. 8 Testing accuracy on Letter dataset with Gaussian kernel

    Fig. 9 Testing accuracy on Adult dataset with Gaussian kernel

    From Figure 2 to Figure 5, it shows that the convergence rate both Pegasos and MR-SGD with the number of iteration growing. And it also shows that MR-SGD with linear kernel has a faster convergence rate than Pegasos on four datasets.

  5. CONCLUSIONS

We proposed a MapReduce-based SVM ensemble with SGD algorithm, which we called MR-SGD. The idea of the proposed algorithm is to improve the convergence rate and classification accuracy of basic SGD algorithm for SVM. The experiments show that MR-SGD can achieve faster convergence rate and higher classification accuracy in almost all cases of binary classification. One of the future works is to find a better aggregate strategy of SVM Ensemble. On the other hand, we will exploit different sampling method to enhance performance of MR-SGD.

ACKNOWLEDGMENT

This research is supported by the natural science foundation of Hebei Province No. F2015201185.

REFERENCES

  1. Cortes C, Vapink V, Support Vector Networks, Machine Learning, vol 20, 1995, pp. 273-297.

  2. Tong Zhang, Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms, International Conference, 2004, pp. 919-926.

  3. Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson, Online Learning with Kernels, in IEEE Transactions on Signal Processing, vol 52, pp. 2165-2176.

  4. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, Pegasos: Primal Estimated Sub-gradient Solver for SVM, Mathematical Programming, vol 127, 2011, pp. 3-30.

  5. Zhuang Wang, Koby Crammer, Slobodan Vucetic, Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large- Scale SVM Training, Journal of Machine Learning Research, vol 13, 2012, pp. 3103-3131.

  6. Krzysztof Sopyla, Pawel Drozda, Stochastic Gradient Descent with Barzilai-Borwein update step for SVM, Information Sciences, vol 316, 2015, pp. 218-233.

  7. Nasullah Khalid Alham, Maozhen Li, Yang Liu, Man Qi, A MapReduce-based distributed SVM ensemble for scalable image classification and annotation, Computers and Mathematics with Applications, vol 66, 2013, pp. 1920-1934.

  8. Ferhat Ozgur CATAK, Mehmet Erdal BALABAN, A MapReduce- based distributed SVM algorithm for binary classification, Turkish Journal of Electrical & Computer Science, 2013, pp. 863-873.

  9. JC Platt, Fast Training of Support Vector Machines using Sequential Minimal Optimization, Advances in Kernel Methods, 1999, pp. 185- 208.

  10. G. Kimeldorf, G. Wahba, Some results on Tchebycheffian spline functions, Journal of Mathematical Analysis & Applications, vol 33, 1971, pp. 82-95.

  11. Apache Hadoop. http://hadoop.apache.org/, 2016.

  12. Honnutagi, Pooja S, The Hadoop Distributed File System, International Journal of Computer Science & Information Technolo, vol 5, 2014, pp. 6238-6243.

  13. J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM, vol 51, 2008, pp. 107 113.

  14. S. Sonnenburg, V. Franc, E.Y. Tov, M. Sebag, PASCAL large scale learning challenge, 2008.

  15. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2016.

Leave a Reply