Flower Classification using Supervised Learning

Download Full-Text PDF Cite this Publication

Text Only Version

Flower Classification using Supervised Learning

Asmita Shukla

Computer Science and Engineering

Babu Banarasi Das National Institute of Technology and Management Lucknow, India

Ankita Agarwal

Computer Science and Engineering

Babu Banarasi Das National Institute of Technology and Management Lucknow, India

Hemlata Pant

Computer Science and Engineering

Babu Banarasi Das National Institute of Technology and Management Lucknow, India

Priyanka Mishra

Computer Science and Engineering

Babu Banarasi Das National Institute of Technology and Management Lucknow, India

AbstractBiodiversity of earth is very rich. About 360000 create a healthy biome within the environment of earth. Some of them are identical in physical appearance like shape, size and color. Hence it is difficult to recognize any species. Similarly Iris flower species has three subspecies Setosa, Versicolor and Virginica. We are using Iris dataset because it is frequently available. The dataset of Iris flower contains 3 classes of 50 instances each. With the help of Machine learning, Iris dataset identifies the sub classes of Iris flower. The paper focuses on how Machine Learning algorithms can automatically recognize the class of flower with the help of high degree of accuracy rather than approximately. There are three phases to implement this approach are segmentation, feature extraction and classification. Using Neural Network, Logistic Regression, Support Vector Machine and k-Nearest Neighbors

KeywordsIris dataset, k-nearest neighbors, Logistic Regression, Neural Network, Scikit-Learn, Support Vector Machine.

  1. INTRODUCTION

    Machine Learning is program that learns from past data set to perform better with experience. It is tools and technology that we can utilize to answer questions with our data. Machine Learning works on two values these are discrete and continuous. The use and applications of Machine Learning has wide area like Weather forecast, Spam detection, Biometric attendance, Computer vision, Pattern recognition, Sentiment Analysis, Detection of diseases in human body and many more. The learning methods of Machine Learning are of three types these are supervised, unsupervised and reinforcement learning. Supervised learning contains instances of a training data set which is composed of different input attributes and an expected output. Classification which is the sub part of supervised learning where the computer program learns from the input given to it and uses this learning to classify new observation. There are various types of classification techniques; these are Decision Trees, Bayes Classifier, Nearest Neighbor, Support Vector Machine, Neural Networks and many more. Some example of Classification tasks are Classifying the credit card transactions as legitimate or fraudulent, classifying secondary structures of protein as alpha-helix, beta-sheet or random coil and categorize the news stories as finance, weather, entertainment and sports.

    Python is a programming language created by Guido van Rossum in 1989. Python is interpreted, object-oriented, dynamic data type of high-level programming language. The programming language style is simple, easy to implement and elegant in nature .Python language consists of powerful libraries. Moreover, Python can easily combine with other programming languages, such as C or C++ or Java.

    Scikit-Learn use the sciPy library of python as a toolkit. Scikit learn was originally called as "Scikits.learn". It includes dataset loading, manipulation and pre-processing of pipelines and metrics. Scikit Learn has a huge collection of machine learning algorithms.

  2. REVIEW CRITERIA

    The Iris data set is present at University of California Irvine Machine Learning Repository. The data set was firstly acquainted by Edgar Anderson in 1935 but due to use of many classification methodologies, it was further generalized by Ronald Fisher in 1936 and also known as Fisher's Iris data set. Hence the characteristics of data are multivariate and based on real values.

    1. David W. Corne and Ziauddin Ursani proposed in their paper an evolutionary algorithm for nonlinear discriminant classifier, in which they mentioned that it was not appropriate for learning tasks with any individual single value. Hence they tested this method on two data sets, Iris Flower and Balance Scale, where decisions of class membership can only be affected collectively by individual lineaments of flower.

    2. Detlef Nauck and Rudolf Kruse have proposed a new approach in which they classify the data on the basis of fuzzy Neural Networks. They used backpropagation algorithm to define other class of fuzzy perceptron. They concluded that on increasing the number on hidden layer, increase the need of more training cycles and raises incorrect results. Hence the better result can be evaluated using 3 hidden layers also.

    3. To overcome the problem of data depth, long parameters, long training time and slow convergence of Neural Networks, two other algorithms Transfer Learning and Adam Deep Learning optimization algorithms were considered for flower recognition by Jing FENG, Zhiven WANG, Min ZHA and Xiliang CAO. Where, Transfer Learning was based on

      features in isomorphic spaces. They concluded in their paper that if the pictures of flowers placed into model training in the form of batches, then it will meliorate the speed of updating the value of parameters and provide the best optimal result of parameter values.

    4. Rong-Guo Huang, Sang-Hyeon Jin, Jung-Hyun Kim, Kwang-Seog Hong focus on recognition of flower using Difference Image Entropy (DIE),which is based on feature extraction. According to their research, the experimental results give 95% of recognition rate as an average. The DIE based approach takes original image of flower as an input, and applies pre-processing and DIE computation to produce recognition result.

  3. PROPOSED ALGORITHM Segmentation is the process which is used to remove the inadmissible background and consider only the spotlight (foreground) object that is flower. The main objective is to simplify the representation of the flower and to provide something which is more significant and easier to analyze.

    In Feature Extraction we extract characteristics or information from flower in the form of real values like float, integer or binary. The primary features to quantify the plants or flowers are color, shape, texture. We do not prefer only one feature vector because the sub species have many attributes which are common with each other and produce less effective result. Therefore we have to measure the image by merging different feature descriptors which identify the image more efficaciously. The first five Iris datasets are represented in the given table 1.

    Table1. The five Iris datasets [5]

    After extracting features and labels from Iris dataset, we need to train the system. With the help of scikit-learn we create machine models, which classify the Iris flower into their sub species. The following table2 represents the descriptive statistics of Iris dataset.

    Table2. The description of Iris dataset

    Neural Network, Iris Species have less feature, therefore multilayer perceptron is used as the currently architecture of neural network to preclude overfitting. In multilayer perceptron model, there is one scaling layer, two perceptron layer and one probabilistic layer. Iris dataset has four attributes, hence input layer consists of four variables these are sepal_length, sepal_width, petal_length and petal_width. Thebelow graphs represent the relationship between SepalLength vs. SepalWidth (Fig1), PetalLength vs. PetalWidth (Fig2), SepalLength vs. PetalLength (Fig3) and SepalWidth vs. PetalWidth (Fig4)

    Fig1. Relationship graph of sepal length and width

    Fig2. Relationship graph of petal length and width

    Fig3. Relationship graph of sepal and petal length

    Fig4.Relationship graph of sepal and petal width

    The scaling layer is used for normalizing the input values. We use mean and standard deviation scaling method to calculate normal distribution of sepal and petal, lengths and widths. The first perceptron layer consist 4 inputs and 3 neurons and the other perceptron layer consists 3 inputs and 3 neurons. Both two perceptron layer uses logistic activation function. The probabilistic layer permits the outputs to be represented as probabilities and normalizes the feature of each dataset to the range of 0-1. The neural network produces three outputs as Iris subspecies Setosa, Versicolor and Virginica.

    Fig5.Neural Network in the first perceptron layer [10]

    Logistic Regression falls under the category of classification algorithms of machine learning. It provides a baseline for any binary classification problem and is also used for multinomial classification in which more than two classes are ascertained. Using link function, logistic model is transformed to predictor. Regularization [12] is used to identify error of overfitting and underfitting in the proposed model while training data.

    Fig6. The univariate plots are represented in the form of histograms

    Support Vector Machine (SVM), In SVM dimensionality reduction techniques like Principal Component Analysis (PCA) and Scallers are used to classify dataset expeditiously. The first step towards implementation of SVM is data exploration. The initial configuration of hyper parameters like degree of polynomial or type of kernel are done by data exploration .Here we use two variables x and y, where x and y represent the features matrix and the target vector respectively. Dimensionality reduction is used to reduce the number of features in dataset which further reduces the computations. Iris dataset have four dimensions, with the help of dimensionality reduction it will be projected into a 3 dimensions space where the number of features is 3. We split the transformed data into two part, these are 80% of training set and 20% of test set.

    Fig7. The number of features in the new subspace is 3

    k-Nearest Neighbors (KNN), KNN is used for both classification and regression. In KNN, we have labeled dataset which consists of training observations (X, Y) and would like to establish the relationship between X and Y

    When KNN has unseen observation, then similarity is determine by distance metric between two data points. The distance can be measured by following methods [16]-

    Where, n is number of dimensions, x is datapoint from dataset and y is new data point to be predicted.

    KNN does not take string labels. Hence LabelEncoder is used to modify the string labels into numbers where Iris Setosa, Iris Versicolor and Iris Virginica are represented by 0, 1 and

    2 respectively. Iris dataset are multivariate, therefore data visualization is done by several plotting methods like Parallel Coordinates, Andrew Curves, Pairplot, Boxplot. Implementing KNN with scikit-learn following three steps are performed: making decisions, evaluating predictions and using cross validation for parameter pruning.

    Fig8. For plotting multivariate data, Parallel Coordinates technique is used

    Fig 9.For visualizing the multivariate data Andrew Curves are used

    Fig 10.Pair Plot is used to visualize the distribution of the relationship between multiple variables separately

    Fig11. The representation of data in from of Box Plots

  4. RESULT

  1. An accuracy of 96.667% is achieved using Neural Network. We find that neural network learns from its existing feature and with the help of its weights and biases, it predicts the more accurate outcomes.

  2. The accuracy of logistic regression on calculating with scikit-learn is 96.6667%.

    Using sklearn model the accuracy is

  3. The dataset consists of 4 dimensions, where PCA compress the data and provides the number of features in the new subspace. Using Linear SVM, the accuracy of training set is 0.97 and test set is 1.00 while using non-linear SVM, the accuracy of training set is 99.17% and test is 100. And on tuning the C parameter the best estimator accuracy on training set is 96 and on test set it is 100.

    1. Using Linear SVC

    2. Using Grid Search

    3. Using Non-Linear SVC

  4. Using KNN classification, the accuracy of our model is evaluated to 96.67% and on finding the best k, the optimal numbers of neighbors is 9. (Fig12)

Fig12. The above diagram represents the optimal number of neighbours which is 9.

Table3. The accuracy of Classification model

Table 4.The accuracy of model using SVM

REFERENCES

  1. Ziauddin Ursani and David W. Corne , A Novel Nonlinear Discriminant Classifier Trained by an Evolutionary Algorithm DOI: 10.1145/3195106.3195132

  2. Detlef Nauck and Rudolf Kruse, NEFCLASS-A Neuro-Fuzzzy approach for the classification of data DOI: 10.1145/315891.316068

  3. Jing FENG, Zhiwen WANG, Min ZHA and Xinliang CAO, Flower Recognition Based on Transfer Learning and Adam Deep Learning Optimization Algorithm. DOI: 10.1145/3366194.3366301

  4. Roung Guo Huang, Sang-Hyeon Jin, Jung Hyun Kim and Kwang- Seck Hong, Flower Image Recognition Using Difference Image Entropy. DOI: 10.1145/1821748.1821868

  5. UCI Machine Learning Repository- IRIS DATASET

  6. Introduction of Machine Learning and scikit-learn, Available at- https://youtu.be/GwIo3gDZCVQ

  7. Introduction to Python Programming Language, Available at https://www.ritchieng.com/machine-learning-iris-dataset/ https://www.geeksforgeeks.org/python-language-advantages- applications/

    https://medium.com/gft-engineering/start-to-learn-machine-learning- with-the-iris-flower-classification-challenge- https://www.theseus.fi/bitstream/handle/10024/64785/yang_yu.pdf?s equence=1&isAllowed=y

  8. Image Classification using Python and Scikit-learn https://gogul.dev/software/image-classification-python

  9. Image Segmentation

    https://en.wikipedia.org/wiki/Image_segmentation

  10. Classification of iris flowers from sepal and petal dimensions using Neural Designer https://www.neuraldesigner.com/learning/examples/iris-flowers- classification

  11. Neural Network from Kaggle by Louisong available at https://www.kaggle.com/louisong97/neural-network-approach-to- iris-dataset

  12. Logistic Regression from pluralsight available at https://www.pluralsight.com/guides/designing-a-machine-learning- model

  13. Support Vector Machine from Kaggle by Moghazy available at https://www.kaggle.com/moghazy/classifying-the-iris-dataset-using- svms

  14. k- Nearest Neighbor from Kaggle by skalskip available at https://www.kaggle.com/skalskip/iris-data-visualization-and-knn- classification

  15. Exploratory Data Analysis of IRIS Data Set Using Python available at https://medium.com/@avulurivenkatasaireddy/exploratory-data- analysis-of-iris-data-set-using-python-823e54110d2d

  16. Distance functions:

https://www.saedsayad.com/k_nearest_neighbors.htm

Leave a Reply

Your email address will not be published. Required fields are marked *