Role of Machine Learning in Image Classification

DOI : 10.17577/IJERTCONV10IS09008

Download Full-Text PDF Cite this Publication

Text Only Version

Role of Machine Learning in Image Classification

A. Chandrasekhara

Assistant Professor,

Department of Computer Science Engineering , Sree Vidyanikethan Engineering College, Andra Pradesh, India

T. Monisha

Department of Computer Science Engineering , Sree Vidyanikethan Engineering College, Andra Pradesh, India

M. Hafeeza Taj

Department of Computer Science Engineering , Sree Vidyanikethan Engineering College, Andra Pradesh, India

Abstract:- The paper gives an overview on the role of machine learning being used in the field of image classification and efficacy. We introduce the metrics of image classification efficacy from medicine and pharmacology to overcome the limitations of accuracy metrics. We include a baseline classification to derive the metrics of image classification efficacy and apply real-world and hypothetical examples to further examine their usefulness. Finally, we detail the procedures of classification efficacy assessment for image classification in the paper.

Key Terms- Accuracy, classification algorithms, classification assessment, image classification, machine learning, remote sensing.


    Machine learning, specifically deep learning, has been deployed in every field that involves image classification and arithmetic circuits [1-5]. Deep learning has transformed the way we classify images at any scale. At the micro scale, biomedical imaging can benefit from deep learning for a better understanding of irregular human body activities and early diagnosis of severe diseases [43 45]; at the macro scales, Earth surface characterization [6] [8] and seven solid Earth geoscience [9] can be strengthened by applying deep learning. The main advantage of deep learning is that a well-trained neural network facilitates automated image classification and can be applied to many different image types. It is essential to assess the accuracy of classification outputs with a deep learning classification algorithm for its new applications [10], [11]. As deep-learning classification methods continue to diversify and advance, the rigorous assessment of neural networks becomes increasingly vital. More than a dozen metrics have been invented for evaluating pattern recognition and computer vision [12], [13]. With or without modification, these metrics are extensively applied in image classification-related fields, from molecular imaging to earth observation.

    The existing accuracy metrics can be divided into three types:

    Type I: Accuracy metrics are directly derived from error matrices (also known as confusion matrices). These metrics for positive-negative binary classification include accuracy (or overall accuracy) at the map level and sensitivity, specificity, positive precision, and negative precision at the class level. Earth resource remote sensing often involves multiple classes and traditionally uses producers accuracy (equivalent to sensitivity and specificity) and users accuracy (equivalent to positive precision and negative precision) [14]. Although these accuracy metrics are interpretable, they are affected by the size distributions of classes. The values of these accuracy metrics are not as informative as to be expected.

    Type II: These accuracy metrics, which are the immediate derivatives of Type I accuracy metrics, typically include balanced accuracy (arithmetic mean of sensitivity and specificity) and F1 score (harmonic mean of positive precision and sensitivity). The balanced accuracy may reduce class imbalance effects but blurs accuracy interpretation whereas the F1 score may be interpretable but is still affected by class imbalance [16-17]. Similar to machine learning applications, the F1 score has become increasingly popular. The mean of two accuracy values may prevent the lower accuracy value from alerting a potential flaw in classification. For example, binary classification output with a value of 0 for any single accuracy metric is useless.

    Type III: This type of metrics is rooted in Statistics and then introduced to image classifications to assess their performance. Such metrics include the Matthews correlation coefficient (MCC) and Cohens Kappa coefficient (Kappa) [18]. Because they were developed in different contexts, these metrics are not interpretable for image classification accuracy assessment despite of their popularity. One common misinterpretation is that when the MCC or Kappa rate is equal to 0, the classification method is usually believed to be similar to random guessing. Remote sensing researchers suggest rejecting the use of Kappa for image classification accuracy assessment [19], [20]. Medical imaging researchers suggest that the MCC provides a more truthful and informative result than other metrics for binary classification assessment based on a series of studies [17].

    Among the three types of accuracy metrics, Type I metrics are the most commonly employed metrics in research. If a single accuracy value is reported in earth remote sensing, this value is highly likely the overall accuracy [15]. The rates of overall accuracy can sometimes be misleading. Accuracy metrics are regularly utilized for accuracy assessment of image classification, although the values of some metric, such as the MCC and Kappa, do not strictly indicate the accuracies. As the name suggests, the rates of the MCC may indicate the correlation levels, whereas Kappa may indicate extent of agreement. When the accuracy metric values are compared between two image classifications, the classification efficacy is examined. Consequently, the word efficacy sometimes appeared as a verbal explanation for the effectiveness of image classification approaches in various fields [26] [34]. Such use of efficacy makes sense only when different classifications use the same classification Scheme and address images with the same area Extent and the same data-acquisition Time (SET). In the medical fields, efficacy is a common term, and its values are computed by comparing the illness rates between sampled people with a treatment and sampled people without a treatment. Following the same concept, we generalize the evaluation of image classification methods with efficacy, which is quantified by referring to a standard baseline classification as a control to mitigate the class imbalance effects. The resulting image classification efficacy provides an alternative measure for assessing image classification.



      An error matrix is a table that displays the number or percentage of cases correctly classified and those incorrectly classified (Table 1). Practically, only random samples are used to compose an error matrix. The reference values (also known as ground truthing) are assumed to be true and represent the actual population.

      Table 1. General error matrix (also known as a confusion table).

      An error matrix resembles a contingency table in statistics. The accuracy for each individual class is computed by using either the reference total or classification total.

      Binary classification is conducted in many fields, and the two classes are commonly referred to as positive for class 1

      and as negative for class 2 [12], [13]. In this case, researchers tend to use different terminologies:

      • The true positive rate, which is also referred to as sensitivity in pharmacology and as recall in machine learning, is the percentage of positive objects that are classified correctly within the reference total of class positive.

      • The true negative rate, which is also referred to asspecificity in pharmacology, is the percentage of negative objects that are correctly classified within the reference total of class negative.

      • Positive precision is the number of correctly classified positive cases over the total number of positive cases given by the classifier.

      • Negative precision is the number of correctly classified negative cases over the total number of negative cases given by the classifier.


    In the medical field, the efficacy of a drug is defined by comparing the drug effects on the treatment group to those of a baseline group or the placebo group. Vaccine efficacy (VE) [35] is defined as


    where ARU is the attack rate in the unvaccinated population and ARV is the attack rate in the vaccinated population. The rates of ARU and ARV are usually determined with a double-blind randomized placebo- controlled trial with persons susceptible to disease.

    This approach is perfectly transferable to quantify the effectiveness of image classification methods: a vaccine is equivalent to a classification method; an attack rate is com parable to classification error; the use of vaccine corresponds to the application of classification method; and a randomized placebo control is similar to a random classification as a baseline in image classification. For binary classifications, we refer to the terms in pharmacology and machine learning to name the following class specific, image classification efficacies as sensitivity efficacy.

    Fig. 1 Changes in classification accuracy with binary class proportion or size ratios when MICE = 0

    For binary classifications, we refer to the terms in pharmacology and machine learning to name the following

    class specific, image classification efficacies as sensitivity efficacy (SeE), specificity efficacy (SpE), positive precision efficacy (PpE), and negative precision efficacy (NpE). Each of these efficacies provides the assessment of classification from different perspective in a way similar to sensitivity, specificity, positive precision and negative precision.



    We use seven classifications for image data with class-size ratios near 9:1 to explain the unique usefulness of image classification efficacies (Tables 2 and 3). The first three cases show that the MCC and Kappa can be quite sensitive for a slight change in the classification result for a minor class.

    Table 2 Results of six image classifications with positive and negative classes

    CAS E
















































    Table-3: Comparison of map error derivatives of seven error matrices.









































    The minimum effective classification accuracy is related to class size ratios, it is important to use image classification efficacy to evaluate the performance of image classification. For example, the minimum effective accuracies are 0.58 and 0.82 when the class-size ratios are 70:30 and 90:10, respectively. Therefore, an accurate rate of 0.80 is pretty good for binary classification with a class- size ratio of 70:30 but fails for binary classification with a class-size ratio 90:10. The proportion of burned area is 0.37% within the global mapping extent and the overall accuracy is 99.5% on average among six global burned area products [21-23]. In this case, the erage MICE value is only 0.29, indicating that global burned area classifications are more similar to a random classification than to a perfect classification.


    It is not surprising that binary classification usually has greater overall accuracy than multiclass classification [24], [25], [36]. This phenomenon is the classification scheme effect, which makes overall accuracy incomparable between two classifications that involve different numbers of classes [15]. With the same overall accuracy, the MICE values increase with respect to the number of map classes. The below graph represents it.

    Fig. 2 Responses of MICE (%) to the number of balanced classes with the same overall accuracy rates.

    Such an increase in the MICE with the number of map classes makes sense as it reflects the notion that it is more difficult to classify more classes than to classify fewer classes. This result explains another advantage of MICE to overall accuracy. The same MICE value (0.70) is obtained when overall accuracy = 0.85 for two classes; when overall accuracy = 80% for three classes; and when overall accuracy = 75% for six classes (Fig. 2). These three classification methods have the same effectiveness although the overall accuracy values are different. For the global land-cover classification [25], the MICE = 0.63, although its overall accuracy is only 0.67. Such a relatively high MICE value suggests that the global land-cover classification is more effective than the global burned area classifications (MICE = 0.29 on average) despite their almost perfect overall accuracy (A = 99.5% on average).


Image classification is often performed by following a hierarchical classification system [36], [37], which allows lower-level classes to be aggregated into higher-level classes. Such aggregation usually ensures that overall accuracy cannot be reduced except when only combining classes without mis classification errors between them. If the misclassification error is relatively small, the overall accuracy can increase, but the MICE values may decrease, suggesting that such an aggregation does not improve the classification effectiveness (Fig. 3 left).

Fig. 3 Error matrices explaining the effectiveness of class aggregation from three to two classes in terms of image classification accuracy and efficacy.

When combining classes with substantial errors between them, the overall accuracy and MICE values can increase (Fig. 3 right). This kind of effective aggregation is assumed to be the case when aggregation follows a hierarchi cal classification system. For example, the overall accuracies of the 2011 US National Land Cover Database (NLCD) at Classification Level II and Classification Level I were 82% and 88%, respectively [37]. The corresponding MICE values are 80% and 85%, respectively, confirming that class aggregation from Level II to Level I of the NLCD is effective.


The advantage of image classification efficacy is that it can mitigate the effects of class imbalance and classification schemes on classification assessment and thus, emphasize the true effectiveness of classification methods (Fig. 4). Therefore, image classification efficacy can function as a general metric for comparing different image classification methods with different class proportions. With rapid advancements in image classification techniques, periodic reviews are becoming increasingly important [2][4], [6], [8], [38], [39]. These reviews inevitably involve classification methods that have experimented with differe

nt data sources. Comprehensive reviews on image classification techniques can be strengthened by using image classification efficacies. The metrics of image classification efficacy are particularly useful for comparing classification methods and thus, their relative differences are more important than their absolute values.

Fig. 4 Diagram explaining the approach with image classification efficacy to evaluate the performance of image classification methods that involve different images and/or classification schemes.

This does not mean that the efficacy scores should not have a target. As previously discussed, a negative value of image classification efficacy means that the classification is unacceptable, which is the bottom line. The question is how high is high enough? It is understandable if an image classification analyst considers accuracy target. For example, Anderson [40-42] proposed an accuracy target of 85% for land use land cover classification with satellite remote sensing data. Referring to a binary classification for a class size ratio of 75:25, which is the median of 50:50 and 100:0 ratios, the MICE equal 60%, corresponding to an overall accuracy of 85%. Therefore, we can subjectively set the target of the image classification efficacy scores to 60%. We then divided the positive efficacy values into six levels: 00.19 indicates slight progress, 0.200.39 denotes moderate progress, 0.400.59 represents barely satisfactory, 0.600.74 indicates satisfactory, 0.750.89 denotes extraordinary, and 0.900.99 represents almost perfect. By using this scale, for example, the efficacies of US NLCD datasets [37] are extraordinary at classification levels I and II.

The introduction of image classification efficacy does not mean complicating existing classification assessment prac tices. The misuse of existing classification accuracy metrics can be avoided by employing image classification efficacy. To better conduct image classification efficacy assessment, we summarize the assessment procedures under different circumstances and for different purposes (Fig. 5).

Fig. 5 Flowchart of classification assessment with image classification accuracy and efficacy metrics. SET stands for classification Scheme, area extent, and data-acquisition time.

If classification methods that need to be compared are executed with the same images and classification scheme, their comparative assessment can be made directly with Type I accuracy metrics. Otherwise, it will become risky to conduct conventional accuracy assessments. In this case, the MICE and class-level efficacy metrics should be utilized.


The derivation of image classification efficacies has followed the broadly understandable vaccine efficacy. Image classification efficacy means the effectiveness of image classification relative to random assignment. The metrics of image classification efficacy are applicable to binary and multiclass classification, and suitable for both class-level and map level efficacy assessments. More importantly, the values of image classification efficacy mitigate the effects of class proportions and classification schemes, and thus, are useful for comparing classification methods that are tested with different images. The introduction of image classification efficacy meets the critical need to rectify the strategy for the assessment of image classification performance as image classification methods are becoming more diversified. The metrics of image classification efficacy can be employed to assess image classifications in all the relevant fields, ranging from molecular imaging to earth observation remote sensing. In any case, researchers are encouraged to provide image data, training data, and reference data when they report their classification progress so that image classification efficacies can be computed when needed.


[1] S. L. Goldenberg, G. Nir, and S. E. Salcudean, A new era: (2019) Artificial intelligence and machine learning in prostate cancer, Nature Rev. Urol., vol. 16, no. 7, pp. 391403.

[2] T. Krishnan, S. Saravanan, A. S. Pillai, and P. Anguraj, Design of high-speed RCA based 2-D bypassing multiplier for fir filter, Mater. Today Proc., Jul. 2020, doi: 10.1016/j.matpr.2020.05.803.

[3] A. Nunes et al., 2020. Using structural MRI to identify bipolar disorders13 site machine learning study in 3020 individuals from the ENIGMA bipolar disorders working group. Mol. Psychiatry, vol. 25, no. 9, pp. 21302143.

[4] P. Anguraj and T. Krishnan, Design and implementation of modified BCD digit multiplier for digit-by-digit decimal multiplier, Analog Integr. Circuits Signal Process., pp. 112, 2021.

[5] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert, . (2019.) Attention gated networks: Learning to leverage salient regions in medical images, Med. Image Anal., vol. 53, pp. 197207.

[6] T. Krishnan, S. Saravanan, P. Anguraj, and A. S. Pillai, Design and implementation of area efficient EAIC modulo adder, Mater. Today Proc., vol. 33, pp. 37513756, 2020.

[7] X. Liu, L. Faes, A. U. Kale, S. K. Wagner, D. J. Fu, A. Bruynseels, T. Mahendiran, G. Moraes, M. Shamdas, C. Kern, J.

R. Ledsam, M. K. Schmid, K. Balaskas, E. J. Topol, L. M. Bachmann, P. A. Keane, and A. K. Denniston, (2019). A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: A systematic review and meta-analysis, Lancet Digit. Health, vol. 1, no. 6, pp. e271e297..

[8] L. Li, Y. Chen, Z. Shen, X. Zhang, J. Sang, Y. Ding, X. Yang, J. Li, M. Chen, C. Jin, C. Chen, and C. Yu, (2020). Convolutional neural network for the diagnosis of early gastric cancer based on magnifying narrow band imaging, Gastric Cancer, vol. 23, no. 1, pp. 126132.

[9] G. Grekousis,, (2019). Artificial neural networks and deep learning in urban geography: A systematic review and meta- analysis, Comput., Environ. Urban Syst., vol. 74, pp. 244256..

[10] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson, (2019). Deep learning in remote sensing applications: A meta- analysis and review, ISPRS J. Photogramm. Remote Sens., vol. 152, pp. 166177.

[11] T. Kattenborn, J. Leitloff, F. Schiefer, and S. Hinz, (2021). Review on convo lutional neural networks (CNN) in vegetation

remote sensing, ISPRS J. Photogramm. Remote Sens., vol. 173, pp. 2449.

[12] K. J. Bergen, P. A. Johnson, M. V. de Hoop, and G. C. Beroza, (2019). Machine learning for data-driven discovery in solid Earth geoscience, Science, vol. 363, no. 6433, Mar., Art. no. 0323.

[13] A. E. Maxwell, T. A. Warner, and L. A. Guillén, (2021). Accuracy assessment in convolutional neural network-based deep learning remote sensing studiesPart 1: Literature review, Remote Sens., vol. 13, no. 13, p. 2450.

[14] A. E. Maxwell, T. A. Warner, and L. A. Guillé, (2021), Accuracy assess ment in convolutional neural network-based deep learning remote sensing studiesPart 2: Recommendations and best practices, Remote Sens., vol. 13, no. 13, p. 2591.

[15] P. Baldi, S. Brunak,Y. Chauvin, C. A. F. Andersen, and H. Nielsen, (2000). Assessing the accuracy of prediction algorithms for classification: An overview, Bioinformatics, vol. 16, no. 5, pp. 412424.

[16] D. M. W. Powers, (2020), Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv:2010.16061.

[17] R. G. Congalton and K. Green, (2019,). Assessing the Accuracy of Remotely Sensed Data: Principles and Practices, 3rd ed. Boca Raton, FL, USA: CRC Press, pp. 6976.

[18] G. Shao, L. Tang, and J. Liao, (2019). Overselling overall map accuracy mis informs about research reliability, Landscape Ecology, vol. 34, no. 11, pp. 24872492.

[19] S. Stehman and J. Wickham, (2020) A guide for evaluating and reporting map data quality: Affirming Shao et al. Overselling overall map accu racy misinforms about research reliability, Landscape Ecol., vol. 35, pp. 12631267.

[20] D. Chicco and G. Jurman,, (2020) The advantages of the Matthews correlation accuracy in binary classification evaluation, BMC Genomics, vol. 21, no. 1, pp. 113,.

[21] R. A. Fisher, (1958) Statistical Methods for Research Workers, 13th ed. New York, NY, USA: Hafner, , pp. 150183.

[22] R. G. Pontius, Jr., and M. Millones, Death to Kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment, Int.

[23] G. M. Food, (2020) Explaining the unsuitability of the Kappa coefficient in the assessment and comparison of the accuracy of thematic maps obtained by image classification, Remote Sens. Environ., vol. 239, Mar., Art. no. 111630.

[24] D. Chicco, N. Tötsch, and G. Jurman, (2021) The Matthews correlation coef ficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,, BioData Mining, vol. 14, no. 1, p. 13,

[25] D. Chicco, V. Starovoitov, and G. Jurman, The benefits of the Matthews correlation coefficient (MCC) over the diagnostic odds ratio (DOR) in binary classification assessment, IEEE Access, vol. 9, pp. 4711247124, 2021.

[26] D. Chicco, M. J. Warrens, and G. Jurman, (2021), The Matthews correlation coef ficient (MCC) is more informative than Cohens Kappa and Brier score in binary classification assessment, IEEE Access, vol. 9, pp. 7836878381

[27] M. Padilla, S. V. Stehman, R. Ramo, D. Corti, S. Hantson, P. Oliva, I. Alonso-Canas, A. V. Bradley, K. Tansey, B. Mota, J. M. Pereira, and E. Chuvieco, (2015) Comparing the accuracies of remote sensing global burned area products using stratified random sampling and estimation, Remote Sens. Environ., vol. 160, pp. 114121.

[28] J. Scepan, (2008), Thematic validation of high-resolution global land-cover data sets, Photogramm. Eng. Remote Sens., vol. 65, pp. 10511060, Sep. 1999.

[29] G. M. Foody, Harshness in image classification accuracy assessment, Int. J. Remote Sens., vol. 29, pp. 31373158.

[30] S. V. Stehman and G. M. Foody, (2019), Key issues in rigorous accuracy assess ment of land cover products, Remote Sens. Environ., vol. 231, Sep. Art. no. 111199.

[31] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, (2015), Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning, IEEE Trans. Geosci. Remote Sens., vol. 53, no. 6, pp. 33253337.

[32] O. Okwuashi and C. E. Ndehedehe, (2020), Deep support vector machine for hyperspectral image classification, Pattern Recognit., vol. 103, Jul. Art. no. 107298.

[33] J.-U. Hou, S. W. Park, S. M. Park, D. H. Park, C. H. Park, and S. Min, (2021) Efficacy of an artificial neural network algorithm based on thick-slab MRCP images for the automated diagnosis of common bile duct stones,, J. Gastroenterol. Hepatol., Jun..

[34] T. J. Lark, I. H. Schelly, and H. K. Gibbs, (2021) Accuracy, bias, and improvements in mapping crops and cropland across the united states using the USDA cropland data layer, Remote Sens., vol. 13, no. 5, p. 968, Mar..

[35] K. Jia, S. Li, Y. Wen, T. Liu, and D. Tao, (2021) Orthogonal deep neural networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 4, pp. 13521368.

[36] A. Repici, M. Badalamenti, R. Maselli, L. Correale, F. Radaelli,

E. Rondonotti, E. Ferrara, M. Spadaccini, A. Alkandari, A. Fugazza, A. Anderloni, P. A. Galtieri, G. Pellegatta, S. Carrara,

M. Di Leo, V. Craviotto, L. Lamonaca, R. Lorenzetti, and C. Hassan, (2020). Efficacy of real-time computer-aided detection of colorectal neoplasia in a randomized trial, Gastroenterology, vol. 159, no. 2, pp. 512520,

[37] M. S. Seo, J. Lee, J. Park, D. Kim, and D.-G. Choi, Sequential feature filtering classifier, (2021) IEEE Access, vol. 9, pp. 9706897078,.

[38] W. A. Orenstein, R. H. Bernier, T. J. Dondero, A. R. Hinman, J.

S. Marks, K. J. Bart, and B. Sorokin, Field evaluation of vaccine efficacy, Bull. World Health Org., vol. 63, no. 6, p. 1055, 1985.

[39] K. Fenske, H. Feilhauer, M. Förster, M. Stellmes, and B. Waske, (2020) Hier archical classification with subsequent aggregation of heathland habitats using an intra-annual rapideye time-series, Int. J. Appl. Earth Observ. Geoinf., vol. 87, Art. no. 102036.

[40] J. Wickham, S. V. Stehman, L. Gass, J. A. Dewitz, D. G. Sorenson, B. J. Granneman, R. V. Poss, and L. A. Baer, (2017). Thematic accuracy assess ment of the 2011 national land cover database (NLCD), Remote Sens. Environ., vol. 191, pp. 328 341.

[41] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, (2019) Deep learning classifiers for hyperspectral imaging: A review, ISPRS

J. Photogramm. Remote Sens., vol. 158, pp. 279317.

[42] G. Cheng, X. Xie, J. Han, L. Guo, and G.-S. Xia, (2020). Remote sensing image scene classification meets deep learning: Challenges, methods, bench marks, and opportunities, IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, no. 99, pp. 37353756.

[43] J. R. Anderson, (1976). A land use and land cover classification system for use with remote sensor data, U.S. Geological Survey Professional, Washing ton, DC, USA, Paper 964.

[44] Jeyasudha, S., Geethalakshmi, B., Saravanan, K., Kumar, R., Son,

L. H., & Long, H. V. (2021). A novel Z-source boost derived hybrid converter for PV applications. Analog Integrated Circuits and Signal Processing, 109(2), 283-299

[45] Jeyasudha, S., & Geethalakshmi, B. (2021). Modeling and performance analysis of a novel switched capacitor boost derived hybrid converter for solar photovoltaic applications. Solar Energy, 220, 680-694..