Mass Classification of Mammogram Images using Selected Textural Features with SVM Classifier

DOI : 10.17577/IJERTV4IS051281

Download Full-Text PDF Cite this Publication

Text Only Version

Mass Classification of Mammogram Images using Selected Textural Features with SVM Classifier

Nita Chayengia

Department of Electronics and Telecommunication Engineering

SVERIs COE Pandharpur, Solapur, Maharashtra, India

Prof. Mrs. M. M. Pawar

Department of Electronics and Telecommunication Engineering

SVERIs COE Pandharpur, Solapur, Maharashtra, India

AbstractIn mammography diagnosis systems, high False Negative Rate (FNR) has always been a significant problem since a false negative answer may lead to a patients death. Development of a new Computer aided Diagnosis (CADx) system for the diagnosis of breast masses is the major objective of this paper. It aims at intensifying the performance of CADx algorithms as well as reducing the FNR by using Gray-Level Co- Occurrence Matrix (GLCM) for extraction of textural features. The input Regions of Interest (ROIs) are segmented manually and further subjected to feature extraction, selection and then classification. However from 322 MIAS database, 59 ROIs are taken into consideration for feature extraction. 19 texture features are extracted and 11 features are selected for feature classification. SVM classifier with Polynomial kernel and 80-20 train-test partition is used for classification. The Sensitivity, Specificity and Accuracy obtained by the selected features are 100%, 100%, and 100% respectively.

KeywordsMammogram, ROI, Texture,SVM

  1. INTRODUCTION

    Cancer is a group of diseases that cause cells in the body to change rapidly and grow out in an uncontrolled manner. Basically cancer cells sooner or later form a lump or mass called a tumor. Breast cancer is found to be the most common cancer occurring in women all over India and accounts for 25% to 31% of all cancers in Indian women. For the year 2012, GLOBOCAN (WHO), estimated 70218 dead cases of women in India due to breast cancer. India ranks first in this regard compared to any other country in the world (second: China – 47984 deaths and third: US – 43909 deaths). In India, many non-oncology medical professionals such as General Surgeons, Gynecologists etc. apt to treat breast cancer themselves, this leads to a lot of wrong decisions, unnecessary investigations, and painful surgeries. This directly has an effect on the outcome and longevity of the patient [8]. Breast cancer begins in the breast tissue called lobules, and also found in the ducts that connect the lobules to the nipple. Breast Cancer is commonly detected before or after symptoms are developed in a woman. Masses detected on a mammogram are basically benign or malignant. Most breast lumps turn out to be benign which are non-cancerous and are not life- threatening. But some masses turn out to be malignant; that is, they are cancerous and are life-threatening [9].

  2. LITERATURE SURVEY

    Liu et.al [1] proposed a new feature selection method, known as SVM-RFE with an NMIFS filter (SRN). They achieved good accuracy rate with the SVM classifier using the

    selected features. These are the F-score (88%), Relief (88%), SVM-RFE (90%), SVM-RFE (mRMR) (91%), and SRN

    methods (93%), with a tenfold cross-validation procedure, and 91%, 89%, 92%, 92%, and 94%, respectively, with a leave- one-out (LOO) scheme. Tahmasbi et.al [2] paper is directed towards the development of a novel Computer-aided Diagnosis (CADx) system for the diagnosis of breast masses. Their objective is to intensify the performance of CADx algorithms as well as reducing the FNR by utilizing Zernike moments as descriptors of shape and margin characteristics. The designed systems yield Az ¼ 0.976, representing fair sensitivity, and Az ¼ 0.975 demonstrating fair specificity. The best achieved FNR and FPR are 0.0% and 5.5%, respectively. Chen et.al [3] proposed a rough set (RS) based supporting vector machine classifier (RS_SVM) for breast cancer diagnosis. In the proposed method (RS_SVM), RS reduction algorithm is employed as a feature selection tool to remove the redundant features and further improve the diagnostic accuracy by SVM. The proposed RS_SVM not only attains very high classification accuracy but also easily detect accurate breast diagnosis with a combination of five informative features. Zheng et.al [4] objective of research is to diagnose breast cancer based on the extracted tumor features. A hybrid of K-means and support vector machine (K-SVM) algorithms is developed to extract convenient information and identify the tumor. Based on 10-fold cross validation, the proposed methodology achieves accuracy rate of 97.38%, when tested on the Wisconsin Diagnostic Breast Cancer (WDBC) data set from the University of California Irvine machine learning repository. Mousa et.al [5] introduced two methods based on wavelet analysis and fuzzy-neural approaches. These methods are mammography classifier based on globally processed image and on locally processed image (region of interest). This system classifies normal from abnormal masses and micro calcification. The estimation of the system is carried out on Mammography Image Analysis Society (MIAS) dataset. Guo et.al [6] proposed a novel method for breast cancer diagnosis using the feature generated by genetic programming (GP). A new feature extraction measure (modified Fisher linear discriminant analysis (MFLDA)) was developed to overcome the drawbacks of Fisher criterion. The capability of this technique is to alter information from high-dimensional feature space into one- dimensional space and automatically determine the relationship amongst data, to increase classification accuracy. Haralick et.al [7] calculated textural features based on gray tone spatial dependencies, and illustrates their application in category-identification tasks of three different kinds of image

    data. He used two kinds of decision rules i.e., piecewise linear decision rule and min-max decision rule. Accuracy of test dataset is 89, 82 and 83 percent for photomicrographs, aerial photographic imagery and satellite imagery respectively.

  3. METHODOLOGY

    Fig.1. a) Original Image, b) ROI Image

    A. Image database

    The mammogram images used in this experiment are taken from the mini mammography database of MIAS. The Mammographic Image Analysis Society, MIAS database contains all total 322 mammographic images in MLO which it contains 207 normal, 63 benign and 52 malignant cases. It has been found that images of MIAS database are basically stored in .pgm (Portable Gray Map) format. The original MIAS Database (digitized at 50 m pixel edge) has been reduced to 200 m pixel edge and clipped/padded so that every image is 1024 pixels×1024 pixels. All images are held as 8-bit gray level scale images with 256 different gray levels (0255) [1]. The Mammographic Image Analysis Society has provided the image database for the purpose of research. In our experiment we have considered breast tissues which are fatty, fatty- glandular, dense-glandular and the abnormalities like, well- defined / circumscribed masses, spiculated masses and ill- defined masses as shown in Table I.

    TABLE I. DISTRIBUTION OF MIAS MASSES

    1. Feature Extraction and Selection

      After extracting ROI, calculate a set of features that is related to texture of the boundary and its neighbor regions. Characteristic of benign tumor are round, smooth, and well- circumscribed boundary, whereas the boundary of a malignant tumor is usually spiculated, rough, and blurry [3]. Thus, we can use a boundary analysis to classify the masses into benign or malignant. Although these features have been used in different publications, we will use only texture features in this paper to get better perfomance.

      1. Texture Features from GLCM

    The texture information of the region surrounding the mass boundary contains important information to discriminate the benign and malignant masses. Thus, we have used the texture information for mass classification. The GLCM has been widely used in several applications, including the analysis of mammographic masses [1]. The features that are extracted from the GLCM are the autocorrelation, correlation, contrast, the cluster prominence, the cluster shade, the energy, entropy, homogeneity, the maximum probability, the sum of squares, the sum of average, the sum of variance, the sum of entropy, the difference in variance, the difference in entropy, the information measure of correlation (1), information measure of correlation (2), the inverse difference normalized, and the inverse difference moment normalized [6] as shown in Table 3. In the computation of the features, we scale the gray level to 16, and four GLCMs are constructed by scanning each mass ribbon at angles of 0, 45, 90, and 135, with the pixel distances set to 1. The GLCMs are averaged before the feature extraction. Thus, we obtain 19 texture features [1]. Because the offset is often expressed as an angle, the following Table II lists the offset values that specify common angles, given the pixel distance D.

    Angle

    Offset

    0

    [0 D]

    45

    [-D D]

    90

    [-D 0]

    135

    [-D D]

    TABLE II. OFFSET VALUES

    Class

    Benign

    Malignant

    Total

    Circumscribed masses Spiculated masses

    Ill defined masses

    19+2

    11

    7

    4

    8

    8

    25

    19

    15

    Total

    39

    20

    59

    1. Region of Interest

      A region of interest (ROI) is a portion of an image that is to be filtered or performed some other operation on. ROI creates a binary mask, and this binary image have the same size as that of the image which is to be processed with pixels that define the ROI set to 1 and all other pixels set to 0. In an image we can extract one or more ROI. The regions can be defined by a range of intensities. As shown in figure 2.a) Original image having 1024×1024 pixels are cropped into 256×256 pixels in figure 2.b) keeping in mind that region of interest is not affected.

      The figure 3 illustrates the array: offset = [0 1; -1 1; -1 0; -1 -1]

      900 [-1 0]

      1350 [-1 1]

      450 [-1 1]

      00 [0 1]

      1. b)

    Fig.2. a) Original Image, b) ROI Image

    Pixel of interest

    Fig.3. Gray-Level Co-occurrence Matrices (GLCM) showing offset values

    A number of texture features may be extracted from the GLCM [1]. We use the following notation:

    N g is the number of gray levels used.

    is the mean value of P.

    x , y , x and y are the means and standard deviations of Px

    and Py .

    Px i is the ith entry in the marginal-probability matrix obtained by summing the rows of Pi, j.

    From the obtained features we have selected 11 features for feature classification based on SVM. These are the cluster prominence, cluster shade, energy, entropy, the sum of squares, the sum of average, the sum of variance, the sum of entropy, the difference in variance, the difference in entropy, and the information measure of correlation. The comparison of classification accuracy of the nineteen features done by the SVM model as shown in Table III.

    Feature Index

    Feature

    Formula

    F1

    Autocorrelation

    F1 i jPi, j

    i j

    F2

    Correlation

    i jPi, j x y F 2 i j

    x y

    F3

    Contrast

    F3 n2 Pi, j

    n i j

    F4

    Cluster prominence

    F 4 i j x y 4Pi, j

    i j

    F5

    Cluster shade

    F5 i j x y 3Pi, j

    i j

    F6

    Energy

    F 6 Pi, j2

    i j

    F7

    Entropy

    F 7 Pi, jlogPi, j

    i j

    F8

    Homogeneity

    F8 1 Pi, j

    1 i j2

    i j

    F9

    Maximum probability

    F9 MaxPi, j i, j

    F10

    Sum of squares

    F10 i 2Pi, j

    i j

    F11

    Sum of average

    F11 2Ng i P i

    i2 x y

    F12

    Sum of square / variance

    F12 i 2Pi, j

    i j

    F13

    Sum of entropy

    2Ng

    F13 Px y ilogPx y i

    i2

    F14

    Difference in variance

    F14 VariancePx y

    TABLE III. TEXTURE FEATURES

    F15

    Difference in entropy

    Ng 1

    F15 Px y ilogPx y i i0

    F16

    Information measure of correlation

    F16 HXY HXY1

    maxHX ,HY

    F17

    Inverse difference moment normalized

    Ng 1Ng 1

    F17 pi, j

    1 i j2

    i0 j0

    1. Feature Classification

      To create this kind of feature-based classification, we should have some information of what features make noble analysis of class membership for the classes we are trying to differentiate [4]. The classification method with supervised learning involves two steps:

      1. Training this is where we discover what features are useful for classification by looking at pre-classified examples.

      2. Testing this is where we look at new examples and assign them to classes based on the features we have learned during training.

    Features are trained and tested before passing through SVM classifier as shown in Figure 4.

    During the process as shown in Figure 4, we tried to utilize an equal number of images taken from the MIAS database. Fifty nine ROIs were used for the experiments; among them, 39 were benign and 20 were malignant. In the process 12 masses were used as testing samples and 47 masses were used as the training samples. The SVM classifier was used as this classifier has six different kernel values here.

    Fig.4. Feature Classification

    1) SVM Classifier

    The Support vector machines (SVM) is originally developed by Boser et al. (1992) and Vapnik (1995) is based on the VapnikChervonenkis (VC) theory and structural risk minimization (SRM) principle [7]. This technique tries to find the tradeoff between reducing the training set error and increasing the margin, in the direction to achieve the best results. Performance of the classifier is measured by the true positive rate (TPR), the true negative rate (TNR), and the accuracy (ACC). The numbers of true positive in a classifier is represented as TP, false positive as FP, true negative as TN, and false negative number as FN. In order to assess SVM classifier prediction performance, we calculate the sensitivity, specificity and classification accuracy respectively as shown in Table IVand the equations are as given below:

    TPR

    TP

    TP FN

    (1)

    As shown in Table VI sensitivity, specificity and classification accuracy for 50-50, 70-30 and 80-20 train-test

    NR TN (2)

    TN FP

    ACC TP TN (3)

    TP TN FP FN

    TABLE IV. TEXTURE FEATURES

    Metrics

    Formula

    Description

    Percentage of

    abnormalities

    Sensitivity

    TPR TP 100%

    TP FN

    correctly detected /

    classified as abnormalities

    Specificity

    TNR TN 100%

    TN FP

    Percentage of normal structures

    correctly detected / classified as normal

    Accuracy

    ACC TP TN 100%

    TP FN TN FP

    Percentage of abnormalities and

    normal structures

    correctly detected /

    classified

  4. RESULTS

    For the purpose of classification, it is anticipated that the linear separability of the mapped samples is enhanced in the kernel feature space so that applying traditional linear algorithms in this space could result in better performance compared to those obtained in the original input space. Inappropriate selection of kernel can give worse classification performance than that of the linear one. Therefore, selecting a proper kernel with good class separability plays a vital role in kernel-based classification algorithms. Here the Polynomial function performs well to do the task. A feature extraction method for finding the most significant features is proposed and implemented to detect and classify breast cancer in mammogram masses. The method is based on a constructing the database samples in 50-50%, 70-30%, 80-20% training- testing partition. The classification accuracy rate achieved by the proposed method using 50-50, 70-30 and 80-20 train-test

    partition is 79.31%, 94.44% and 100% respectively for benign versus malignant masses as shown in Table V.

    TABLE V. TEXTURE FEATURES

    Metrics

    Classification Accuracy

    50-50 %

    traintest partition

    70-30 %

    traintest partition

    80-20 %

    traintest partition

    Sensitivity (%)

    83.33

    83.33

    100

    Specificity (%)

    78.26

    91.67

    100

    Accuracy (%)

    79.31

    94.44

    100

    increases respectively. We trust the proposed system can be very helpful in assisting the physicians to make the truthful diagnosis on the patients.

    TABLE VI. TEXTURE FEATURES

    Training- testing

    partition (%)

    Number of samples

    in set

    Train dataset

    Accuracy (%)

    Test dataset Accuracy (%)

    Training set

    Testing set

    50-50

    30

    29

    100

    79.31

    70-30

    41

    18

    100

    94.44

    80-20

    47

    12

    100

    100

  5. RESULTS

In this paper, we have studied and presented the results of the classification of breast masses with a data set of 322 images. For ROI extraction original image having 1024×1024 pixels are cropped into 256×256 pixels and we have considered 39 benign and 20 malignant masses of circumscribed, spiculated and ill-defined masses. After the ROI extraction, each mass was represented witp9 texture features. Before classification, feature selection was performed with 59 ROIs. The SVM classifier using a Polynomial kernel is employed to perform the classification tasks on 59 ROIs. With the SVM classifier, Sensitivity, Specificity and Accuracy achieved in this paper is 100%, 100%, and 100% respectively.

REFERENCES

  1. Amir Tahmasbi, Fatemeh Saki, Shahriar B.Shokouhi, Classification of benign and malignant masses based on Zernike moments Computers in Biology and Medicine 41 (2011) 726735.

  2. Hui-Ling Chen, Bo Yang, Jie Liu, Da-You Liu, A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis Expert Systems with Applications 38 (2011) 9014 9022.

  3. Bichen Zheng, Sang Won Yoon, Sarah S. Lam, Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms Expert Systems with Applications 41 (2014) 14761482.

  4. Rafayah Mousa, Qutaishat Munib*, Abdallah Moussa, Breast cancer diagnosis system based on wavelet analysis and fuzzy-neural Expert Systems with Applications 28 (2005) 713723Tavel, P. 2007 Modeling and Simulation Design. AK Peters Ltd.

  5. Hong Guo, Asoke K. Nandi, Breast cancer diagnosis using genetic programming generated feature Pattern Recognition 39 (2006) 980 987.

  6. R. M. Haralick, K. Shanmugam, and I. Dinstein, Textural Features of Image Classification, IEEE Transactions on Systems, Man and Cybernetics, vol. SMC-3, no. 6, Nov. 1973.

  7. Breast Cancer India : http://www.breastcancerindia.net/

  8. Breast cancer statistics : http://www.wcrf.org/int/cancer-facts- figures/data-specific-cancers/breast-cancer-statistics.

Leave a Reply