A Content based Mail Detection Technique

DOI : 10.17577/IJERTV11IS080019

Download Full-Text PDF Cite this Publication

Text Only Version

A Content based Mail Detection Technique

Irani Hazarika

Dept of Computer Science Gauhati University Guwahati, India

Purabi Choudhury

Dept of Computer Science Gauhati University Guwahati, India

Abstract Email spam is one of the major problems of the todays internet. There is various ways to detect the spam emails. In this paper a content based mail detection techniques has been proposed. For this purpose, the content of each mail is considered as a text document. To measure the performance, the proposed method is applied on a real life dataset. The performance of the proposed method is compared with standard classification algorithm K-Nearest Neighbor (KNN). It is seen that proposed method gives better accuracy than the KNN algorithm.

KeywordsClassification, Spam mail, Vector space model, Content based mail detection

  1. INTRODUCTION

    The term spam is generally used to denote an unsolicited commercial e-mail. Email spam is one of the major problems of the todays internet. There are various methods that have been proposed to automatic classify messages as spam or legitimate, such as rule-based approaches, white and blacklists, collaborative spam filtering, challenge-response systems etc. Some popular spam mail detection methods include approaches like rule based algorithm [1], Support Vector Machines [2], Naive Bayes classifiers [3], Bayesian classifier [7] etc to automatically detect and remove these spams. A Bayesian classifier is statistical classifier works on independence computation of probability [7]. In paper [4] a non-content based mail filtering approach is proposed. Here, an intelligent hybrid spam filtering method is given, which only analyzes email headers for spam detection. The content based mail detection technique [5] [6] is an important and popular one. Content based method checks for text in the body of email for classification. It performs text classification task by employing some preprocessing on the text. In this paper, we have proposed a content based mail classification technique by using vector space model. To classify an input mail as ham or spam here we consider content of each mail as a text document.

    The remainder of this paper is organized as follows. In section II, some pre-processing steps are given. In section III, the proposed method is described. Experimental results are showed in Section IV. Finally, Section V offers conclusions and outlines for future works.

  2. STEPS OF PREPROCESSING

    As we mentioned earlier the aim of the proposed method is to classify a test mail as spam or ham by analyzing its content and content of sets of available training mails. In the proposed method we considered two types of training mails sets. One training mails set contains only ham mails and other contains only spam mails.

    Thus, we have two training datasets that contains two sets of training mails and a test dataset that contain an input test mail. Before applying the proposed method, first we have

    to apply some pre-processing tasks on the datasets. The pre- processing steps are described below:

    • Converting HTML files to TEXT files

      The mails in the datasets are present in the form HTML. Thus these mails are first converted to text files using the software named Okdo HTML to TXT converter.

    • Stop word removal

      After converting the mails to text file, the non textual content and stop words are removed from the text files by using a standard stop wordlist. After removing the stop words, each text file in the datasets contain only a set of keywords.

    • Creating master wordlist

    After removing the stop words from the files, we have to create master wordlist for each of the training mail set and test mail. The master word list contains every distinct keyword present in the dataset.

  3. PROPOSED METHOD

    In this proposed method the mails present in the training datasets and test dataset are represented as vector space models by using normalized term frequency of the mails. After that arithmetic mail vectors are computed from the vector space models. Using these arithmetic mail vectors the test mail is classified as spam or ham. The procedures are described below-

    1. Vector space model representation of mails

      After applying the pre-processing steps on the mails present in the datasets, each mail is represented as a fuzzy set by using normalized term frequency of the keywords present in its master wordlist.

      Let W be the set of all distinct keywords appearing in the master wordlist of the dataset. Let |W| = n. The keywords are arranged in a random order and thus get a sequence of the form

      W = {w1, w2, w3,…., wn}

      Now keeping this ordering in mind, we can represent each mail d as

      d={o1, o2, o3,….. on}

      where oi indicates the term frequency of the word wi in the mail d i.e. the number of occurrences of word wi in d. If a word wi is not present in the mail, then oi =0 for that mail.

      We can normalized the term frequency value of the word wi in mail d to [0, 1] as follows-

      ntf i = oi/n

      follows-

      Thus, the fuzzy set representation of mail d is as d= { ntf 1, ntf 2, ntf 3, .. , ntf n }

      Again, the vector space model is an algebraic model

      iii. Calculate the arithmetic mean vector Strain for the vector Vm x wspam.

      Step 2:

      For the test mail

      for representing text documents as vectors of identifiers. Here, the vector space model for the mails present in a dataset has created as follows-

      The vector space model Vmxn is represented based on the normalized term frequency vector for m numbers of mails present in the dataset and n numbers of distinct words in its master wordlist, where entry Vij denotes normalized term frequency (ntfij) of jth keyword in ith mail.

      In this proposed method, we will create three vector space models, one for training ham mail dataset, one for training spam mail dataset and one for the test data.

    2. Calculation of arithmetic mean vector

      From each vector space model a new vector is created which is called as arithmetic mean vector.

      Suppose, A = {atf1, atf2,., atfn} is the arithmetic mean vector for the vector space model Vmxn. Then is atfi calculated as follows-

      1. Create a word list LN with all distinct words from the test mail.

      2. Find out wordlists XH and XS with EH, and ES numbers of extra distinct word respectively from LN. The words in XH and XS are in LN, but not in LH and LS respectively.

      3. Append EH and ES number of zeroes to the arithmetic mean vectors H and S respectively.

      4. Calculate the vector space models V1 x wham1 and V1 x wspam1 for the test mail. To create V1 x wham1, the normalized term frequency vector for the test mail is calculated by appending the wordlist XH to LH (wham1: total number of words in XH and LH) and to crate V1 x wspam1, the normalized term frequency vector for the test mail is calculated by appending the wordlist XS to LS (wspam1: total number of words in XS and LS)

      5. Calculate the arithmetic mean vector Htest for

        V1 x wham1 and Stest for V1 x wspam1

        Step 3:

        1. Calculate fuzzy similarity fsimham between the arithmetic mean vectors Htrain and V1 x wham1.

        2. Calculate fuzzy similarity fsimspam between the

        =1

        atfj =

        ij /m

        arithmetic mean vectors Strain and V1 x wspam1

    3. Fuzzy similarity measure

      A fuzzy similarity function for pair of mails is defined which can be calculated from he fuzzy logic [8]. Let ntfm1 and ntfm2 be two normalized term frequency vector for mail m1 and m2, then fuzzy similarity between m1 and m2 can be calculated using-

      sim (m1, m2) =. | ntfm1ntfm2| / | ntfm1nfm2|

    4. Proposed algorithm

    The proposed spam mail detection algorithm is as follows-

    Step 1:

    1. For training dataset containing n numbers of ham mails

      1. Create a master ham word list LH which contains all distinct words (Wham) from ham mails.

      2. Create a vector space model Vn x wham for n numbers of ham mails.

      3. Calculate the arithmetic mean vector Htrain for the vector Vn x wham

    2. For dataset containing m numbers of ham mails

      1. Create a master spam word list LS which contains all distinct words (Wspam) from ham mails.

      2. Create a vector space model Vm x wspam for m numbers of input spam mails.

    Step 4: Assign the mail to the group ham mails if value of

    fsimham more than fsimspa, otherwise assign it to the group spam mails.

  4. EXPERIMENTAL RESULTS AND DISCUSSIONS For the experimental purpose the proposed method has been

    applied on the real life mail dataset Eron. The Eron email dataset was collected and prepared by the CALO project. The mails in the data set are classified into two categories ham and spam. The ham category contains a large set of ham mails and spam category contains a large set of spam mails.

    The performance of the proposed method is compared with standard KNN algorithm by using classification accuracy. Both the methods are implemented in C++.

    For the convenience the KNN algorithm is described below-

    Consider a training dataset that contains both ham and spam mails. Find the distinct word lists for both training and test mail sets separately. Now, add the words which are present in test mail list but not present in the training mails list into the distinct word list of the training mails set. Suppose this new distinct word list is Lnew.

    Based on this list Lnew find the vector space models for both the training mails set and test mail using normalized term frequency of the mails. Now, both these vector space models are inputted to the KNN algorithms for classify the test mail.

    The steps of KNN algorithm are-

    1. Calculate the fuzzy similarity value ci between the test mail and each training mails di (where i=1, 2, 3…,

      n) using respective vector space model.

    2. Each pair [(ci, di) is stored in a structure A as follows-

      A= [(c1, d1), (c2 , d2), …….. , (cn , dn)]

    3. Reverse the similarity values present in A as-

      ci = 1 – ci

    4. Arrange the pairs (ci, di) present in A in ascending order according of the value ci.

    5. Set Dmax = cn

    6. Calculate the probability of mail di as-

      Pi=1-ci/Dmax , for i=1,2,3,……,n

    7. Select first K mails from A.

    8. Set Bj=0, where value of j is 1 up to number of classes.

    9. Bj= Bj + Pi, where Pi is the probability of mail di from class j and di first K mails from A

    10. Select j with maximum value of Bj

    11. Assign the test mail to class j

      • Accuracy Measure

        The accuracy measures the percentage of prediction that is correct. The accuracy (Acc) of the proposed method has been calculated using the following measure-

        .

        Acc = (TP+TN) / (TP+FP+FN+TN)

        Where-

        TP= Number of true positive FP= Number of false positive FN= Number of false negative TN= Number of true negative

        True Positive (TP): This term states the number of spam mails correctly classified as spam.

        False Negative (FN): This term states the number of ham mails that is classified as spam.

        False Positive (FP): This term states the number of ham mails that is classified as ham.

        True Negative (TN): This term states the number of spam mails that is classified as ham.

        • Results and discussion

      To measure the performance of the proposed method we consider two sets of training datasets (one for spam mails and one for ham mails) each containing 300 emails from ERON mail dataset. Thus, in KNN algorithm the training mails set contains 600 mails as it contains both ham and spam mails. After that the accuracy of the proposed method and KNN algorithm has been measured for different numbers of test emails. The results are shown in Table 1. From Table 1, it is seen that the proposed method gives better results than KNN algorithm in all cases.

      Table1: Accuracy of the KNN algorithm and proposed method on different numbers of test mails

      No of test mails

      Accuracy of KNN

      Accuracy of Proposed Method

      30

      0.4

      0.4

      50

      0.78

      0.78

      80

      0.6

      0.6

      100

      0.68

      0.78

  5. CONCLUSION AND FUTURE WORKS This paper presents a content based mail classification

method. The evaluation demonstrates that the proposed method gives better results than the KNN algorithm in classification of spam mails. In future this method will be extended, so that it can detect attachment based mails and also ccomparison of the result of the algorithm on some real data set will be done.

REFERENCES

[1] G.Santhi, S.Maria Wenisch, Dr. P. Sengutuvan, Fuzzy Rule based Novel Approach to Spam Filtering, International Journal of Computer Applications, Vol. 71(14), pp. 0975 8887, May 2013

[2] H. Drucker, V. Vapnik, D. Wu, Support vector machines for spam categorization, IEEE Transactions on Neural Networks, Vol. 10(5) pp. 10481054, 1999.

[3] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G.Sakkis, C.D. Spyropoulos, P. Stamatopoulos, Learning to filter spam e-mail: A comparison of a naive Bayesian and a memory based approach, Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases , (PKDD 2000), pp. 1 13, 2000.

[4] Yong Hu, Ce Guo, E.W.T. Ngai, Mei Liu, Shifeng Chen A scalable intelligent non-content-based spam-filtering framework Expert Systems with Applications, Vol. 37, pp. 8557-8565, 2010

[5] M. Basavaraju, Dr. R. Prabhakar, A Novel Method of Spam Mail Detection using Text Based Clustering Approach, International Journal of Computer Applications, Vol. 5(4), pp. 0975 8887, August 2010

[6] Dr. Sonia, Spam Filter: VSM based Intelligent Fuzzy Decision Maker, IJCST, vol. 1(1), September 2010.

[7] Sunil B. Rathod, Tareek M. Pattewar, "Content based spam detection in email using Bayesian classifier", International Conference on Communications and Signal Processing (ICCSP), pp. 1257-1261, 2015.