Email based Spam Detection

— Nowadays, a big part of people rely on available email or messages sent by the stranger. The possibility that anybody can leave an email or a message provides a golden opportunity for spammers to write spam message about our different interests .Spam fills inbox with number of ridiculous emails . Degrades our internet speed to a great extent .Steals useful information like our details on our contact list. Identifying these spammers and also the spam content can be a hot topic of research and laborious tasks. Email spam is an operation to send messages in bulk by mail .Since the expense of the spam is borne mostly by the recipient ,it is effectively postage due advertising. Spam email is a kind of commercial advertising which is economically viable because email could be a very cost effective medium for sender .With this proposed model the specified message can be stated as spam or not using Bayes’ theorem and Naive Bayes’ Classifier and Also IP addresses of the sender are often detected .


I. INTRODUCTION
In recent years, internet has become an integral part of life. With increased use of internet, numbers of email users are increasing day by day. This increasing use of email has created problems caused by unsolicited bulk email messages commonly referred to as Spam. Email has now become one of the best ways for advertisements due to which spam emails are generated. Spam emails are the emails that the receiver does not wish to receive. a large number of identical messages are sent to several recipients of email. Spam usually arises as a result of giving out our email address on an unauthorized or unscrupulous website .There are many of the effects of Spam .Fills our Inbox with number of ridiculous emails .Degrades our Internet speed to a great extent .Steals useful information like our details on you Contact list .Alters your search results on any computer program .Spam is a huge waste of everybody's time and can quickly become very frustrating if you receive large amounts of it .Identifying these spammers and the spam content is a laborious task . even though extensive number of studies have been done, yet so far the methods set forth still scarcely distinguish spam surveys, and none of them demonstrate the benefits of each removed element compose .In spite of increasing network communication and wasting lot of memory space ,spam messages are also used for some attack . Spam emails, also known as non-self, are unsolicited commercial or malicious emails, sent to affect either a single individual or a corporation or a bunch of people. Besides advertising, these may contain links to phishing or malware hosting websites found out to steal confidential information. to solve this problem the different spam filtering techniques are used. The spam filtering techniques are accustomed protect our mailbox for spam mails.
II. LITERATURE SURVEY In the paper [1], authors have highlighted several features contained in the email header which will be used to identify and classify spam messages efficiently .Those features are selected based on their performance in detecting spam messages. This paper also communalize each features contains in Yahoo mail,Gmail and Hotmail so a generic spam messages detection mechanism could be proposed for all major email providers.
In the paper [2], a new approach based on the strategy that how frequently words are repeated was used. The key sentences, those with the keywords, of the incoming emails have to be tagged and thereafter the grammatical roles of the entire words in the sentence need to be determined, finally they will be put together in a vector in order to take the similarity between received emails. K-Mean algorithm is used to classify the received e-mail. Vector determination is the method used to determine to which category the e-mail belongs to.
In the paper [3],authors described about cyber attacks .Phishers and malicious attackers are frequently using email services to send false kinds of messages by which target user can lose their money and social reputations. These results into gaining personal credentials such as credit card number, passwords and some confidential data .In This paper ,authors have used Bayesian Classifiers .Consider every single word in the mail. Constantly adapts to new forms of spam.
In the paper [4],proposed system attempts to use machine learning techniques to detect a pattern of repetitive keywords which are classified as spam. The system also proposes the classification of emails based on other various parameters contained in their structure such as Cc/Bcc, domain and header. Each parameter would be considered as a feature when applying it to the machine learning algorithm. The machine learning model will be a pre-trained model with a feedback mechanism to distinguish between a proper output and an ambiguous output. This method provides an alternative architecture by which a spam filter can be implemented. This paper also takes into consideration the email body with commonly used keywords and punctuations.
In the paper [5],authors investigated the use of string matching algorithms for spam email detection. Particularly this work examines and compares the efficiency of six well- III. PROPOSED SYSTEM In this system, to solve the problem of spam, the spam classification system is created to identify spam and nonspam. Since spammers may send spam messages many times, it is difficult to identify it every time manually .So we will be using some of the strategies in our proposed system to detect the spam. The proposed solution not only identifies the spam word but also identifies the IP address of the system through which the spam message is sent so that next time when the spam message is sent from the same system our proposed system directly identifies it as blacklisted based on the IP address.
In the proposed model ,the web application is done using dot net and spam detection is done using machine learning .The web application consists of following modules:

User Management :
The user who is using this for the very first time must register, by using the website the user or the individual should get registered into it, by registering this will help to maintain separate account for each user. Registration of the user is must before they log in. The user will login to the main page with his registered name and password. Once the user successfully login the authorized page will be displayed otherwise that shows the error messages. Login is compulsory. Login: The user will login to the main page with his registered name and password. Once the user successfully login the authorized page will be displayed otherwise that shows the error messages. Login is compulsory. Registration: First time while using the website the user or the individual should get registered into it, by registering this will help to maintain separate account for each user. Registration of the user is must before they log in.

Compose
Input: the sender will compose the new email; the sender should add the address of the recipient, the subject and the message. Output: the email will be sent based to the address mentioned by the recipient.

Inbox
This page will store all of the mails received by user. All the received Mails will be listed sorted in order of date. Input: the inbox page will accept all the incoming emails sent to an individual. Output: the receiver can open and read the email received to their address.

Sent
This folder stores all the mails sent from the user. Input: here the sender will compose an email and send to the recipient. Output: Sent email can be be read out .

Trash
This folder will store all of mails deleted by the user.
Input: select and Delete all the unwanted emails. Output: all the deleted emails are added in the trash bin. Trash bin stores all the deleted emails.

Voice Message
Input: The Email has been sent in the form of the text message by the sender Output: The email has been read through the use of voice note by the receiver. 7. Offline notification Input: The sender sends an email Output: the receivers receive a notification offline in the text format as SMS.

Delete For everyone
Input: here the sender deletes the email which he has sent Output: the email has been erased or deleted for both the sender as well as the receiver. 9. Read Message Input: The receiver will read the email. Output: the sender will get a notification stating the sender as read the message. When we receive message in the inbox ,that message will be exported to dataset. This message will be detected as spam or not using Naïve Bayes Classifier.
Before detecting whether received message is spam or not ,the model has to be trained which is explained in the below section. 2. It has many fields, some of these columns of the dataset are not required. So remove some columns which are not required. We need to change the names of the columns. With the help of NLTK (Natural Language Tool Kit) for the text processing, Using Matplotlib you can plot graphs , histogram and bar plot and all those things ,Word Cloud is used to present text data and pandas for data manipulation and analysis, NumPy is to do the mathematical and scientific operation. The packages used in the proposed model are shown below.  5. Whenever there is any message, we must first preprocess the input messages. We need to convert all the input characters to lowercase. 6. Then split up the text into small pieces and also removing the punctuations. So the Tokenization process is used to remove punctuations and splitting messages.

The Porter Stemming Algorithm is used for stemming.
Stemming is the process of reducing words to their root word.
8. We need to find the probability of the word in spam and ham messages.

V. RESULTS AND DISCUSSIONS
When we receive message in the inbox ,that message will be exported to dataset as shown below. This message will be detected as spam or not. The exported message will be detected as spam or not using Bayes' theorem and Naive Bayes' Classifier following all the steps discussed above along with finding probability of words in spam and ham messages to detect it as spam or not. The below figures shows message which got detected as spam and ham. If "Urgent! Please call 09062703810" is an exported message from the inbox to the dataset then based on trained dataset and using Bayes' theorem and Naive Bayes' Classifier, the above message is detected as Spam as shown below. If "Thanx" is an exported message from the inbox to the dataset then using Bayes' theorem and Naive Bayes' Classifier, the above message is detected as Ham as shown below. The IP address of the sender can also be detected. Fig.11.IP address of the sender VI. CONCLUSION Email has been the most important medium of communication nowadays, through internet connectivity any message can be delivered to all aver the world. More than 270 billion emails are exchanged daily, about 57% of these are just spam emails. Spam emails, also known as non-self, are undesired commercial or malicious emails, which affects or hacks personal information like bank ,related to money or anything that causes destruction to single individual or a corporation or a group of people. Besides advertising, these may contain links to phishing or malware hosting websites set up to steal confidential information. Spam is a serious issue that is not just annoying to the end-users but also financially damaging and a security risk. Hence this system is designed in such a way that it detects unsolicited and unwanted emails and prevents them hence helping in reducing the spam message which would be of great benefit to individuals as well as to the company .In the future this system can be implemented by using different algorithms and also more features can be added to the existing system.