Ceasing the Malicious Activities Based on Hadoop Technology

Bhagwan Jadhav; Bhushan Chhapekar; Ravi Jadhav; Vishal Bangar; Prof. Pankaj Chandre

doi:10.17577/IJERTV2IS120640

Volume 02, Issue 12 (December 2013)

Ceasing the Malicious Activities Based on Hadoop Technology

DOI : 10.17577/IJERTV2IS120640

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 112
Total Downloads : 362
Authors : Bhagwan Jadhav, Bhushan Chhapekar, Ravi Jadhav, Vishal Bangar, Prof. Pankaj Chandre
Paper ID : IJERTV2IS120640
Volume & Issue : Volume 02, Issue 12 (December 2013)
Published (First Online): 18-12-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Ceasing the Malicious Activities Based on Hadoop Technology

Bhagwan Jadhav, Bhushan Chhapekar, Ravi Jadhav, Vishal Bangar, Prof. Pankaj Chandre.

B.E. Computer Engineering, SharadchandraPawar Collage of Engineering, Otur (Pune)

Abstract

Today Large Number of peoples is interacting with the online application system that can perform fraud activities or real activities. So its not efficient for identifying which users are real among them. Its one of the major and most popular aspect is online purchasing. People nowadays dont hesitate to make money transactions online, for which new security measures are to be credited. But there is always a counter measure to threat the security of the system, such as fraudulent activities. For this use of click stream analysis proves beneficial. Click stream analysis works on the clicking pattern of users. Logs of user activities are stored using Hadoop, effective for processing and analyzing big data, which uses its Map Reduce feature for categorizing data on the key values provided. This helps in identifying normal and abnormal user behaviors which in turn helps in preventing the malicious activities which are carried out on online applications. This technique can be used to avoid the losses incurring from them and enhance the security from business perspective.

Keywords:

Map Reduce, HDFS, Click stream Data, Web log Analytics, Fraud Users.

Introduction

With the modern of internet, all the traditional activities are being digitized. This has made all the things accessible with just a click. Computers along with internet are the new gateway to the world. We have privilege of having all the necessary things in our place. One of the new inventions led by digitization is e-commerce. E-commerce includes the transaction where money is involved. Money can be used virtually for almost everything we need. But with the great power of using money comes the great responsibility of securing it.

As the money came in, frauds followed. It has become crucial to take effective measures to ensure the safety of the money involved. In online application involves invalid address, purchasing the commodity only to make it unavailable to other customers and all the activities which hamper the applications performance. Many inventions have been made to make it secure but hackers have to be found to outsmart the developers each time. Click stream analysis, where clicks are used to determine the user behavior is new powerful technology to restrict the fraudulent activities. In click stream analysis, the pattern of clicks is studied by generating the logs for sessions of user activity. These logs are then used to differentiate amongst the genuine user and fraud user. This helps to alert the administrative authorities about the malicious activity. Suspicious users are then made to go through some more test of verification, if these users get through these tests successfully the service is provided else they are added to black list, a list of fraud users. This list can later be used to avoid the loss incurring through these users.
Existing System

Existing Sybil defense schemes work by analyzing local communities and Sybil Guard exploits this property to bind the number of identities a malicious user can create and that show the effectiveness of Sybil Guard analytically.
1. Drawbacks:
  - Huge amount of Distributed cache required.
  - Implements Fuzzy K Mean algorithm for searching a string which its time complexity high.
  - Its Not Implemented for Online Shopping System till now.
Problem Statement

Fraudulent activities involve breaking the services of a particular online application. This involves provision of invalid address, purchasing the commodity only to make it unavailable to other customers and all the activities which hamper the applications performance. Many inventions have been made to make it secure but hackers have to be found to outsmart the developers each time. Obviously huge amount of users list are made. So maintaining and accessingthis type of list, isolating the real users a fraud users list are not efficient for Administrator or database manager and its a time consuming process.
Proposed System

In our system, overcomes the drawback of existing system. It has advent features which are easily accessing, managing, isolating and storing the users list. It is a beneficial for Administrator and service providers. It is possible using the modern technology Click Stream Analysis where clicks are used to determine the user behavior is new powerful technology to restrict the fraudulent activities. In click stream analysis, the pattern of clicks is analyzed by generating the logs for sessions of user activity. These logs are then used to differentiate amongst the genuine user and fraud user. This helps to alert the administrative authorities about the malicious activity. Suspicious users are then made to go through some more test of verification, if these users get through these tests successfully the service is provided else they are added to black list, a list of fraud users. Finally generated lists are provides to the administrator or product producer.
1. Features:
  - Fraudulent activities can be decreased significantly.
  - Stores large database at the same time it can analyze the data using Map Reduce Algorithm.
  - Hadoop processes data fast which is very useful for Real Time Systems.
  - Click Stream Analysis generates large database as user can navigate through the webpage anywhere and for any long time.
  - Provides very high detection accuracy on our click stream traces.
2. Application:
  - To detect the fraudulent activities through click stream analysis take effective measures to reduce the losses incurring due to them.
  - Reduce the cyber crimes using the Hadoop technology.
  - Online application provides the security.
  - Identify the user behaviors.

System Requirement:

Hardware Requirement.
Software Requirement.

Hardware Requirement:

Table 1. Hardware Components

System

Intel 2.40 GHz

Hard Disk

160 GB, Expandable

RAM

4 GB

Cache Memory

4 MB
Software Requirement:

Table 2. Software Components

Operating System	Apache Server version 2.2.6, Windows XP, Windows 07, all Linux Versions.
Programming Environment	JDK, Hive, Flume.
Programming Language	Java, pig Script
Web Browser	Internet Explorer 6.0 above

JAVA

Java is a general-purpose, concurrent, class-based, object-oriented computer programming language that is specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere" (WORA), meaning that code that runs on one platform does not need to be recompiled to run on another. Java applications are typically compiled to byte code (class file) that can run on any Java virtual machine (JVM) regardless of computer architecture.

5.1.1. Fatures:
- Object Oriented.
- Platform Independent.
- Portable.
- Robustness.
  
  5.2. PIG Script:
  
  Pig is a high-level platform for creating Map Reduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java Map Reduce idiom into a notation which makes Map Reduce programming high level, similar to that of SQL for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

ALGORITHM

KnuthMorrisPratt algorithm (KMP):

The KnuthMorrisPratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
1. Background:
  1. A string matching algorithm wants to find the starting index m in string S [] that matches the search word W [].
  2. The most straightforward algorithm is to look for a character match at successive values of
    
    the index m, the position in the string being searched, i.e. S[m].
  3. If the index m reaches the end of the string then there is no match, in which case the search is said to "fail". At each position m the algorithm first checks for equality of the first character in the searched for word, i.e. S[m]
    =? W [1].
  4. If a match is found, the algorithm tests the other characters in the searched for word by checking successive values of the word position index, i. The algorithm retrieves the character W[i] in the searched for word and checks for equality of the expression S [m+i] =? W[i].
  5. If all successive characters match in W at position m then a match is found at that position in the search string.

Takes Time and Space proportional to pattern length.

Technology Used
1. HADOOP
  
  The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named Map Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system is designed so that node failures are automatically handled by the framework.
  
  7.3. Map/Reduce
  
  Map Reduce provides a programming model that abstracts the problem from disk reads and writes, transforming it into a computation over sets of keys and values. The approach taken by Map Reduce may seem like a brute-force approach. Map Reduce works well on unstructured or semi structured data, since it is designed to interpret the data at processing time.
  
  Fig.1. Hadoop Cluster
  
  In Hadoop there are two main components:
  - Hadoop distributed file system(HDFS)
  - Map Reduce
2. HDFS

HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project.

HDFS performs the write once and read multiple operations. Its accessing speed is very fast and automatically maintains multiple copies of data, deploying processing logic in the event of failure.

HDFS Functions:
- Very large files.
- Streaming data access.
- Commodity hardware.
- Low-latency data access.
- Lots of small files.
- Multiple writers, arbitrary file modifications.
HDFS Cluster Types:

Name Node(Manage File system Namespace)
Data Node(Access Data from Name Node)
Edge Node( Communication Link)

Fig.2. Map/Reduce Process

Map/Reduce Components:
1. Job Tracker
2. Task Tracker
Map/Reduce operation:

Splitting of Data.
Sorting of Data.
Merging of Data.

Fig.3. Map/Reduce Basic Operation.

Literature Survey

Today near about 40-50 % users are interacting with online application system as a fraud.

E.g. Online Shopping System. In that huge amount of fraud users are rapidly increases and they reserve more than one product having no specific intention. So its difficult to know which users are real and which users are frauds among made users list. Hence large number of users list is made and its tedious task to maintain and isolating the users list and its time consuming process. To maintaining the huge amount of web log files and managing block of memory requires advent features which are available in Hadoop technology.

Overall survey of the papers concludes that they are uses Hadoop technology for storing and accessing the big data set. But it requires large cache memory. It controls click stream analysis which uses click sequence model only. Large scale data are distributed with the help of map technique but at the time of integrating/merging different clusters are more complex.
System Implementation
1. Workflow of System
  
  Fig.4. Workflow of System.
  
  Many users are interacting with the online shopping system they might be real or fraud users. They are first visit online shopping website, observe products and select specific products whatever they want and then read its product specifications. If one of the product like then they are going to purchase this product and choose transaction method. They go on Cash on Delivery method. By click on any tab or any option then its automatically generate web log pattern with the help of click stream analysis. With the help of click stream analysis users behavior is easily identified. These patterns are analyzed using Key Value. Key value is nothing but a user name and its parameter which are mapping, comparing and reduce the key values which helps to identifying and isolating users behavior that are real users and fraud users. All these web logs and resulting files are stored in HDFS which improves the speed of accessing and retrieving the files. In that main function named Map/Reduce is used for improving the performance of response time. So using Hadoop technology these all output data are provides to the Administrator or product producer.
  
  Fig.5. Sequence Diagram.
Conclusion
- Detection of Fraud/Sybil users activities through Click Stream Analysis which is control by Hadoop Framework.
- Determines the user behavior using Click stream pattern.
- Huge amount of web logs are easily managed and identifies real users and fraud users.
- Separate list are made of Real users and fraud users. The Fraud users are added in the black list.
- Provide finally isolated data to the Administrator which is very beneficialand time saving Manner.
Future Scope
- It will be used for Prepaid Transaction in online shopping system.
- Used for hacking Prevention.
- All online application system are used for their security purpose or improving performance of the system.
References

Qi Chen, Cheng Liu, and Zhen Xiao, Senior Member, IEEE Improving Map Reduce performance Using Smart Speculative Execution Strategy Digital Object Identifier 10.1109/TC.2013.15.
Gang Wang, Tristan Konolige, Christo Wilson,Xiao Wang, HaitaoZheng and Ben Y.Zhao You are How You Click:Click stream Analysis for Sybil Detection IEEE 2013.
Hongyong Yu, Deshuai Wang Mass Log Data Processing and Mining Based on Hadoop and Cloud Computing The 7th International Conference on Computer Science and Education (ICCSE 2012) Melbourne, Australia.
Natalia Danilova, David Stup Application of Natural Language Processing and Evidential Analysis to Web-Based Intelligence Information Acquisition2012.
Jing Zhang1, Gongqing Wu1, Xuegang Hu1, Xindong Wu A Distributed Cache For Hadoop Distributed File System in Real-time Cloud Services

The 13th International conference on Grid Computing 2012 ACM/IEEE.
Hadoop, http://hadoop.apache.org/,2011.chian Premchaiswadi Extracting Weblog of Siam University for Learning User Behavior on Map-Reduce.

S.Eknayake, J.Mitchell, Y.Sun and J. Qiu, Memcached Integration and Twister, http://salsahpc.org/CloudCom2010/EPoster/Cloudco m2010_submission_264.pdf.
Randolph E. Bucklina and Catarina Sismeirob a Peter W. Click Here for Internet Insight: Advances in Clickstream Data Analysis in Marketing Mullin Professor, UCLA Anderson School, 110 Westwood Plaza, Los Angeles, CA 90095, USA Senior Lecturer, Imperial College Business School, Imperial College, London, UK.Journal of Interactive Marketing 23 (2009).

System	Intel 2.40 GHz
Hard Disk	160 GB, Expandable
RAM	4 GB
Cache Memory	4 MB

Ceasing the Malicious Activities Based on Hadoop Technology

Leave a Reply