A Review Analysis of Preprocessing Techniques in Web usage Mining

Mr.    Jitendra B.    Upadhyay; Dr.    S.    V.    Patel

doi:10.17577/IJERTV4IS041348

Volume 04, Issue 04 (April 2015)

A Review Analysis of Preprocessing Techniques in Web usage Mining

DOI : 10.17577/IJERTV4IS041348

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 142
Total Downloads : 6318
Authors : Mr. Jitendra B. Upadhyay, Dr. S. V. Patel
Paper ID : IJERTV4IS041348
Volume & Issue : Volume 04, Issue 04 (April 2015)
DOI : http://dx.doi.org/10.17577/IJERTV4IS041348
Published (First Online): 28-04-2015
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Review Analysis of Preprocessing Techniques in Web usage Mining

Mr. Jitendra B. Upadhyay Assistant Professor

Shrimad Rajchandra Institute of Management and Computer Application, UTU

Bardoli, India

Dr. S. V. Patel Professor

Department of Computer Science, Veer Narmad South Gujarat University Surat, India

Abstract The Internet web has become popular tool to assist human for their information needs from web server. Due to increasing number of users for web access day by day, there is a need to analyze behavior of such user, in order to monitor and improve performance and throughput of website. Web usage mining is one of the data mining applications which deal with web log files and extract useful information from web. There are different phases are for web usage mining: Data preprocessing, discover pattern and pattern analysis. Among them data preprocessing is the most crucial phase of web usage mining because without good quality of data it is difficult to identify pattern of users behavior. This paper provides reviews of different data preprocessing methods like data collection, data cleaning, User identification, session identification and path completion which will be useful for the community to select one or combination of available techniques in order to carry out efficient preprocessing in order to obtain reliable data mining outcome.

Keywords Preprocessing, Web Usage Mining, Web Logs

INTRODUCTION

Nowadays, World Wide Web is main source to find any type of world entity information. The Web continues to grow on high speed rate as information doorway. As there is huge amount of structured, semi-structured, unstructured, distributed, static and dynamic data available on web pages of web, it is difficult task to access relevant information with speed. Web mining has emerged as most effective techniques to overcome above problem [5]. Web mining is the application of data mining techniques to discover interesting and potentially useful patterns and implicit information from the Web. Web Mining applies data mining, the artificial intelligence and so on to the web data and extracts the users using pattern [8]. Based on different mining objects of web sites there are three main categories for knowledge discovery: Web content mining, Web structure mining and Web usage mining.
1. Web content mining is process of extracting knowledge from the scanning and mining text, pictures, graphs, etc of web pages to indentify relevant contents.
2. Web structure mining is the process of inferring knowledge to identify relationship between web pages in the web using structuring of code, links between references and organization of World Wide Web.
3. Web Usage Mining
Web Usage mining is process to extract useful and interesting patterns from log files which show the usage behavior of the users. Web Usage Mining is application of data mining techniques to identify users usage or behavior pattern from web data which can be used to personalize the Web or to enhance the quality of commerce services or to improve the web structure and web server performance [10]. Web usage mining consists of major three tasks: Data Preprocessing, Pattern Discovery and Pattern Analysis
[Figure [1]: Types of Web Mining]
As doing mining process need data collection which are available through log files. [To mine the web data, log files are used.] Log file is location where recoded entry of requests occurred by user on website. Basically log files are located from[1] three different sources like web server, proxy server and browser. Log files contain each record of any user visit or any single click on web site so log files having huge amount of data. Due to large amount of data in web log file, it may have irrelevant data, so original log file cant be used directly for pattern discover. By data fusion, data cleaning, user identification, session identification and path completion the information in the web log can be used as transaction file for discover pattern. Pattern discovery is process to extract user action pattern by using different data mining algorithms or statistics, machine learning or pattern recognized[18], After the preprocessing and pattern discovery, the obtained usage patterns are analyzed to filter uninteresting information and extract the useful information.

Among these phases data preprocessing is the most crucial phase of web usage mining because without good quality of data it is difficult to identify pattern of users behavior. This paper provides reviews of different data preprocessing methods like data collection, data cleaning, User identification, session identification and path completion which will be useful for the community to select one or combination of available techniques in order to carry out efficient preprocessing in order to obtain reliable data mining outcome.

The rest of the paper is organized as follows. Section 2 describes the data preprocessing including it techniques. Section 3 details the literature reviews of related work. Section 4 compares various preprocessing techniques used by researchers and conclusion is given in section 5.
DATA PREPROCESSING

Three main stages of web usage mining are Data Preprocessing, Pattern discovery and Pattern analysis. Among them data preprocessing is essential stage which can done by data collection, data cleaning, user identification, session identification, path completion, transaction identification and formatting [5]. Because of log file contains bulk of data, preprocessing of log file reformats the entries of a log file into a form that can be used directly by subsequent steps of the log analyzer [7].
[Figure [2] : Data Preprocessing Techniques]
1. Data Collection
  
  Generally, huge number of records is inserted in web log files at server which may be created different log files day wise or month wise. So in beginning of the data preprocessing records of all log files are gathered into one log file [17]. Figure 3 is sample log file data.
2. Web Logs
  
  When any user request a particular page on website, an entry is entered into one file of server is called log file. It is consider as reliable source to predict users behavior. There are mainly three data source for log file: Web Server, Proxy server and Client browser.
  1. Web Server is most common source of data. There are four types of server logs [3]:
    1. Access log file
    2. Error log file
    3. Agent log file
    4. Referred log file
      
      Access log file contains all the information that provides to the client by the server. Error log file contains a list of any server error. Agent log file provide the information about users browser, operating system and version of browser. Referred log file is used to allow websites and web servers to identify where people are visiting them from, for promotional or security purpose. Amongst its first two file are very common and useful to fetch the required information of user behaviors. The server log does not contain cached page visited record. The cached pages are called from local storage of browser or proxy server. Different web server having different format of log files like Common log format, IIS standard/ extended log file, Combined/Extended common log format, Log markup language, etc. The following figure [3] shows server log file sample.
      [Figure [3] : Server Log file Sample from srimca.edu.in website]
      These server log files contains various fields like date, time, remote server name and IP address, mode of request, UR of request done by clients, server port number, client IP address, user agents and different status.
  2. Proxy Server Log: A proxy server is a server which acts as an intermediate between the client and web servers. Therefore if the server gets a request of the client through the proxy server. So the clients request entries are maintained in separate log file at proxy server.
    [Figure of Proxy Server]
  3. Client/Browser Log: This type of file can be made to reside in the clients browser itself. Client side logs are useful to handle web page caching, session reconstruction problems. HTTP cookies could also be used for this purpose.
3. Data Cleaning
  
  The data cleaning is process that removing irrelevant information/fields/records that are not required for mining. Figure 1 show steps for data cleaning. To analysis of huge amount of data from file which may be irrelevant is a cumbersome activity. So cleaning is necessary at initial stage. If a user requests a specific images pages like .gif, .JPEG, etc. are downloaded which are not useful for further analysis are eliminated. If there is unsuccessful http
  
  request, need to be removing from log files. Automated programs like web robots, spiders and crawlers are also to be removed from log files.
4. User Identification
  
  User identification means indentifying individual users by observing their IP address. There are following rules to identify unique users: 1) If there is new IP address then there is a new client; 2) If the IP Address is same but the operating system or browsing software are different, a reasonable assumption is that each different agent type for an IP address represents a different client; 3) While the IP address, operating system and browsers are all the same , the client can be a new one by identifying whether the requesting page can be reached by the pages accessed before, according to the topology of the site.
5. Session Identification
  
  This is time duration of users spent in web page. This can be done by nothing down the user id those who have visited the web page and had traversed through the links of the web page.
6. Path Completion
After completion of session identification need to verify for any missing web pages of traveling by user. Path completion is process to the addition of important page accesses that are missing in the access log due to browser and proxy server caching [13]. After the path completion process, the user session file results in paths consisting of a collection of page references including repeat page accesses made by a user.
LITERATURE REVIEW

Sanjay Bapu Thakare et al.[2] described the effective and complete preprocessing of access stream before actual mining process can be implemented. They suggested Improvised merging algorithm for data preprocessing method for margining process. They did implementation of fields extraction in core Java. They also suggested improvised TransLog algorithm for convert log file in database. They did not discuss about user identification and more processes of web usage mining. For implementation they used IIS log file format.

Amit Dipchandji Kasliwal et al. [9] proposed a web usage mining methods using well known tool RapidMiner for predicting access behavior. They were taking a log file from KDD repository and performed web usage mining techniques. Using RapidMiner tool they did data preprocessing methods and obtaining ARFF file and applying association rule to ARFF file using MATLAB identify specific result of web pages visits.

Navin Tyagi et al. [21] surveyed about the data preprocessing activities like data cleaning, data reduction and related algorithms. They presented algorithms for data cleaning and data reduction based on CERN (Common Log Format) log file format. About further procedures of preprocessing did not mention.

K. R . Suneetha et al. [23] focused on data preprocessing techniques including web log structure, data cleaning and user identification. They used data from NASA web server log files. To improve efficiency of log files they remove unnecessary data from web log files and analyze process including generating various reports. They also did analysis on most system error found during visit of website. They did not apply any data mining algorithms for pattern discovery.

Jaideep Srivastava et al. [22] provides a detailed taxonomy of the work in area of web mining, including research area as well as commercial offerings. They provide up-to-date survey of the existing work is also provided for Web Usage Mining Research project and products. They provides idea of web usage mining applications likes personalization, system improvements, site modification, Business Intelligence and user characterization. In added in work they provides overview of WebSHIFT system which is designed to perform Web usage Mining from server logs in the extended NSCA format.

Chitraa et al. [6] proposed a new technique for identifying sessions for extraction of user patterns. Their experimental results show that the proposed Session Identification technique is an effective one to construct sessions accurately. In their proposed method a matrix is constructed from which sessions are identified using MATLAB tool. They proposed session construction algorithm based on browsing time. They mainly focus on preprocessing web data using data cleaning, user identification and session identification.

Sheetal A. Raiyani et. al. [25] Introduced proposed Technique DUI (Distinct User Identification) based on IP address, Agent, Referred pages on desired session time. They discussed all methods of preprocessing including web log format. They used R K Universitys library web log of 8 months from Jan1, 2012 to Aug 3, 2012 and implemented preprocessing techniques. In result they got different distinct users number on their month wise data.

Vellingiri J. et al.[26] focuses on providing techniques for better data cleaning and transaction identification from the web log. They used data preprocessing methods including data cleaning by remove unnecessary data, robot cleaning; user identification using reference length, where reference length is the time taken by the user to view a particular page; session identification, path completion and transaction identification using reference length. They focused on two algorithms one is Maximal Forward References (MFR) and Reference Length (RL). Using these two algorithm author helps to determine only the relevant logs that the user is interested in.

G. T Raju et al.[27] proposed a complete preprocessing methodology that allows the analyst to transform any collection of web server log files into structured collection of tables in relational database model. They compared preprocessing techniques which used by researchers including data source, data cleaning and data formatting and structuring. They showed experiment results likes reducing size of log file for preprocessing, day wise unique visitors, user session identifications.

P. Nithya et al.[28] proposed a novel pre-processing technique by removing local and global noise and web robots. They implemented data cleaning phase will helps in determining only the relevant logs that the user is interested in. Anonymous Microsoft Web Dataset and MSNBC.com. They did not mention other preprocessing methods like user identification, session identification and path completion.

Vellingiri J. et al.[7]provided three phases of web usage mining for user navigation discovery including preprocessing phase, Identify users behavior and classifications of user behaviors. In the preprocessing phase, the data cleaning process includes removal of graphics, video, status code and robots cleaning. In the second phase, design a set of clusters using Weighted Fuzzy-Possibilistic C-Means (WFPCM), which consists of similar data items based on the user behavior and navigation patterns for the use of pattern discovery. Inthe third phase, classification of the user behavior is carried out for the purpose of analyzing the user behavior using Adaptive Neuro- Fuzzy Inference System with Subtractive Algorithm (ANFIS-SA).

Ashwin Riyani et al.[8]focused on a complete preprocessing style having data cleaning, user and session Identification activities to improve the quality of data. They introduced proposed technique (algorithm) DUI (Distinct User Identification) based on IP address, Agent and Session time, Referred pages on desired session time. They did not use pattern discovery and pattern analysis methods.

RenÃ¡ta IvÃ¡ncsy et al.[10]provided novel approach that uses a complex cookie-based method to identify web users. They developed an implementation called Web Activity Tracking (WAT) system that aims at a more precise distinction of web users based on log data. Using WAT they presented different useful result.

Arvind Dangi et al.[11] proposed a new method for web data preprocessing in which it has three phases. In the first phase some websites are selected and by different locations access these website & by applying the (java) tools & methods then find out the IP address of that websites, session usage time & navigations, in the final phase combine them as framework which may be helps to investigate the web user usage behavior.

S. Prince Mary et al.[12] describes the preprocessing methods and steps involved in retrieving the required information effectively. They included data collection, data cleaning by local and global noise removal & graphics records, HTTP failure status etc, User and session identification with path completion.

Shaily Langhnoja et al.[17] gives detailed description of how pre- processing is done on web log file and after that it is sent to next stages of web usage mining. They Used algorithms of data cleaning, user and session identification on web log files and showed results.

Zidrina Pabarskaite [19] presented two new techniques for enhancing web log mining. First is novel framework for performing advanced web log data cleaning and second is data mining is visualization.

C. E. Dinuca [24] propose a new method for identifying sessions based on average time of visiting web pages based on the use of fixed values cause errors in identifying sessions. They implemented in Java programming language by using NetBeans IDE and used two algorithms to identify sessions including 30 minutes to indicate end of session and average time spent on page by users. They showed result and conclude that complexity of classify algorithm is not modified by new approach.

R. Suguna et al.[29] discusses the basics of web log preprocessing, existing preprocessing techniques, the proposed User Interest Level based preprocessing (UILP) algorithm and performance of the proposed (UILP) algorithm with existing algorithms to identify user interest level. They considered session and frequency values as the key for identifying user interest level.

SUMMARY AND ANALYSIS OF LITERATURE REVIEW


Authors	Preprocessing Techniques					Discovery Of Pattern	Analysis of Pattern	Remarks
Authors	Data Fusion	Data Cleaning	User Identificati on	Session Identificati on	Path Completion	Discovery Of Pattern	Analysis of Pattern	Remarks
Sanjay Bapu Thakare et al.[2]	Yes [Improvised Merging Algorithm]	Yes	NA	NA	NA	NA	NA	– Implement Field extraction using JAVA Source code and Convert log file into database using algorithm named Improvised TransLog Algorithm
Amit Dipchanji Kasliwal et al. [9]	NA	Yes	Only Discussion	30 min expiration time for user session	NA	Associatio n Rule	NA	Con: To model the frequent user visits the website had accessed during the specific period.
Navin Tyagi et al. [21]	NA	Yes [Algorithm]	NA	NA	NA	NA	NA	techniques missing
K.R. Suneetha et al. [23]	NA	Yes	Yes	NA	NA	NA	NA	– Implemented data leaning and user identification on NASA web server

Use MatLab7.0 for Data Cleaning and filtering
Implement association rule using RapidMiner tool
Used log file from KDD repository

Give data cleaning and data reduction algorithms based on CERN (common log file) format.
Other preprocessing

								log file and display result for improving website performance
Jaideep Srivastava et al. [22]	Yes	Yes	Yes	Yes	Yes	Yes	Yes	– Used WebSIFT system to perform WUM from server log in the extended NSCA formats( include referred and agent fields)
V. Chitraa et al. [6]	Yes	Yes [Including robot cleaning]	Yes [using Reference Length]	Yes [using Reference Length]	Yes [using Reference Length]	NA	NA	– Provide result of data cleaning and path completion after experiment.
Sheetal A. Raiyani et al. [25]	NA	Yes	Yes [Distinct User Identificati on]					– Mentioned standard user identification with their problem and proposed Distinct User Identification algorithm
Vellingiri J. et al.[26]	NA	Yes [Including robot cleaning]	Yes [using Reference Length]	Yes [using Reference Length]	Yes [using Maximal Forwarded Reference ]	NA	NA	– Computing reference length and maximal reference forwarded algorithm to fulfill path completion technique.
G. T Raju et al.[27]	Yes [Algorithm with pseudo code]	Yes	Yes	Yes [Pseudo code]	Yes	NA	NA	– Implemented all pseudo code on NASA and academy website log file.
P. Nithya et al.[28]	Yes	Yes [Including local and global noise removing, Robot cleaning]						– Implemented on Microsoft Web Dataset and MSNBC.com website log files.
Vellingiri J. et al.[7]	Yes	Yes [Including local and global noise removing, Robot cleaning]	NA	NA	NA	Yes [Weighted Fuzzy Possibilist ic C- Means Algorithm ]	Yes [ANFIS with Subtractive Algorithm]	preprocessing including robot cleaning and apply Weighted Fuzzy- Possibilistic C- Means Algorithm for pattern discovery, Adaptive Neuro- Fuzzy Inf erence System with Subtractive Algorithm for pattern analysis.
Ashwin Riyani et al.[8]	NA	Yes	Yes [ Distinct User Identificati on Algorithm]	NA	NA	NA	NA	– Introduced proposed Technique UI (Distinct User Identification) based on IP address, Agent and Session time, Referred pages on desired session time.
RenÃ¡ta	Yes	Yes	Yes	NA	NA	NA	NA	– Develop Web

Implemented Data cleaning in
Included result of pattern discovery

IvÃ¡ncsy et al.[10]			[ Cookie- based user identificatio n]					Activity Tracking (WAT) system that finds distinction of web users based on log data.
Arvind Dangi et al.[11]	NA	Yes	Yes [ Using JAVA coding]	Yes [ Using JAVA coding]	NA	NA	NA	– Provide a framework to identify user and session with each step of data preprocessing
S. Prince Mary et al.[12]	Yes	Yes [Common Algorithm]	Yes [Based on IP and Agent ]	Yes [Time Oriented steps]	Yes [Using Graph Model]	NA	NA	– Apply algorithms of data preprocessing on web logs and showed result of user and session identification.
Shaily Langhnoja et al.[17]	Yes	Yes [ Common Algorithm]	Yes [Common algorithm]	Yes [Common algorithm]	NA	NA	NA	– Used algorithms of data cleaning, user and session identification on web log files and showed results
Zidrina Pabarskaite [19]	NA	Yes	Yes	NA	NA	NA	NA	– Introduced Novel advanced cleaning improves web log mining results. Improved filtering removes pages with no links from other pages
C. E. Dinuca [24]	NA	Yes	NA	Yes [Modified basic algorithm using mean value of sessions]	NA	NA	NA	– Implemented in Java programming language by using NetBeans IDE, two algorithms to identify sessions and showed results.
R. Suguna et al.[29]	Yes	Yes [UILP algorithm]	Yes [UILP algorithm]	Yes [UILP algorithm]	Yes [UILP algorithm]	NA	NA	– Provided User Interest Level Processing algorithm to improve quality of preprocessing

More focuses on cookie based analysis.
Implemented and show results

[Table [1]: Summary of Literature Review of Preprocessing Techniques]

Analysis and Issues: On basis of review,, it can be seen that there are many common data preprocessing techniques applied in various types of log files. Some authors used common techniques like remove graphical records, HTTP failure status record etc. But three or four authors included robot cleaning in data cleaning preprocessing which helps while log files are collected from proxy server or having proxy server between server and client. To identify user, researchers used different methods based on relevant data with predicting IP address relationship, based on cookie, based on reference length, etc. Researcher implemented different algorithms for identifying session based on 30 minutes expiration time or average time spent by users, etc.

Further, it is found that there is no algorithm focusing on multiple clients accessing website through the same IP address. It can be considered an area of further research.

CONCLUSION

The web server having huge amount of data with different fields which are not reflected directly about user visiting pages during users interaction with website. Due to availability of irrelevant data of users visiting activity in web logs there is not surly getting user and session identification properly. So web log files need to be preprocessing before mining processes like to discover knowledge of user behavior pattern. Data preprocessing is required phase in web usage mining process. There are different heuristics algorithms or methods used to remove irrelevant data from web logs files and identify user and session. This paper has focused on applied various data preprocessing methods with their advantages and disadvantages. After completion of data preprocessing, applying data mining algorithms like clustering, associations, classifications etc. for web usage mining with respect to identifying web users behavior.

REFERENCES:

L.K.Joshila Grace, V. Maheshwari, Dhinaharan Nagamalai," Analysis of Web Logs and Web User in Web Mining", International Journal of Network Security & Its Applications (IJNSA), Vol.3, No.1, January 2011
Sanjay Bapu Thakare, Prof. Sangram Z Gawali, "A Effective and Complete Preprocessing for Web Usage Mining", International Journal on Computer Science and Engineering, Vol. 02, No. 03, pp. 848-851, 2010,
Neha Jagannath, Sidharth Samat, "A Survey on Cleaning of Web Pages Before Web Mining", International Journal of Innovations & Advancement in Computer Science, ISSN 2347-8616, Vol. 3, Issue 8,

October 2014
Vijayashri Losarwar, Dr. Madhuri Joshi, "Data Preprocessing in Web Usage Mining", International Conference on Artificial Intelligence and Embedded Systems, July 15-16, 2012, Singapore
Mitali Srivastava, Rakhi Garg, P.K.Mishra, "Preprocessing Techniques in Web Usage Mining: A Survey", International Journal of Computer Applications (0975-8887), Vol. 97, No. 18, July 2014
V.Chitraa, Dr. Antony Selvadoss Thanamani, "Web Log Data Cleaning for Enhancing Mining Process", International Journal of Communication and Computer Technologies", Vol. 01, No. 11, Issue 03, December 2012.
Vellingiri J., S. Kaliraj, S. Satheeshkumar and T. Parhtiban, " A Novel Approach for User Navigation Pattern Discovery and Analysis for Web Usage Mining", Journal of Computer Science 11(2); 372-382, 2015
Ashwin G. Raiyani, Sheetal S. Pandya, "Discovering user identification mining techniques for preprocessed web log data", Journal of Information, Knowledge and Research in Computer Engineering, ISSN: 0975-6760, Vol. 2, Issue. 2, Pages 477-482, OCT-2013
Amit Dipchandji Kasliwal, Dr. Girish S. Katkar, " Web Usage mining for predicting User Access Behavior", International Journal of Computer Science and Information Technology, Vol. 6 (1), 2015, 201-204
Renata I., Sandor J., "Analysis of Web User Identification Methods",

World Academy of Science, Engineering and Technology, 2007
Arvindkumar Dangi, Sunita Sangwan, " A new approach for user identification in web usage mining preprocessing", IOSR Journal of Computer Engineering, e-ISSN: 2278-0661, p-ISSN: 2287-8727, Vol. 11, Issue. 3, (May-June2013), Pages 57-61
S. Princy Mary, E. Baburaj, "An efficient approach performs pre- processing", Indian Journal of Computer Science and Engineering,

ISSN: 0976-5166, Vol. 4, No.5, Oct-Nov-2013
C.P.Sumathi, R. Padmaja Valli, T. Santhanam, "An overview of preprocessing of web log files for web usage mining", Journal of Theoretical and Applied Information Technology, ISSN: 1992-8645, e-

ISSN: 1817-3195, Vol. 34, No.1, December 2011
Dr. Sanjay Dhawan, Mamta Lathwal, "Study of preprocessing Methods in Web Server Logs", nternational Journal of Advanced Research in Computer Science and Software Engineering, ISSN: 2277 128X, Vol. 3,

Issue. 5, May – 2013
C. Oostuizen, J Wesson, C Cilliers, "Visual Web Mining of Organizational Web Sites", Proceeding of the Information Visualization
A. Jameela, P.Revathy, "Comparisons of Decision and Random tree algorithms on A web log data for finding frequent patterns", International Journal of Research in Engineering and Technology, e-

ISSN: 2319-1163, p-ISSN:2321-7308, Vol. 3, Issue. 7, May-2014
Shaily Langhnoja, Mehul Barot, Darshak Mehta, " Pre-processing: Procedure on Web Log File for Web Usage Mining", International Journal of Emerging Technology and Advanced Engineering, ISSN: 2250-2459, ISO 9001:2008 Certified Journal, Vol. 2, Issue. 12,

December 2012
Naga Lakshmi, Raja Sekhara Rao, Sai Satyanarayan reddy, " An overview of Preprocessing on Web Log Data for Web Usage Analysis", International Journal of Innovative Technology and Exploring Engineering, ISSN: 2278-3075, Vol. 2, Issue. 4, March – 2013
Pabarskaite Z (2002), Implementing advanced cleaning and end-user interpretability technologies in web log mining, in 24th International Conference on Information Technology Interfaces (ITI), Vol. 1 Page(s): 109-113.
D. Tanasa, B. Trousse (2004), Advanced Data Preprocessing for Intersites Web Usage Mining in IEEE Intelligent Systems, Vol. 19 Issues. 2 Page(s): 59-65.
Navin Kumar Tyagi, A.K.Solanki, Sanjay Tyagi, " An Algorithmic approach to data preprocessing in Web Usage Mining", International Journal of Information Technology and Knowledge Management, Vol. 2, No. 2, pp. 279-283, Dec-2010
Jaideep Srivastava, Robert Cooley, Mukund Despande, Pang-Ning Tan, "Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data", SIGKDD Explorations, Vol. 1, Issue 2 Jan 2000
K.R. Suneetha , Dr. R. Krishnamoorthi, "Indentifying User Behavior by analyzing Web Server Access Log File", International Journal of Computer Science and Network Security, Vol. 9, No. 4, April 2009
C.E. Dinuca, D. Ciobanu, " Improving the session identification using the mean time", International Journal of Mathematical Models and Methods in Applied Sciences, Vol. 6, Issue 2, 2012
Sheetal A. Raiyani, Shailendra Jain, Ashwin G. Raiyani, " Advanced Preprocessing using Distinct User Identification in web log usage data", International Journal of Advanced Research in Computer and Communication Engineering, Vol. 1, Issue 6, August 2012
Vellingiri J. and S. Chenthur Pandian, " A Novel Technique for Web Log mining with Better Data Cleaning and Transaction Identification ",

Journal of Computer Science 7(5): 683-689, ISSN: 1549-3636, 2011
G T Raju and P S Satyanarayana, "Knowledge Discovery from Web Usage Data: Complete Preprocessing Methodology", International Journal of Computer Science and Network Security, Vol. 8, No. 1,

January 2008
P. Nithya and Dr. P. Sumathi, "Novel Pre-Processing Technique for Web Log Mining by Removing Global Noise, Cookies and Web Robots", International Journal of Computer Applications, Vol. 53, No.17, September-2012

A Review Analysis of Preprocessing Techniques in Web usage Mining

Leave a Reply