A Machine Learning Model to Predict Fake Review using Classifier on Yelp Dataset

Download Full-Text PDF Cite this Publication

Text Only Version

 

A Machine Learning Model to Predict Fake Review using Classifier on Yelp Dataset

1. Gouri Patil

Associate Professor Dept of CSE, GND Engineering College Bidar

2. Swathi Raje

Student 4th Sem M. Tech , gnd engineering college . bidar

Abstract: This article presents an overview of our study to create a learning machine model to detect whether reviews on the help dataset are correct or false. Specifically, we applied and compared different classification techniques in machine learning to find out which one would give the best result. Brief descriptions for each of the classification techniques are provided to aid understanding of why some methods are better than others in some cases. The best result was achieved by using the XGBoost classification technique, with the F-1 score reaching 0.99 in prediction.

Keywords: XGBoost, Sentiment Analysis, Machine Learning, Data Analysis.Web Analytics, Etc.

  1. INTRODUCTIONWith the growth of online information today, people tend to see reviews first for the places they want to visit, such as restaurants, hotels, or other businesses they need or before they go and buy some product. Yelp is an advertising service and a forum for audience review, which individuals normally utilise to post some review about their business views.Statistics show that by the end of 2018, there have been more than 177 million reviews on the Yelp website. It is benefiting both consumers and businesses. For a business owner, they getfree advertising from people who give a useful and positivereview of their business. Unfortuantely, the problem arises when a small portion of irresponsible business owners try to boost up their market by hiring people to create some fake reviews about their business on Yelpwebsite.

    Yelp realizes this potential threat will create misleading information for their users. To overcome this problem, Yelp has already provided reviews policy for business owners. Other than that, Yelp has also implemented a recommended software system that aims to automatically filter all reviews have been determined to be problematic. Yelp does not attempt to promote users’ reviews that they do not know much or reviews that may be prejudicial because they have been asked by family, friends and favorite clients, with the aim of making their information helpful and dependable. The reviews are evaluated based on quality, reliability, and user activity[1]. Currently, about 75 percent of all reviews on Yelp website is recommended.

  2. EXISTING SYSTEMYelp realizes this potential threat will create misleading information for their users. To overcome this problem, Yelp has already provided reviews policy for business owners. Other than that, Yelp has also implemented a recommended software system that aims to automatically filter all reviews

    have been determined to be problematic. In order to make your material helpful and credible, Yelp does not want to point out user reviews that do not know much about or reviews that are biassed because family members, friends or favorite customers have requested them. The reviewsare evaluated based on quality, reliability, and user activity[1]. Currently, about 75 percent of all reviews on Yelp website is recommended. However, no system or method can be truly foolproof. In an attempt to improve the accuracy of identifying fake reviews, machine learning can be very useful. In particular, machine learning classification techniques can learning from data and then be applied to separate truthful reviews from fake ones.The remaining paper will be organized in the following way. Section II examines important literature which sets the stage and forms the basis of our investigation.In particular, it surveys four popular machine learning classification approaches. SectionIII explain so ur method .SectionIV presents preliminary results of our method. Finally, SectionV concludes the paper.

  3. PROPOSED SYSTEMextreme rating ratio of the reviewer [10], [11] is also an interesting feature. Fake reviewer will always give either (1 or 5) star to convince people of their opinions, according to this,I calculated the extremerate(1staror5stars) ratio for every reviewer and used the ratio as one feature of every review. For all unique reviewers, the ratio of extreme rating (1 or 5).The number of extreme ratings by the reviser was computed by dividing by the overall number of reviews.For all the unique reviewers, we calculated this value and fed this value to there view,which was reviewed by the corresponding reviewers.

    Fig 1: Architectural block diagram

  4. OBJECTIVES
    1. The process of design is used to turn a user-oriented input description into a computer system. This design is crucial to prevent data entry errors and to show the proper administration of the computer system for receiving right information.
    2. It is possible to employ user-friendly interfaces for handling enormous volumes of data for the entering of data. The objective of the input design is to facilitate data entering and be free of errors. The screen of entry of information is meant to perform all the handling of data. It also offers document viewing facilities.
    3. If the data are entered, their legitimacy will be verified. With the aid of the displays, data could be input. Suitable notifications are supplied indicating the user is not currently in the maize. Input Design’s goal is a simple input layout to be followed
  5. RESULTSInput design: The input design is the user information system connection. This involves data specification development and preparation procedures and requires these steps to put transaction information into a format which is suitable for processing, either by inspecting the computer to read information from a printed or written document or by transfering it to the system.Input design is designed to monitor the required input amount, to manage mistakes, to prevent delays, to avoid extra stages and to simplify the process. The input is built such that privacy is maintained in a way that guarantees security and easy use. The following things were considered input design:
    • What data should be given as input?
    • How the data should be arranged or coded?
    • The dialogue to guide the operating staff in their input.
    • Methods for the validation of input and processes for error preparation to follow.
  6. CONCLUSIONS AND FUTURE WORK

This paper has reviewed four popular machine learning classification methods for finding fake Yelp reviews. Reviews rates such as useful, cool and funny only acquired by non- filtered review mean soon after the reviews get filtered by Yelp,there view will be hiddensoitcannotbe cratedbyothers

The experiment results showed a very high score in prediction,whenusingXGBoost.Therearestillmanyfeatures that we cannot implement because of the limitation in the dataset such as user trust factor based on user friendship, and also user Profile (join date, photo,etc.).

Imbalance on the dataset needs to handle because imbalance dataset gives poor result in our experiment. While running the experiment, we found that SVM took the longest time to train the model, and Gaussian Naïve Bayes gave the lowest score on average.

In our opinion, we cannot say that reviews got filtered by YELP recommendation system is 100% fake, because there are still other factors that may lead machine learning intofalse prediction. Other techniques that are potentially reliable and can be used for filtering review is using verified buyermethod as some crowdsource web have beenused.

  1. Ott, M., Cardie, C., Hancock, J.: Estimating the prevalence of deception in online review communities. In: Proceedings of the 21st International Conference on World Wide Web, pp. 201210. ACM (2012)
  2. Wang, Z.: Anonymity, social image, and the competition for volunteers: a case study of the online market for reviews. B.E. J. Econ. Anal. Policy 10(1), 133 (2010)
  3. Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 309319 (2011)
  4. Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 309319 (2011)
  5. Heydari, A., Tavakoli, M., Salim, N.: Detection of fake opinions using time series. Expert Syst. Appl. 58, 8392 (2016)
  6. Lim, E.-P., Nguyen, V.-A., Jindal, N., Liu, B., Lauw, H.W.: Detecting product review spammers using rating behaviors. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 939948. ACM (2010)
  7. Xie, S., Wang, G., Lin, S., Yu, P.S.: Review spam detection via temporal pattern discovery. In: Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining, pp. 823831. ACM (2012)
  8. Ye, J., Kumar, S., Akoglu, L.: Temporal opinion spam detection by multivariate indicative signals. In: ICWSM, pp. 743746 (2016)
  1. Ott, M., Cardie, C., Hancock, J.: Estimating the prevalence of deception in online review communities. In: Proceedings of the 21st International Conference on World Wide Web, pp. 201210. ACM (2012)
  2. Wang, Z.: Anonymity, social image, and the competition for volunteers: a case study of the online market for reviews. B.E. J. Econ. Anal. Policy 10(1), 133 (2010)
  3. Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 309319 (2011)
  4. Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 309319 (2011)
  5. Heydari, A., Tavakoli, M., Salim, N.: Detection of fake opinions using time series. Expert Syst. Appl. 58, 8392 (2016)
  6. Lim, E.-P., Nguyen, V.-A., Jindal, N., Liu, B., Lauw, H.W.: Detecting product review spammers using rating behaviors. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 939948. ACM (2010)
  7. Xie, S., Wang, G., Lin, S., Yu, P.S.: Review spam detection via temporal pattern discovery. In: Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining, pp. 823831. ACM (2012)
  8. Ye, J., Kumar, S., Akoglu, L.: Temporal opinion spam detection by multivariate indicative signals. In: ICWSM, pp. 743746 (2016)

 

REFERENCES

Output design: A quality output meets and discloses such information to the end user. User and other systems are conveyed through outputs in all system processing results. The output design defines how information is moved and how the output is created from hard copying. It is the most crucial and most immediate user information. Efficient and intelligent output design allows users to make decisions by enhancing their system relation.

  1. Computer output should be designed in an ordered and carefully prepared way; The right output must be produced while each output is intended to easily and efficiently discover the system for humans. In the analysis of computer output, the specific output required for meeting the requirements should be identified.
  2. Select information presentation techniques.
  3. Create papers, reports or other formats that contain information generated by the system.The information system output should achieve one or more of the following goals.
    • Provide information on historical events, present conditions or future projections.
    • Report major events, opportunities, issues or warnings.
    • Trigger an action.

Confirm an action.

Leave a Reply

Your email address will not be published. Required fields are marked *