Online Plagiarism Detection and Result Evaluation using Data Mining and NLP

DOI : 10.17577/IJERTCONV10IS04031

Download Full-Text PDF Cite this Publication

Text Only Version

Online Plagiarism Detection and Result Evaluation using Data Mining and NLP

Aparna Babu

B-tech student Computer Science and Engineering Mangalam College of Engineering

(Affiliated by A P J Abdul Kalam University) Kerala, India

Anju S

B-tech student Computer Science and Engineering Mangalam College of Engineering

(Affiliated by A P J Abdul Kalam University) Kerala, India

Archana P Udayan

B-tech student Computer Science and Engineering Mangalam College of Engineering

(Affiliated by A P J Abdul Kalam University) Kerala, India

Giny Mary John

B-tech student Computer Science and Engineering Mangalam College of Engineering

(Affiliated by A P J Abdul Kalam University) Kerala, India

Divya S B

Assistant professor Computer Science and Engineering Mangalam College of Engineering

(Affiliated by A P J Abdul Kalam University) Kerala, India

AbstractPlagiarism is the process of copying a document from other document and this project is to find not only the plagiarism being done in the paper but to find the part also where the plagiarism has been applied so it can be easily understandable. The percentage and the sentence which are being copied from other document can be detected from this application so we can ensure if anyone is copying the content from any document. The most important factor most of the plagiarism detection is on website but this is an application so it can be found easily. It ensures that no actor of the system is copied by others. Hence coordinating, maintaining and making sure that all activities in the application is synchronized. A review on plagiarism in assignments, its drawbacks, and the solution to overcome the drawbacks are presented in this paper.

KeywordsSupply Natural Language Processing, TF- IDFVectorizer, Rabin-Karp Algorithm, KMP Algorithm


Plagiarism is the representation of another author's language, thoughts or ideas as one's own original work.Distinguish literary theft has turned into a wide exploration region to uncover its sorts thus as to keep understudies from duplicate right encroachment and to work on the educational level. Copyright infringement is a vital subject that must be managed since, in such a case that it isn't individuals will quit thinking carefully and simply depend on what inventive thoughts others consider. Copyright infringement is finished by reworded works and the likenesses among watchwords and word for word covers, change of sentences from one structure to other structure, which could be recognised utilizing WordNet and so forth. This copyright infringement indicator

estimates the comparative text that matches and distinguishes literary theft.

Theft of sentences can be identified by using several methods. In this software, student can upload assignments by login into their accounts by using personal login id and password. Software will take one assignment and check it with the other assignments uploaded to detect plagiarism.Plagiarism detector estimates the comparative text that matches and identify copyright infringement. As well as semantical checking will be likewise finished regarding task. Likewise, understudies can see the historical backdrop oftheir past reports. Teachers likewise ready to check the syntax botches on the substance and symantical literary theft. Objectives include 1. To contrast the task and any remaining submitted task for plagiarism. Model If the clump having 100 understudies, then a solitary task is checked with any remaining 99 tasks.2.To check with linguistic and semantical approach.3. Uncommon changes like graph and tables will be checked for plagiarism.4.Plagiarism recognition report will be produced.5.To add missing references or modify your text.


A.Online assignment plagiarism checking using data mining and NLP[1] system plagiarism detector estimates the comparable text that matches and distinguishes counterfeiting.

Figure.1 : Online assignment plagiarism detection system.

Too symantical checking will be likewise finished concerning assignment. For identifying the counterfeiting, we will utilize data mining algorithm and natural language processing. Working Flow include: Collection of assignment, Pre- processing, Classification, Text examination, Similarity measures, Clustering the copied information, Similarity score.

Show the percentage of similarity. Check grammar mistake.

  • Doesnt show area of similarity.

B.Software metrics and plagiarism detection

The relentless nature of forging acknowledgmentsystems, which endeavour to recognize similarprojects in tremendous peoples, is essentially likely to thechoice of program depiction. Programming estimationsgenerally used as depictions are portrayed, and thelimitations of estimations changed from programmingmultifaceted design measures are outlined. An application expressestimation is proposed[2], one that addresses thedevelopment of a program as a variable-length profile. Itsconstituent terms, each recording the control structures in aprogram piece, are mentioned for viable connection. Theprevalent presentation of the duplicating acknowledgmentsystem considering this profile is represented, and gettingmultifaceted nature measures from the profile is analyzed. Programmingestimations have beenmade to evaluate programmingquality and coordinating the item headway process. Anrepresentation of the wide-going usage of estimations is inthe area of scholarly robbery in student programming assignments.An estimation is the eventual outcome of applying a change limitto an article. By virtue of programming estimations, thearticle being alluded to is a program (or a programprogression process), and the change is applied by an itemanalyser. The extent of conceivably important changes is incrediblycolossal; in any case, most are planned to decrease the thing to alittle plan of pictures that address some huge partof the article. The viewpoint highlighted might be program size,data use, stream of control, or the speed of misstepsoccurring during progress.

In all cases, necessities foressential assessment of the multifaceted nature of programming modulesreally inclines toward the use of estimations that includea lone numeric sum. Forging disclosure attempts torecognize tantamount ventures from a gigantic social occasionexecuting a comparative task. Motorized systems for pickingrelative student sections rely upon connectionsbetween diminished depictions of the activities. The depictionsare modifying estimations; but they are likelygoing to be one of a kind explanation relatives of single-regardedmultifaceted nature measures. Figure 1 addresses the three hugecycles that make up a summarized duplicating areastructure. The estimations made by the item

analyser are facilitated to convey an overview of programcoordinates with an extent of the qualifications between their

metric depictions. The last stage orders the once-over on measurementresemblance and channels out those sets showing colossal measurementcontrasts. The sections recorded as possible copyrightencroachment are finally recuperated for manual evaluation andportrayal.

Figure.2 : A plagiarism detection system.

Application specific matric. Represents structure of a program as a variable length profile. Cannot display the portion where plagiarism occurs.

C.Supply Semantic Text Similarity Using Corpus-Based Word Similarity and String Similarity

The technique for estimating the semantic similitude of texts utilizing a corpus-based measure of semantic word comparability [3]nd a standardized and altered adaptation of the Longest Common Subsequence (LCS) string matching calculation. Existing techniques for processing text likeness have zeroed in for the most part on either enormous reports or individual words. We centre around registering the closeness between two sentences or two short passages. The proposed technique can be taken advantage of in an assortment of uses including text-based information portrayal and information disclosure. Assessment results on two distinct informational indexes show that our strategy outflanks a few contending techniques. Similitude is an intricate idea which has been generally talked about in the etymological, philosophical, and data hypothesis networks. Frawley [1992] examines all semantic composing with regards to two components: the discovery of likenesses and contrasts. A successful strategy to figure the likeness between short messages or sentences has numerous applications in regular language handling and related regions, for example, data recovery to be probably the best procedure for further developing recovery viability and in picture recovery from the Web, the utilization of short message encompassing the pictures can accomplish a higher recovery accuracy than the utilization of the entire report where the picture is inserted. In data sets, message comparability can be utilized in diagram matching to tackle semantic heterogeneity, a vital issue in any information

sharing framework whether it is a combined data set, an information reconciliation framework, a message passing framework, a web administration, or a distributed information the executives framework. It can likewise be utilized in text closeness join administrator that joins two relations assuming their join credits are literarily like one another, and it has an assortment of utilization spaces including reconciliation and questioning of information from heterogeneous assets; purging of information; and mining of information.

D.Plagiarism Detection using Semantic Analysis

Recognize Plagiarism has turned into a wide examination region to uncover its sorts thus as to forestall the infringement of privileges, particularly in instruction to keep understudies from copyright encroachment and to work on the instructive level. Plagiarism is unsatisfactory utilization of crafted by another creator either as an exact duplicate, or change it a smidgen. The easiest portrayal of a copyright infringement is either a 'reorder' for a text regardless of whether the source was referred to or an adjustment of certain words by taking the importance without referring to the source, where deciding the significance is the hardest and most complex errand. The rising rate of literary theft in the advanced education area, which is viewed as adequate way of behaving by some, since counterfeiting saves time and exertion, and gives improved outcomes, turned into a major issue looked by instructive organizations. Semantic literary theft is an adjustment of the importance of words by taking equivalents of it, while holding the places of the words. There is a ton of hypotheses in the field of recognition of literary theft for the messages that contain huge changes in punctuation and in

significance however for the most part deficient and wasteful, and this addresses the greatest test in the discovery of these changes, since it requires investigation of messages that convey comparative implications and going with a choice regardless of whether there is a counterfeiting. The primary target of this research[4] is to track down an appropriate method for identifying semantic counterfeiting which happens on the importance and utilizing equivalents and supplant it rather than the first words. This examination points likewise to apply a pre-handling for the expressions of exploration by utilizing tokenization and stop word eliminating processes, then tried regardless of whether the examination enter under the specialization of software engineering, where just such examination will expose to semantic literary theft location by utilizing WordNet. This paper, portrays a way to deal with identify semantic plagiarism which happens in investigates by utilizing WordNet. In this methodology, WordNet has demonstrated as a powerful method for recognizing the semantic plagiarism by given the equivalents of words in the record then distinguish the plagiarism, Then, is to know the adjustment of the words areas that have been changed by other interchangeable words. Assuming there is no adjustment of words areas as it exists in the information base of specific words and remembered for WordNet to be used to identify the semantic plagiarism.

Figure 3: Semantic Plagiarism Detection

Capacity of storage is 500000 documents. Runtime required to get matching result is 6 seconds. Risky activity

  1. A New Approach for Calculating Semantic Similarity between Words Using WordNet and Set Theory

    Semantic Similarity is a task in the area of Natural Language Processing (NLP) that scores the relationship between texts or documents using a defined metric. Working out semantic likeness between words is a difficult undertaking. In this segment, we will introduce the various strides of our work and how we ascertain the semantic likenessbetween two words W1 and W2. The information sources are two English words, and the result is a semantic similarity score. The proposed strategy involves benefits of both synset and gleams for expanding the similitude score. WordNet is a lexical word reference adroitly coordinated, where every idea has a few attributes: Synsets and Glosses. Synset address sets of equivalents of a given word and Glosses are a short portrayal. The proposed strategy depends on set hypothesis' ideas and WordNet properties, by ascertaining the relatedness between the synsets' what's more, shines' of the two ideas.

  2. Machine Learning Models for ParaphraseIdentification and its Applications on Plagiarism Detection

    Paraphrase identification is concerned with the ability of identifying alternative linguistic expressions of the same meaning at different textual levels. Paraphrase of a sentence conveys a similar importance yet its construction and the grouping of words changes. This identification requiresaddressing a text in some structure bringing its setting into thought and forming a measurement to communicate the similarity between a couple of texts. Thissystem[5] attempt to perform rework recognizable proof utilizing different AI models and make a performance examination among these models. In particular, we made the models utilizing Strategic Regression, Support Vector Machines, and unique structures of Neural Networks. Among the thought about models, true to form, Recurrent Neural Network (RNN) is the most ideal for our summary ID task. This model can be utilized to foster a literary theft location framework where a basic revamp of a text will in be hailed as counterfeited. The paper suggest that Paraphrase

    Identificationcan be carried out for copyright infringement discovery actually and additionally fostered a straightforward application for the exhibition reason.

  3. A Hybrid Approach for Detection of Plagiarism using Natural Language Processing

Recognition of plagiarism in research papers chose for meeting distributions and diaries, or in tasks gave over for assessment is vital. It guarantees the work submitted is free from any duplicated content. While many software that detect plagiarism in documents are available commercially; they suffer from certain drawbacks. Research has been carried out in this direction with an attempt to improve the efficiency and correctness of the underlying programs/algorithms. Proposed a hybrid approach that consolidates the basics of regular language handling and text mining to distinguish literary theft in a record. This approach was viewed as exceptionally powerful in distinguishing equivalent words and change in the plan of words utilized in a sentence. This venture ndeavours to work on the viability of counterfeiting identification instruments by utilizing the ideas of normal language handling and text mining to guarantee that these apparatuses are not tricked by the previously mentioned changes made in the semantics of the language utilized in the paper. It proposes a structure for discovery of copyright infringement that not just investigations the sentences framing the archive yet additionally its construction and semantics.


From the study we have done on the papers listed in references, few prominent drawbacks in plagiarism detection are:

  1. Doesnt show area of similarity[1]: Existing systems are nothighlight the portion of similarity.

  2. Doesnt check grammar [2]: System doesnt help user by checking the grammar.

  3. Cannot completely find the mistakes it can help up to some extent. [3]: It check plagiarism between two document or file. doesnt compare with many documents so cant tell that it completelyfinds the mistakes

  4. Destroyed professional and academic reputation[5]: Legal and Monetary repercussion.


    To solve the problems listed and to integrate the activities, we have designed system to detect the plagiarism in the academic assignment which will help to stop copying the assignment of other student and will improve the quality of education and also will help to improve personal skills of student . In this system plagiarism detector measures the similar text that matches and detects plagiarism. For detecting the plagiarism we will use data mining algorithm and natural language processing.

    In this paperwe are using django for web application development . It contains 2 modules : i)student ii)admin.The Students have an option to register and login. When they login , they can upload there assignment. And there is a feature to check plagiarism of that particular student assignment with the other students assignments.The Admin can login and admin have an option to check plagiarism of each students with other all students. The other feature of admin is to generate a plagiarism report . That means the Admin can see the plagiarized contents/copied contents with the help of string searching algorithms.

    Plagiarism Checking :- The initial step is to preprocess the all uploaded documents. Then we need to extract the features of this uploaded assignment documents using TF- IDF vectorizer. Finally we find the plagiarism based on the cosine similarity value.

    Figure 4: Architectural diagram

    The system using string searching algorithms for searching the matching contents of plagiarized document. The algorithm used is Rabin-Karp Algorithm.The proposed

    framework is carried out as a web application utilizing Python Django outline work. The framework comprising of two elements, student and administrator. Understudy can enlist and login. Student have two choices, can upload the assignment and actually take a look at the literary theft. That is student can really take a look at his task with others tasks to check whether it have comparability with others. Administrator can likewise really take a look at the copyright infringement. Administrator can choose every student and can check the cosine similitude between the task he chose with different assignment. Generate plagiarism report: The report shows the matching substance in this record. Here a string-matching calculation is utilized to show the matching substance. Tokenization: report will be isolated into tokens it will be switched over completely to more modest case. Stop word removal: rehashed words get eliminated. TFIDF: each word will be changed over completely to comparing vector structure. Then see as the cosine closeness of one record with different reports. utilizing the string looking through calculation we can show the matching substance and produce report which shows the area of similarity.


In this paper, we have discussed various aspects related to online plagiarism checking. Plagiarism checking is important in academics to improving the quality of education. plagiarism could erase all the chances of building a successful career. The proposed system describes an approach to detect plagiarism using data mining and NLP. The system generates a report which shows the matching content in assignments of school and college students. The student and admin can check the plagiarism. The system will generate the report showing area of matching content. Here we designed a simple method which assist us with the detection of instances of plagiarism and semantical checking. By using data mining algorithm and NLP it will provides straightforward documentation.


[1] Taresh Bokade, Tejas Chede, Dhanashri Kuwar, Prof. Rasika ShintreOnline Assignment Plagiarism Checking Using Data Mining and NLP2021.

[2] Software metrics and plagiarism detection, J. Syst. Software, vol.

13, pp. 131 138, 1990.

[3] M. J. Wise, Detection of similarities in student programs: YAPing may be preferable to Plagueing, ACM SIGCSE Bull., vol. 24, no. 1, pp. 268271, 1992.

[4] A. Islam and D. Inkpen, Semantic text similarity using corpus-based word similarity and string similarity, ACM Transactions on Knowledge Discovery from Data, vol. 2, no. 2, pp. 125, Jan. 2008.

[5] U. Bandara and G. Wijayarathna, A Machine Learning Based Tool for Source Code Plagiarism Detection, International Journal of Machine Learning and Computing, pp. 337 343, 2011.

[6] Eman Salih Al-Shamery and Hadeel Qasem Gheni. Plagiarism detection using semantic analysis.Indian Journal of Science Tech.

[7] Alexander Budanitsky and Graeme Hirst. Semantic distance in wordnet: An experimental, Application – oriented evaluation of five measures. University of Toronto- Toronto, Ontario, Canada.

[8] A. Anguita, A. Beghelli, and W. Creixell, Automatic cross-language plagiarism detection, 2011 7th International Conference on Natural Language Processing and Knowledge Engineering, 2011.

Leave a Reply