Abstract
With the growth of the content found throughout the Web, every information can be plagiarized. Plagiarism is the process of using the ideas of another without naming the source. Consequently, plagiarism detection is necessary but complicated as it is often facing significant challenges given the large amount of material on the World-wide-web and the limited access to a substantial part of them. In this paper, we present a novel plagiarism detection method for French documents. The proposed method combines the intrinsic and extrinsic aspects for plagiarism detection. We achieved good results with both approaches. For the extrinsic method, we achieved an accuracy of 62% for the first tests of the method. As for the intrinsic, we achieved an F-score of 0.328.
Introduction
Categorization of some of the proposed approaches of plagiarism detection.
The expansion of the media, including the Internet has made it feasible to come by a numerous amount of data [11]. In fact, researchers around the world have access to a wide range of information via the Internet as it represents a much easier and faster method to acquire knowledge [16]. Hence, this ease of access remains a threat to the integrity of information and copyrights. Plagiarism is the unacknowledged reuse of others’ ideas or text without giving proper credit [8]. Plagiarism detection approaches have a long history of attempts to improve their performance in detecting text misuse [21]. Actually, the plagiarism detection task is one of the active research topics in computational Natural Language Processing (NLP). It has already attracted broad interest and multiple international competitions have been convened since 2009. This task aims to detect reuse, reproduction and/or modification of text from one document to another [36]. As a matter of fact, plagiarism is considered a major problem in the modern world since it affects many domains including education and research [13]. Therefore, plagiarism detection systems are becoming a necessity. Furthermore, the plagiarism that occurs in academic research is the most critical and requires more attention to identify [34]. Consequently, detecting plagiarism is a continuous concern within academia, and the last two decades have witnessed remarkable advances in automatic plagiarism detection tools [18]. While the concept of plagiarism is not new, the way that individuals plagiarize has changed [5]. Actually, we can distinguish multiple forms of plagiarism [15, 6]: Firstly, we have Copy/paste, which represents the act of copying word for word a part of a text without proper citation of the author. Secondly, we can find Paraphrasing, in which the copied segment is modified but the idea and some words stay the same. Thirdly, we have Idea plagiarism, where the same idea is expressed using di?erent words or a different language. Finally, we have Authorship plagiarism, which is the case of obtrusively putting one’s name to someone else’s work. As a matter of fact, the rapid evolution of information content has made the field of scientific research so vulnerable to plagiarism [38]. Indeed, it has strong negative impacts on academia and the public [37]. We can single out two types of detections: First, extrinsic plagiarism detection, which performs a comparison between a source document and a collection of documents. It is the process of identifying whether, in a given document, there are some parts that have been copied from other documents [1]. Second, intrinsic plagiarism detection, which is based on analysing the suspect document with the aim to discover internal evidence of plagiarism [18], for that, we use stylistic features such as the writing style of the author, structuring of paragraphs, sections’ formulation, etc. [39].
Several studies have investigated the best ways to detect plagiarism. However not many tackled a hybrid method. In this paper, we present our hybrid method combining both aspects: extrinsic and intrinsic plagiarism detection for French language. We conduct our investigation using an artificial corpus built from Project Gutenberg novels. We first present our proposed intrinsic method, followed by our proposed extrinsic method based on the use of word embeddings.
The remainder of this paper is organized as follows. Section 2 presents the related research works on extrinsic and intrinsic plagiarism detection. Section 3 describes our proposed method followed by the presentation of our experiments and results in Section 4. Finally, Section 5 gives some concluding remarks followed by the future work direction.
In this section we present a variety of plagiarism detection methods of both approaches: extrinsic and intrinsic. Figure 1 gives an overview of the existing methods proposed in the literature.
Extrinsic plagiarism detection
Several works have addressed the issue of plagiarism identification. Belyy and Dubova [4] proposed a method for extrinsic plagiarism detection for the Russian language. They begin by a data preprocessing step where the dataset is tokenized, lemmatized and augmented with POS tagging. Afterwards, the authors joined short sentences (less than 3 tokens) with adjacent longer sentences. Next, they performed a sentence-to-sentence matching; they explored parallel matching and pairwise matching. In fact, they experimented with negative sampling; they selected from their dataset a collection of short texts that they marked as negative cases of plagiarism to help train their classifier. Moreover, the authors performed a granularity reduction, for that they explored the evaluation metrics proposed by [27] which are macro precision and recall, granularity and plagDet, which is the main measure for plagiarism detection systems evaluation; it is especially dedicated to manually paraphrased plagiarism datasets.
Asghari et al. [16] proposed a cross-language plagiarism detection method for pairs of documents in English and Persian. For this purpose, they constructed an English-Persian bilingual plagiarism detection corpus. The authors focused on finding any cases of plagiarism in a given Persian document by searching for a corresponding source document in English, considering that compared to English, the Persian language is a less-resourced language. In their work, the authors evaluated some of the existing state-of-the-art methods such as CL-ESA [32], CL-LSA [40] on their created corpus. After a thorough investigation, their results show that the word embedding methods outperform the other methods when encountering heavily paraphrased passages.
In their studies, Zaher et al. [29] proposed a plagiarism detection method for documents written in Arabic. In fact, the authors proposed an unsupervised model to identify plagiarism cases in ASTAP documents (Abstract Syntax Tree Arabic Plagiarism). These documents are a set of handwritten Arabic documents. First, the authors experimented with an OCR (Optical Character Recognition) implemented with GWO (Grey Wolf Optimization) algorithm [28] for the purpose optimizing the character features selection. Next, they used an AST (Abstract Syntax Tree) [2, 25] to detect the similarity between the documents. Moreover, the authors perform a preprocessing step in which they tokenize the document and remove the stop words. Then, they carry out a stemming step to decrease the number of types in text group. Additionally, the authors performed a synonym replacement step where they transformed the words to their most common alternative. Afterwards, they exploited a query submission module to search the Web for possible similar documents. Finally, Zaher et al. computed the similarity between the input document and the retrieved documents, and based on the results they produce a report for the plagiarized documents including the corresponding sources and their URLs.
Menon et al. [34] proposed a method for plagiarism detection using sparse dictionary learning. The basic idea is to explore sparse representation techniques to identify the similarity between large amounts of data. The method is to reduce each document in the corpus to a vector of real numbers (ratio of counts also addressed as term frequency). First, they tokenized their corpus and removed the stop words. Second, they performed a lemmatization. Finally, they built a TF-IDF matrix. After preprocessing the corpus, the authors performed an Orthogonal Matching Pursuit (OMP). This algorithm represents the match between documents in a matrix format. After-wards, they explored the Singular-Value Decomposition (SVD) to factorize the obtained matrix. The similarity between the suspect document and source documents is identified after decomposing the source documents using the OMP and SVD algorithms and matching them with the suspect document. Then, based on a threshold, they determine if a suspect document is plagiarized or not.
Oberreuter et al. [14] proposed in the PAN-11 competitive conference a method for extrinsic plagiarism detection. The authors exploited the PAN-09 and PAN-10 corpora provided by the PAN@CLEF competition. The final tests were performed on the PAN-09 and PAN-11 corpuses. The main steps of their proposed method are to remove stop words and to exploit word 4-grams. If two documents have at least two word 4-grams close enough to be in the same paragraph, the documents are moved to the next step. If not, the pair is rejected. Secondly, it executes an exhaustive search to find the plagiarized passages; it uses word trigrams and it does not discard the stop words. In fact, their system obtained fifth place for the task of plagiarism detection with a precision of 0.85 and a recall of 0.48.
In their work, Meuschke et al. [31] proposed a novel approach that detects plagiarism that combines a variety of aspects: mathematical expressions, images, citations and text. Their proposed method follows the design of a multistage detection process consisting of candidate retrieval, detailed comparison, and human inspection [7]. For the first type which is math similarity, the authors performed a pairwise similarity assessment of formulae using three similarity measures [33]. For the second type, the authors employ perceptual hashing using a discrete Cosine to detect image similarity. For the next type, Meuschke et al. explored four citation based similarity measures: bibliographic coupling, longest common citation sequence, greedy citation tiling and citation chunking. Finally, for the text similarity part, the authors applied text retrieval methods. For a detailed comparison between the documents, there exists two methods, either the use a full string matching or the use of the Encoplot algorithm [9] which is an efficient character 16-grams comparison.
Intrinsic plagiarism detection
Zlatkova et al. [12] proposed a supervised method for style breach detection also known as intrinsic plagiarism detection. In their work, the authors combined a TF-IDF representation of the documents with features specifically engineered for the identifying the writing style in a given document. In fact, the authors utilized a stacking technique to combine an ensemble of multiple learners. They used an ensemble of diverse models including SVM, Random Forest, AdaBoost, MLP and LightGBM. In addition, they experimented with deep learning techniques, more specifically, Convolutional Neural Networks (CNN). First, the authors performed a preprocessing of their corpus. Second, they segmented each given document to three equal sets of words, then for each obtained segment, they computed the feature vectors. Actually, the authors exploited a variety of features such as vocabulary richness, frequent words, the average number of occurrences of n-grams (n
In their work, Ben Salem et al. [18] proposed an intrinsic plagiarism detection method revolving solely on the use of n-grams to detect plagiarism cases. The authors experimented with character n-grams and worked on English and Arabic texts. In fact, unlike their previous work [17], the authors chose not to weight the n-grams with their frequencies. As a matter of fact, they performed a feature extraction step where they introduced the Proportions of the N-grams Frequency Classes in a given fragment (the NFCP features). Afterwards, for the plagiarism identification, the authors decide for each part in the suspect document whether it is original or plagiarized. For this purpose, Ben Salem et al. exploited Naïve Bayes a supervised classification algorithm to classify the different parts of the document in two classes.
Another method is the one proposed by Al-Sallal et al. [21] who experimented on a Corpus of English Novels (CEN) containing 292 novels. In their work, the authors exploited the use of Bag of Words (BOW), Latent Semantic Analysis (LSA) and Stylometry. Their aim was to identify the writing practices of a certain author. Some of the exploited stylometric features are for instance, the use of unique words, the vocabulary richness, etc. In addition to the use of BOW and LSA, the authors applied the MLP classification algorithm (Multi-Layer Perceptron) to classify the suspect document in two parts plagiarized and non-plagiarized.
Karaś et al. [10] proposed a method for intrinsic plagiarism detection as a part of the PAN-17 task. First, they segmented the document to paragraphs. The segmentation was done by assuming two blank lines the distance between two paragraphs. Else, if no blank lines are detected, the authors designated
In their study, Kuznetsov et al. [26] used stylometric features such as character n-grams, word n-grams, punctuation marks and pronouns count. They proposed an approach for author diarization, a task in the PAN-16 event. This task was divided in three subtasks; intrinsic plagiarism detection, author diarization with a known number of authors and unrestricted diarization. Actually, for the intrinsic plagiarism task they exploited a per-sentence approach [30]. It constructs disjoint segments of variant lengths and afterwards detects plagiarism on the sentence level. In fact, the authors labelled their sentences using this rule: if, in a given sentence “s”, more than half of its characters are plagiarized, then the sentence is labelled as plagiarized. Otherwise, it is labelled as non-plagiarized. For the features used, the authors used punctuation marks and POS tags (VERB, NOUN, ADJ, ADV, etc.). For the unrestricted diarization, they estimated the number of authors by computing an averaged t-statistic for all pairs of author segments. They actually iterated through the number n (n
Polydouri et al. [3] proposed a method for intrinsic plagiarism detection, which depends thoroughly on machine learning. The authors tested their method on the corpus provided by the PAN@CLEF competition for the intrinsic plagiarism detection task that was the centre of interest in the PAN-09 and PAN-11 competitions. Then they compared their results with the highest scores obtained in both events. Their approach comprised five steps: First, preprocessing (de-capitalization, removal of alphanumeric and special characters, part-of-speech (POS) tagging, stemming, etc.). Second, text segmentation (using sliding windows with 15 as a threshold and 5 sentences as the window’s step). Third, style analysis followed by feature extraction (11 stylistic and semantic attributes: mean sentence length, mean syllable count per sentence, Flesch-Kinkaid grade [41], etc.). Finally, detecting outliers and a post-processing of the output. Actually, for the outliers detection step, the authors experimented with two training algorithms: SVM and decision trees, and for the exploited corpus, they tested their system on two corpora. Polydouri et al. managed to obtain an F-score of 0.419 for the PAN-09 corpus and an F-score of 0.328 for the PAN-11 corpus.
Main steps of our proposed method.
In their study, Elamine et al. [22] proposed an intrinsic plagiarism detection method that is based on a statistic text segmentation. The authors tested their method on the PAN@CLEF corpus for the task of style breach (PAN 16 and 17) in addition to a collected corpus from various news articles and children’s stories found on the Internet. Their method comprises two major steps: preprocessing and similarity detection. First, they preprocessed the suspect document; they de-capitalized the document, removed the stop words and then performed a step of text cleaning. Second, they segmented the document into paragraphs. Afterwards, using the term frequency technic, the authors transformed the obtained segments into vector representation. Next, they computed the Cosine similarity between the obtained vectors. Finally, they experimented with the unsupervised algorithm Kmeans to group the obtained values in two classes. The authors achieved an F-score of 0.309.
Multiple Plagiarism detection methods have been also investigated in previous works. Recently, neural networks approaches received considerably more attention than others for the extrinsic plagiarism detection. In this work, we explore the use of neural networks by exploiting multidimensional document representation [20] in the task of extrinsic plagiarism detecting for French documents. Since to our knowledge there exists no works in the literature proposing a hybrid plagiarism detection method combining both aspects intrinsic and extrinsic, we attempt in this paper to propose our hybrid method.
In this section, we present our proposed hybrid method for plagiarism detection. In our method, we apply two plagiarism detection approaches; the first is extrinsic and the second is intrinsic. Afterwards, we compare the obtained results by both methods. The combined result of the two methods will be the result of plagiarism detection in a given suspect document.
As shown in Fig. 2, each suspect document is preprocessed for both approaches extrinsic and intrinsic. For the first method, the document is afterwards converted to a vector representation using a trained doc2vec model. Next, a similarity score is calculated between the obtained vector and the pre-calculated vector representation of all the source documents. The similarity score is computed using the Cosine measure. Afterwards, we compute the similarity between the vector representations of the sentences to finally obtain the detected plagiarized sentences. For the second method, after preprocessing the suspect document we segment it into paragraphs of invariant length. Next, we identify for each segment its writing style, for that we extract the detected stylistic features in each fragment. Finally, we identify the plagiarized parts by comparing the detected writing styles. As a final step to our method, we compare the obtained results of both methods extrinsic and extrinsic. Our result will be the combination of the detected parts for the same suspect document treated in both methods.
Example of a segmented document.
In this section, we present our intrinsic method. We recall that in the intrinsic approaches, a segment is considered plagiarized if it deviates from the dominant writing style identified in the suspect document. Therefore, it can be considered as a task of style breach. Consequently, our aim is to analyse a given suspect document and identify the writing style of the author.
Preprocessing
The collected corpus contains a lot of noise, some special characters, and words in a different language. Given the nature of the raw collected data, we did some cleaning before the preprocessing step. We manually: removed the comments that are fully in other languages (mostly in English); deleted the authors and editors’ names and deleted the existing URLs. Moreover, all documents are lowercased and stop words, numbers and special characters such as “$, #, &, (,)” are removed. This was done using the NLTK library and regular expressions. In what follows, we present an example of a sentence before and after the preprocessing step:
Before preprocessing
“M. Phlipon, père de madame Roland, était graveur à Paris. Elle-même y est née en 1754, et fut l’objet constant des soins de sa mère, pour qui elle avait non pas une tendresse filiale, mais un de ces sentiments passionnés qui longtemps isolent de tout ce qui nous reste à donner de notre âme.”
After preprocessing, this sentence becomes as follows
[‘m’, ‘phlipon’, ‘père’, ‘madame’, ‘roland’, ‘était’, ‘graveur’, ‘paris’, ‘elle même’, ‘y’, ‘née’, ‘fut’, “l’objet”, ‘constant’, ‘soins’, ‘mère’, ‘avait’, ‘tendresse’, ‘filiale’, ‘sentiments’, ‘passionnés’, ‘isolent’, ‘reste’, ‘donner’, ‘âme’]
Feature detection
Once the suspect document is preprocessed, we analyse the writing style of the document. To do so, we exploited linguistic and statistic features to identify for each segment its writing style. We used three main features; the frequency of spelling errors, the readability test with Flesch-Kincaid [41] and we computed the similarity between the vector representations of the segments with the Cosine measure. In fact, we experimented with three different similarity measures: Jaccard, Cosine and Dice; however, Cosine measure proved to be the best out of the three. Figure 3 presents an example of a segmented document ready to be analysed for the writing style detection.
Similarity detection
In this step, we compare the identified writing styles with each other to see if the different segments have the same writing style or not. Indeed, for each segment we observe the frequency of spelling errors, the readability degree attributed with Flesch-Kincaid and the similarity of the segment in question with the other segments. In fact, we compare each identified style for a given segment against the whole document except itself. Then, we analyse the suspect segments that have been identified to obtain the list of suspect parts. Thus, the segment having a different style compared to the other segments will be marked as plagiarized.
Extrinsic method
Training the Doc2Vec model
The vast majority of available datasets for plagiarism detection are mainly dedicated to English language or a cross lingual setting including English. Recently, plagiarism detection started to be applied to other languages like Russian [4] and Persian [16]. However, and according to our knowledge, there is no plagiarism detection datasets dedicated to French language. In order to overcome this lack of dataset for French language, we collected French novels from the Gutenberg Project. The collected documents vary in length; ranging from long literary novels to short poems. In addition, the collected dataset contain novels from different domains including but not limited to religious, political and romantic writings. Overall, we collected over 500 documents with an average of 44.128 words per document. In addition to this dataset, we also used online French dictionaries to enrich our vocabulary while training the model. We freely distribute all the resources created in the context of this work.1 Figure 4 illustrates an example of description of the creation of a fake document for the sake of our experiments.
In order to develop and to evaluate our plagiarism detection system, we have decided to divide the collected data to training, development and test sets using respectively to training, tuning and testing our system. Table 1 gives details about the number of documents and the number of lines for each set.
Training, development and test statistics for the extrinsic approach
Training, development and test statistics for the extrinsic approach
Distribution of each suspect document over the test set documents
Example of description for a created fake document.
The development and test sets creation was inspired by the PAN-09 corpus creation provided by the competitive conference PAN@CLEF. Indeed, we artificially created a plagiarized document using the development and test sets documents. The plagiarized document is a result of a copy/past case, no obfuscation or altering was made to the copied parts. Next, we run the plagiarism detection process and we evaluated our plagiarism detection system using the artificially plagiarized document which we also refer to as suspect document.
Training of the Sent2Vec model.
Actually, each suspect document is annotated with a detailed description of the start, and end of plagiarism and the corresponding source documents, Overall, we created 5 suspect documents using the 100 documents from development set and we did the same using the 100 documents of test set. Table 2 presents the number of documents used to create each suspect document using the 100 documents of the test set. For instance, the first document (Doc1) is a copy/paste from only 16 test set documents.
To sum up, the plagiarism detection system is evaluated using accuracy measure on the created suspect documents from development and test sets.
In this step, we perform a comparison between the obtained vector representation of the suspect document and the collection of source documents. Our aim is to retrieve from the large collection a set of matching source documents to the suspect document at hand. To do so, we used the Cosine measure to compute the distance between the vectors. In fact, the obtained similarity scores are compared to a predefined threshold and used to decide whether a document is plagiarized or not. The threshold is empirically identified using the development set.
Training the Sent2Vec model
For the creation of this model, we used the same training corpus presented in the previous section. However, instead of working with document vectors, we trained our model on a set of sentences from the training corpus. Figure 5 illustrates the training phase of the Sent2Vec model. As indicated in Table 1, the model was trained on a set of 4.258.449 sentences.
Compute similarity
Having our sentence vectors at hand, we compare each sentence of the suspect document with the sentences of the source documents, for that we used the Cosine distance to compute the distance between the vectors. In fact, the obtained similarity scores are compared to a predefined threshold and used to decide whether a sentence in the suspect document is plagiarized or not. The threshold is empirically identified using the development set. Our result is the plagiarized sentences and the indication of its origin in the corresponding source document.
Hybridization
In our method, we consider 3 types of hybridization: First, performing the extrinsic method on the suspect document then performing the intrinsic process. Second, performing the intrinsic plagiarism detection, then, forwarding the detected plagiarized segments to the extrinsic method to search for a set of possible source documents. Finally, performing each method on its own for a given suspect document then comparing the obtained results. The final result of the plagiarism detection will be the combination between the results of the both. In our work, we consider experimenting the 3 types of hybridization. We in fact, started with the third type; the other two will be considered in our future works.
Intrinsic plagiarism detection results
Intrinsic plagiarism detection results
In this approach, we first perform the extrinsic plagiarism detection method. Actually, the available content found on the Web does not represent the totality of the content existing on the Internet; this is what is referred to as “the invisible Web”. In fact, the invisible Web is estimated to be thousands of times larger than the content found with general search engine queries. According to some statistics, search engines have access to less than half of 1% of all Web pages available on the Internet. Consequently, it is likely to not be able to retrieve a set of matching source documents for a given suspect document. In this case, intrinsic plagiarism detection becomes a necessity to enhance the result of the extrinsic method.
Intrinsic to extrinsic
We recall that in an intrinsic plagiarism detection method, a segment is considered plagiarized if a deviation in its writing style is detected. Therefore, in this type of approach, we firstly perform the intrinsic process to identify a set of plagiarized segments in the suspect document. Afterwards, rather than working on the whole suspect document, we can search for equivalent source documents for the suspect segments using the extrinsic process. Such method optimizes the run time of the plagiarism detection process, given that only parts of the suspect document will be analysed with the extrinsic plagiarism detection approach and not the totality of the document.
Comparison of the two methods
For this type of approach, for the same suspect document, two plagiarism detection processes will be executed. The suspect document will be verified by the extrinsic and intrinsic methods each on its own. Afterwards, we perform a comparison between the obtained results. If both methods identify the same parts as plagiarized, then the retrieved result will be the final result of the method. Otherwise, an intersection of the results will be considered as the final result of the method.
Experiments and results
In this section, we present our experiments and results for the intrinsic and extrinsic plagiarism detection methods respectively.
Intrinsic method
To detect intrinsic plagiarism detection we employed linguistic and statistical features. First, after preprocessing the suspect document, we segmented it into paragraphs of invariant lengths. Second, using the Flesh-Kincaid test, we computed the readability of each segment. If we detect a sudden change of the readability score from easy to difficult or vice versa, the equivalent segment is marked as plagiarized. Next, we also computed the frequency of spelling errors within the different segments. In fact, for each segment if the number of spelling errors exceeds two times the number of sentences in that segment. The latter will be marked as plagiarized. Afterwards, we exploited the term frequency technic to transform the segments into a vector representation, thereafter, using the Cosine measure we compute the similarity between the different vectors. Actually, we compute the similarity of each segment against the whole document except itself: the segment having a low score compared to the others will be marked as plagiarized. Finally, we analysed the suspect segments that have been identified using our features to obtain the list of suspect parts and mark them as plagiarized.
Actually, we tested our method on two different corpora in two languages: English and French (both corpuses are artificially built for the purpose of intrinsic plagiarism detection). To our knowledge there are no proposed intrinsic plagiarism detection methods treating the French language, so we compared our obtained results for the English corpus with the results of our previous work [22]. Table 3 presents our obtained results with the English corpus.
We tested our method on the PAN-16 corpus and the PAN-17 corpus for the task of style breach, in addition to a collected corpus from various news articles and children’s stories found on the Internet. In our corpus, each document has 30% plagiarism and 70% original.
As shown in Table 3 we managed to obtain good results, in fact our method outperforms the results in [22]. Unlike our previous work, we combined linguistic and statistic features. Indeed, this combination proved to outperform a purely statistic method [22]. In fact, we give an overview on the created system for the plagiarism detection in the Appendix.
Extrinsic method
Doc2Vec and Sent2Vec embeddings are trained using the gensim [35] library. For the first model, we created a CBOW model trained for 160 epochs with a vector size of 300, a window size of 8, and an alpha of 0.025. Development set is used to tune the threshold value which is fixed to 0.49. In order to validate the results of our system, we compared our results to a baseline system exploiting term frequency vectors and Cosine similarity measure. Throughout the rest of the paper, we refer to this system as TF-IDF system. We also used development set to tune the threshold value to 0.96 for TF-IDF system.
For both systems, each suspect document is compared to all source documents and a Cosine score is calculated. Table 4 gives an overview of the obtained Cosine scores for one suspect document, selected from the test set, compared to a subset of source documents for both baseline system and our proposed method. The plus sign (
Cosine score between a suspect and several source documents [23]
Cosine score between a suspect and several source documents [23]
As we can note from the table, the Cosine scores calculated using the Doc2vec vector representations are more diversified while these scores are very close for all source document with TF-IDF system. Taking into account the tuned threshold of each system (0.96 for TF-IDF and 0.49 for our system), we can see that 5 out of 8 documents are correctly detected as source of the suspect document with our method whereas only 3 are detected with the baseline system. Overall, we were able to achieve an accuracy score of 62% with our method, compared to an accuracy score of 52% with the TF-IDF system. In fact, working with embeddings proved to be far more superior to the baseline method, thus we intend to continue our experiments with it [24, 23].
For the Sent2Vec model, we also created a CBOW model trained for 40 epochs with a vector size of 10, a window size of 8, and an alpha of 0.025. Development set is used to tune the threshold value which is fixed to 0.30. In our experiments, we created a set of fake documents (as presented in Table 2) the addition was that instead of working with only copy/paste cases we also created a corpus containing paraphrase cases. The experiments are still in progress so we do not have final results. However, the preliminary results are promising.
In this paper, we presented a hybrid plagiarism detection method. For the intrinsic plagiarism detection process, we proposed a method combining statistic and linguistic features, and we managed to improve the performance of the plagiarism detection system. In fact, we achieved an F-score of 0.328. For the extrinsic plagiarism detection process, we proposed a system based on a document and sentence representation: we employed a Doc2Vec and a Sent2Vec models. For the document representation, we compared our method with a baseline plagiarism detection system where each document is represented by its TF-IDF vector. Our system is trained and evaluated using a data set created from the Gutenberg project. Our results show that Doc2Vec document based representation outperforms representation based on conventional TF-IDF vectors. Overall we obtained a plagiarism detection accuracy score of 62%. We also, worked on a sentence vector representation and the preliminary results are promising. Thus, as future works we intend to continue with our experiments with the Sent2Vec model, and we intend on experimenting with different approaches for the hybridization, to see which method is the most efficient.
Footnotes
Appendix
In this appendix, we present an overview on the plagiarism detection system. We briefly present the process of our intrinsic method in Figs 6 and 7, whereas in Figs 8 and 9 we present our extrinsic method.
Screen for opening a document.
Overview on the process of intrinsic plagiarism detection in a selected document.
Result of the source retrieval step for the extrinsic process.
Results of the extrinsic plagiarism detection.
This figure gives an overview of the interface for selecting and opening a file from the corpus. For the English corpus, the files are extracted from the PAN@CLEF corpus and children’s storybooks from the Web. Whereas for the French language, the corpus was created using documents from the Gutenberg Project.
Figure 7 presents an example of intrinsic plagiarism detection in a selected suspect document. After loading the document, we perform the intrinsic plagiarism detection process and obtain as a result the detected plagiarized segments. Our system colours the detected parts in red.
Figure 8 gives an overview on the returned result of the step of source retrieval. Based on the value of the threshold, our system labels the source documents as source of plagiarism or not. The retrieved list will be addressed in the next steps of our extrinsic plagiarism detection method.
Figure 9 illustrates a sample of the result returned by the extrinsic method. In this figure, we have the attributed decision based on the value of the threshold, which is equal to 0.30. The output can be interpreted as follows: For example, the line selected in red represents the value of Cosine similarity (0.392539) between the second line in the suspect document and the third line from the first source document in our collection of the retrieved matching documents. It was labelled by our system as plagiarized since the value is higher than the fixed threshold.
