On the use of word embedding for cross language plagiarism detection

Abstract

Cross language plagiarism is the unacknowledged reuse of text across language pairs. It occurs if a passage of text is translated from source language to target language and no proper citation is provided. Although various methods have been developed for detection of cross language plagiarism, less attention has been paid to measure and compare their performance, especially when tackling with different types of paraphrasing through translation. In this paper, we investigate various approaches to cross language plagiarism detection. Moreover, we present a novel approach to cross language plagiarism detection using word embedding methods and explore its performance against other state-of-the-art plagiarism detection algorithms. In order to evaluate the methods, we have constructed an English-Persian bilingual plagiarism detection corpus (referred to as HAMTA-CL) comprised of seven types of obfuscation. The results show that the word embedding approach outperforms the other approaches with respect to recall when encountering heavily paraphrased passages. On the other hand, translation based approach performs well when the precision is the main consideration of the cross language plagiarism detection system.

Keywords

Cross-language plagiarism detection low resource languages distant language pairs text re-use

1. Introduction

Plagiarism is the unacknowledged reuse of others’ ideas or text without giving a proper credit [37]. In recent years, researchers enjoy easy access to a wide range of information via the Internet, especially across languages. Unfortunately, this also causes plagiarism occurs more simply. There are many attempts to detect plagiarism, especial across languages. In a research accomplished by Stein et al. [38], a generic three-step retrieval process for a plagiarism detection (PD) system was proposed. They have presented the retrieval process for external plagiarism detection as depicted in Fig. 1. In the candidate retrieval step, a heuristic task to retrieve potential source documents is done. In the text alignment step, an exhaustive comparison of suspicious document against selected source documents is applied. In the final stage named as knowledge-based post-processing step, those detected fragments with proper citation are discarded as they are not plagiarized. The result is offered to a human expert to take the final decision. It should be noted that textual similarity detection methods are not exactly the methods to detect plagiarism. Plagiarism occurs when someone deliberately copy a passage of text without proper attribution, while these methods only detect textual similarities. Therefore, it is not enough to just recognize text similarities and to consider these similarities as plagiarized passages [11].

Figure 1.

Generic retrieval process for external plagiarism detection.

The problem of detecting plagiarism does not end at language boundaries. When plagiarism is done by a translation process, it is known as cross-language plagiarism. Nowadays, a vast amount of knowledge is created in rich resource languages like English, and students and re- searchers, especially in low resource languages, have a motivation to bring the knowledge to their language through translation. Cross Language Plagiarism Detection (CLPD) systems try to find plagiarism cases between language pairs. Cross-language plagiarism detection is to identify the text reuse given suspicious documents in one language L ${}_{1}$ and the possible source document in L ${}_{2}$ . Text reuse detection across languages is even harder if the detection of text reuse is between distant language pairs [5].

In this paper we have focused on English-Persian language pairs as two distant languages, in which English is a rich resource language, while Persian is a low resource one. Therefore, the task of a CLPD system is to find the source document(s) in English for the given suspicious document in Persian in order to detect possible re-use of text.

There are some drawbacks of applying current algorithms to CLPD in Persian. In this research we challenge with the following problems:

•

Persian is a less-resourced language which has a low degree of representation on the Web and few numbers of NLP algorithms and techniques as well. They are often referred to as low profile languages. Because of the shortage of resources and tools in Persian, machine translation tools do not work well, especially when we deal with passages that are heavily paraphrased.

•

Persian and English are distant languages, so some approaches such as ordinary cross-lingual character n-gram (CL-CNG) algorithm cannot be applied in an English-Persian text reuse detection system.

•

Persian is an Arabic-Script based language. There are many problems to basic preprocessing tasks in this language such as normalization, stemming and recognizing word and multi-token word boundaries [10].

•

Various types of paraphrasing can be done through translating a text passage from a language into other languages. In other words, translation and paraphrasing can be considered as two connected natural language processing tasks. In order to investigate the effect of paraphrasing via translation, we compiled a plagiarism detection corpus with different types of paraphrasing such as summarizing, splitting the sentence to two or more sentences, merging two or more sentence to one sentence and heavy paraphrasing of the sentence in the target language. Since CLPD systems would have different performances facing different types of paraphrasing, different approaches can be evaluated properly using the proposed corpus. To the best of our knowledge, no work has been done before considering these differing types of paraphrasing.

Word embedding methods showed their effectiveness in text similarity in recent years [17]. In this paper, we present a word embedding based approach to cross language plagiarism detection and compare its performance against five different categories of algorithms for the task of CLPD.

The rest of the paper is organized as follows. Section 2 describes some of the recent works in the field of cross-language plagiarism analysis. Section 3 presents our methodology for investigating various CLPD algorithms with a broad type of parameters. The data preparation and corpus construction for evaluating the algorithms are also described in this section. Section 4 gives a detailed description of the experiments carried out in our work. Finally, Section 5 includes the discussion and the future work.

2. Related work

In this section, we present some of the previous methods on cross-language plagiarism detection. Figure 2 depicts the taxonomy of various approaches on CLPD. As shown in the figure, there are five main categories for the task of CLPD. Moreover, some of the recent works have tried to combine different approaches to benefit from advantages of two or more methods. In the following subsections, we describe the recent approaches based on the above mentioned taxonomy.

Figure 2.

Taxonomy of approaches to CLPD.

2.1 Lexical based approaches

Lexical based approaches try to compare multilingual documents without using translation systems or any multi-lingual resources of data. They analyze cross-lingual similarity considering the structural and lexical similarity between languages. Cross-Language Character N-Gram model (CL-CNG), which uses overlapping character N-gram tokens, has been proposed in [21]. The method is based on the fact of lexical similarity between languages sharing similar syntactic structure (e.g., related European language pairs). The obtained results for European languages show a competitive accuracy with respect to language-specific approaches. This approach can compare multilingual documents without using translation systems. However, due to lexical differences and different writing alphabets between distant languages with different lexicon, this method cannot be applied for detecting cases of similarity while encountering different lexicon [41].

2.2 Thesaurus based approaches

Thesaurus based approaches use multi-lingual resources to transform passages in different languages into a unique language independent form. The BabelNet and EuroWordNet are the most popular resources for different cross language tasks including CLPD. BabelNet is a very large, wide-coverage multilingual semantic network which is automatically constructed by integrating lexicographic and encyclopedic knowledge from WordNet and Wikipedia[24]. The BabelNet version 3.7 covers more than 270 languages and made up of about 14 million entries, called Babel synsets.1

¹
http://www.babelnet.org/stats.

The EuroWordNet is a multilingual database of words and their relations for most European languages (i.e. English, Danish, Italian, Spanish, German, French and Czech) and contains sets of synonyms and relations between them [9].

MLPlag system is proposed by Ceska et al. [9] is based on the analysis of word positions for plagiarism detection across languages. The proposed approach utilizes the EuroWordNet thesaurus which transforms words into language independent form. In the case of ambiguous words, two words from different languages have been considered as plagiarized if one of the senses matches with the one in the other language. They compared the influence of multilingual pre-processing and also two different similarity measures, named as symmetric and asymmetric measures on the performance of their plagiarism detection system [9].

An approach to identify very similar documents among a collection of candidate documents has been proposed in [35]. The proposed method is based on representing the document contents by a vector of thesaurus terms from a multilingual thesaurus, and measuring the similarity between vectors. In their proposed method, they used a “Length Factor” based on the observation of differences between the lengths of original and translated texts in Spanish, French and English. They found that the variation of the length difference approximately follows a normal distribution and considered it as a factor for computing the similarity between documents. The proposed length factor has also been used as a separated score for measuring cross-lingual text similarity for CLPD in [18].

CL-CTS method is proposed by [18] to measures the cross-lingual similarity based on a conceptual thesaurus by representing documents in the conceptual space using a domain specific Eurovoc conceptual thesaurus. The proposed model represents documents as vectors after filtering stop words, stemming and using term frequency weighting schema to build the vectors. At the final step, they compare the similarity between vectors using the cosine similarity measure in addition to named entities matching and “Length Model” similarity.

A knowledge graph-based approach is proposed in [13] by using BabelNet to obtain and compare context models of document fragments in different languages. To build their knowledge graph, at the first step, a set of concepts in each fragment of text is extracted in different languages. In the next step, they obtain a set of paths (P) by searching the BabelNet for paths between each pairs of concepts. The knowledge graph is constructed by joining the paths from P. Then the concepts and relations have been weighted based on the degree of relatedness. Finally, to compare pairs of fragments in different languages, the resulted graphs are compared based on conceptual graph similarity algorithm. Their results show better performance of the proposed model with respect to other lexical and distributional based models [13].

2.3 Translation based approaches

Translation based models use dictionaries and machine translation systems to translate suspicious document into the source language and then do a mono-lingual analysis. Automatic machine translation tools have been used by many researches for detecting cases of cross-lingual plagiarism [27, 25, 26, 20]. In translation plus monolingual analysis (T+MA) approach, the suspicious documents are translated to the same language as the source documents, and mono-lingual PD methods applied to find plagiarized cases. Different monolingual PD methods (e.g. word N-Gram similarity and vector space model similarity detection) can be applied to compare the resulted mono-lingual documents. However, the accuracy of CLPD systems based on these approaches is constrained by the availability and quality of translators; in other words, it is limited to and upper bounded by quality of Machine Translation tools [19].

2.4 Corpora based approaches

Corpora based approaches use different multi-lingual resources to train similarity detection models. Although most methods use sentence aligned parallel corpora, some approaches have been proposed based on comparable resources.

Cross-language explicit semantic analysis (CL-ESA) retrieval model is proposed in [33] for cross-language similarity analysis. The proposed model is an extension to previously proposed explicit semantic analysis (ESA) model. ESA uses a document collection (D) with n documents, and measure the cosine similarity of target document (d) with the collection (D). In the proposed model, each document can be represented with a vector (V) of n dimensions, where the ith index in V shows the cosine similarity between d and the ith document of D. The similarity between two documents under the ESA model is defined as the similarity between resulted vectors (e.g. cosine similarity). CL-ESA uses same principle as ESA to compare documents in different languages. For this purpose, a collection of comparable Wikipedia documents in different languages (D) is used to measure the cosine similarity between document d in the language L ${}_{1}$ with the collection D in the same language. Like ESA, the similarity between documents can be calculated by measuring the cosine similarity between the resulted vectors [33].

Cross-lingual latent semantic analysis (CL-LSA) has been proposed by Rehder et al. [36] to construct a multilingual semantic space. LSA creates a reduced-dimension feature space by applying singular value decomposition (SVD) on word-document matrix, in which words that occur in similar contexts are near to each other. The proposed CL-LSA method uses manually or automatically translated documents to create a set of bilingual training documents. Based on the structure of training documents that contain terms from both languages, the resulting LSA model is a bilingual vector space.

An approach for cross lingual plagiarism detection by using statistical bilingual dictionaries based on the IBM-1 alignment model has been proposed in [7, 28]. Given the suspicious and source texts $x$ and $y$ (written in different languages), their goal is to answer the question “Is $x$ plagiarized (and translated) from $y$ ”. To achieve this goal, they divided documents into fragments and the objective was to know if a suspicious fragment $x$ was a plagiarism case from one of the source fragments $y$ . In order to determine if $x$ is plagiarized from any $y$ fragment, the probability $p(y|x)$ had been calculated for each pairs of fragments.

This model has been tested on a mini-corpus of original plagiarized pair of texts. Moreover, in [28] the proposed statistical approach based on IBM1 has been evaluated on different cross-lingual tasks of NLP such as bilingual text classification, cross-language information retrieval and cross-language plagiarism detection. In contrast to current approaches that ignore or do not take full advantage of multi-linguality, the aim of the presented approach was to capture word correlation across languages. The obtained results in different tasks show the benefits of the IBM1 model and the advantageous of learning cross-lingual information directly from cross-lingual resources.

2.5 Word Embedding approaches

Word Embedding (WE) methods, which map words or phrases to vectors of real numbers, have shown tremendous success in numerous NLP tasks in recent years. According to good performance of word embedding methods, some of the more traditional distributional representation models have been fully replaced with these novel approaches. Cross-lingual word embedding models try to learn features (em-bedding) for each word in such a way that similar words in each language are assigned similar embedding (that meets monolingual objective function), and also similar words across languages to have similar representations (that meets cross-lingual objective function) [17]. To achieve this goal, different bilingual resources (i.e. parallel corpora, word aligned corpora or comparable corpora) have been used by different approaches.

The use of word embedding methods in mono-lingual plagiarism detection has been used by Gharavi et al. [16]. In this paper we investigate the performance of cross-lingual word embedding in the task of CLPD.

2.6 Hybrid approaches

In addition to the mentioned approaches for measuring cross-lingual similarity and cross-language plagiarism detection which use different resource to train their models, some of the recent hybrid methods try to combine the benefits of different approaches to improve accuracy.

A new model based on knowledge graph and continuous space representation of words has been proposed in [15, 14]. The presented method basically follows the previously proposed CL-KGA model [13]. For weighting the obtained BabelNet semantic relations, instead of using the BabelNet’s relation weights, the continuous Skip-gram and SenVec models have been used in this approach [15]. Moreover, in this research the impact of relevant aspects of the model has been studied for the task of CLPD which includes word sense disambiguation (WSD), vocabulary expansion, language independence and representation by similarities with a collection of concepts. The obtained results show the importance of WSD for improving the model’s performance in the task of cross-language plagiarism detection [15].

Ferrero et al. presented different syntax-based, dictionary-based, context-based and MT-based methods and a hybrid method by combining some of these approaches for the task of cross-lingual textual similarity in SemEval-2017, named as CompiLIG system [12]. Among all of their runs, the Cross-Language Conceptual Thesaurus-based Similarity (CL-CTS) achieved the best result, which consists of representing texts as bag of words (or concepts) to compare them [12]. As a hybrid method, the most similar words from the embedding space have been added to the main concepts of the sentences from a multi-lingual semantic network. In other words, they use word embedding methods to enrich the basically extracted concept from the thesaurus [12].

In this paper, we have investigated various approaches that are proved to be efficient in CLPD and compared them with each other. It should be noted that a comprehensive investigation and comparison of monolingual plagiarism detection algorithms in Persian has been done in a PAN-FIRE shared task on plagiarism detection [2]. In this paper, we focused on English-Persian CLPD. Shortly our contributions are as follows:

•
Benchmarking the state-of-the-art CLPD and cross-lingual text similarity detection approaches
•
Investigating the performance of CLPD approaches applied to low resource languages (e.g. Persian)
•
Applying the above mentioned approaches on the HAMTA-CL corpus with various types of obfuscation (paraphrasing)

Moreover, in this research we have investigated cross lingual word embedding method on the task of plagiarism detection and compared it to the other previously proposed approaches. In order to compare the performance of the proposed approach, we have applied the above mentioned approaches on the HAMTA-CL corpus with various types of obfuscation.
3. Our approach

In this section, the selected methods and the evaluation framework for measuring the performance of algorithms are described in detail. Moreover, the proposed approach on CLPD based on cross-lingual word embedding is also described.

3.1 Choosing the algorithms

In order to compare the performance of the word embedding method against the other algorithms, a collection of state-of-the-art approaches including CL-ESA, CL-KGA, CL-LSA and T+MA were selected and applied on the proposed HAMTA-CL corpus. Due to lexical and syntactical differences between Persian and English as two distant languages, the CL-CNG method cannot be applied for detecting cases of similarity, so we ignored this method.

CL-ESA: Cross-language explicit semantic analysis (CL-ESA) proposed in [33] as a cross-lingual retrieval model. In the proposed model, a document d in the language l can be represented as an ESA vector d, using the cosine similarity with the index collection D in the corresponding language l. Also, a document d’ in the language l’ can be presented as a vector d’ by computing the cosine similarity of d’ with the index collection D’ in language l’. The similarity between two documents under the ESA model is defined as the similarity between the resulted vectors (e.g. cosine similarity).

As mentioned in [33], the collection D should contain documents from a broad range of domains, and each index document should be of “reasonable” length. While a subset of the documents in Wikipedia can fulfills both properties, for the training phase, we used a collection of 20000 comparable articles from Wikipedia. The selected articles cover broad ranges of topics, contains both Persian and English Wikipedia pages and also contains more than 500-words length in both languages.

In our experiments, the suspicious and corresponding source documents have been split into sentences. We embed each sentence in the source and suspicious documents into vectors using CL-ESA. For this purpose, each sentence has been compared with a collection of 20000 documents under cosine similarity measure. The Persian sentences have been compared with Persian Wikipedia pages and English ones have been compared against equivalent English pages. To detect cases of plagiarism between documents, the cosine similarity between derived vectors in two documents has been computed.

CL-LSA: The goal of cross-lingual latent semantic analysis (CL-LSA) is to construct a multilingual semantic space. The proposed CL-LSA method uses manually or automatically translated documents to create a set of bilingual training documents. Based on the structure of training documents that contain terms from both languages, the resulting LSA model is a bilingual vector space. In our experiment, we train a model using CL-LSA on sentence aligned parallel corpora. For detecting cases of text similarity between source and suspicious documents, both documents have been split into sentences. The created LSA model has been used to convert each sentence into low-dimensional LSA space. The resulted vectors of sentences have been compared using cosine similarity measure to detect cases of similarity between source and suspicious documents.

T+MA: In the Translation plus Mono-lingual Analysis approach, the suspicions documents have been translated from Persian into English, using Google translate API. The Vector Space Model (VSM) method is used to convert sentences of resulted English documents and source documents into vectors. Like previous models, the resulted vectors of sentences have been compared using the cosine similarity measure to detect cases of similarity between source and suspicious documents.

CL-KGA: CL-KGA goal is to exploit explicit semantics for representation of the documents. [13] CL-KGA provides a context model by generating knowledge graphs that expand and relate the original concepts from suspicious and source paragraphs. Finally, the similarity is measured in a semantic graph space [13].

Given a source document and a suspicious document we compare document fragments in following steps as described in [13]: We segment the original document in a set of fragments, using a 5-sentence sliding window with a 2-sentence step on the input document. The paragraphs are lemmatized and tagged according to their grammatical category. The knowledge graphs from the tagged fragments are built using the BabelNet. Finally we compare these graphs to measure similarity between different fragments.

Cross-lingual word embedding: In this paper we investigate BILBOWA (Bilingual Bag-of-Words without Alignments) as one of the state-of-the-art cross lingual word embedding methods. This algorithm is a fast and simple method for learning distributed representation of bilingual words [17]. BILBOWA does not rely on word aligned parallel data, and this makes the algorithm appropriate for less resourced languages (e.g. Persian). The model tries to learn both mono and cross lingual word embedding using joint optimization. In order to train mono-lingual vectors, the well-known mono-lingual word embedding methods (e.g. CBOW and Skip-Gram [22]) have been used.

In our experiments, the source and suspicious documents are split into sentences. The BILBOWA has been used to convert constitutive words into 200 dimensional vectors. We used a simple averaging approach to combine word-vectors to create vectors of sentences. It has been shown that the averaging approach has the best performance for the task of sentence embedding for semantic similarity detection [39]. The resulted vectors of sentences have been compared using cosine similarity measure to detect cases of similarity between source and suspicious documents.

3.2 Evaluation framework

For investigating the performance of the CLPD algorithms, an evaluation framework is required. The framework is comprised of an evaluation corpus along with evaluation measures. In the following subsections we will thoroughly describe the construction of the HAMTA-CL English-Persian corpus and the measures that have been employed for evaluation of the methods.

3.2.1 Corpus construction

In order to compare the performance of different algorithms on English-Persian plagiarism detection, a CLPD evaluation corpus should be constructed. In this section, we first review some of the recently developed corpora and then describe our methodology for building an English-Persian plagiarism detection corpus.

– Previous works on CLPD corpus construction: The PAN plagiarism detection corpus PAN-PC-09 includes a set of cross-language plagiarism cases across two language pairs [34]. Out of different types of obfuscation, more than 10% have been covered by cross-language cases of plagiarism, which includes automatically translated plagiarized fragments from German and Spanish to English. Subsequent PAN-PC-10 [29] and PAN-PC-11 [31] corpora contains 14% and 11% cross-language cases of plagiarism, respectively. Moreover, for improving the quality of cross-language corpus, 1% of automatically translated fragments of PAN-PC-11 have been manually corrected.

A cross-language plagiarism detection corpus has been constructed by Ceska et al. [9] for evaluating CLPD methods using JRC-EU and Fairy-tale multilingual corpora. The proposed corpus consists of 200 English reports from JRC-EU and 27 English document of Fairy-tale as source documents and the same number of documents in Czech as the suspicious ones.

A cross-language PD corpus has been compiled in [30] using 23,000 JRC-Acquis parallel corpus documents and 45,000 Wikipedia documents, in which 10,000 aligned documents have been used to test the algorithms.

The first English-Persian corpus has been proposed by Asghari et al. [1] in PAN 2015 text alignment corpus construction shared task. An English-Persian sentence aligned parallel corpus was used to compile cases of plagiarism across the two languages. Plagiarized fragments in suspicious document have been constructed from Persian sentences and corresponding source fragments have been constructed from English sentences. To consider the degree of obfuscation in plagiarized fragments, a combination of sentences with different similarity scores were chosen. The number of sentences and their similarity score in a fragment specifies the four degree of obfuscation in the fragments.

A Hindi-English corpus which includes 5032 English documents from Wikipedia and 388 Hindi documents has been used for the CL!TR task on cross-language text re-use detection [6]. To generate cases of plagiarism, the participants are asked to write a short answer to a set of questions either by re-using the source documents or by using learning materials. To simulate real cases of plagiarism, they asked participants to answer questions with 4 different levels of obfuscation including: near copy, light revision, heavy revision and no-plagiarism [6].

A multi-style multi-granularity corpus for cross-language textual similarity detection has been proposed by Ferrero et al. [11]. The proposed corpus is in French, English and Spanish and is based on a parallel corpus along with a comparable corpus. Both human translated texts from multiple types of authors and also machine translated texts have been used for constructing the corpus. They have prepared different granularities in document-level, sentence level and chunk-level (noun chunks).

– HAMTA-CL corpus construction: In constructing a plagiarism detection corpus, some text fragments from source document should be inserted into the suspicious document in order to simulate plagiarism. In order to have more realistic cases of plagiarism, the text fragments should be paraphrased (obfuscated). Paraphrasing are alternative ways of conveying the same information [3], so plagiarists use paraphrasing to change the word forms while keeping the same meaning. In our approach to construct the HAMTA-CL corpus, we have focused on creating various types of paraphrasing. None of the above mentioned corpora have considered such versatile types of paraphrasing in creating bilingual PD corpora. It should be mentioned that translation is inherently a paraphrasing mechanism. In [3] it has been shown that the task of generating paraphrases can be accomplished using bilingual parallel corpora. They have also defined a paraphrase probability derived from a phrase-based statistical machine translation (SMT) approach that allows paraphrases to be ranked by translation probabilities. Bosma and Callison-Burch [8] have investigated how paraphrasing can be accomplished via translation. In this paper, in order to incorporate different kinds of paraphrasing techniques into a bilingual CLPD corpus, we have considered the following obfuscation approaches:

•
Simple Translation (STR): Creating plagiarized passages by combining topically related sentences from a parallel corpus.
•
Artificial (ART): Creating plagiarized passages by combining topically related sentences from a parallel corpus, along with artificial obfuscation in the target language.
•
Paraphrasing (PAR): Creating plagiarized passages by combining topically related sentences from a parallel corpus, along with human aided paraphrasing in the target language. In this type of obfuscation, a monolingual paraphrasing is done in the target language regardless of the source language.
•
Summarization (SUM): Translation plus human aided summarization of the passage in the target language.
•
Circular Translation (CTR): Translation from source language L ${}_{1}$ to a different language L ${}_{3}$ and then translate it back into the target language L ${}_{2}$ .
•
Split (SPL): Translation plus dividing the sentence in the target language into two or more sentences.
•
Merge (MRG): Translation plus combining two or more sentences in the target language into one sentence.

Figure 3 demonstrates the flow diagram for construction of the cross-language PD corpus with the above mentioned paraphrasing techniques. Wikipedia articles are used as primary resource to create the HAMTA-CL corpus. Because of its scale, context and open accessibility, Wikipedia is the best available resource to compile such a corpus. Due to the importance of documents’ length in compiling a realistic PD corpus, among the whole Wikipedia documents, 1904 documents with a variety of lengths have been used to compile the proposed corpus. Parsivar pre-processing toolkit was used to normalize the Wikipedia documents [23]. The statistics of documents is represented in Table 1.

Table 1
Corpus statistics

Document purpose Number of documents 1904

% of Source documents (English) 59%

% of Suspicious documents (Persian) 41%

Document length Short (1–400 words) 67%

Medium (400–2000 words) 28%

Long (2000–17000 words) 5%

Average number of words per document 482

Average number of sentences per document 23

Smallest document (by words) 55

Largest document (by words) 16685

Since in real situations, the plagiarism can be done in different lengths, a broad range of lengths is considered to create potential plagiarized fragments. The lengths of fragments are distributed between 20 and 300 words to simulate all types of plagiarism and the distribution of fragments’ length are depicted in Table 2 and Fig. 4, respectively. Moreover, the statistics of the different types of obfuscations in proposed corpus is represented in Table 3.

Table 2
Plagiarism case statistics

Case length Short (20–50 words) 36%

Medium (50–100 words) 42%

Long (100–300 words) 22%

Figure 3.
Flow diagram of bilingual PD corpus construction.

As shown in the Fig. 4, the Merge obfuscated fragments have the shortest length on average among all the passages. Also, the average length of Summarization obfuscated fragments (SUM) are the longest among different types of passages. This is because we have selected long passages for the source documents in such a way that the summarization process could be easier for crowd-workers. Moreover, the mean length of all plagiarized passages except summarization is almost the same.

Table 3
Statistics of different types of obfuscation

Obfuscation Number of fragments % of fragments

Simple translation (STR) 498 29%

Artificial (ART) 495 29%

Paraphrasing (PAR) 185 10%

Summarization (SUM) 134 8%

Circular translation (CIR) 187 10%

Split (SPL) 144 9%

Merge (MRG) 58 5%

Table 4
Ratio of plagiarism fragments in documents

Plagiarism per document Ratio

Hardly (5%–10%) 50%

Medium (11%–25%) 17%

Much (26%–60%) 33%

Figure 4.
Length distribution of the fragments.

We should also cover different situations concerning ratio of plagiarized fragments per suspicious document. To this aim, a wide range of plagiarism ratio is considered from hardly (i.e., low ratio of plagiarized fragments per suspicious document) to much (i.e., most parts of the document is plagiarism) as shown in Table 4.

– Corpus evaluation and validation: Some efforts have been made in previous works to evaluate plagiarism detection corpora. Potthast et al. proposed some automatic and manual methods to evaluate and validate submitted corpora on the first shared task on plagiarism detection data submission [32]. Also, manual and automatic evaluation measures to evaluate PD corpora have been proposed in [40].

The corpus was automatically validated considering the ratio of the length of plagiarized passages to the length of the documents, and the distribution of plagiarized passages across the documents. Moreover, a manual checking was done for evaluating the quality of plagiarized fragments. It should be noted that the constructed corpus is freely available to use for the research community.2
²
http://www.ictrc.ac.ir/corpus/HAMTA-CL.rar.

3.2.2 Evaluation measure

Obfuscation	Number of fragments	% of fragments
Simple translation (STR)	498	29%
Artificial (ART)	495	29%
Paraphrasing (PAR)	185	10%
Summarization (SUM)	134	8%
Circular translation (CIR)	187	10%
Split (SPL)	144	9%
Merge (MRG)	58	5%

Plagiarism per document	Ratio
Hardly (5%–10%)	50%
Medium (11%–25%)	17%
Much (26%–60%)	33%

The ordinary measures for evaluating the performance of NLP algorithms are precision, recall and F-measure. In plagiarism detection tasks, we use character-level precision and recall. Besides this performance measures, another measure that characterizes the goodness of a detection algorithm have been defined in [34, 4]; whether a plagiarism case is detected as a whole or it has been detected in several pieces. Granularity quantifies whether the contiguity between plagiarized text passages is properly recognized. A low granularity simplifies both the human inspection of algorithmically detected passages as well as an algorithmic analysis within a potential post-process [34]. To capture this characteristic, they have introduced the granularity of R under S as follows:

$\displaystyle\textit{gran(S,R) }=\frac{1}{|S\textsubscript{R}|}\sum_{s\in S% \textsubscript{R}}|R\textsubscript{S}|$ (1)

The range of gran(S, R) is between [1, R], with 1 indicating the desired one-to-one correspondence and R indicating the worst case. Precision, recall (both at character-level) and granularity have been combined to an overall score based on following equations:

$\displaystyle\textit{Precision(S,R)}=\frac{1}{|R|}\sum_{r\in R}\frac{\bigcup_{% s\in S}(s\sqcap r)}{|r|}$ (2)

$\displaystyle\textit{Recall(S,R) }=\frac{1}{|S|}\sum_{s\in S}\frac{\bigcup_{r% \in R}(s\sqcap r)}{|s|}$ (3)

$\displaystyle F_{1}=2.\frac{\textit{Precision.Recall}}{\textit{Precision}+% \textit{Recall}}$ (4)

The three measures can be applied in isolation, but they can also be combined into a single, overall performance score as follow:

$\displaystyle\textit{Plagdet(S,R)}=\frac{F_{1}}{(1+\textit{gran(S,R)})}$ (5)

Where $S$ denote the set of plagiarism cases in the suspicious documents of the corpus, and $R$ denote the set of plagiarism that detected by detector for these documents, and $F_{1}$ denotes the $F$ -Measure.

4. Experiments

We investigate the performance of CL-LSA, CL-ESA, BILBOWA, CL-KGA and T+MA on the task of English-Persian plagiarism detection in the following experiments.

4.1 Experiment 1

In this experiment, we have measured the performance of CL-ESA, CL-LSA, BILBOWA, CL-KGA and T+MA algorithms in detecting cases of plagiarism on the whole HAMTA-CL corpus. Our goal is to measure the performance of CLPD methods on the proposed corpus that contains various types of obfuscation against the BILBOWA.

The graphs of precision, recall and F1 measure versus different similarity thresholds are depicted in Fig. 5.

Figure 5.

Performance of algorithms on the HAMTA-CL corpus in terms of Cosine similarity between sentence pairs.

As shown in the figure, the T+MA algorithm obtains the best F1 and precision in the whole corpus. Moreover, BILBOWA outperforms other approaches with respect to recall for different ranges of cosine similarity threshold. In comparison to other approaches, CL-ESA obtains the worst results among all the algorithms, especially in the case of precision. In the graph that represents F1, T+MA performs well when the threshold is less than 0.15. After this threshold, the performance of T+MA decreases monotonically. There is a similar trend in CL-LSA except that the best performance achieved under the threshold of about 0.35. On the contrary, the behavior of BILBOWA and CL-ESA remains constant in most of the ranges of similarity thresholds. The best performance for BILBOWA is achieved with threshold of about 0.98 which obtaining on F1 measure of 0.61.

4.2 Experiment 2

In the second experiment, the performance of the CLPD methods is computed against separate sub-corpora containing different types of paraphrasing. Our purpose is to evaluate the capability of the methods on detecting different types of obfuscation.

Figure 6.

Performance measure for different types of obfuscation.

The performance of the methods to recognize cases of plagiarism with different types of paraphrasing is presented in Fig. 6. As shown in the figure, the T+MA method outperforms the other approaches in F1 and precision. Also, BILBOWA achieved the best results in the case of recall measure for different types of obfuscation. Although T+MA performs better than the other methods, however BILBOWA shows a close performance to T+MA for more complicated types of paraphrasing (e.g. Merge and Split) and outperforms T+MA in detecting summarization cases of obfuscation. Table 5 shows Plagdet and granularity obtained by each method. We discuss the performance of methods on the sub-corpora from three perspectives as follows.

In the first perspective, we focus on the performance of algorithms over sub-corpora. Since T+MA is a precision-oriented method, its best performance is when the obfuscation complexity is low. For inst-

Table 5

Comparison of performance measures in each method vs. different types of obfuscation

Obfuscation	CL-ESA				CL-LSA				BILBOWA				T+MA				CL-KGA
	Recall	Precision	Granularity	Plagdet	Recall	Precision	Granularity	Plagdet	Recall	Precision	Granularity	Plagdet	Recall	Precision	Granularity	Plagdet	Recall	Precision	Granularity	Plagdet
Whole corpus	0.41	0.21	1	0.28	0.63	0.49	1	0.55	0.64	0.55	1	0.59	0.83	0.83	1	0.83	0.55	0.70	1	0.61
STR	0.40	0.21	1	0.28	0.61	0.64	1	0.62	0.62	0.59	1	0.60	0.93	0.84	1	0.88	0.59	0.70	1	0.64
ART	0.52	0.16	1	0.24	0.62	0.47	1	0.54	0.60	0.43	1	0.50	0.90	0.80	1	0.84	0.62	0.73	1	0.67
PAR	0.39	0.18	1	0.25	0.57	0.45	1	0.50	0.52	0.49	1.001	0.50	0.84	0.80	1	0.82	0.55	0.50	1	0.52
SUM	0.64	0.39	1.004	0.48	0.68	0.60	1.07	0.61	0.80	0.79	1.008	0.79	0.71	0.90	1.03	0.78	0.68	0.56	1.03	0.60
CTR	0.45	0.16	1	0.23	0.55	0.51	1	0.53	0.45	0.43	1	0.44	0.71	0.79	1	0.75	0.51	0.62	1	0.56
SPL	0.39	0.09	1.001	0.15	0.52	0.30	1.01	0.38	0.59	0.47	1.02	0.52	0.57	0.75	1.05	0.62	0.62	0.24	1.04	0.34
MRG	0.66	0.16	1.01	0.26	0.45	0.46	1.002	0.46	0.65	0.59	1.066	0.59	0.62	0.84	1.004	0.71	0.82	0.25	1.09	0.36

ance, the performance of T+MA on STR sub-corpus is highest with respect to the other sub-corpora. BILBOWA is a semantic based and recall oriented method. So, it can be seen when the obfuscation complexity is high (such as summarization), it outperforms the other algorithms. In LSA algorithm, the precision value increases monotonically, but when the threshold value reaches around 0.7, the precision of LSA decreases. Since the recall is very low at the values above this threshold, the change in F1 is not perceptible. ESA is a recall oriented semantic approach and it can be seen that its precision is very low. The precision doesn’t leverage with the increase in threshold value. This method has also the lowest recall among the other algorithms and decreases rapidly in threshold value around 0.7. Since ESA is inherently a semantic approach, it works better in fragments with complex obfuscation.

In the second perspective, we focus on the effect of obfuscation complexity on the performance of the above mentioned methods. From a general point of view, it is expected that the performance of the methods decreases when the obfuscation complexity increases. In simple translation (STR), since there is no obfuscation in the fragments, all of the methods have their best results. In the artificial obfuscation (ART), the fragments are constructed automatically, while the fragments in paraphrase obfuscation (PAR) have been manually changed by human. As a result, the PAR fragments have more complexity with respect to the ART fragments. As shown in Fig. 6, the algorithms have better performances in artificial obfuscation with respect to the paraphrase obfuscation. Merge (MRG) and Split (SPL) obfuscations cause the structure of sentences to be messed up, whereas all of the methods work on a sequence of individual sentences to detect cases of plagiarism. Therefore, the worst performance of the methods occurs with MRG and SPL types of obfuscations. Moreover, the performance of all of the methods in SPL obfuscation is lower than MRG. As a last point, it seems that the most complex obfuscation is summarization (SUM), but since the summarized passages are relatively long (with respect to merge and split passages), the performance of the methods on SUM is better than MRG and SPL.

In the third perspective, we consider the sensitivity of methods on each sub-corpus with respect to the whole HAMTA-CL corpus as shown in Table 6. It can be seen in the table that the performance of BILBOWA on Merge (MRG) and Split (SPL) obfuscation has the lowest change with respect to the whole corpus among the other methods. In other words, MRG and SPL obfuscation decreases the performance of all algorithms except BILBOWA. On the other hand, BILBOWA has the most change in the performance among all of the approaches in the case of Artificial (ART) and Paraphrase (PAR) obfuscation. Another issue is that all of the methods have the lowest change in performance in simple translation (STR) and have the highest growth of performance when facing summarization (SUM) obfuscation as well.

Table 6

Changes in the performance of algorithms with respect to the whole corpus

Obfuscation	CL-ESA	CL-LSA	BILBOWA	T+MA	CL-KGA
Simple translation (STR)	0	$+$ 7.4	$+$ 1	$+$ 5.1	$+$ 2.5
Artificial (ART)	$-$ 3.6	$-$ 1.4	$-$ 9.2	$+$ 1.5	$+$ 5.5
Paraphrasing (PAR)	$-$ 2.5	$-$ 4.7	$-$ 9.1	$-$ 0.9	$-$ 9.1
Summarization (SUM)	$+$ 20.7	$+$ 5.6	$+$ 20.0	$-$ 5.1	$-$ 1.3
Circular translation (CIR)	$-$ 4.4	$-$ 2.1	$-$ 15.1	$-$ 7.8	$-$ 5.6
Split (SPL)	$-$ 12.3	$-$ 17.5	$-$ 7.6	$-$ 20.8	$-$ 27.6
Merge (MRG)	$-$ 1.9	$-$ 9.5	$-$ 0.3	$-$ 12.2	$-$ 25.8

5. Conclusion and future works

In this paper we presented a novel approach to cross language plagiarism detection using word embedding methods and explore its performance against other state-of-the-art plagiarism detection algorithms. Moreover, we investigated various algorithms on the task of cross language plagiarism detection. We categorized the methods, described their pros and cons and compared them in the task of CLPD, focusing on English and Persian as two distant languages. We also investigated the performance of CLPD approaches applied to Persian as a low resource language.

For investigating the performance of the algorithms, a corpus comprised of seven different types of obfuscation was constructed. The simulated cases of plagiarism were compiled by graduated crowd workers, while the artificial ones were compiled automatically. For validation of the corpus, it was automatically checked considering the ratio of length of plagiarized passages to length of the documents and the distribution of plagiarized passages across the documents as well. Moreover, for evaluation of the corpus, a manual checking was done for investigating the quality of plagiarized fragments. We compared the performance of the algorithms on the whole corpus and also on separate sub-corpora containing different types of paraphrasing as well.

For comparing the methods on CLPD, we implemented five algorithms and evaluated them using the constructed corpus. The performance of the algorithms on detecting cases of plagiarism in different types of paraphrasing showed that T+MA method outperforms other approaches in F1 and precision. Also, BILBOWA achieved the best results in the case of recall for different types of obfuscation. The results can also show that BILBOWA can detect more complicated types of plagiarism (e.g. Merge, Split and Summarization).

As a future work, we plan to focus our research on improving the performance of the above mentioned algorithms concerning Persian specific features. Another research that can be investigated in the future is to work on bilingual plagiarism detection when the source and target languages are both less resourced.

References

Asghari

Khoshnava

Fatemi

and Faili

, Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus: Notebook for {PAN} at {CLEF} 2015, In Cappellato

Ferro

Jones

G.J.F.

and SanJuan

, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.

Asghari

Mohtaj

Fatemi

Faili

Rosso

and Potthast

, Algorithms and corpora for persian plagiarism detection: overview of {PAN} at {FIRE} 2016, In Majumder

Mitra

Mehta

Sankhavara

and Ghosh

, editors, Working notes of {FIRE} 2016 – Forum for Information Retrieval Evaluation, December 7–10, 2016, volume 1737 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2016, pp. 135–144.

Bannard

C.J.

and Callison-Burch

, Paraphrasing with bilingual parallel corpora, In Knight

H.T.

and Oflazer

, editors, {ACL} 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 25–30 June 2005, University of Michigan, {USA}, The Association for Computer Linguistics, 2005, pp. 597–604.

Barrón-Cede no

Potthast

Rosso

and Stein

, Corpus and evaluation measures for automatic plagiarism detection, In Calzolari

Choukri

Maegaard

Mariani

Odijk

Piperidis

Rosner

and Tapias

, editors, Proceedings of the International Conference on Language Resources and Evaluation, {LREC} 2010, 17–23 May 2010, Valletta, Malta. European Language Resources Association, 2010.

Barrón-Cede no

Rosso

Agirre

and Labaka

, Plagiarism detection across distant language pairs, In Huang

C.-R.

and Jurafsky

, editors, {COLING} 2010, 23rd International Conference on Computational Linguistics, Proceedings of the Conference, 23–27 August 2010, Beijing, China, Tsinghua University Press, 2010, pp. 37–45.

Barrón-Cede no

Rosso

Devi

S.L.

Clough

P.D.

and Stevenson

, PAN@FIRE: Overview of the cross-language !ndian text re-use detection competition, In Majumder

Mitra

Bhattacharyya

Subramaniam

L.V.

Contractor

and Rosso

, editors, Multilingual Information Access in South Asian Languages – Second International Workshop, {FIRE} 2010, Gandhinagar, India, February 19–21, 2010 and Third International Workshop, {FIRE} 2011, Bombay, India, December 2–4, 2011, Revised Selected Papers, volume 7536 of Lecture Notes in Computer Science, Springer, 2011, pp. 59–70.

Barrón-Cede no

Rosso

Pinto

and Juan

, On cross-lingual plagiarism analysis using a statistical model, In Stein

Stamatatos

and Koppel

, editors, Proceedings of the ECAI’08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008, volume 377 of {CEUR} Workshop Proceedings. CEUR-WS.org, 2008.

Bosma

and Callison-Burch

, Paraphrase substitution for recognizing textual entailment, In Peters

Clough

P.D.

Gey

F.C.

Karlgren

Magnini

Oard

D.W.

de Rijke

and Stempfhuber

, editors, Evaluation of Multilingual and Multi-modal Information Retrieval, 7th Workshop of the Cross-Language Evaluation Forum, {CLEF} 2006, Alicante, Spain, September 20–22, 2006, Revised Selected Papers, volume 4730 of Lecture Notes in Computer Science, Springer, 2006, pp. 502–509.

Ceska

Toman

and Jezek

, Multilingual Plagiarism Detection. In Dochev

Pistore

and Traverso

, editors, Artificial Intelligence: Methodology, Systems, and Applications, 13th International Conference, {AIMSA} 2008, Varna, Bulgaria, September 4–6, 2008. Proceedings, volume 5253 of Lecture Notes in Computer Science, Springer, 2008, pp. 83–92.

10.

Farghaly

and Shaalan

K.F.

, Arabic natural language processing: challenges and solutions, {ACM} Trans. Asian Lang. Inf. Process. 8(4) (2009), 14:1–14:22.

11.

Ferrero

Agnès

Besacier

and Schwab

, A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection, In Calzolari

Choukri

Declerck

Goggi

Grobelnik

Maegaard

Mariani

Mazo

Moreno

Odijk

and Piperidis

, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation {LREC} 2016, Portorož, Slovenia, May 23–28, 2016, European Language Resources Association {(ELRA)}, 2016.

12.

Ferrero

Besacier

Schwab

and Agnès

, CompiLIG at SemEval-2017 Task 1: Cross-language plagiarism detection methods for semantic textual similarity, In Bethard

Carpuat

Apidianaki

Mohammad

S.M.

Cer

D.M.

and Jurgens

, editors, Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017, Vancouver, Canada, August 3–4, 2017, Association for Computational Linguistics, 2017, pp. 109–114.

13.

Franco-Salvador

Gupta

and Rosso

, Knowledge graphs as context models: Improving the detection of cross-language plagiarism with paraphrasing, In Ferro

, editor, Bridging Between Information Retrieval and Databases – {PROMISE} Winter School 2013, Bressanone, Italy, February 4–8, 2013. Revised Tutorial Lectures, volume 8173 of Lecture Notes in Computer Science, Springer, 2013, pp. 227–236.

14.

Franco-Salvador

Gupta

Rosso

and Banchs

R.E.

, Cross-language plagiarism detection over continuous-space-and knowledge graph-based representations of language, Knowl.-Based Syst. 111 (2016), 87–99.

15.

Franco-Salvador

Rosso

and Montes-y Gómez

, A systematic study of knowledge graph analysis for cross-language plagiarism detection, Inf. Process. Manage. 52(4) (2016), 550–570.

16.

Gharavi

Bijari

Zahirnia

and Veisi

, A deep learning approach to persian plagiarism detection, In Majumder

Mitra

Mehta

Sankhavara

and Ghosh

, editors, Working notes of FIRE 2016 – Forum for Information Retrieval Evaluation, Kolkata, India, December 7–10, 2016, volume 1737 of CEUR Workshop Proceedings, CEUR-WS.org, 2016, pp. 154–159.

17.

Gouws

Bengio

and Corrado

, BilBOWA: Fast bilingual distributed representations without word alignments, In Bach

F.R.

and Blei

D.M.

, editors, Proceedings of the 32nd International Conference on Machine Learning, {ICML} 2015, Lille, France, 6–11 July 2015, volume 37 of {JMLR} Workshop and Conference Proceedings, JMLR.org, 2015, pp. 748–756.

18.

Gupta

Barrón-Cede no

and Rosso

, Cross-language high similarity search using a conceptual thesaurus, In Catarci

Forner

Hiemstra

Pe nas

and Santucci

, editors, Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics – Third International Conference of the {CLEF} Initiative, {CLEF} 2012, Rome, Italy, September 17–20, 2012. Proceedings, volume 7488 of Lecture Notes in Computer Science, Springer, 2012, pp. 67–75.

19.

Gupta

and Singhal

, Mapping hindi-english text re-use document pairs, In Majumder

Mitra

Bhattacharyya

Subramaniam

L.V.

Contractor

and Rosso

20.

Kent

C.K.

and Salim

, Web based cross language plagiarism detection, CoRR, abs/0912.3, 2009.

21.

McNamee

and Mayfield

, Character N-gram tokenization for european language text retrieval, Inf. Retr. 7(1-2) (2004), 73–97.

22.

Mikolov

Chen

Corrado

and Dean

, Efficient estimation of word representations in vector space, CoRR, abs/1301.3, 2013.

23.

Mohtaj

Roshanfekr

Zafarian

and Asghari

, Parsivar: A language processing toolkit for persian, In Calzolari

Choukri

Cieri

Declerck

Goggi

Hasida

Isahara

Maegaard

Mariani

Mazo

Moreno

Odijk

Piperidis

and Tokunaga

, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018, European Language Resources Association ELRA, 2018.

24.

Navigli

and Ponzetto

S.P.

, BabelNet: Building a very large multilingual semantic network, In Hajic

Carberry

and Clark

, editors, {ACL} 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11–16, 2010, Uppsala, Sweden, The Association for Computer Linguistics, 2010, pp. 216–225.

25.

Nawab

R.M.A.

Stevenson

and Clough

P.D.

, University of Sheffield – Lab Report for {PAN} at {CLEF} 2010, In Braschler

Harman

and Pianta

, editors, {CLEF} 2010 LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy, volume 1176 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2010.

26.

Oberreuter

L’Huillier

Rios

S.A.

and Velásquez

J.D.

, Approaches for intrinsic and external plagiarism detection – Notebook for {PAN} at {CLEF} 2011, In Petras

Forner

and Clough

P.D.

, editors, {CLEF} 2011 Labs and Workshop, Notebook Papers, 19–22 September 2011, Amsterdam, The Netherlands, volume 1177 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2011.

27.

Pereira

R.C.

Moreira

V.P.

and Galante

, A new approach for cross-language plagiarism analysis, In Agosti

Ferro

Peters

de Rijke

and Smeaton

A.F.

, editors, Multilingual and Multimodal Information Access Evaluation, International Conference of the Cross-Language Evaluation Forum, {CLEF} 2010, Padua, Italy, September 20–23, 2010. Proceedings, volume 6360 of Lecture Notes in Computer Science, Springer, 2010, pp. 15–26.

28.

Pinto

Civera

Barrón-Cede no

Juan

and Rosso

, A statistical approach to crosslingual natural language tasks, J. Algorithms 64(1) (2009), 51–60.

29.

Potthast

Barrón-Cede no

Eiselt

Stein

and Rosso

, Overview of the 2nd international competition on plagiarism detection, In Braschler

Harman

and Pianta

, editors, {CLEF} 2010 LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy, volume 1176 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2010.

30.

Potthast

Barrón-Cede no

Stein

and Rosso

, Cross-language plagiarism detection, Language Resources and Evaluation 45(1) (2011), 45–62.

31.

Potthast

Eiselt

Barrón-Cede no

Stein

and Rosso

, Overview of the 3rd international competition on plagiarism detection, In Petras

Forner

and Clough

P.D.

, editors, {CLEF} 2011 Labs and Workshop, Notebook Papers, 19–22 September 2011, Amsterdam, The Netherlands, volume 1177 of {CEUR} Workshop Proceedings. CEUR-WS.org, 2011.

32.

Potthast

Goering

Rosso

and Stein

, Towards data submissions for shared tasks: First experiences for the task of text alignment, In Cappellato

Ferro

Jones

G.J.F.

and SanJuan

, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.

33.

Potthast

Stein

and Anderka

, A wikipedia-based multilingual retrieval model, In Macdonald

Ounis

Plachouras

Ruthven

and White

R.W.

, editors, Advances in Information Retrieval , 30th European Conference on {IR} Research, {ECIR} 2008, Glasgow, UK, March 30–April 3, 2008. Proceedings, volume 4956 of Lecture Notes in Computer Science, Springer, 2008, pp. 522–530.

34.

Potthast

Stein

Barrón-Cede no

and Rosso

, An evaluation framework for plagiarism detection, In Huang

C.-R.

and Jurafsky

, editors, {COLING} 2010, 23rd International Conference on Computational Linguistics, Posters Volume, 23–27 August 2010, Beijing, China, Chinese Information Processing Society of China, 2010, pp. 997–1005.

35.

Pouliquen

Steinberger

and Ignat

, Automatic identification of document translations in large multilingual document collections, CoRR, abs/cs/060, 2006.

36.

Rehder

Littman

M.L.

Dumais

S.T.

and Landauer

T.K.

, Automatic 3-language cross-language information retrieval with latent semantic indexing, In Voorhees

E.M.

and Harman

D.K.

, editors, Proceedings of The Sixth Text REtrieval Conference, {TREC} 1997, Gaithersburg, Maryland, USA, November 19–21, 1997, volume Special Pu, National Institute of Standards and Technology {(NIST)}, 1997, pp. 233–239.

37.

Stein

Stamatatos

and Koppel

, Proceedings of the ECAI’08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008, volume 377 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2008.

38.

Stein

zu Eissen

S.M.

and Potthast

, Strategies for retrieving plagiarized documents, In Kraaij

de Vries

A.P.

Clarke

C.L.A.

Fuhr

and Kando

, editors, {SIGIR} 2007: Proceedings of the 30th Annual International {ACM}{SIGIR} Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23–27, 2007, ACM, 2007, pp. 825–826.

39.

Wieting

Bansal

Gimpel

and Livescu

, Towards universal paraphrastic sentence embeddings, CoRR, abs/1511.0, 2015.

40.

Zarrabi

Rafiei

Khoshnava

Asghari

and Mohtaj

, Evaluation of text reuse corpora for text alignment task of plagiarism detection, In Cappellato

Ferro

Jones

G.J.F.

and SanJuan

, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.

41.

Barrón-Cedeño

Gupta

and Rosso

, Methods for cross-language plagiarism detection. Knowledge-Based Systems 2013 Sep 1; 50, 211-7.

Document purpose	Number of documents	1904
	% of Source documents (English)	59%
	% of Suspicious documents (Persian)	41%
Document length	Short (1–400 words)	67%
	Medium (400–2000 words)	28%
	Long (2000–17000 words)	5%
	Average number of words per document	482
	Average number of sentences per document	23
	Smallest document (by words)	55
	Largest document (by words)	16685

Case length	Short (20–50 words)	36%
	Medium (50–100 words)	42%
	Long (100–300 words)	22%

On the use of word embedding for cross language plagiarism detection

Abstract

Keywords

1. Introduction

2.2 Thesaurus based approaches

1 http://www.babelnet.org/stats.

2.4 Corpora based approaches

2.5 Word Embedding approaches

2.6 Hybrid approaches

3.1 Choosing the algorithms

3.2 Evaluation framework

3.2.1 Corpus construction

4.1 Experiment 1

References

¹
http://www.babelnet.org/stats.