Abstract
The paper presents a Collaborative Adversarial Network (CAN) model for paraphrase identification, which is a collaborative network holding generator that is pitted against an adversarial network called discriminator. There has been tremendous research work and countless examinations done on sentence similarity demonstration. Learning and identifying the constant highlights, specifically in various areas and domains is the main focus of paraphrase identification. It Involves the capture of regular highlights between two sentences and the community-oriented learning upon traditional ill-disposed and adversarial learning for common feature extraction. The model outperforms the MaLSTM model, which is the baseline model, and also proves to be comparable to many of the state-of-the-art techniques.
Introduction
Paraphrase identification is the undertaking of deciding if the two sentences written in natural language are similar in semantic meaning. If two sentences have the same meaning, then they are called paraphrases of each other. This method signifies a fundamental approach in various data mining techniques. It can prove to be a vital part of application in numerous fields such as plagiarism detection, machine translation, and others [1].
Detection of paraphrases typically follows two methodologies: a supervised methodology mainly focused on AI (ML) and profound learning
Calculations; and an unsupervised methodology dependent on content like likeness calculations. ML algorithms that handle the identification task in paraphrases as a standard classification issue with syntactic or linguistic characteristics.
Most generic techniques measure sentences similarities that are expensive and vulnerable to errors, such as dependency parsing, based on practical techniques and linguistic tools [2, 3]. To achieve sequential hidden conditions for each word, the long-term memory models (LSTM) [5] utilized by the Siamese Recurrent Neuro-Network (RNN) [4] is one of the well- known architecture.
GAN (Generative Adversarial Network) is profound learning, unsupervised machine learning technique proposed by Ian Goodfellow and barely any different analysts incorporating Yoshua Bengio in 2014. It’s an antagonistic system framework, where the generative model is restricted by an adversary: a discriminative model which can choose whether an example originates from the model circulation of information conveyance. Consequently, the name Generative Adversarial Network. In GAN, two models are all the while prepared; these are a generative model portraying the dissemination of information and a discriminative model assessing the likelihood of a preparation information test instead of the generative model. The generative model’s learning strategy is to expand the likelihood of a mistake in the unfair model [6].
CAN (Collaborative Adversarial Network) is a modification of GAN generally used to deal with text data rather than images as for GAN. In CAN, we have a collaborative network containing Generator that is pitted against an ill-disposed system called Discriminator. Both Generator and Discriminator are multilayer perceptrons (MLP). Generator’s goal is to show or create data that is very similar to the training data. The generator needs to generate data that is indistinguishable from the real data. Generated data should be such that discriminator is tricked to identify it as real data. Discriminator objective is to identify if the data is real or fake.
This work proposes to identify the paraphrasing problem using CAN. CAN includes a common feature extractor for boosting the relationship of sentences into the recurrent neural network model. The extractor extorts similar characteristics within a word pair. The integration of adversarial networks with collaborative learning mainly signifies the collaborative work done between the generator and the discriminator. More specifically, the generator attempts to deliver features on the grounds of one of the two expressions; the discriminator endeavors to tell which state the highlights are Session (Content and Semantics) produced by (for example, generator source). In expansion to conventional antagonistic learning, this examination plans to encourage agreeable preparing for explicit component extricates, unmistakable from past work, which just utilized the ill-disposed system way to deal with become familiar with the particular highlights in a few zones and parameters.
In this work, further section 2 clarifies the past work done identified with this subject. Section 3 portrays the issue tended to and section 4 clarifies the proposed system for issue explanation. The exploratory arrangement and results are additionally talked about in area 4 and 6 individually.
Related work
Tremendous research work has done related to sentence similarity demonstration. Mainly the author has fundamentally audited the work, which is most identified with their exploration. This paper contains a concise presentation of the important work about the Generative Adversarial Network. Collaborative Adversarial Network only demonstrates joins shared learning into the system of GAN for normal element extraction, and therefore the author gives the most appropriate research work under this area.
Srivastav and Govikar [7] in 2017, exhibited an overview, utilizing semantic analysis of the two content elements on paraphrase identification technique for Indian dialects. Because of its huge significance, various methodologies were exhibited on topic paraphrase detection procedures in the English language like Artificial Intelligence, string similarity, syntactic closeness, Logic-based, vector-based model, and most importantly machine interpretation. After analyzing these approaches, Vector based methodology (Bi-CNN-MI), it is a machine interpretation approach (SimMat Metric, Sumo Metric), has been considered as the best technique. Been utilized and actualized on the English language. After analyzing and viewing the work from past decades, the conclusion can be drawn at the last that most of the work has been done on local Indian dialects like Malayalam, Hindi, and Kannada. Moreover, minute research has been done in Hindi Language, and almost nothing has been done for the Marathi Language.
Pawar and Mago [8] in 2019, proposed approach a powerful calculation to assess the semantic similarity and comparability it fundamentally utilized the lexical database and fuse an assortment of domain-specific default standard language and with the integration of exchangeable area explicit corpora and the utilization of lexical databases. This methodology has been tried on the most recent educational files that contain standard results just as mean human results.
Ferreira et al. [9], in 2018, presented a paraphrase identification system as customary techniques just depend on lexical and syntactic measures. The research includes the integration of semantic, syntactic, and lexical measures. The author also proposed three new sentence likeness measures. It intends to improve the outcomes consolidating various degrees of data the sentence. Another commitment of this work is the assessment of various AI calculations to classify according to sentence matches as rewords or not.
Sharma et al. [10], in 2019, investigated the Natural Language Processing by seeing copy question recognition in the Quora Dataset through a straight and a tree-based model. What was frightening to discover is that CBOW overwhelmed the more confounded intermittent models with consideration. In addition to the fact that CBOW outperformed different models, yet it likewise edged them out by in excess of a rating point.
Jianquan et al. [11] in 2019, proposed a far off supervision strategy to learn improved word portrayals, so as to extra join the equivalent words assets in paradigmatic relations. The outcomes demonstrate that the proposed models effectively catch semantic data of antonyms as well as accomplish huge upgrades in both natural and extraneous assessment errands.
Mohammad et al. [12] in 2017, utilized the technique of Support Vector Regression (SVR) for reword ID and semantic content closeness examination by utilizing lexical, syntactic, and semantic highlights basically in Arabic news and tweets. Therefore, distinguishing summarizes and semantic comparability investigation has turned into a need to abstain from getting a similar news post a few times.
Kubal and Nimkar [13], in 2018, proposed a crossbreed Deep Learning engineering planning to catch the same number of highlights from the inputted sentences in trademark language. This examination proposes a crossbreed Deep Learning structure, including a mix of CNN and LSTM.
Li et al. [14] in 2015, proposed similitude calculation dependent on the syntactic structure utilizing the examination of the current sentence closeness calculation. He recommended that including semantic closeness calculation in the reason of syntactic similitude can extraordinarily improve the viability of recovery. Calculation. His work computes the similitude of the words with the assistance of “HowNet” stage [15] which is a sound judgment learning base with ideas spoke to by Chinese and English words for depicting the items and uncovering the connections among ideas and the connections between their properties as the essential substance [16].
Aggarwal et al. [17] in 2018, presented a vigorous and conventional reword location model dependent on a profound neural system model which had the option to perform well on both client produced uproarious short messages, for example, Tweets and great clean messages. Their work displayed a couple of insightful word closeness procedure, that can get minute-grained semantic comparing data lying between each pair of words in given sentences. The model that they created, comprised together sentence displaying and pair-wise word similitude coordinating model.
Wei et al. [18] in 2015 exhibited assessments on twitter data taking a shot at two related assignments Paraphrase Identification (PI) and Semantic Textual Similarity (SS) frameworks. According to the various sentences, a parallel yes/no judgment or an evaluated score is delivered to quantify their semantic equality.
Quan et al. [19], in 2019, proposed another strategy for sentence likeness demonstrating. The focal thought of the proposed model is to join syntactic data, semantic highlights, and consideration weight system together, engrossing the benefits of different methods. The significant benefits of the model are that it tends to be utilized as a general structure and dissimilar to the vast majority of sentence inserting based models, the model can be free from tedious picking up/preparing when word embeddings are accessible.
Ding et al. [20], in 2019, proposed a multi-area antagonistic neural system for content arrangement undertakings. The engineering takes the advantages of both space adjustments and performs multiple tasks learning. An enormous scale microblog corpus was gathered for preparing the beginning word installing from Sina Weibo. In this paper, an ill-disposed preparing system and symmetry requirements are utilized to ensure that the private and shared highlights don’t slam into one another, which can improve the exhibitions of both the source areas and the objective space.
Thenmozhi et al. [21], in 2016, utilized a way to deal with it decide the intent of provisos situated in the writings given in complex sentences by settling the conjunctions that distinguish concealed triples from the content. The methodology removes provision-based comparability includes, in particular idea score, connection score, a suggestion from the writings. In this work, every one of these parts of content for estimating the likeness score to separate statement-based similitude includes, in particular, WS, RS, CS and PS are considered by presenting a comparability extrication administrator that is a variety of Jaccard closeness collective. The procedure is assessed to quantify the summary comparability for the Microsoft Research corpus.
Problem description
Paraphrase Identification aims to forecast which of the provided pairs of questions contain two questions with the same meaning. Quora has been visited by more than 100 million individuals consistently, and therefore it is nothing sudden that various people present similarly worded requests. Various requests with a comparative point can cause searchers to contribute more vitality looking through the best answer for their request, and cause researchers to feel they need to react to various variations of a comparative request. Quora values sanctioned inquiries since they give a superior encounter to dynamic searchers and journalists, and offer more an incentive to both of these gatherings in the long haul. Since it is a binary classification problem, the target value 1 indicates that the pair of questions have essentially the same meaning or intent, and the target value 0 indicates that the respective pair of questions have different meanings.
Methodology
Proposed architecture
The proposed architecture is displayed in Fig. 1.

The proposed architecture.
The dataset is converted into a data frame using the Pandas library. The data frame contains for columns namely, “id”, “question1”, “question2”, and “is_duplicate”. The “id” columns contain the question ids of the dataset, the “question1” and “question2” contains the questions which are nothing but the similar sentences and the “is_duplicate” column contains the binary digits 1 and 0 which specify if the questions are similar or not, respectively. Firstly, the data frame is preprocessed using the built-in package in python known as re, which is used to work with the regular expressions. The preprocessing step involves converting the text to the list of words. For this purpose, the text is first cleaned using regular expressions. The stopwords are removed by using the NLTK library. NLTK [22] is a popular Natural Language Toolkit for performing various tasks in natural language processing. It enables us to perform numerous preprocessing tasks such as lemmatization, stemming, parsing, and others.
Furthermore, the sequences are iterated, and by removing the unwanted words, the final data frame is prepared.
Components and layers
Word Embeddings
For the conversion of words to the vector space, the pre-trained GloVe word embeddings have been adopted [23]. The common crawl method of the GloVe embedding has been used, which contains 42 billion tokens, 1.9 million vocabulary words. This pretrained uncased embedding has 300 dimension vectors. It is an unsupervised algorithm that helps in achieving the vector representation for different words without the use of neural networks. With the help of GloVe, a co-occurrence matrix is generated through the probability estimation of a single word co-occurring with other words. This matrix is built for the entire corpus, and then it is factorized to give the word and context vectors. Moreover, the GloVe embeddings do not rely on localized statistics but take into account the co-occurrences of words to achieve word vectors. The loss function for the GloVe is given by the variation between the logarithm of the probability of word co-occurring and the product of word embeddings.
MaLSTM or Manhattan LSTM
Manhatten LSTM model [24, 25] is a Siamese deep learning neural network which uses the manhattan distance to calculate the similarity between the sequences (words or sentences). The Manhattan distance [26] for a two-dimensional plane between two points x and y are given by the following Equation 1:
Here (a,b) are the coordinates of the point x and (c,d) are the coordinates of point y.
The Siamese MaLSTM includes two or more related subnetworks beginning from the word embedding matrix to the outermost layer of LSTM. The architecture of MaLSTM is given in Fig. 2.

The architecture of MaLSTM.
In Fig. 2, the inputs are given as the fixed-length word vectors where the non-zero values uniquely identify each word. The embedding layer uses the pre-trained Glove embeddings for generating an embedding matrix. Here we will have two embedding matrices that are fed individually to the similarity function. The similarity function is given by the following Equation 2:
Here, l denotes left, and r denotes right. After this step, the paraphrase identification is predicted by the model. In our paper, we will use MaLSTM as our baseline model.
The CAN architecture comprises of the common features of both the sentences which are explicitly extracted by the generator and fed to the discriminator. The similarity function of the baseline model, that is, the MaLSTM model, is concatenated with the common features which help in further improving the performance of our model. Furthermore, the hidden state of the MaLSTM model is fed into the RNN model as it helps in decreasing the vanishing gradient problem. The same learned hidden state is also fed as an input to the common feature extractor, which extracts the common features between both the sentences explicitly. The hidden state contains the contextual reference from the first word to the current word, which is under consideration. Then the sentences are represented by the pooling mechanisms over these hidden states, which is described by Fig. 3. Now, while calculating the similarity score using the similarity function, the common features of both the sentenced are also included, and the probabilities are then calculated using the softmax function.

Architecture for Collaborative and Adversarial Network.
Let there be two sentences A and B, which are to be identified as paraphrased or not. The generator first generates the new features based on the individual features of sentence B. These new features are calculated by using the tan function of the individual features of B and the common features from both A and B, based on the weight matrix, which is taken by the learnt hidden state of the baseline model. This is given by the following Equation 3:
Here, F n is the new features calculated by the generator, W m is the weight matrix, F B is the individual features of the sentence B, and the S n is the bias.
Now, the discriminator takes the new features and calculates the probability of these features lying in either sentence A or B.
The adversarial collaboration is performed based on the two cases. First, if the two sentences are paraphrased, then the generator aims to make the new features Fn closer to the individual features FA to confuse the discriminator. This is known as the adversarial case. In the second case, if the sentences are not paraphrased, that is, if they are different, then the new features generated by the generator, Fn, are extremely unrelated from the individual features FA. This helps the discriminator to distinguish between the two sentences easily. This is known as the collaboration method.
Hece, the Collaborative Adversarial Networks work by playing a collaborative game between the generator and the discriminator.
CAN includes a standard feature extractor for boosting the relationship of sentences into the recurrent neural network model. The extractor extorts similar characteristics within a word pair. It uses the Latent Semantic Analysis (LSA), which is used to analyze the vast majority of textual data. It uses a latent structure to outline the relationship between the combination of characters and the whole textual data by the simplification of the textual vectors. The common features are extracted by mapping the high dimensional textual data from both the questions to a lower-dimensional latent space by Singular Value Decomposition (SVD) of the dimensional matrix.
The common features for both the questions are sent by the generator to the discriminator to confuse the discriminator into differentiating between the question from which the feature is obtained.
For the processing of the sequential data, the recurrent neural network is used. The main principle for the working of RNNs is that the output from the past is fed as an input for the current step. The key component for the RNN is the memory which stores some past information about the sequence in the hidden state. It uses the same parameters for all the input neurons and the neurons in the hidden layer and performs the same task on each of them to obtain the output. This helps in the reduction of the parametric complexity. Hence the RNN has been used in the model so that it can utilize the output from the previous steps as input for the next.
Collaborative adversarial network loss
The principal function of the discriminator is to differentiate between the selected label features, namely X and Y. The minimization of the cross-entropy function for the predicted labels is performed by the discriminator, as shown in Equation 4:
Where Θ represents parameters in D.
In particular, the previous generators only plays against the discriminator producing the indistinguishable portrayal that discriminator cannot recognize.
Thus, making older generator quite different from modern generators. Here, we are introducing a generator based upon collaborative and adversarial games according to sentence similarity making it hard to discriminate the feature label for the closer sentences. While generating feature value far from other sentences for the dissimilar sentences. Thus, making playing a collaborative game to make feature discrimination simpler. In this way, our generator is attempting to amplify the feature label prediction indicated by the sentence similarities. Thus maximizing the feature label prediction as in Equation 5:
Where Θ d represents all the parameters in generator G.
Joint Learning comes from combining the loss of the discriminator and generator with the loss from sentence similarity as in Equation (6):
Where
The main aim here is to optimize the objective function in the above equation running along the epochs training the sentence similarity feature and predictor generator and then the feature label discriminator of the batches and samples inside as shown in the Algorithm 1, where B denotes all the b samples in a batch, n_epoch and n_batch are the numbers of iterations and batches. These iterations were performed until the last optimal value of the objective is achieved. These parameters were trained, maximizing the likelihood of true labels on all the corpora.
Algorithm 1
Dataset
The model has been implemented on the Quora Question Pair dataset [27], which is taken from the Kaggle website. This is the dataset used for a Kaggle competition of the same name. The dataset contains 404291 pairs of Quora Questions. 60% of the total number of records have been taken for the purpose of training, which accounts for 242574 question pairs. And, 161716 question pairs have been used as the testing dataset, which accounts for the remaining 40% of the records. A thing to note here is that even though the classes are not balanced perfectly, we can still use them as they would not have any effect on the performance of our model.
Figure 4 denotes the number of question pairs and the binary class of 0 and 1, which is given by the y-axis and x-axis, respectively. The figure further shows that the number of question pairs having the is_duplicate column value as 0, which denotes that the question pairs are not similar, are approximately 240000. In contrast, the number of question pairs having the is_duplicate column value as 1, which denotes that the question pair are similar, is approximately 160000.

The number of question pairs having a similarity or nonsimilarity and are shown by 1 and 0, respectively.
As the majority of the state of the art methods have used accuracy as a metric for compairing the results, we have also used the same metric for the comparison of our results.
The baseline model, which is MaLSTM model, has been used for comparing our results in terms of accuracy, F1 score as well as the Mean Reciprocal Rank (MRR).
For the determination of the accuracy, Equation 7 has been used, which divides the sum of all the correct predictions divided by the sum of all predictions.
F1 score is a function that includes recall and precision, which helps in understanding the balance between them much clearer. The formula for the F1 score is given in Equation 8:
The MRR or mean reciprocal rank is used for calculating the reciprocal ranks of the results obtained. It refers to the multiplicative inverse of the rank for the topmost answer. It is given by the following formula:
Here the ranki indicates to the initial document for the i-th term.
The model is trained on a GTX 1060, which consists of approximately 1280 CUDA cores along with a dedicated 6 GB RAM. The number of epochs used is 50, keeping in consideration of the overfitting of the model. The number of training question pairs is 242574, and for the training, the batch size used is 150. For the testing of the model, 161716 question pairs are used, which are sufficient enough to provide a valid result for the model. The maximum length of the characters to feed to the model at an instant is taken as 179.
The dropout of the final layer has been set to 0.5 with Adadelta [28, 29] as the optimizer. Adadelta is a more robust extension of the Adagrad optimizer, which is based on the stochastic gradient descent algorithm. Its mechanism is based on the updation of the gradient through the moving window method, which enables it to continue learning even if the updation process is over. The Adadelta optimizer also does not require any prior learning rate. Pytorch deep learning framework has been used, which is an open- enhances the research prototyping for production deployment. It also enables the use of parallel programming to train networks on the GPUs. The CUDA cores have been utilized with an upgraded version of 10.1.
Result and analysis
The confusion matrix for testing the dataset with collaborative adversarial networks is represented by the heatmap given in Fig. 5. The heatmap is the representation of matrix data in a graphical form. It is used to represent the confusion matrix or any two-dimensional data in the form of color. The heatmap is built by using the Seaborn python library, which is constructed on top of the Matplotlib. Seaborn provides a high-level interface which can be rendered with the help of Matplotlib for the purpose of generating high-quality multidimensional images.

The heatmap representing the confusion matrix.
Figure 5 displays the heatmap for the predicted and the actual labels on the horizontal axis and the vertical axis, respectively. I can be inferred from the heatmap that 2632 true negatives and 926 true positives have been successfully identified by the model along with 1141 false positives and 1281 false negatives. The total number of correct predictions made by the model is 5980.
The performance of the model can be measured by various error metrics such as accuracy, precision, recall, and F1 score.
The accuracy tells us how close our model gets for predicting the correct values while the precision refers to the level of consistency of the predicted results.
Figure 6 depicts the intermediate accuracy for the training and validation for 50 epochs. It can be noted from the figure that the accuracy of both the training as well as validation increases with the increase in epochs.

Training and validation accuracy.
It can be seen from the above Fig. 7 that the training loss decreases more steeply than the validation loss with the increase in a number of epochs. The validation loss is nearly steady after the first 30 epochs, whereas the training loss keeps on decreasing further.

Training and validation loss.
The precision metric is given as true positives divided by the sum of true positives and the false positives. The following Equation 9 explains the formula for the precision:
Here, TP refers to true positives, TN refers to true negatives, FP refers to false positives, and FN refers to false negatives.
Another metric for the error analysis is the recall. It refers to the number of true positives divided by the sum of the true positives and the true negatives. It is given by the formula given in Equation 10:
According to the above-given formulas, the precision comes out to be 0.891, and the recall comes out to be 0.907.
The precision gives the ratio of the number of question pairs, which were predicted as paraphrased to the actual number of paraphrased question pairs. At the same time, the recall tells us that what number of question pairs predicted as positive were paraphrased.
The baseline, which is considered in this paper, used the MaLSTM model. It is a Manhattan LSTM model that uses the manhattan distance and LSTM model [30, 31] to calculate the similarity between the sequences. The following comparison table shows the comparison of the baseline model with the collaborative adversarial (CAN) model in terms of accuracy, F1 score, and the Mean Reciprocal Rank (MRR) [32].
Table 1 clearly explains how the collaborative adversarial model performs far better MaLSTM on the Quora Question pairs dataset in terms of accuracy, F1 score, and the mean reciprocal rank by obtaining the accuracy as 88.9%, F1 score to be 0.903 and MRR as 0.73.
Comparison of the Collaborative Adversarial Network model with the MaLSTM baseline model on the Quora Question Pairs dataset, The best performance has been highlighted in the bold
Comparison of the Collaborative Adversarial Network model with the MaLSTM baseline model on the Quora Question Pairs dataset, The best performance has been highlighted in the bold
Table 2 reveals that the collaborative adversarial model, if not the best, is comparable with the other state of the art methods. The table depicts that the MT-DNN model delivers the best performance among all others by obtaining an accuracy of 89.6%. The Collaborative feature in the model helps it to perform better than the GenSan, BiMPM, pt-DecAtt, and the Bi-CAS-LSTM model by obtaining an accuracy of 88.9%.
Comparison of the Collaborative Adversarial Network model with the other state of the art methods on the Quora Question Pairs dataset, The best performance has been highlighted in the bold
Comparison of the Collaborative Adversarial Network model with the other state of the art methods on the Quora Question Pairs dataset, The best performance has been highlighted in the bold
The two other models, namely RE2 and DIIn, perform better than the collaborative adversarial networks model by obtaining an accuracy of 82% and 89.06%, respectively.
The following Fig. 8 clearly shows the performance of our model with other state of the art methods. As it can be seen that the accuracy of our model, CAN, is comparable with other state of the art methods.

Comparision of accuracy with other state of art methods.
The Collaborative feature in the model helps it to perform better than the GenSan, BiMPM, pt-DecAtt, and the Bi-CAS-LSTM model by obtaining an accuracy of 88.9%. This means that out of 100 cases, in approximately 89 cases, our model is able to correctly predict the paraphrased questions. Our model uses the adversarial networks with common feature extraction, which overcomes the shortcomings of the other models by feeding in the common features of both the questions to the discriminator so as to confuse it further and hence make our model more efficient. The three models, namely MT-DNN, RE2, and DIIN, perform better than the collaborative adversarial networks model by obtaining an accuracy of 89.6%, 89.2%, and 89.06%, respectively.
In this paper, the author is proposing a paraphrase identification model based upon Collaborative and Adversarial Network (CAN), which is a modified version of GAN but works upon the text. In a nutshell, our CAN model is identifying similar and dissimilar sentences using collaborative and adversarial learning. In comparison with previous work in a similar area, the author has shown comparable to is not better than other approaches for the same. Also, our model shows excellent potential using the CAN for paraphrase identification. Future aspects of paraphrase identification can be seen in Machine Translation, Question Answering, Information Retrieval, and summarization.
