Adam Adadelta Optimization based bidirectional encoder representations from transformers model for fake news detection on social media

Abstract

Social platform have disseminated the news in rapid speed and has been considered an important news resource for many people over worldwide because of easy access and less cost benefits when compared with the traditional news organizations. Fake news is the news deliberately written by bad writers that manipulates the original contents and this rapid dissemination of fake news may mislead the people in the society. As a result, it is critical to investigate the veracity of the data leaked via social media platforms. Even so, the reliability of information reported via this platform is still doubtful and remains a significant obstacle. As a result, this study proposes a promising technique for identifying fake information in social media called Adam Adadelta Optimization based Deep Long Short-Term Memory (Deep LSTM). The tokenization operation in this case is carried out with the Bidirectional Encoder Representations from Transformers (BERT) approach. The measurement of the features is reduced with the assistance of Kernel Linear Discriminant Analysis (LDA), and Singular Value Decomposition (SVD) and the top-N attributes are chosen by employing Renyi joint entropy. Furthermore, the LSTM is applied to identify false information in social media, with Adam Adadelta Optimization, which comprises a combo of Adam Optimization and Adadelta Optimization . The Deep LSTM based on Adam Adadelta Optimization achieved maximum accuracy, sensitivity, specificity of 0.936, 0.942, and 0.925.

Keywords

Fake news detection tokenization singular value decomposition adam optimizer kernel linear discriminant analysis

1. Introduction

The majority of citizens have spent their entire lives by intermingling in online through social media systems in recent times, and much more people look and collect news from social media in place of traditional media. Implicit nature of such social media sites is the driving force behind this massive shift in news consumption. It is generally considered as a time effective process, and cost effective to gather news from social platform when contrasted with the conventional news organizations, such as television, and newspapers. Besides, the benefits of using this social platform are to share, comment, and converse the trending information’s with the readers available on social platform [14]. Because of its inherent properties, social media offers multiple advantages to audience, rapid distribution, and very cheap cost and is easily accessible by everyone at any time. The major reason behind this rapid development of user’s commitment in online news is the adorable characteristics of social platforms, such as fast dissemination, and user-friendly type. Though the social media has gained more attention among people in the world, the quality of the news distributed through such platforms are very low when compared to the conventional news distributed by the news organizations. This is because of the fact that the social media doesn’t have any authority that tend to cause the dissemination of wide spread of fake news. Numerous fake news, which are substandard with highly fake news are rapidly spread through online [28, 15]. Social platform has become a fundamental source of news consumption in recent days. Thus, it plays an effective role for individual person to post or comment news.

Fake has been termed as a own created information that copies the original content of news media but not followed the guidance of any organization, that imbricates with other content disorders like false information or misleading information that is intentionally made to mislead the people. It can minimize or maximize the efficiency of the programs, and campaigns that focused in citizen’s awareness, health, and well-being is the consequences of this false information in social platform [17]. Fake news means false information that spreads rapidly among people that deliberately made to mislead the people, which affect each individual in this society or as a whole society. The major impacts of dissemination of fake news are as follows: For instance, it may disrupt the news ecosystem’s truthfulness stability. The second issue is that false information persuades the majority of citizens to believe false narratives. Finally, fake news may create considerable impacts on real-world events [16, 31, 33, 34]. The recognition of false information has got more attention hence the social media also enable the huge dissemination of false information due to its harmful effects of false news. However, the efficiency of false report detection merely from original news content is usually insubstantial as the pieces of fake information are written to imitate the original news [18].

False information has emerged as one of the key constraints in both industry and academia in recent times, and human fact-checking is an appropriate solution to this issue. Human fact-checking scholars may have extremely lower power effectiveness. Besides, the manual fact-checking is very much expensive and laborious. Hence, in order to cope up with such limitations, Deep Learning and Machine Learning are introduced to computerize this mechanism [35, 36]. For past few years, Internet of Things (IoT) researchers have been trying to develop effective strategies for online false reports detection to find the fake news from original content and different hierarchical classification techniques can be employed for fake news detection [22, 37, 38]. To achieve the task as it needs systems to review the news and evaluate it with the original news to categorize it as fake whereas the identification of fake news is a difficult process. [19, 39, 40]. The Natural Language Processing (NLP) tasks shares similarities to fake news in which the deep learning models are utilized. In [21], to estimate the semantic similarity of question pairs the Siamese Manhattan LSTM (MaLSTM) is used. The text sequence is converted into a fixed length vector representation by the deep neural network which is then employed to determine the similarities of two textual sequences [20].

The most important aim of this experimentation is to expand and establish an efficient approach for fake news identification from social media by presenting Adam Adadelta Optimization (AO) based Deep Long Short-Term Memory (Deep LSTM). The contribution is given as follows: An effective approach is extended for detecting the false information from social media using Adam AO based Deep LSTM. The assortment of top level features is made by employing Renyi joint entropy measure such that it results the data with reduced dimension.

The leftover sections of the study paper are given below: The literature examination of the existing strategies of identifying fake news is examined in Section 2. The Adam AO based Deep LSTM for fake information identification and Section 4 explain the Adam AO training algorithm is depicted in Section 3. The findings and the explanation of the Adam AO based Deep LSTM scheme is presented in Section 5, the research is concluded in Section 6.

2. Literature review

Various accessible methods of fake news detection in social media are deliberated as follows: For effective prediction of fake short-text tweets, Lu and Li [1] developed a neural network based model named Graph-aware Co-Attention Networks (GCAN). This problem statement was very challenging task when compared with other conventional studies. The developed GCAN model provided early detection of fake news with high accuracy. Moreover, GCAN was not only developed for identifying the false information in social media but also utilized for classification tasks of short-texts in social media, like hate speech detection, sentiment detection, and tweet popularity prediction. However, the method failed to include generalization model to eliminate the event specific features as it improves the overall system performance. Han et al. [2] modeled a propagation based method approach for fake news detection, which uses Graph Neural Network (GNN) to discriminate among the propagation patterns of real and fake news on social platform. It did not depend on any text details but, it offered better results and higher performance. The developed model avoided the process of re-training on the complete data that decreases the training time. The primary drawback of this approach was that it did not handle different graph structure. For false report detection, Kaliyar et al. [3] devised a deep convolutional neural network for fake news detection (FNDNet). Here, Deep Fake extracted various features at each layer. The cross-entropy rate was very low and the developed approach obtained high classification accuracy. However, the method failed to consider the user profiles based features, and context-specific datasets as it detects the fake news article effectively. A deep neural network is commenced by Kaliyar et al. [4] for classification task, where the context of the news article and the presence of echo chambers in the social media platform were considered for the identification of false report. The developed method offered precision without causing over fitting on the training data. However, the model had the inability to conduct text based classification of news articles in real time.

Monti et al. [5] presented an automatic fake news detection model based on geometric deep learning. The developed method incorporated the heterogeneous data associated with the user activity and profile, news disseminating patterns, social network topology, and social content. The major advantage of this detection model was learning capability the task-specific features from the information. The developed model achieved very high accuracy in various setting levels including large-scale real data, but the performance results was not satisfied in false news identification, such as classification of news topic and virality prediction. Zhang et al. [6] presented a concept of double emotion features in which the emotions of publisher and the social emotions were taken into account for false news identification. The developed model was very well compatible with existing fake news detector. The developed model failed to leverage multi-modal information to acquire the emotions even more precise and also failed to utilize enlightened techniques for dual emotion representation. An efficient deep neural network that had concentrated on considering both the existence of echo chambers over social network and the content of news article was designed by Kaliyar et al. [7]. The developed model provided valuable information about user based interaction. However, accurate classification could be enhanced by including the temporal information. Wang et al. [8] end-to-end framework called Event Adversarial Neural Network (EANN). The developed model was basically comprised with three components. The system actually recorded basic information across all activities, and this depiction was also discriminatory on upcoming events that were to come. However, the method failed to handle the complex feature representation.

3. Proposed adam Adadelta Optimizer (AO) based deep long short term memory (Deep LSTM) for fake news identification

Fake news identification is intentionally made by the bad writer that imitates the real news and dissemination of such fake news across worldwide misleads the people. At first, the input review information is taken from the review database specified in [10, 27] and it is subjected to tokenization phase. The tokenization phase is fulfilled by using the Bidirectional Encoder Representations from Transformers (BERT) process [11]. The features like Term Frequency-Inverse Document Frequency (TF-IDF), sentence level features, like punctuation, elongated words, capitalized words, Emoticons, Numerical words, and SentiWordNet (score), like positive and negative scores are extracted at the feature extraction phase. Once the features are extracted, dimensionality reduction is performed using Kernel Linear Discriminant Analysis (LDA) [24], and Singular Value Decomposition (SVD) [25] individually, and then the top-N features are selected using Renyi joint entropy. Finally, the chosen features are given into the Deep LSTM [9] for false information identification, where the network classifier is trained by means of the Adam Adadelta Optimizer (AO). The Adam AO is derived by the incorporation of Adam optimization [23] and AO, wherein the activation functions are the sigmoid activation function in output layer and the tangential activation function in hidden layer. Figure 1 represents the schematic view of Adam AO based Deep LSTM.

Figure 1.

Block diagram of adam Adadelta Optimizer (AO) based deep long short term memory (Deep LSTM).

3.1 Attainment of input review data

The public reviews on social media platform reveal the emotions, opinions, sentiments, and the expressions of the people all over the world that is very substantial. Hence the review data acquired from the fake news net database specified in [10, 27] is considered as an input data for upcoming process.

$D$ is the dataset with $n$ number of sample review information and it is represented as,

$\displaystyle D=\left\{{R_{1},\,R_{2},\ldots R_{m},\ldots R_{n}}\right\}$ (1)

where $R_{m}$ denotes the input review data and $n$ implies the overall review details.

3.2 Tokenization by bidirectional encoder representations from transformers (BERT) technique

Tokenization refers to the technique of separating a statement into individual words, and it is important to tokenize content feedback. Before even being fed into the framework, the phrases are separated into minor words which are known as the tokens. Each model showed its respective entailed with its tokenizer, trained on a massive raw data. Here, $R_{m}$ is the input review data given into the tokenization, which is performed using BERT technique. $T_{m}$ is the output the model.

3.3 Feature extraction

The result of the tokenization procedure is applied to the extraction of feature method. This is the procedure of extracting the appropriate characteristics, which is mainly utilized to train a machine learning classifier. The fundamental aim of feature extraction is to characterize the input text and make it suitable for further processing. The choice of suitable features can enhance the accuracy of the framework and also reducing the price of the training phase. Furthermore, choosing effective attributes can lengthen the system’s training time. As a result, less but more effective features are extracted are described below:

3.3.1 Term frequency-inverse document frequency (TF-IDF)

TF-IDF [13] utilizes the frequency count of every term present in the whole document for the characterization of input text. Such frequencies are employed to determine the value that represents the implication of a term mentioned in the whole manuscript. Alternately, it also represents every term in the document by its corresponding weight. Typically, most of the feature extraction process includes the feature of TF-IDF [22], because the normal term frequency is not a reliable technique to characterize all the terms in a document, specifically when the set of documents has huge number of documents. It is a process of information retrieval in which its value maximizes if the token occurs often in the document and the value minimizes if the token occurs frequently in the corpus, thereby resulting an accurate metric value. TF-IDF provides the score of word frequency that attempts to point up the interesting words. Thus, TF-IDF can be expressed as,

$\displaystyle f_{1}=\textit{tf}_{u,v}\,\times\,\log\,\left({\frac{M}{\textit{% df}_{u}}}\right)$ (2)

where the entire documents is specified as $M$ , the $\textit{df}_{u}$ indicates the number of documents containing $u$ and the $\textit{tf}_{u,v}$ represents the number of occurrences of $u$ in $v$ .

3.3.2 SentiWordNet

The SentiWordNet [12] is employed to extract the scores and numbers of positive, negative, and neutral words in tweets. In addition, the SentiWordNet is also utilized to determine the total score and this score is very significant to categorize the tweets into multiple classes to understand the sentiment value of the tweets. Here, polarity score triple is employed to calculate the semantic orientation of a word and for this, comparison of positivity and negativity value per term is performed. The negative or positive grouping of review texts and it is specified as,

$\displaystyle f_{2}(d)=\textit{PMI}(d,\,\textit{pos})-\textit{PMI}\,(d,\,% \textit{neg})$ (3)

where $d$ is a lexicons term. $\textit{PMI}\,(d,\,\textit{pos})$ implies the Point-wise Mutual Information (PMI) score among $d$ and the positive class and $\textit{PMI}\,(d,\,\textit{neg})$ represents the PMI score between $d$ and negative class.

3.3.3 Punctuation

Emotions on the social tweets are strengthened utilizing exclamation and question marks. Punctuation they are essential for the grammatical structure of a text. This feature considers the number of questions marks in the tweets and the number of exclamation marks in the tweet message. Therefore, such features are more useful to provide certain information about the sentiment of a tweet message. The punctuation feature is expressed as $f_{3}$ .

3.3.4 Elongated words

These are a type of feature that counts the number of words that are replicated with the characters twice, 3 times, or 4 times all through the file. Eg., happy. The result obtained from this type of feature is expressed as $f_{4}$ .

3.3.5 Capitalization

Capitalized words, also known as capitalization, are a feature that is primarily used to determine the number of words in a document that contain all upper-case characters. The output obtained from this feature is denoted as $f_{5}$ .

3.3.6 Numerical words

It is a characteristic utilized to establish the number values existed in a sentence and the feature extracted is notated as $f_{6}$ .

3.3.7 Emoticons

Emoticons are facial expressions acquired in pictorial representation, and are frequently utilized in social platform to convey their emotions. The emoticons must be substituted with its sentiment polarity and the resultant of this feature is indicated as $f_{7}$ .

Therefore, the output obtained from the method of feature extraction is expressed in the form of,

$\displaystyle F_{m}=\left\{{f_{1},\,f_{2}(d),\,f_{3},\,f_{4},\,f_{5},\,f_{6},% \,f_{7}}\right\}$ (4)

3.4 Dimension reduction using kernel linear discriminant analysis (LDA) and singular value decomposition (SVD)

The extracted feature $F_{m}$ is given to the process of dimensionality reduction to reduce the feature measurement. The score to every feature is generally formulated using general criteria during the features extraction mechanism, and then the top-N features are picked from the extracted features. Here, dimensionality reduction is done using Kernel LDA and SVD, which are briefly discussed in the below section.

3.4.1 Kernel linear discriminant analysis (LDA)

It is a standard tool for categorization problem and it is purely on the basis of conversion of the input space into a latest one. By exploiting kernel functions, LDA [24] is generalized in a case where in the converted space the principal component is nonlinearly associated with the input variables. The Kernel function $K$ is utilized to construct the nonlinear dividing function in the input space, which is corresponding to linear dividing operator in the feature space $F$ . The methods are given as follows:

Step 1: calculate the matrices

For a given classes $x$ and $y$ , kernel function is computed as follows,

$\displaystyle\left({k_{gh}}\right)_{xy}=\varpi^{r}\,\left({W_{xg}}\right)\,% \varpi\,\left({W_{yh}}\right)$ (5)

Let $K$ be a $\left({N\,\times\,N}\right)$ matrix describes on the class elements by $\left({K_{xy}}\right)_{\begin{subarray}{c}x=1,\ldots,E\\ y=1,\ldots,E\end{subarray}}$ , where $\left({K_{xy}}\right)$ is a matrix formed by dot product in the feature space $F$ .

$\displaystyle K=\left({K_{xy}}\right)_{\begin{subarray}{c}x=1,\ldots,E\\ y=1,\ldots,E\end{subarray}}$ (6) $\displaystyle K_{xy}=\left({k_{gh}}\right)_{\begin{subarray}{c}g=1,\ldots,ex\\ h=1,\ldots.,ey\end{subarray}}$ (7)

where $K_{xy}$ is a $\left({e_{x}\,\times\,e_{y}}\right)$ matrix and $k$ is a $\left({N\,\times\,N}\right)$ symmetric matrix, such that $K_{xy}^{r}=K_{xy}$ .

$\displaystyle H=\left({H_{a}}\right)_{a=1,\ldots,E}$ (8)

where $H_{a}$ is a $\left({e_{a}\,\times\,e_{a}}\right)$ matrix with terms all equal to $\frac{1}{e_{a}}$ . $H$ is a $(N\times N)$ block diagonal matrix.

Step 2: Decompose ${\bm{K}}$ using Eigen vectors decomposition

Assume Eigen vectors matrix decomposition $K$ and it is articulated as,

$\displaystyle K=G\,\Gamma G^{r}$ (9)

where the diagonal matrix of non-zero eigen values is specified as $\Gamma$ and $G$ is the normalized eigen vectors matrix corresponding to $\Gamma$ .

Step 3: Compute Eigen vector and Eigen values

Since $G$ is orthonormal, the solutions are determined by maximizing $\lambda$ .

$\displaystyle\lambda\,\alpha=G^{r}\,H\,G\,\alpha$ (10)

Step 4: Compute Eigen vectors and normalize them

As the eigen vectors are linear combinations of $F$ elements, the coefficients $\beta_{xy}=(x=1,\ldots E$ ; $y=1,\ldots e_{x})$ , such that

$\displaystyle\eta=\sum\limits_{x=1}^{E}{\sum\limits_{y=1}^{e_{x}}{\beta_{xy}\,% \varpi\,\left({W_{xy}}\right)}}$ (11)

All solution of $\eta$ lie in the span of $\varpi\,\left({W_{gh}}\right)$ . The coefficients $\beta$ are normalized by $n$ requiring that the equivalent vector $\eta$ be normalized in $F$ .

$\displaystyle\eta^{r}\,\eta=1$ (12) $\displaystyle\eta^{r}\,\eta=\sum\limits_{x=1}^{E}\sum\limits_{y=1}^{e_{x}}{% \sum\limits_{a=1}^{E}{\sum\limits_{b=1}^{e_{a}}{\beta_{xy}\,\beta_{ab}\,\varpi% ^{r}\,\left({W_{xy}}\right)}}}\varpi\,\left({W_{ab}}\right)=1$ (13) $\displaystyle\eta^{r}\,\eta=\sum\limits_{x=1}^{E}{\sum\limits_{a=1}^{E}{\beta_% {x}^{r}\,K_{xa}\,\beta_{a}=1}}$ (14) $\displaystyle\eta^{r}\,\eta=\beta^{r}\,K\,\beta=1$ (15)

The coefficients of $\eta$ are divided by $\sqrt{\beta^{r}\,K\,\beta}$ to attain the normalized vectors $\eta$ .

Step 5: calculate projections of test points into the Eigen vectors ${\bm{\eta}}$

By knowing the normalized vectors $\eta$ , then formulate the projections of a test point $s$ by,

$\displaystyle\eta^{r}\,\varpi\,(s)=\sum\limits_{x=1}^{E}{\sum\limits_{y=1}^{e_% {x}}{\beta_{xy}\,k\,\left({W_{xy},\,s}\right)}}$ (16)

The result achieved through the process of Kernel LDA is expressed as $X$ .

3.4.2 Singular value decomposition (SVD)

SVD [25] is a general method for the study of multivariate information and it can identify and even mine the small signals from noisy data. SVD offers another way to factorize a matrix into singular values and singular vectors. Moreover, SVD is used to determine some of the same type of information as the Eigen decomposition determines.

Let $Y$ indicates a $i\,\times\,j$ matrix of real valued data. The expression for SVD is computed as,

$\displaystyle Y=C\,I\,O^{t}$ (17)

where $C$ is a $m\times n$ matrix, $I$ is a $n\times n$ diagonal matrix, and $O^{t}$ is also a $n\times n$ matrix. The columns present in $C$ are known as left singular vectors. $\left\{{c_{o}}\right\}$ generates an orthonormal bias for the expression profiles, such that $c_{mm}\,c_{nn}=1$ for $mm=nn$ and otherwise $c_{mm}c_{nn}=0$ . The rows in $O^{t}$ is called right singular vectors. $\left\{{w_{o}}\right\}$ generate an orthonormal bias for the gene transcriptional responses. The element of $I$ are called singular values and the diagonal elements are in the form of non-zero. Hence, $I=\textit{diag}\,\left({ii_{1},\ldots ii_{j}}\right)$ . Moreover, $ii_{o}>0$ for $1\leqslant 0\leqslant l$ and $ii_{mm}=0$ for $(l+1)\leqslant 0\leqslant j$ . One of the significant outcomes of SVD of $Y$ is expressed as,

$\displaystyle Y^{(ab)}=\sum\limits_{o=1}^{ab}{c_{o}\,ii_{o0}\,w_{o}^{t}}$ (18)

Let the measurement of the extracted feature is $\left[{1\,\times 1007}\right]$ . The size of the output $X$ obtained through Kernel LDA is $[1\,\times\,10]$ , whereas the SVD generates the output $Y$ with a dimension of $[1\,\times 10]$ .

3.4.3 Fusion based on Renyi joint entropy

Once the measurement of the features is decreased by means of Kernel LDA and SVD, the top-N characteristics are selected using Renyi joint entropy. The equation for Renyi joint entropy is represented as,

$\displaystyle Q_{\theta}\,\left({X,\,Y}\right)=\frac{1}{1-\theta}\,\log\,\left% ({\sum\limits_{x\,\in\,X}{\sum\limits_{y\,\in\,Y}{P\,(x,y)}^{\theta}}}\right)$ (19)

where $x$ implies the particular value of output $X$ of Kernel LDA with class label and $y$ specifies the particular value of output $Y$ of SVD with class label. $P\,(x,\,y)$ implies the joint probability of these values occurring together. Here, $\theta$ is considered as $\theta\geqslant 0$ , and $\theta\neq 1$ . The Renyi joint entropy selected the top-N features with a dimension of $[1\,\times\,20]$ . The result is specified as $J_{q}$ .

3.5 False news detection using deep long short term memory (Deep LSTM)

The output obtained from the fusion of features using Renyi joint entropy $J_{q}$ is given to the Adam AO trained Deep LSTM to detect the fake news from the tweets.

3.5.1 Structure of Deep Long Short Term Memory (Deep LSTM)

Deep neural systems are a type of recursive feed forward systems, which are mainly utilized for the purpose of extracting and learning the features which are deeply implanted in the information. The result obtained from Deep-LSTM [9] network depends upon the state of such cells and this feature facilitates the system for the function of detection, since instead of last input such method needs the historical context of inputs. The effective principle is based on the memory cells and these cells convey many subunits with various objectives. Figure 2 denotes the structure of Deep LSTM.

Figure 2.

Structure of deep long short term memory (Deep LSTM).

The input node $L_{q}$ receives $J_{q}$ input from the Deep LSTM input layer and also from the preceding hidden states $\textit{Hh}_{q-1}$ of the node in $q$ time intervals. The sigmoid is utilized as the gating function [26] that helps to enhance the prediction accuracy, while $\tan\,h$ is utilized as the output activation function. So, the weighted sum of $J_{q-1}$ and $\textit{Hh}_{q-1}$ is fed throughout the $\tan\,h$ function and it is expressed as,

$\displaystyle L_{q}=\tan\,h\,\left({J_{q}.\,V_{LJ}+\,\textit{Hh}_{q-1}\cdot V_% {\textit{LHh}}+\textit{bias}_{\textit{input\,node}}}\right)$ (20)

The input gate $i_{q}$ is identical to the input node because it chooses to accept the same inputs as the input source.

$\displaystyle i_{q}=\gamma(J_{q}\cdot V_{\textit{LJ}}+\textit{Hh}_{q-1}\cdot W% _{\textit{LHh}}+\textit{bias}_{\textit{inputgate}})$ (21)

The internal state $s_{q}$ is expressed as,

$\displaystyle s_{q}=i_{q}\,\Theta\,L_{q}+s_{q-1}$ (22)

The forget state $f_{q}$ , which is used to re-stimulate the internal state of memory cell and it is computed as,

$\displaystyle f_{q}=\gamma\,\left({J_{q}.\,V_{\textit{HhJ}}+\textit{Hh}_{q-1}% \cdot V_{\textit{fHh}}+\,\textit{bias}_{\textit{forget}}}\right)$ (23)

The output gates perform the process, which is expressed as,

$\displaystyle O_{q}=\gamma\,\left({J_{q}.\,V_{\textit{OJ}}+\textit{Hh}_{q-1}\,% \cdot V_{\textit{OJ}}+\,\textit{bias}_{\textit{output\,gate}}}\right)$ (24)

The final result of the memory cell is represented as,

$\displaystyle\textit{Hh}_{q}=\tan\,h\,(s_{q})\,\Theta\,O_{q}$ (25) $\displaystyle s_{q}=L_{q}\,\Theta\,i_{q}+\,s_{q-1}\,\Theta\,f_{q}$ (26)

4. Training procedure of adam Adadelta Optimizer (AO) algorithm

The Adam AO is derived by the combination of Adam Optimization [23], and Adadelta Optimization, which is employed to train the network classifier Deep-LSTM [9] in order to achieve the optimum solution and effectively detect the fake news over social media. Adam only needs first-order gradients with less memory obligated. From the estimation of 1 and 2 moments of the gradients the individual adaptive learning rates for various constraints are formulated by this model. The name Adam is obtained from the estimation of adaptive moment. The benefits of Adam optimization are that the magnitudes of parameters are unchanged and its step sizes are roughly bounded by stepsize hyper-parameter and it does not need any immobile purpose. Moreover, it works well with sparse gradients and it generally executes a pattern of step size annealing. Adadelta optimization is a gradient descent stochastic technique that supports adaptive learning rate per region of space to address two issues. i) Continuous decay of learning rates during training; and ii) the requirement for a physically selected global learning rate. Manual tuning of a learning rate is not needed in this technique, which also would seem to be robust to various architectural model choices, data modalities, noisy gradient details, and hyper parameter estimation.

Step 1: Initialization

To initialize the parameter of the optimization, such as $\sigma$ , $\nu_{1}$ , $\nu_{2}$ , and $\phi_{0}$ , respectively. Here, $\sigma$ is the step size, the hyper-parameters are denoted as $\nu_{1}$ , and $\nu_{2}$ , and the initial parameter is termed as $\phi_{0}$ .

Step 2: Determine fitness

It is utilized to estimate the false information effectively from the social media platform using the optimal function and it is expressed as,

$\displaystyle\textit{Fitness function}=\frac{1}{\omega}\,\sum\limits_{m=1}^{% \omega}{\left[{\textit{TA}_{m}-\textit{Hh}_{q}}\right]^{2}}$ (27)

where $\textit{TA}_{m}$ denotes the targeted outcome and the outcome obtained from the Deep LSTM classifier is termed as $\textit{Hh}_{q}$ .

Step 3: Determine the moment vector

Let us initialize the first and second moment vector as $S_{0}$ , and $U_{0}$ , respectively. The Adam algorithm changes exponential moving average of gradient $(S_{T})$ , and squared gradient $(U_{T})$ , the hyper-parameters $\nu_{1},\,\nu_{2}\in\,[0,\,1)$ handle the exponential decay rates of such moving averages. The update biased 1 and 2 second moment vector are computed as follows,

$\displaystyle S_{T}=\nu_{1}.\,S_{T-1}+\,\left({1-\nu_{1}}\right).\,\textit{Gg}% _{T}$ (28) $\displaystyle U_{T}=\nu_{2}.\,U_{T-1}+\,\left({1-\nu_{2}}\right).\,\textit{Gg}% _{T}^{2}$ (29)

Step 4: Compute bias-corrected moment vector

Let Gg be the gradient of the stochastic objective $Z$ with decay rate $\nu_{2}$ . Let us consider $\textit{Gg}_{1},\ldots\textit{Gg}_{T}$ be the gradients at consequent time steps $T$ and it is expressed as,

$\displaystyle\textit{Gg}_{T}=\nabla_{\phi}\,Z_{T}\,\left({\phi_{T}-1}\right)$ (30)

The first moment determination is expressed as,

$\displaystyle\mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}% \over{S}}_{T}=\frac{S_{T}}{\left({1-\nu_{1}^{T}}\right)}$ (31)

The second moment determination is,

$\displaystyle\mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}% \over{U}}_{T}=\frac{U_{T}}{\left({1-\nu_{2}^{T}}\right)}$ (32)

Step 5: Update parameters

In Rider Deep LSTM, the update solution is derived by integrating Rider optimization algorithm with the Adam Optimization [30]. According to the Rider Deep LSTM, the parameter update solution of algorithm is derived by hybridizing Adam optimization [23] with AO [29] and it is derived as,

$\displaystyle\phi_{T}=\phi_{T-1}-\frac{\sigma}{\sqrt{\mathord{\buildrel\lower 3% .0pt\hbox{$\scriptscriptstyle\frown$}\over{U}}_{T}+\zeta}}.\,\mathord{% \buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}\over{S}}_{T}$ (33)

From Rider Deep LSTM,

$\displaystyle\phi_{T+1}=0.5\,\phi_{T+1}^{\textit{Adam}}+0.5\,\phi_{T+1}^{% \textit{Adadelta}}$ (34) $\displaystyle\phi_{T+1}=0.5\,\left[{\phi_{T}-\frac{\sigma}{\sqrt{\mathord{% \buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}\over{U}}_{T}+\xi}}.\,% \mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}\over{S}}_{T}}% \right]+0.5\,\left[{\phi_{T}-\frac{\textit{RMS}\,\left[{\Delta\,\phi_{T-1}}% \right]}{\textit{RMS}\,\left[{\textit{Gg}_{T}}\right]}\,\textit{Gg}_{T}}\right]$ (35) $\displaystyle\phi_{T+1}=\phi_{T}-0.5\,\left[{\frac{\sigma}{\sqrt{\mathord{% \buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}\over{U}}_{T}+\zeta}}.\,% \mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}\over{S}}_{T}+% \frac{\textit{RMS}\,\left[{\Delta\phi}\right]_{T-1}}{\textit{RMS}\,\left[{% \textit{Gg}_{T}}\right]}\,\textit{Gg}_{T}}\right]$ (36) $\displaystyle\phi_{T+1}=\phi_{T}-\frac{1}{2}\,\,\left[{\frac{\sigma}{\sqrt{% \mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}\over{U}}_{T}+% \zeta}}.\,\mathord{\buildrel\lower 3.0pt\hbox{$\scriptscriptstyle\frown$}\over% {S}}_{T}+\frac{\textit{RMS}\,\left[{\Delta\phi}\right]_{T-1}}{\textit{RMS}\,% \left[{\textit{Gg}_{T}}\right]}\,\textit{Gg}_{T}}\right]$ (37)

$\sigma$ represents the learning rate. RMS of gradient is specified as $\textit{RMS}\,\left[{\textit{Gg}_{T}}\right]$ and $\textit{RMS}\,\left[{\Delta\phi}\right]_{T}$ denotes the root mean square of parameter update . The smoothing term is indicated as $\xi$ .

Step 6: Termination

The process is repeated until the optimal solution is attained.

Algorithm 1 views the pseudo code of Adam AO with Deep-LSTM classifier.

Algorithm 1. Adam AO algorithm with Deep-LSTM classifier
1	Initialize Step size $\sigma$
2	Estimate Exponential decay rates for moment $\nu_{1},\nu_{2}\in[0,\,1)$
3	Initialize $Z(\phi)\to$ Stochastic object function with parameters $\phi$
4	Assume Initial parameter vector $\phi_{0}$
5	Apply Deep-LSTM output in the fitness function to estimate the false information effectively from the social media platform
6	Start first moment vector $S_{0}$
7	Start second moment vector $U_{0}$
8	While $\phi_{T}$ not converged do
9	$T=T+1$
10	Get gradient with stochastic object at time step $T$ using Eq. (30)
11	Update biased first moment using Eq. (28
12	Update biased second moment using Eq. (29)
13	Compute bias-corrected first moment using Eq. (31)
14	Compute bias-corrected second moment using Eq. (32)
15	Update the parameter using Eq. (37)
16	end while
17	return $\phi_{T}$

5. Results and discussion

This segment elucidates the result and discussion of Adam AO based Deep LSTM with the evaluation metrics.

5.1 Experimental setup

The execution of the suggested technique is performed in PYTHON. The performance of Adam AO based Deep LSTM is investigated by relating to the accuracy, sensitivity, and specificity. The database utilized in the proposed method are FakeNewsNet database (Dataset #1) [10], and BuzzFeedNews database (Dataset #2) [27].

BuzzFeedNews database: BuzzFeedNews database is a collection of datasets from Facebook posts and it consists of text, videos, photos, and links but for this research work we consider only the text data. Some of the contents are mostly true and some of them are mixture of true and false content.

FakeNewsNet database: FakeNewsNet database is similar to the above said database, which consists of gossips and fake mews about political events and so on.

Table 1 lists the abbreviations used in this paper and their definitions.

Table 1
List of abbreviations

AO	Adadelta optimizer
BERT	Bidirectional Encoder Representations from Transformers
CNN	Convolutional Neural Network
Deep LSTM	Deep Long Short-Term Memory
DUAL	Deep Unified Attention Model with Latent Relation Representations
EANN	Event Adversarial Neural Network
FNDNet	deep convolutional neural network for fake news detection
GCAN	Graph-aware Co-Attention Networks
GNN	Graph Neural Network
IoT	Internet of Things
LDA	Linear Discriminant Analysis
LSTM	Deep Long Short Term Memory
MaLSTM	Manhattan LSTM
NLP	Natural Language Processing
PMI	Point-wise Mutual Information
SVD	Singular Value Decomposition
TF-IDF	Term Frequency-Inverse Document Frequency

5.2 Analysis using Dataset #1 and Dataset #2

This division calculates the performance of suggested model with the evaluation metric using Dataset #1, and Dataset #2.

Table 2 represents the confusion matrix for Dataset #1.The accuracy of the proposed method with feature size $=$ 5, 10, 15 and 20 is 0.801, 0.805, 0.807, and 0.810 for the training data 60%. The sensitivity of Adam AO based Deep LSTM with feature size $=$ 5, 10, 15 and 20 is 0.927, 0.930, 0.934, and 0.94 for the training data to 90%. The specificity of the proposed approach with feature size $=$ 5, 10, 15 and 20 is 0.787, 0.791, 0.795, and 0.798 for the training data 60%.

Table 2
Confusion matrix for Dataset #1

	Predicted positive	Predicted negative
Actual positive	3164.0	298.0
Actual negative	299.0	6328.0

Table 3 demonstrates the confusion matrix for Dataset #2. For 90% data, the accuracy attained by the suggested scheme for feature size $=$ 5, 10, 15, and 20 is 0.929, 0.930, 0.932, and 0.936. The sensitivity achieved by the Adam AO based Deep LSTM with feature size $=$ 5, 10, 15 and 20 is 0.846, 0.850, 0.854, and 0.861 for the training data 60%. The specificity of the Adam AO based Deep LSTM with feature size $=$ 5 is 0.792, 10 is 0.798, 15 is 0.802, and 20 is 0.807 for the training data 60%.

Table 3

Confusion Matrix for Dataset #2

	Predicted positive	Predicted negative
Actual positive	193.0	12.0
Actual negative	13.0	385.0

The efficiency development of the Adam AO based Deep LSTM is analyzed and compared with the existing methods, such as GNN [2], Deep Unified Attention Model with Latent Relation Representations (DUAL) [32], and Deep Fake [4].

Table 4

Comparative discussion

Dataset	Metrics/methods	GNN	DUAL	Deep fake	Proposed Adam AO based Deep LSTM
Dataset #1	Accuracy (%)	82.20	83.85	84.77	92.41
	Sensitivity (%)	86.37	87.04	88.19	93.96
	Specificity (%)	78.80	79.56	82.88	91.37
Dataset #2	Accuracy (%)	83.49	84.55	85.51	93.55
	Sensitivity (%)	87.74	88.46	89.91	94.19
	Specificity (%)	79.89	81.30	83.25	92.46

Figure 3.

Comparative analysis using Dataset #1.

Figure 4.

Comparative analysis using Dataset #2.

GNN: Here, a propagation based technique is used for fake news detection, which utilizes GNN to differentiate the propagation patterns of real news and fake news over social networks.

DUAL: Here, a DUAL model is used for detecting fake news. This method integrates two scopes of features for detection and concurrently discovers the hidden representation of these two features.

Deep Fake: Here, the content of the news article and the presence of echo chambers in the social network are considered for fake news detection. A tensor representing social context is formed by integrating the user, news, and community information. The news content is amalgamated with the tensor, and coupled matrix-tensor factorization is engaged for obtaining a representation of both social context and news content.

In Fig. 3, the comparative analysis of suggested scheme by considering the Dataset #1 is shown. The accuracy of the proposed method is 0.924, where the existing GCAN is 0.822, DUAL is 0.839, and Deep Fake is 0.848 for the training data 90%. The sensitivity of the proposed method is 0.939 that results the efficiency improvement of developed scheme with GNN, DUAL, and Deep Fake, is 8.68%, 7.931%, and 6.463% for the training data $=$ 90% . The analysis of specificity of the proposed method is 0.914, where the conventional schemes attained the specificity of 0.788 for GNN, 0.796 for DUAL, and 0.829 for Deep Fake for 90% training data.

In Fig. 4, the comparative analysis of Adam AO based Deep LSTM using Dataset #2 is represented. The accuracy of the proposed method is 0.936 that results the efficiency improvement of extended scheme with GNN is 12.049%, DUAL is 10.646%, and Deep Fake is 9.399% for the training data 90%.The sensitivity of the devised method is 0.942, whereas 0.877 for GCAN, 0.885 for DUAL, and 0.899 for Deep Fake for the training data 90%. If the training data $=$ 90%, the specificity of the devised method is 0.925 that results the efficiency enhancement of GNN is 15.723%, DUAL is 13.723%, and Deep Fake is 11.052%.

Table 4 signifies the comparative discussion of Adam AO based Deep LSTM based on the best performance. From the analysis, the Adam AO based Deep LSTM achieved maximum accuracy, sensitivity, specificity of 93.55%, 94.19%, and 92.46% for Dataset #2.

6. Conclusion

Fake news is purposely spread by the bad writers who manipulate the original content of the news. In this research, an Adam AO based Deep LSTM is devised to identify the fake information in social media by considering the assessment data. Here, tokenization is performed using BERT technique and the features are removed. The measurement of the feature is decreased using Kernel LDA, and SVD, where the fusion of the feature is carried out using Renyi joint entropy. Once the top-N features are selected, the fake news detection is performed by means of Deep LSTM and the training of the network classifier is completed by using Adam AO based Deep LSTM. The Adam AO based Deep LSTM achieved maximum accuracy, sensitivity, specificity of 0.935512, 0.941913, and 0.924554. To improve the accuracy of detection process using other novel optimization algorithm with different classifier and explore in all type of news like image and videos will be the future work.

Footnotes

Author’s Bios

Steni Mol T S is pursuing PhD in the Computer Science Department of the Hindustan Institute of Technology and Science, India. This research involves the challenge and solutions of Text Mining, Natural Language Processing and Artificial Intelligence. She holds a Bachelor Degree in computer science (2011) at Nesamony Memorial Christian College, ManonmaniamSundaranar University, India and a Master of Degree in Computer Application (2014) at School of Communication & Management Studies Cochin, Mahatma Gandhi University, India

Sreeja P S is an Assistant Professor (SG) in the Department of Computer Applications, HITS, Padur. She completed her MCA from Bharathidasan University, M.Phil. (Computer Applications) andPh.D from College of Engineering, Anna University, Chennai. She has got the first rank in M.Phil. and received a UGC-BSR fellowship for her research. She received Honorary Rosalind Membership from London Journal Press for her significant research work.She has many International journal and conference publications to her credit and derived 60+ citations. Her research interests include Artificial Intelligence, Text Mining, and Natural Language Processing, Cognitive Poetics.

References

Y.J.

and Li

C.T.

, GCAN: Graph-aware co-attention networks for explainable fake news detection on social media, In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 505–514, Online. Association for Computational Linguistics. arXiv preprint arXiv:200411648. (2020).

Han

Karunasekera

and Leckie

, Graph neural networks with continual learning for fake news detection from social media, arXiv preprint arXiv:2007.03316 (2020), doi: 10.48550/arXiv.2007.03316.

Kaliyar

R.K.

Goswami

Narang

and Sinha

, FNDNet – a deep convolutional neural network for fake news detection, Cognitive Systems Research 61 (2020), 32–44.

Kaliyar

R.K.

Goswami

and Narang

, DeepFakE: improving fake news detection using tensor decomposition based deep neural network, The Journal of Supercomputing 77(2) (2021),, 1015–1037.

Monti

Frasca

Eynard

Mannion

and Bronstein

M.M.

, Fake news detection on social media using geometric deep learning, arXiv preprint arXiv:190206673. (2019).

Zhang

Cao

Sheng

Zhong

and Shu

, Mining Dual Emotion for Fake News Detection, In Proceedings of the Web Conference, Ljubljana, Slovenia (2021), 3465–3476.

Kaliyar

R.K.

Goswami

and Narang

, EchoFakeD: improving fake news detection in social media with an efficient deep neural network, Neural Computing and Applications 33 (2021), 8597–8613.

Wang

Jin

Yuan

Xun

Jha

and Gao

, Eann: Event adversarial neural networks for multi-modal fake news detection, In Proceedings of the 24th acmsigkdd international conference on knowledge discovery & data mining, London United Kingdom (2018), 849–857.

Majhi

Naidu

Mishra

A.P.

and Satapathy

S.C.

, Improved prediction of daily pan evaporation using Deep-LSTM model, Neural Computing and Applications 32(12) (2020), 7823–7838.

10.

FakeNewsNet database, https://github.com/KaiDMML/FakeNewsNet/tree/master/dataset" Accessed on August 2021.

11.

Devlin

Chang

M.W.

Lee

and Toutanova

, Bert: Pre-training of deep bidirectional transformers for language understanding,arXiv preprint arXiv:181004805. (2018).

12.

Jianqiang

and Xueliang

, Combining semantic and prior polarity for boosting twitter sentiment analysis, In IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), Chengdu, China (2015), 832–837.

13.

Agarwal

and Dixit

, Fake news detection: an ensemble learning approach, In IEEE 4th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India (2020), 1178–1183.

14.

Shu

Sliva

Wang

Tang

and Liu

, Fake news detection on social media: A data mining perspective, ACM SIGKDD explorations newsletter 19(1) (2017), 22–36.

15.

Shu

Wang

and Liu

, Understanding user profiles on social media for fake news detection, In IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), IEEE, Miami, FL, USA (2018), 430–435.

16.

Shu

Mahudeswaran

Wang

Lee

and Liu

, Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media, Big Data 8(3) (2020), 171–188.

17.

Pulido

C.M.

Ruiz-Eugenio

Redondo-Sama

and Villarejo-Carballido

, A new application of social impact in social media for overcoming fake news in health, International journal of environmental research and public health 17(7) (2020), 2430.

18.

Shu

Mahudeswaran

Wang

and Liu

, Hierarchical propagation networks for fake news detection: Investigation and exploitation, In Proceedings of the International AAAI Conference on Web and Social Media 14 (2020), 626–637.

19.

Thota

Tilak

Ahluwalia

and Lohia

, Fake news detection: A deep learning approach, SMU Data Science Review 1(3) (2018), 10.

20.

Umer

Imtiaz

Ullah

Mehmood

Choi

G.S.

and On

B.W.

, Fake news stance detection using deep learning architecture (cnn-lstm), IEEE Access 8 (2020), 156695–156706.

21.

Imtiaz

Umer

Ahmad

Ullah

Choi

G.S.

and Mehmood

, Duplicate questions pair detection using siameseMaLSTM, IEEEAccess 8 (2020), 21932–21942.

22.

Jiang

J.P.

Haq

A.U.

Saboor

and Ali

, A Novel Stacking Approach for Accurate Detection of Fake News, IEEE Access 9 (2021), 22626–22639.

23.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, arXiv preprint arXiv:14126980. (2014).

24.

Baudat

and Anouar

, Generalized discriminant analysis using a kernel approach, Neural computation 12(10) (2000), 2385–2404.

25.

Wall

M.E.

Rechtsteiner

and Rocha

L.M.

, Singular value decomposition and principal component analysis, In A practical approach to microarray data analysis, Springer, Boston, MA (2003), 91–109.

26.

Vijayaprabakaran

and Sathiyamurthy

, Towards activation function search for long short-term model network: A differential evolution based approach, Journal of King Saud University-Computer and Information Sciences 34(6) (2020), 2637–2650.

27.

BuzzFeedNews database, https://github.com/BuzzFeedNews/2016-10-facebook-fact-check/blob/master/data/facebook-fact-check.csv, Accessed on August 2021.

28.

Darekar

R.V.

and Dhande

A.P.

, Emotion Recognition from Speech Signals Using DCNN with Hybrid GA-GWO Algorithm, Multimedia Research 2(4) (2019), 12–22.

29.

Zeiler

M.D.

, Adadelta: an adaptive learning rate method, arXiv preprint arXiv:12125701. (2012).

30.

Binu

and Kariyappa

B.S.

, Rider Deep LSTM Network for Hybrid Distance Score based Fault Prediction in Analog Circuits, IEEE Transactions on Industrial Electronics 68(10) (2020), 10097–10106.

31.

Cristin

Cyril Raj

and Marimuthu

, Face Image Forgery Detection by Weight Optimized Neural Network Model, Multimedia Research 2(2) (2019), 19–27.

32.

Dong

Yao

Wang

Benatallah

Sheng

Q.Z.

and Huang

, Dual: A deep unified attention model with latent relation representations for fake news detection, In International conference on web information systems engineering, pp. 199-209). Springer, Cham, November (2018).

33.

Zheng

Qin

Shao

and Hou

, A novel objective image quality metric for image fusion based on Renyi entropy, Inf. Technol. J 7(6) (2008), 930–935.

34.

Kirmani

and Wahid

, Revised Use Case Point (Re-UCP) Model for Software Effort Estimation, International Journal of Advanced Computer Science and Applications 6(3) (2015), 65–71.

35.

Srivastava

P.P.

Goyal

and Kumar

, Analysis of Various NoSql Database, In the proceeding of International Conference on Green Computing and Internet of Things (ICGCIoT), IEEE, Greater Noida, India (2015).

36.

Weets

J.F.

Kakhani

M.K.

and Kumar

, Limitations and challenges of HDFS and MapReduce, In the proceeding of International Conference on Green Computing and Internet of Things (ICGCIoT), IEEE, Greater Noida, India (2015).

37.

Manouchehri

Taghipour

Ghavami

Ebadi

Homaei

and Latifnejad Roudsari

, Night-shift work duration and breast cancer risk: an updated systematic review and meta-analysis, BMC women’s health 21(1) (2021), 1–16.

38.

Taghipour

Glaa

and Zoghlami

, Network coordination with minimum risk of information sharing, in the proceeding of International Conference on Advanced Logistics and Transport (ICALT), IEEE (2014), Hammamet, Tunisia, 184–188.

39.

Meher

R.K.

, Hybrid Grasshopper Optimization and Bat Algorithm based DBN for Intrusion Detection in Cloud, Multimedia Research 4(4) (2021), 31–38.

40.

Padma

, Intrusion Detection using Naive Bayes Ant Colony Optimization Algorithm in a Wireless Communication Network, Journal of Networking and Communication Systems 5(1) (2022), 21–29.

Adam Adadelta Optimization based bidirectional encoder representations from transformers model for fake news detection on social media

Abstract

Keywords

1. Introduction

2. Literature review

3. Proposed adam Adadelta Optimizer (AO) based deep long short term memory (Deep LSTM) for fake news identification

3.3 Feature extraction

3.3.1 Term frequency-inverse document frequency (TF-IDF)

3.3.4 Elongated words

3.3.5 Capitalization

3.3.6 Numerical words

3.3.7 Emoticons

3.4.1 Kernel linear discriminant analysis (LDA)

3.5.1 Structure of Deep Long Short Term Memory (Deep LSTM)

5.1 Experimental setup

Table 1 List of abbreviations

Table 2 Confusion matrix for Dataset #1

Footnotes

Author’s Bios

References

Table 1
List of abbreviations

Table 2
Confusion matrix for Dataset #1