CapsuleNet-enhanced multi-modal model for false news detection: A qualitative analysis

Abstract

The spread of false news has hurt both individual practitioners and the media. To enhance the efficiency of false news detection, this study constructs a multi-modal news detection model. The model includes a text encoding module, a contextual semantic encoder, a news propagation encoder, and a false news detection model that integrates semantic features and image recognition. In the results, the multi-modal model showed significantly higher accuracy and F1 score in detecting false news than the unimodal model. Its accuracy and F1 score improved by an average of 7.57% and 7.34% on the POL and GOS datasets, and 7.20% and 6.38% on the WEIBO and TWITTER datasets. In addition, hyperparameter analysis showed that the model performance reached its optimum when the parameters r and k were adjusted to their optimal values. The ablation experiment further validated the importance of the channel attention mechanism and graph comparison method in improving model performance. The results indicate that multi-modal models have significant advantages in detecting false news and can effectively utilize information from different modalities to improve detection accuracy. This study is meaningful for evaluating the reliability of false news information and the media’s credibility in society. Although certain achievements have been made in the research, there are still some limitations. For example, the model may have generalization issues when tested on specific datasets, and the complexity of the model may make deployment difficult in resource-constrained environments. Future work will explore simplified versions of the model and conduct tests on more diverse datasets to enhance the model’s generalization ability and practicability.

Keywords

multi-modal news detection model capsule network false news detection channel attention

Introduction

The promotion of information technology has made the Internet the main channel for information dissemination, greatly promoting the rapid and widespread dissemination of information.^1,2 However, the other side of this phenomenon is the proliferation of false news, which not only damages the news’ authenticity and the media’s credibility but also poses a serious threat to social order and the psychological safety of the people.^3,4 Therefore, it is urgent to address the issue of effectively detecting and curbing the spread of false news. At present, some scholars have also researched the detection of false news. Altheneyan et al. developed a false news recognition model that integrates big data technology and machine learning, focusing on the widespread dissemination of false news on Twitter. The F1 score of this model was 92.45%, better than the baseline method.⁵ Capuano et al. designed a content-based False News Detection (FND) model to address the shortcomings of manual checks in mitigating the spread of false news. This model could accurately identify false news and achieve accurate classification.⁶ Abualigah et al. proposed a model based on the fusion of a Convolutional Neural Network and a Long Short-Term Memory Network (CNN-LSTM) to address the acceleration of the spread of false information caused by the development of the Internet. The fusion model had a high accuracy in handling and classifying false news, with a value of 98.974%.⁷ Mohawesh et al. proposed a recognition model based on relational variables to deal with the difficulty of detecting false news caused by language complexity. The model improved the language conversion rate by 3.97% and the accuracy of detecting false information by 2.88%.⁸ The above-mentioned research on FND faces limitations in dataset generalization, timeliness of social context features, and insufficient fusion of multi-modal features. By introducing the Multi-Head Self-Attention (MHSA) mechanism, this study proposes an FND model combined Multimodal News Detection Model (MNDM) with Capsule Networks (CapsNet). The innovation lies in proposing a multi-modal detection model that integrates text, image, and social network information while introducing CapsNet to strengthen the model’s capacity to capture and represent entities and their attributes in images. This study aims to accurately identify false news and reduce its harm to society and humanity by constructing MNDM.

Compared with traditional news dissemination methods, the novelty of the research model lies in the mode fusion approach. Methods such as SAFE, SpotFake, and HMCAN are all in the form of text + image, while the proposed model is text + image + social network. Second, the proposed model feature extraction method is the CapsNet + channel attention + graph comparison method, which is superior to the traditional CNN and attention mechanisms of other methods. The contribution of this study lies in four aspects. (1) Through the dynamic routing mechanism of the CapsNet, the model can capture the entities and their attributes in the image more effectively, enhance the representation ability of image features, and thereby improve the accuracy of FND. (2) The channel attention mechanism enables the model to pay more attention to important feature channels, while the graph contrast rule enhances the processing ability of social network information through contrastive learning. The combination of the two further improves the model’s performance. (3) Through experiments on multiple datasets, the stability and superiority of the multi-modal model in different scenarios have been proved, providing a more reliable solution for FND. (4) The optimal parameter configuration of the model is determined through hyperparameter analysis, and the importance of the channel attention mechanism and the graph comparison method in improving the model performance is verified through ablation experiments, providing valuable references for subsequent research.

Methods and materials

Construction based on MNDM

The harm of false news is multifaceted. It not only damages the authenticity of the news and the credibility of the media but also poses a threat to social order and the psychological safety of the people. Strengthening the detection of false news has become particularly important.^9,10 This study constructs MNDM based on this. In the text encoding module, when forwarding news for the first time, the written copy is defined as a copy vector, and the encoded news is the news vector. All news texts are processed using word segmentation and termination words. A single news text is defined as $q_{i}$ . The news text is encoded to form a sequence set, and the relevant expression is given by formula (1).

T^{q_{i}} = {T_{1}^{q_{i}}, T_{2}^{q_{i}}, T_{3}^{q_{i}}, \dots T_{n}^{q_{i}}}

(1)

In formula (1), $T^{q_{i}}$ is the set of text sequences and $n$ is the number of words. The set of copy for all users who forward the news text is $D^{q_{i}}$ . After a user’s copy is selected, segmented, and terminated, a corresponding set of text sequences can be obtained. Meanwhile, based on the encoder, the text sequence is converted into word vectors, and the copy vector can be obtained by taking the average value. All users’ copywriting is done in the same way, combining all copywriting vectors into a set of copywriting vectors. The user’s node information is defined as a copy vector. In news texts, semantic encoders are used to enhance semantic features, which mainly consist of shared networks, channel attention calculators, Fully Connected Layers (FCLs), and pooling layers. Its structure is shown in Figure 1.

Figure 1.

Semantic encoder structure diagram.

Figure 1 shows the structure diagram of a semantic encoder, which is used to handle a collection of text sequences. The core idea of the max pooling layer is to obtain the most important feature information by selecting the maximum activation value. The average pooling layer focuses on less obvious features. It uses a max pooling layer to obtain local features and an average pooling layer to obtain global features. The main function of the FCL is to integrate news words into vectors and grammatical features and then integrate all features into a news text matrix. Shared networks are mainly composed of hidden layers and perceptrons, which are mapped based on local and global features to achieve dimensionality reduction. By assigning weights to the channel space of news texts and completing the summation process, a news text channel space with better semantic features can be obtained, as shown in formula (2).

G^{M_{i}} = σ W_{1} (W_{0} (G_{\max}^{M_{i}})) + W_{1} (W_{0} (G_{a v g}^{M_{i}}))

(2)

In formula (2), $G^{M_{i}}$ is the news text channel space, $σ$ is the Sigmoid function, $W_{0}$ and $W_{1}$ are both weight parameters, $G_{\max}^{M_{i}}$ is the local feature, and $G_{a v g}^{M_{i}}$ is the global feature. Based on $G^{M_{i}}$ , it can be imported into the channel attention calculation formula to obtain contextual semantic feature information of news text. The channel attention is shown in formula (3).

f (G^{M_{i}}) = \frac{1}{1 + \exp (- G^{M_{i}})}

(3)

The news dissemination module mainly includes graph data augmentation, Graph Sample and Aggregation (GSA), and contrastive loss calculation, as shown in Figure 2.

Figure 2.

News communication module structure diagram.

Figure 2 shows the structure diagram of the news dissemination module, which is used to analyze and process the sequence of tweet text. Graph data augmentation technology improves the classification and prediction ability of neural networks. This method can remove edges with features in the user propagation network, and finally transform the propagation network into a subgraph for display. The core of GSA is to update the representation of the target node by sampling neighboring nodes and aggregating them. This model generally consists of several aggregation layers, and a single aggregation layer contains multiple aggregation functions. The workflow is to first randomly obtain node information from the aggregation layer, and fuse the feature information of the central node and the obtained nodes based on the aggregation function. The process of fusion and node acquisition can be carried out between different layers, enabling the network model to capture distant nodes.^11,12 The relevant calculation is shown in formula (4).

x^{\sim} = σ (m e a n (x + N e i (x))

(4)

In formula (4), $N e i ()$ is a neighbor node of a single node. Based on the subgraph output by GSA, it is used as the input source to calculate the contrastive loss between subgraphs, which can achieve similarity methods for nodes of the same class and reduce the similarity between nodes of different classes. The subgraph is combined with node information, and the root node sample is defined and compared with a set of nodes related to other users to complete the comparison and obtain corresponding intra-group comparison samples. The comparative loss calculation is shown in formula (5).

O (M_{i}^{1}, M_{i}^{2}) = \log \frac{\exp \frac{γ (M_{i}^{1}, M_{i}^{2})}{k}}{\exp \frac{γ (M_{i}^{1}, M_{i}^{2})}{k} + \sum_{j = 1}^{0} \exp \frac{γ (M_{i}^{1}, u_{j}^{q_{i}})}{k} + \sum_{j = 1}^{0} \exp \frac{γ (M_{i}^{1}, v_{i}^{q_{i}})}{k}}

(5)

In formula (5), the temperature parameter of $k$ and $γ$ are parameters. $(M_{i}^{1}, u_{j}^{q_{i}})$ and $(M_{i}^{1}, v_{i}^{q_{i}})$ are comparative samples within the group. $\exp$ represents an exponential function, which is used to convert a numerical value into a non-negative value. Finally, based on all the information obtained from the above modules, semantic features and graph structure features are imported as input sources into the discriminator of false news. By calculating cross entropy, the news text information can be predicted, and the relevant expression is shown in formula (6).

R = s o f t \max (M_{u}^{1}, M_{i}^{2}) + f (G^{M_{i}})

(6)

The predicted result of $R$ in formula (6). $M_{u}^{1}$ represents the user copy vector. $M_{i}^{2}$ represents the news text vector. $f (G^{M_{i}})$ represents the graph structure feature. Given this, binary classification of the authenticity of news texts can be achieved.

Construction of FND model integrating semantic features and image recognition

Model overview and structure

This study proposes an FND model based on semantic features and image recognition by combining the obtained news text features with image recognition. The model mainly consists of four modules, and the model structure is shown in Figure 3.

Figure 3.

The structure of FND model combining semantic feature and image recognition.

Figure 3 shows the structure of an FND model that combines semantic features and image recognition. In the visual enhancement encoder module, the input source is collected news images. There are two specific network models in this module. Predictive networks can represent the presentation forms of multiple enhanced perspectives in images. The target network defines the presentation form as the object to be predicted. By using a loss function to connect and train, the ability to express the target object can be improved. The module uses the image enhancement method to improve image quality and highlight the feature information of key graphics, facilitating subsequent network training.^13,14 After calculating the corresponding errors based on the projection vector of the prediction layer and the target vector, the subgraphs of the target network and the prediction network can represent the comparative loss, as shown in formula (7).

O_{γ, ε} ≜ | | \bar{r_{γ}} (z_{γ}) - \bar{z_{ε}^{\sim}} | |_{2}^{2} = 2 - 2 \frac{< r_{γ} (z_{γ}), \bar{z_{ε}^{\sim}} >}{| | r_{γ} (z_{γ}) | {|_{2}}^{*} {‖ \bar{z_{ε}^{\sim}} ‖}_{2}}

(7)

In formula (7), $O_{γ, ε}$ is the contrastive loss, $\bar{r_{γ}} (z_{γ})$ is the normalized projection vector, and $z_{ε}^{\sim}$ is the target vector. $\bar{z_{ε}^{\sim}}$ represents the feature vector output by the target network. The training process of the model adopts an optimizer and combines it with the attenuation rate of the target to update the parameters $γ$ and $ε$ of the model, as shown in formula (8).

{\begin{cases} γ = o p t i m (γ, \nabla_{γ} |_{γ, ε}^{\min}, r) \\ ε = η ε + (1 - η) γ \end{cases}

(8)

In formula (8), $r$ is the learning rate and $o p t i m$ () is the optimizer function. $\nabla_{γ}$ represents the gradient with respect to $γ$ , and $ε$ represents the updated parameters.

Semantic encoder and CapsNet

The visual enhancement encoder is followed by the semantic encoder of the context. This module mainly consists of two encoders and a CapsNet. To accurately obtain global features, this study first uses a BERT encoder. This encoder does not adopt the structure of a Recurrent Neural Network (RNN). Therefore, this study adopts Positional Encoding (PE) to preserve the positional information of individual phrases in news samples, and the positional information $a$ of the PE vector is shown in formula (9).

{\begin{cases} P E_{(a, 2 b)} = \sin (\frac{i}{10000^{\frac{2 b}{d}}}) \\ P E_{(a, 2 b + 1)} = \cos (\frac{i}{10000^{\frac{2 b}{d}}}) \end{cases}

(9)

In formula (9), $2 b$ and $2 b + 1$ are even and odd dimensions, and $d$ is the feature dimension of the sequence. After segmenting the news sample with a tokenizer, the position vector can be summed with the word sequence based on the PE vector to obtain a combined vector. The combined vector is imported as an input source into the Transformer encoder in BERT. The BERT encoder places greater emphasis on attention mechanisms by introducing MHSA mechanisms to characterize the interaction process of composite vectors, thereby enriching the semantic diversity of the sample subspace.^15,16 During the calculation process, MHSA can calculate the attention distribution vector between words based on vectors $R$ , $L$ , and $V$ , as shown in formula (10).

{\begin{cases} Q = W^{(Q)} S^{T i} \\ K = W^{(K)} S^{T i} \\ V = W^{(V)} S^{T i} \end{cases}

(10)

In formula (10), $W^{(Q)}$ , $W^{(K)}$ , and $W^{(V)}$ are weight matrices for queries, keys, and values. The $Q$ , $K$ , and $V$ vectors are assigned to different self-attention mechanisms. After combining all the attention matrices, dimensionality reduction can be performed to obtain the attention distribution matrix based on the news sample model. Based on formula (11), a single self-attention head in MHSA can be calculated.

H_{i} = A t t e n t i o n (Q, K, V) = s o f t \max (\frac{Q K^{T}}{\sqrt{d}}) V

(11)

By inputting all the outputs of the self-attention heads into MHSA, a text vector based on the global context of the news sample can be obtained, as shown in formula (12).

E = M u l t i H e a d (Q, K, V) = [H_{1}, H_{2}, H_{3}, \dots H_{n}] W

(12)

In formula (12), $n$ is the number of self-attention mechanisms.

Multi-modal fusion and model evaluation

In addition to the BERT encoder, this study uses the CNN-BiLSTM encoder to obtain local contextual features. In CNN, convolutional kernels adopt a shared weight approach, where kernels of the same class have the same parameter values, significantly lowering the number of parameters in the network model. In Bi-LSTM, due to the presence of forward and backward hidden layers, the output values obtained by CNN convolution are imported into Bi-LSTM to obtain the mathematical expression of the local context, as shown in formula (13).

E_{A} = B i - L S T M ([h_{1}, h_{2}, \dots h_{m}])

(13)

In formula (13), $L$ is the local contextual feature of the news sample, and $h$ is the output result of the CNN layer. Finally, based on the output results of the BERT and CNN-BiLSTM encoders mentioned above, they are combined and imported into CapsNet as input sources. It can more effectively capture and represent entities and their attributes in images, thereby growing the model’s robustness. The relevant calculations are shown in formula (14).

{\begin{cases} D^{M} = [E - E_{A}] \\ R^{M} = r e s (W_{n} D^{M} + z_{n}) \end{cases}

(14)

In formula (14), $D^{M}$ represents the representation form after data processing. $E$ represents the original feature. $E_{A}$ represents the benchmark feature. “ $-$ ” represents the output fusion process. $W_{n}$ and $z_{n}$ are the weight matrix and bias vector of the hidden layer. $r e s$ is to transform the fused features into hidden features. The structure based on CapsNet is shown in Figure 4.

Figure 4.

Schematic diagram of CapsNet connection.

Figure 4 shows a schematic diagram of the connections between capsules in the CapsNet. CapsNet is a new type of neural network structure, aiming to improve the model’s understanding and processing ability of spatial hierarchical relationships. The dynamic routing mechanism is a key innovation of CapsNet, used to dynamically allocate connection weights between capsules at different levels. Through repeated iterations, the routing mechanism can determine the optimal connection method between low-level capsules and high-level capsules, effectively conveying entity information. In CapsNet, shallow capsules are typically obtained by multiplying a weight matrix with low-level capsules, and the weight matrix is a shared matrix between intermediate features and low-level capsules.^17,18 When training the capsule model, the capsule activates shallow capsules with a certain probability. The expression of probability is shown in formula (15).

α_{i j} = s o f t \max (β_{i j}) = \frac{\exp (β_{i j})}{\sum_{k} \exp (β_{i, k})}

(15)

In formula (15), $α_{i j}$ is the activation probability and $β_{i j}$ is the weight value of the connection. $s o f t \max$ represents the classification function. Based on formula (15), the capsule output corresponding to the number of layers can be obtained. After dimensionality reduction of the activation vector for low-level capsules, all output vectors can be normalized, as shown in formula (16).

t_{i} = \frac{{‖ s_{i} ‖}^{2}}{1 + {‖ s_{i} ‖}^{2}} \times \frac{s_{i}}{‖ s_{i} ‖}

(16)

In formula (16), $s_{i}$ is the capsule output of layer $i$ . $t_{i}$ is the normalized output vector. $‖ s_{i} ‖$ represents the norm of $s_{i}$ . After obtaining the normalized output vector, it participates in updating the connection weight values and then updates the activation probability. Finally, after multiple iterations of updating, advanced capsules can be obtained. After the output of the dynamic routing layer is completed, ResNet is used to optimize the context vector in the news samples. The final news text features are shown in formula (17).

N T_{t} = H (A D D (D^{M}, T))

(17)

In formula (17), $N T_{t}$ is the feature of the output news text. The final function for multi-modal fusion attention is shown in formula (18).

M u l t i_{t} = s o f t \max (\frac{F u l l y_{1} (N T_{t}) \times F u l l y_{2} {(N T_{t})}^{T}}{\sqrt{e}})

(18)

In formula (18), $F u l l y_{1}$ and $F u l l y_{2}$ are two FCLs, and $e$ is the number of iterations. $M u l t i_{t}$ is the importance of the word. Subsequently, based on affinity matrix learning and semantic image feature fusion, the multi-modal representation of the final news text is obtained as shown in formula (19).

M N = l a y_n o r m (M N_{O} + N T_{o} (M N_{O}))

(19)

In formula (19), $M N$ is the multi-modal representation of news, $N T_{o}$ is composed of two FCLs, and $M N_{O}$ is the contextual semantic enhancement feature. After obtaining the multi-modal news representation, it is input into the FCL to determine the authenticity of the news and obtain the corresponding loss. To demonstrate the practical application value of the multi-modal FND model, the study selects a well-known case of false news in China: “The exorbitant price of 1.76 million yuan for clam powder.” This piece of news spread rapidly on social media in 2023 and was later confirmed to be false. The study analyzed the news article using the constructed model. The news text and related images are input into the model. The model first encodes the text with BERT, then encodes the images with ResNet, and finally fuses these features through the CapsNet. The final model predicts that the news is false news with an accuracy rate of 92% and a confidence score of 0.85. The model identifies certain key words in the text and inconsistencies in the image, and these features match the known features of false news. This case demonstrates how the proposed model can help social media users and news platforms quickly identify and label false news, thereby reducing its spread and impact.

Finally, in the process of developing and deploying the FND model, ethical issues, potential biases of the dataset, and the challenges of technical implementation must be carefully considered. First of all, ethical issues involve user privacy and data protection. The model needs to ensure compliance with relevant privacy regulations and ethical standards when processing personal data to prevent data leakage or abuse. Secondly, the bias of the dataset may affect the fairness and accuracy of the model. For example, if certain groups or viewpoints in the training dataset are over-represented or under-represented, the model may learn these biases and reflect them in the predictions. Therefore, it is necessary to ensure the diversity and representativeness of the dataset and introduce fairness considerations in the model design. The deployment challenges of the model include how to ensure its robustness and reliability in different environments, as well as how to enable non-technical users to understand and trust the predicted results of the model. The solution to these problems requires not only technological progress but also interdisciplinary cooperation, including the joint efforts of experts in fields such as law, sociology, and ethics.

Results

Performance testing based on MNDM

In the performance testing of the news detection model, POL and GOS datasets from FakeNewNet are selected. The experimental environment is as follows: the processor is Intel Xeon E5-2698 v4, the memory is 30 GB, the graphics card is NVIDIA GeForce RTX 3050, the programming language is Python 3.8, and the deep learning framework is PyTorch 1.11.0. The learning rate of the network structure is 0.05, and the temperature parameter is 0.1. The dimension of the hidden layer is 128 and the learning rate is 0.05. Before conducting model training, the data need to be preprocessed. (1) Text cleaning: Unrelated characters and punctuation are removed, words are extracted from the text, and stem extraction or morphological restoration is performed. (2) Image preprocessing: The input image is resized, normalized, and grayscale converted to meet the input requirements of the model. (3) Data Annotation: It is necessary to ensure that all data samples are correctly labeled as true or false news. (4) Data augmentation: The diversity of data is increased through methods such as rotation, flipping, and cropping, especially for image data. (5) Text vectorization: Text can be converted into numerical vectors using word embedding techniques. (6) Image feature extraction: Pre-trained CNN models are used to extract image features. Meanwhile, the Epochs of the model is 50. The dataset is divided into the training set, the validation set, and the test set in a ratio of 6:2:2. The License type of the POL dataset is Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). The License type of the GOS dataset is Creative Commons Attribution 4.0 International License (CC BY 4.0). This study introduces four common news detection models for comparative experiments, and the performance evaluation indicators are compared as shown in Figure 5.

Figure 5.

Comparison of network performance indicators based on POL and GOS datasets. (a) Comparison of network performance indicators of POL data sets (b) Comparison of network performance indicators of GOS data sets.

In the POL dataset of Figure 5(a), the accuracy (87.93) and F1 score (87.96) of MNDM are significantly higher than those of the control model, with an average improvement of 7.57% and 7.34%, respectively. In the GOS dataset of Figure 5(b), the accuracy and F1 score of MNDM are 97.64% and 97.61%, which are significantly higher than those of other models. Further hyperparameter analysis is conducted on the parameters r and k of the news detection model, and the results based on the POL dataset are shown in Figure 6.

Figure 6.

MNDM performance of parameters k and r in POL dataset. (a) Performance of News detection model model in parameter k. (b) Performance of News detection model model in parameter r.

Figure 6(a) shows the accuracy comparison of network models with different k values. When k = 1, the news dissemination network receives more attention and the model exhibits overfitting. When k = 0.1, the accuracy of the model gradually increases, making it more effective in detecting false news compared to other parameters. Figure 6(b) Comparison of model accuracy for different r values. When r = 14, the model exhibits underfitting, while when r = 18, the model exhibits overfitting. When r = 16, the model performs the best in detecting false news. Overall, when r = 16 and k = 0.1, the news detection model achieves optimal performance. This study further conducts ablation experiments on the network structure of news detection to verify the model’s performance, as listed in Table 1.

Table 1.

Comparison of ablation results of MNDM.

	POL		GOS
Evaluation index	Accuracy	F1	Accuracy	F1
No figure compares learning methods and channel attention mechanisms	84.34^c	84.68^c	97.12^a	97.1^a
No attention mechanism	86.91^b	86.72^b	97.34^a	97.38^a
No comparison of learning methods	85.81^b	86.21^b	97.57^a	97.56^a
Multi-modal news detection model	87.96^a	87.99^a	97.65^a	97.62^a

In Table 1, on the POL dataset, the model has the highest accuracy and F1 score, which are 87.96% and 87.99%, respectively. Its significance is marked with the letter “a,” indicating a significant difference compared with other models. On the GOS dataset, this model also performs the best, with an accuracy rate and F1 score of 97.65% and 97.62%, respectively, and it is also marked as “a.” “No figure compares learning methods and channel attention mechanisms model” performs the worst on the POL dataset, with accuracy and F1 scores of 84.34% and 84.68%, respectively, marked as “c.” Its value is significantly lower than that of the multi-modal news detection model on the GOS dataset. The accuracy rate and F1 score of “No figure compares learning methods and channel attention mechanisms” are 97.12% and 97.1%, respectively, marked as “a,” at which point there is no significant difference from other models. At the end of the study, the accuracy rates and F1 scores of the models under different folding numbers are compared through cross-validation, and the results are shown in Table 2.

Table 2.

The experimental results of cross-validation.

Data set	POL dataset	GOS dataset
Fold 1 accuracy	0.871	0.972
Fold 1 F1 score	0.870	0.973
Fold 2 accuracy	0.877	0.975
Fold 2 F1 score	0.879	0.976
Fold 3 accuracy	0.872	0.974
Fold 3 F1 score	0.874	0.975
Fold 4 accuracy	0.876	0.973
Fold 4 F1 score	0.878	0.974
Fold 5 accuracy	0.873	0.976
Fold 5 F1 score	0.875	0.977

In Table 2, the accuracy rate and F1 score of the model on the POL dataset fluctuate between 0.868 and 0.879, indicating that the model has good stability and predictive ability for this dataset. On the GOS dataset, the model performs more outstandingly, with both the accuracy rate and F1 score ranging from 0.972 to 0.977. This indicates that the model captures features more accurately for this dataset. Through five-fold cross-validation, the model maintains high performance on different data subsets, demonstrating its good generalization ability. The study further compares the errors between the models, and the results are shown in Table 3.

Table 3.

Analysis of model error results.

Model	Dataset	MSE	MAE
UPDF	POL	0.15	0.12
UPDF	GOS	0.08	0.06
PPC	POL	0.14	0.11
PPC	GOS	0.07	0.05
GTUT	POL	0.13	0.1
GTUT	GOS	0.06	0.04
CrossFake	POL	0.16	0.13
CrossFake	GOS	0.09	0.07
NT-UP	POL	0.10	0.08
NT-UP	GOS	0.05	0.03

In Table 3, the error indicators of the NT-UP model on both datasets are superior to those of other models. The MSE of the NT-UP model is 0.10 on the POL dataset and 0.05 on the GOS dataset, indicating that NT-UP has smaller errors in predicting false news and the prediction results are closer to the true values. The MAE of NT-UP is 0.08 on the POL dataset and 0.03 on the GOS dataset, further demonstrating the advantage of NT-UP in prediction accuracy.

Performance detection of FND model integrating semantic features and image recognition

For single-mode and multi-mode news detection, this study selects four typical datasets to evaluate the FND model that integrates semantic features and image recognition. The single modal performance comparison based on the WEIBO dataset and the TWITTER dataset is shown in Figure 7.

Figure 7.

Comparison of accuracy of single-modal models based on WEIBO and TWITTER datasets. (a) Accuracy comparison of single-modal model in WEIBO data set. (b) Accuracy comparison of single-modal model in TWITTER data set.

Figure 7(a) shows the performance comparison of five single-modal models in the WEIBO dataset. The accuracy value of the BERT is the highest, at 77.37%, which is an average improvement of 7.11% compared to other models. Figure 7(b) shows the performance comparison results in the TWITTER dataset. The CapsuleNet model has the highest accuracy in identifying false news, with a value of 79.92%, while the BERT has an accuracy of 78.26%, which is 1.66% higher than the CapsuleNet model. Further comparisons are made between the performance of five single-modal models on the PHEME and THUCNews datasets, as shown in Figure 8.

Figure 8.

Comparison of accuracy of single-modal models based on PHEME and THUCNews datasets. (a) Accuracy comparison of single-modal model in PHEME data set. (b) Accuracy comparison of single-modal model in THUCNews data set.

In the PHEME dataset of Figure 8(a), the accuracy and F1 score of the CapsuleNet model are both at the highest numerical level. The two values of the BERT model are 76.38% and 71.25%, indicating that the model has poor accuracy in identifying false news on PHEME. In the CapsuleNet dataset shown in Figure 8(b), the BERT model has the highest accuracy and F1 score, at 81.24% and 82.61%, respectively. This study further validates the performance of multi-modal models on the WEIBO and TWITTER datasets, as shown in Figure 9.

Figure 9.

Performance comparison of multi-modal models in WEIBO and TWITTER datasets. (a) Performance comparison of multi-modal models in WEIBO data set. (b) Performance comparison of multi-modal models in TWITTER data set.

Figure 9(a) shows the parameter comparison on the WEIBO dataset. The accuracy and F1 score of the proposed MNDM, which integrates semantic features and image recognition, are higher than other models. Its values are 90.62% and 91.25%, with an average improvement of 7.20% and 6.38%. In the comparison of the TITTER dataset in Figure 9(b), the research model performed the best in accuracy and F1 score, with an average improvement of 11.03% and 13.85%, at 91.63% and 94.45%. This indicates that the performance of the MNDM used is better in the WEIBO and TWITTER datasets. Figure 10 compares the performance parameters on the PHEME and THUCNews datasets.

Figure 10.

Performance comparison of multi-modal models in PHEME and THUCNews datasets. (a) Performance comparison of multi-modal models in PHEME data set. (b) Performance comparison of multi-modal models in THUCNews data set.

In Figure 10(a), the accuracy (90.42%) and F1 score (86.18%) of the research model are higher than those of the comparison model on the PHEME dataset, with an average improvement of 6.53% and 9.62%. In Figure 10(b), under the THUCNews dataset, the accuracy (92.54%) and F1 score (94.12%) of the research model are also optimal, with an average improvement of 12.42% and 15.92%. Overall, the performance of the MNDM is superior to other models, and compared to the single-modal model, the multi-modal model has better overall performance. This indicates that more types of news features are considered by the multi-modal model, thereby improving the model’s detection of false news. Finally, this study conducts ablation experiments on the multi-modal model. Considering that the F1 score in the PHEME dataset is 86.18%, this study selects the WEIBO, TWITTER, and THUCNews datasets. Table 4 shows the specific data.

Table 4.

Results of multi-modal model ablation experiment.

	WEIBO		TWITTER		THUCNews
Evaluation index	Accuracy	F1 score	Accuracy	F1 score	Accuracy	F1 score
No multi-modal fusion attention	90.03	91.17	90.35	94.12	89.24	85.82
Removed CNN-BiLSTM encoder and CapsNet	90.66	92.64	90.71	94.02	89.29	83.98
Remove visual enhancement encoder	90.12	90.88	91.26	93.25	89.63	83.62
Research model	91.62	91.26	91.65	94.45	90.43	86.17

In Table 4, when using visual enhancement encoders, CNN-BiLSTM, or CapsNet, the model performance is improved, that is, the ability to identify false news is enhanced. The average accuracy and F1 score without using encoder and CapsNet are 89.88% and 90.37%. When using visual enhancement encoder, CNN-BiLSTM, and CapsNet, the average accuracy of the model is improved to 91.57% and 90.62%, with an average improvement of 1.69% and 0.25%. Finally, the distribution of attention weights for the sentence “The quick brown fox jumps over the lazy dog” is analyzed. The higher the weight value, the more important the model considered the word to be for judging the authenticity of the news. The result is shown in Figure 11.

Figure 11.

Attention heat map of news sentences.

Figure 11 shows the weighted attention heat map of each word. The results show that the attention weight of the word “fox” is the highest, which is 0.30. This means that when the model processes this sentence, it considers “fox” to be the most important for judging the authenticity of the news. Other words such as “quick” and “brown” also receive relatively high weights, which are 0.20 and 0.15, respectively.

Discussion

In the study, on the POL dataset, the accuracy of the news detection model was 87.93%, and the F1 score was 87.96%, with an average improvement of 7.57% and 7.34%. On the GOS dataset, the accuracy is 97.64% and the F1 score was 97.61%, significantly higher than other models. In the performance testing of FND models that integrate semantic features and image recognition, multi-modal models outperformed single-modal models on WEIBO, TWITTER, PHEME, and THUCNews datasets. For example, on the WEIBO dataset, the accuracy of the multi-modal model was 90.62%, the F1 score was 91.25%, and the average improvement was 7.20% and 6.38%. In the ablation experiment of the news detection model, after removing the channel attention mechanism and graph comparison method, the accuracy decreased by 3.62% and 0.53%, indicating that these components are crucial for improving model performance. In the ablation experiment of the multi-modal model, the average accuracy and F1 score without encoder and CapsNet were 89.88% and 90.37%. However, after using these components, the average accuracy of the model increased to 91.57% and 90.62%, with an average improvement of 1.69% and 0.25%. Tufchi et al. constructed a GAN for detecting false news using TWITTER and news on Facebook as research subjects. Similar to the research, this study also uses accuracy and F1 score as evaluation indicators and introduces an encoder similar to the model. The performance of the improved model has been enhanced, enabling it to better identify false news.¹⁹ Comito C et al. developed a deep learning-based MNDM to address the increasingly widespread spread of false news. After introducing a multi-modal mechanism, the performance of the model has been significantly improved, and its ability to detect false news has become even better.²⁰ However, in some failed cases, the model performed poorly in identifying certain specific types of news, such as satirical news or news containing puns. These news often require deeper semantic understanding and cultural background knowledge, but the proposed model has limited capabilities in this regard. Meanwhile, in some cases, the model overly relies on image information while ignoring the text content. For example, when the image is irrelevant or misleading to the content of the news text, the model may make wrong predictions. In summary, the proposed MNDM and FND models that integrate semantic features and image recognition have demonstrated superior performance on multiple datasets, providing effective technical means for detecting false news.

Conclusion

This study explores the detection of false news by constructing an MNDM and an FND model that integrates semantic features and image recognition. The results indicate that multi-modal models exhibit better performance on multiple datasets compared to single-modal models, highlighting the importance of utilizing different modal information to improve FND accuracy. Introducing encoders and CapsNet into the model can improve its ability to detect false news. However, there are some limitations in the research. First, the generalization ability of the model has not yet been verified on a broader dataset, which may affect its applicability in different language and cultural contexts. Second, the ethical impacts of the model, such as privacy protection and algorithmic bias, need to be further explored. Future work will focus on evaluating the generalization ability of the model on more diverse datasets. Meanwhile, the model structure can be simplified to adapt to resource constrained environments. Finally, in-depth research will be conducted on the ethical implications of the model to ensure its fairness and transparency.

Footnotes

ORCID iD

Changyue Li

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Yang

Zhang

Cheng

, et al. Exploring information dissemination effect on social media: an empirical investigation. Personal Ubiquitous Comput 2023; 27(4): 1469–1482.

Choudhuri

Adeniye

Sen

. Distribution alignment using complement entropy objective and adaptive consensus-based label refinement for partial domain adaptation. Artif Intell Appl 2023; 1(1): 43–51.

Monteith

Glenn

Geddes

, et al. Artificial intelligence and increasing misinformation. Br J Psychiatry 2024; 224(2): 33–35.

Aïmeur

Amri

Brassard

. Fake news, disinformation and misinformation in social media: a review. Soc Netw Anal Min 2023; 13(1): 30–42.

Altheneyan

Alhadlaq

. Big data ML-based fake news detection using distributed learning. IEEE Access 2023; 11(2): 29447–29463.

Capuano

Fenza

Loia

, et al. Content-based fake news detection with machine and deep learning: a systematic review. Neurocomputing 2023; 530(5): 91–103.

Abualigah

Al-Ajlouni

Daoud

, et al. Fake news detection using recurrent neural network based on bidirectional LSTM and GloVe. Soc Netw Anal Min 2024; 14(1): 40–51.

Mohawesh

Maqsood

Althebyan

. Multilingual deep learning framework for fake news detection using capsule neural network. J Intell Inf Syst 2023; 60(3): 655–671.

Gupta

Dennehy

Parra

, et al. Fake news believability: the effects of political beliefs and espoused cultural values. Inf Manag 2023; 60(2): 103745–103758.

10.

Horner

Galletta

Crawford

. Emotions: the unexplored fuel of fake news on social media. Fake News on the Internet 2023; 7(5): 147–174.

11.

Yang

Zhou

Cao

, et al. LightingNet: an integrated learning method for low-light image enhancement. IEEE Trans Comput Imaging 2023; 9(5): 29–42.

12.

Peng

Zhu

Bian

. U-shape transformer for underwater image enhancement. IEEE Trans Image Process 2023; 32(2): 3066–3079.

13.

Hossain

Lin

. Efficient stereo depth estimation for pseudo-LiDAR: a self-supervised approach based on multi-input ResNet encoder. Sensors 2023; 23(3): 1650–1662.

14.

Hou

Lian

Chu

. Bearing fault diagnosis method using the joint feature extraction of Transformer and ResNet. Meas Sci Technol 2023; 34(7): 75108–75119.

15.

Vasanthi

Mohan

. Multi-Head-Self-Attention based YOLOv5X-transformer for multi-scale object detection. Multimed Tool Appl 2024; 83(12): 36491–36517.

16.

Zhang

Liu

, et al. DeepCAC: a deep learning approach on DNA transcription factors classification based on multi-head self-attention and concatenate convolutional neural network. BMC Bioinf 2023; 24(1): 345–359.

17.

Andrushia

Neebha

Patricia

, et al. Capsule network-based disease classification for Vitis Vinifera leaves. Neural Comput Appl 2024; 36(2): 757–772.

18.

Bushara

Vinod Kumar

Kumar

. LCD-capsule network for the detection and classification of lung cancer on computed tomography images. Multimed Tool Appl 2023; 82(24): 37573–37592.

19.

Tufchi

Yadav

Ahmed

. A comprehensive survey of multimodal fake news detection techniques: advances, challenges, and opportunities. Int J Multimed Inf Retr 2023; 12(2): 28–41.

20.

Comito

Caroprese

Zumpano

. Multimodal fake news detection on social media: a survey of deep learning techniques. Soc Netw Anal Min 2023; 13(1): 101–118.