Utilization of transformer model in multimodal data fusion learning: Cross-modal knowledge transfer in the new generation learning space

Abstract

In response to the difficulties in integrating multimodal data and insufficient model generalization ability in traditional cross-modal knowledge transfer, this article used the Transformer model to explore it in the new generation learning space. Firstly, the article analyzed the processing methods of data and models in cross-modal knowledge transfer, and explored the application of Transformer models in the learning space. This model used natural language processing to represent and extract textual features, Mel Frequency Cepstral Coefficients (MFCCs) to represent and extract audio features, and Faster R-CNN (Faster Region-based Convolutional Neural Network) to represent and extract image features. The article also discussed the implementation process of the Transformer model functionality. The experiment used data from four datasets, including Quora Question Pairs, to test the performance of the model’s cross-modal knowledge transfer through intelligent question answering and task analysis. In single type data testing, the accuracy and recall of the model in this article were better than the comparison model in the three types of data. The highest accuracy and recall in the test set were 91% and 93%, respectively. In the most challenging multimodal intelligent question answering test, the speech-image question answering method achieved an accuracy rate of 89% in answering open questions, indicating that the model had good multimodal data fusion ability. In the analysis experiment of 6 homework prone knowledge points on images with text annotations, the induction accuracy reached 85%, indicating that the model had strong generalization ability. The experimental results showed that the Transformer model had good cross-modal knowledge transfer performance, providing a reference for subsequent research on cross-modal knowledge transfer in the new generation learning space.

Keywords

Cross-modal knowledge transfer transformer model new generation learning space Mel frequency cepstral coefficients faster R-CNN

1. Introduction

With the development of society, various forms of information are becoming more and more complex, and the way of knowledge transmission is constantly improving. From the initial transmission of language to text, and then to images and videos, the ways in which human knowledge is transmitted are becoming more and more diverse, and various forms of knowledge are constantly merging. However, the traditional cross-modal knowledge transfer methods face difficulties in integrating multi-modal data and model generalization ability, which greatly limits the efficiency and learning effect of knowledge transfer. To address these challenges, this study introduces the Transformer model to explore its cross-modal knowledge transfer capability in the next-generation learning space.

In the context of cross-modal knowledge transfer, the processing of text, audio and image features is crucial. In this article, text features are represented and extracted by natural language processing, audio features are represented and extracted by MFCC, and image features are represented and extracted by fast regional CNN. Through these techniques, this article aims to solve the problems of difficult integration of multi-modal data and insufficient model generalization ability, so as to improve the efficiency and learning effect of knowledge transfer.

The experimental part uses four datasets, including Quora question pairs, to test the model’s performance in intelligent question answering and task analysis. In the single type data test, the accuracy and recall rate of the model in the three types of data are superior to the comparison model, and the highest accuracy and recall rate in the test set are 91% and 93%, respectively. In the most challenging multimodal intelligence question answering test, the voice-image question answering method achieved 89% accuracy in answering open questions, showing the model’s good ability in multimodal data fusion. In six knowledge point analysis experiments on images containing text annotations, the induction accuracy reaches 85%, which indicates that the model has strong generalization ability. The experimental results show that Transformer model has good performance in cross-modal knowledge transfer, which provides a reference for subsequent research on cross-modal knowledge transfer in a new generation of learning space. Main contribution of this paper are as below.

(1)
Introduction of Transformer model: The Transformer model is introduced for the first time in cross-modal knowledge transfer, and its advantages in processing sequence data and multi-head attention mechanism are utilized to improve the generalization ability and knowledge fusion efficiency of the model.
(2)
Multi-modal feature processing: Text, audio and image features are successfully represented and extracted through natural language processing, Mel Frequency Cepstral Coefficients and fast regional convolutional neural network technology, providing an effective solution for the integration of multi-modal data.
(3)
Experimental verification: Four data sets including Quora question pairs were used to verify the performance of the model in intelligent question answering and task analysis, showing the superiority in single type data and multi-modal data fusion.
(4)
Generalization ability analysis: The strong generalization ability of the model is proved in the experiment of knowledge point analysis on the image containing text comments, which provides a reliable basis for practical application.

2. Related work

In order to improve the efficiency of cross-modal knowledge transfer, many scholars have conducted their own research. The first step in cross-modal knowledge transfer is to integrate multimodal data, and some scholars have explored multimodal data fusion [1, 2, 3]. Sleeman IV WC [4] proposed a new classification method to describe and compare multimodal classification models, analyzed the challenges of current multimodal data sets, and discussed future research directions. Windsor E [5] proposed a long and short-term memory (LSTM) model based on multi-modal fusion, which realized efficient prediction of USD/RMB exchange rate by combining market indicators and investor sentiment. Shi Q [6] developed a soft robot sensing system that combines ultrasonic sensors and flexible triboelectric sensors to achieve efficient remote target location and recognition through multi-modal data fusion. Peng B [7] proposed a rotating machinery fault diagnosis method based on deep residual neural network (DRNN) and multi-modal data fusion, which significantly improved the effects of feature learning, model training, anti-noise, fault tolerance and fault diagnosis. Chen L [8] et al. proposed cross-modal emotion extraction to train models that utilize the fundamental frequency of electroglottal signals, achieving a balance between model complexity and performance. Wang Y [9] proposed a high-dimensional sparse hash framework for cross-modal retrieval, which successfully improved the accuracy and efficiency advantages of search frameworks. Although scholars have conducted research on various aspects of cross-modal knowledge transfer, the difficulty of integrating multimodal data has not been resolved.

The Transformer model is a sequence generation model based on multi-head attention mechanism. The initial Transformer model was applied in machine translation tasks. As scholars deepen their understanding of it, it has played an important role in many fields such as image, video, and audio [10, 11, 12]. Zhang Q [13] proposed a transformer that achieves better performance than baseline and representative models in multiple computer vision tasks by introducing multi-scale features and local inductive bias. Chu Y [14] proposed the TransMut framework to predict the binding of polypeptides to HLA through the TransPHLA model and automatically optimize mutant peptides using the AOMP program, thus improving the efficiency and accuracy of vaccine design. Lu X [15] used the GeT model based on Ghost convolution and Transformer network for the diagnosis of grape leaf pests and diseases, realizing efficient deep semantic feature extraction, significantly improving the accuracy and processing speed, and providing an effective benchmark for grape leaf diagnosis in the field. Baid G [16] improved the accuracy of DNA sequence correction with its gap-aware Transformer encoder, significantly reducing read errors compared to traditional methods, and improving the yield and genome assembly quality of PacBio HiFi reads. Chen Ke [17] et al. proposed a text sentiment analysis method based on sentiment lexicons and Transformers, which can fully utilize the feature information of sentiment lexicons to help better encode sentiment words. Mohammadi Farsani R [18] et al. proposed a Transformer neural network based on self attention, which had special ability in predicting time series problems. Widmann T [19] et al. compared the performance of different models and highlighted the advantages of the new Transformer based model. Wang Chencheng [20] et al. used a Transformer model based on multi-head attention mechanism as the error correction model and proposed a dynamic residual structure that dynamically combined the outputs of different neural modules to enhance the model’s ability to capture semantic information.

3. Application of transformer model

3.1 Cross-modal knowledge transfer

For different modalities of information, using the Transformer model for knowledge transfer requires a series of processing steps. Figure 1 summarizes the way the Transformer model handles data and models in cross-modal knowledge transfer.

Figure 1.

Processing method of data and model.

Figure 1 shows the processing method of cross-modal knowledge transfer. For multimodal information data, such as text data, image data, and audio data, it is necessary to preprocess them and convert them into a format that can be uniformly processed by the Transformer model, and then extract their features in different ways. The attention mechanism in the Transformer model can be applied to the fusion of different modal features, which is to enhance the understanding of information between various modalities during model training. In cross-modal information transfer, the multi-head attention mechanism of the Transformer model can enable the model to learn the semantic correlation between different modal information, which is also the main process of knowledge transfer. For model processing, it is first necessary to model according to the requirements of the task. Common tasks include text-image search, which requires the fusion of features from both text and image data for modeling, or the judgment of emotions expressed in a video. This task requires the fusion of image and audio features for modeling. After completing the training of the model, it is necessary to evaluate it. If it meets the set requirements, such as accuracy and recall, and performs well, it can be applied.

The Transformer model faces challenges in practical applications. High computing cost, processing long sequence data, difficulty in processing sparse data and lack of domain knowledge are the limitations. Taking these factors into consideration, it is crucial to select the right scenario and customize the model.

3.2 Application of models in learning space

In the new generation of learning space, the Transformer model has a wide range of ways to achieve knowledge transfer. For example, for personalized learning recommendations, the Transformer model analyzes students’ notes, grades, study habits, and so on. Based on the analysis results, more suitable and interesting learning materials are recommended to students; based on students’ written or spoken questions, the Transformer model constructs natural language understanding and generation, which can intelligently answer students’ questions, provide corresponding reference materials and answer difficult questions; based on the homework submitted by students, errors in the homework are analyzed in the form of text or images, and knowledge points related to easy and frequent mistakes are summarized, providing teaching direction for the teacher. In the above applications, the Transformer model processes multimodal data such as text, images, and speech according to different requirements, providing students with an efficient and accurate cross-modal knowledge transfer process.

The adaptability of the model is reflected in its ability to be customized according to the needs of different tasks. In personalized learning recommendation, the model analyzes students’ notes, grades and study habits to recommend appropriate learning materials. In intelligent question answering, the model builds natural language understanding and generation capabilities to answer students’ questions; In homework analysis, the model analyzes the homework submitted by students, identifies errors and summarizes error-prone knowledge points, and provides teaching guidance for teachers.

3.3 Data representation and feature extraction

This article refines the data representation and feature extraction process of cross-modal knowledge transfer in Transformer model through the following steps: advanced natural language processing technology is used to conduct in-depth text analysis, and pre-trained word embedding model is used to extract text features. The audio data were preprocessed and converted in frequency domain, and MFCC characteristics were obtained by using Mel filter banks and cepstrum analysis. Faster R-CNN was used to extract regional features from image data. Then, using the multi-head attention mechanism of the Transformer model, features of different modes are integrated. Finally, the model is optimized to improve the accuracy and generalization ability of the model.

This article uses Natural Language Processing (NLP) to extract the main parts of text. NLP [21, 22] is a branch of artificial intelligence primarily used for processing human language, which plays an important role in text feature extraction. By capturing the structural relationship between words, Transformer model can better understand the semantics and structure of text, and improve the accuracy and efficiency of text feature extraction. Figure 2is a schematic diagram of the results of NLP analysis of a paragraph structure.

Figure 2.

Schematic diagram of NLP analysis structure.

Figure 2 shows the structural analysis results of NLP for a paragraph of text. The original text is: “A handsome man is playing football on the playground”. After NLP analysis, the main content of this paragraph has been simplified to “man playing football”. All other text is used to modify the main content of this paragraph. This processing helps the model accelerate its understanding of textual information. The main content of the text is used as the text embedded in the Transformer model. Assuming ${W}$ is the set of all embedded texts, then ${W}$ is represented as:

$\displaystyle{W}=\left[{{w}_{C},{w}_{1},{w}_{2},...,{w}_{N},{w}_{S}}\right]$ (1)

Among them, ${w}_{C}$ represents the embedding of classification markers, which is used for text classification of the entire set. N represents the total number of text for all embedded models. ${w}_{S}$ represents the embedding of paragraph end markers, used to divide the end of a sentence and prevent ambiguity when two paragraphs of text are connected to form a new sentence.

The Transformer model extracts features from ${W}$ and obtains the feature set ${T}$ of the text, where ${T}$ is represented as:

$\displaystyle{T}=\left[{{T}_{C},{T}_{1},{T}_{2},\ldots,{T}_{N},{T}_{S}}\right]$ (2)

All feature vectors in the feature set ${T}$ are transformed from the corresponding elements in the text set ${W}$ .

Speech features are usually divided into prosodic features and sound quality features. Rhythmic features are non personality traits that exist in each person’s way of expression. The prosodic features include elements such as energy, resonance peaks, pitch, and speech duration, reflecting the dynamic expression characteristics of a speech. These elements change based on the emotions of the speaker at the time, the environment they are in, and their subjective consciousness. When the speaker is in a joyful state, the energy in the speech would increase, and the resonance peak may be relatively stable; when emotionally excited, the pace of speech would increase and the voice would become loud. When the speaker is in a noisy environment, he can express himself by increasing his volume. In theatrical performance, actors achieve expressive effects that conform to the plot by changing the rhythm of language, which is a subjective manifestation of phonology.

The sound quality characteristics are personalized features that are related to each individual’s physiological structure. Everyone’s vocal cords, mouth, throat, and other parts are unique, and the production and resonance of sound are different. Sound quality features are also the main features of personalized recognition, with specific applications including voiceprint locks, voice authentication, etc. In the process of knowledge transfer, sound quality features can effectively distinguish the specific identity of students and provide more targeted answers.

To describe speech data, frequency domain analysis is generally used, and Fourier transform [23, 24] is used to perform spectral analysis on speech. A segment of speech is decomposed into several single frequency waveforms. The spectral features are mainly composed of linear prediction coefficients and cepstral analysis, and the feature representation method used in this article is Mel Frequency Cepstral Coefficients (MFCC).

MFCC [25] utilizes the non-linear correspondence between Mel frequency and Hertz frequency, and calculates their correspondence. The calculation method for MKCC is:

$\displaystyle\textit{MFCC}\left({{t},{x}}\right)=\sqrt{\frac{1}{{y}}}\mathop{% \sum}\limits_{{x}=1}^{y}{\lg}\left[{E}\left({{t},{x}}\right)\right]{{\cos}}% \left[{x}\left({{y}-1}\right)\frac{{\pi}}{{y}}\right]$ (3)

Among them, ${E}\left({{t},{x}}\right)$ is the output of the x-th Mel filter at time $t$ , and $y$ is the number of filters. Equation (3) reflects the process of MFCC calculating the logarithmic amplitude spectrum, Mel filtering, and discrete cosine transform of the input signal. Figure 3 shows the effect of MFCC processing a segment of audio.

Figure 3.

Audio and Mel spectrogram.

The left figure in Fig. 3 shows the waveform of a certain audio signal, showing the image of the amplitude of this audio signal changing over time. The right image is the Mel spectrogram, with the y-axis representing the index of the Mel filter. It represents the number of the filter bank on a specific Mel frequency scale and does not have a unit itself. Each color band represents a separate Mel filter, which is distributed at intervals throughout the entire image. The Mel spectrogram can effectively extract features from speech data. MFCC simulates human ear perception to extract audio features critical to speech recognition, which has good computational efficiency and noise robustness. However, they are limited in high-frequency resolution and fast audio change capture, and are sensitive to parameter selection.

Selecting the right MFCC parameters is crucial for extracting effective audio features. The key parameters include the number of filters, window length, frame shift and gauss coefficient. The number of filters determines the frequency resolution, while the window length and frame shift affect the time resolution and the coverage of the audio signal. Gauss processing can enhance the stability of features.

For feature extraction of image data, this article chooses to use Faster R-CNN to perform preliminary feature region selection on the image. Faster R-CNN [26, 27] was chosen for its high accuracy in target detection and robustness to multi-scale targets. Compared to YOLO (You Only Look Once), it is more accurate. Compared with R-CNN (Region-based Convolutional Neural Network), it significantly improves speed through regional proposal networks. These characteristics make it an ideal choice for image feature extraction in multimodal learning space. Figure 4 shows some examples of Faster R-CNN detection applications.

Figure 4.

Faster R-CNN detection results.

The images in Fig. 4 are all sourced from the Roboflow Universe platform, displaying the results of vehicle recognition, building frontal recognition, cherry leaf recognition, and other detection methods. For images that require feature extraction, the Faster R-CNN is first used for region feature partitioning, and then feature vectors are extracted. The feature vectors are placed in the embedding layer for processing, and in the embedding layer, the feature vectors are dimensionally reduced from high dimensions to 512 dimensions. ReLU [28, 29] (Rectified Linear Unit) function is used to enhance the nonlinear expression ability of the feature vector after dimensionality reduction, and its calculation formula is as follows:

$\displaystyle{F}_{\textit{ReLU}}={\textit{Max}}\left({0,{x}}\right)$ (4)

The ReLU function is used to reduce the risk of gradient disappearance, and the Dropout layer reduces overfitting and improves the performance of the Transformer model. Finally, the generated vectors are used as input vectors in the Transformer encoder for training.

3.4 Fusion and information transmission of multimodal features

After extracting the features of text, speech, and image, they all enter the encoder of the Transformer model for unified training. All features need to be fused before training. Feature fusion mainly utilizes the multi-head attention mechanism of each layer of the Transformer model encoder to weight and fuse feature vectors of different modalities. The query vector, key vector, and value vector of the feature vector are set to ${C}_{i}$ , ${J}_{i}$ , and ${Z}_{i}$ , respectively, and the set of all feature vectors for the input encoder is ${X}_{i}$ . Among them, $i$ is the total number of feature vectors, then:

$\displaystyle{\theta}_{\textit{MN}}=\frac{{X}\cdot\left({{C}_{i}\cdot{J}_{i}^{% 2}}\right)}{\sqrt{{J}_{k}}}$ (5)

Among them, ${\theta}_{\textit{MN}}$ represents a weight matrix of size ${M}\cdot{N}$ , and ${J}_{k}$ represents the dimension of the key vector. The formula for calculating the weight value of multi-head attention is:

$\displaystyle{\textit{Attention}}\left({{C}_{0},{J}_{0}}\right)={\textit{% Softmax}}\left({{\theta}_{{\textit{MN}}}}\right)\cdot{Z}_{i}$ (6)

When calculating the value of multi-head attention, it is normalized by Softmax and then multiplied by the value vector to obtain the weight value of multi-head attention. Finally, the obtained weight values are weighted and fused with the corresponding feature vectors, resulting in the fusion vector ${\overrightarrow{X}}$ , which is:

$\displaystyle\overrightarrow{X}=\sum^{i}_{i=1}(\textit{Attention)}_{i}(C_{i},J% _{i})\cdot X_{i}$ (7)

In the feature weighted fusion process through multi-head attention mechanism, the model pays attention to the relationships between various modalities of data, generates semantic connections, and improves the model’s ability to express data, which is information transmission.

4. Experimental application of transformer model

4.1 Datasets

According to the task requirements, this experiment used the natural language processing dataset Quora Question Pairs, annotation information dataset AI2D, multimodal dataset Visual Question Answering, and speech dataset VQA-Voice. Data sets are selected based on their relevance and diversity to the experimental target. The Quora Question Pairs dataset is chosen because it contains rich question pairs and is close to real-world application scenarios to train the model to understand and answer the student’s questions. AI2D datasets focus on job analysis, helping models identify errors and error-prone knowledge points. Visual Question Answering data sets enhance a model’s ability to process multimodal information by answering questions in combination with text and images. The VQA-Voice dataset focuses on voice question answering, providing the model with the ability to process voice input. The comprehensive use of these data sets ensures the comprehensiveness and practicality of the model in the multimodal knowledge transfer task.

Figure 5.

Performance test result.

4.2 Experimental environment and parameter settings

This experiment was conducted on the Windows 10 operating system, with a deep learning framework of Pytorch, a Central Processing Unit (CPU) model of Intel i9-9900k, a graphics card model of Nvidia 2080Ti, and a running memory of 32GB.

In the training parameter settings, the learning rate was set to 0.0005; the number of iterations was set to 20; the Dropout value was set to 0.2; the optimization function used the Adam function [30].

4.3 Experimental process and results

Firstly, the basic performance of the model was tested, and three algorithms, Support Vector Machine (SVM) [31], Naive Bayes Classifier, and Random Forest [32], were introduced for comparison in the experiment. Due to the fact that some of the algorithm models mentioned above do not have the ability to handle cross-modal data, the three types of data, namely text, speech, and image, were tested separately. The experiment selected 5000 text Q&A, 2000 speech Q&A, and 3000 image Q&A in four datasets. The training and testing sets were tested in a 1:1 ratio. The obtained results are shown in Fig. 5.

Figure 6.

Questioning test results.

Figure 7.

Job analysis test results.

Figure 5 shows the performance of each model in the training and testing sets. It can be observed that the overall performance of each model in the test set was better than that in the training set, both in terms of accuracy and recall. Among the three types of data, the Transformer model performed the best, with accuracy rates of 0.91, 0.87, and 0.88 for text, speech, and image in the test set, and recall rates of 0.91, 0.90, and 0.93, respectively. The Naive Bayes classifier had poor overall performance, with no data exceeding 0.8 in the test set. The results show that the Transformer model performs well on various types of data, while naive Bayes classifiers generally perform poorly. In addition, differences in the model’s performance on the training set and the test set were also observed, indicating the importance of the model’s generalization performance. Further study of this performance volatility may help optimize model performance and address challenges in different data modes.

In order to test the Transformer model’s ability to integrate multimodal data, the experiment asked questions about the model. There were two types of questioning methods: text questioning and speech questioning, and the model was explicitly required to answer in the form of text or images. There were three types of problems: factual problems, reasoning problems, and open-ended problems. Fact questions are questions with clear answers, such as “what is the standard gravitational acceleration” or “who was the first person to manufacture an airplane”; inference questions are questions that require explanation, such as “why ice floats on the water surface”; open questions are a type of question that tests the creativity and understanding of a model, such as “what is the biggest challenge you may encounter in the next year of learning?”. Three types of questions were labeled as A, B, and C, and the accuracy and time of the model’s answer were tested. The obtained results are shown in Fig. 6.

Figure 6 shows the results of questioning the model. Among them, subgraphs (a), (b), (c), and (d) represent the question answering methods of text-text, text-image, speech-text, and speech-image, respectively. In the text-text question answering method, the model had a very high accuracy rate for three types of questions, which were 99%, 98%, and 99%, respectively. In the text-image question answering method, the response times for the three types of questions were 1.87s, 2.14s, and 3.64s, respectively. Among the four questioning methods, the most challenging one was the speech-image question answering method, which greatly tested the model’s ability to integrate multimodal information. In this experiment, the accuracy of the model in answering three questions was 91%, 91%, and 89%, with response times of 2.36s, 3.64s, and 6.54s, respectively. The experimental data showed a decrease compared to other questioning methods, but the accuracy remained around 90%, indicating that the model has good multimodal data fusion ability and has good answer accuracy for questioning data of different modalities, effectively completing knowledge transfer.

In order to test the generalization ability of the model, different types of assignments were inputted into the model as data, including six types: mathematics, physics, chemistry, biology, history, and art. The data of text, images, and images with text annotations were tested separately and labeled as E, F, and G. The accuracy of the model’s induction of error prone knowledge points was calculated, and the accuracy was determined by the teacher manually judging whether the results are accurate. The obtained results are shown in Fig. 7.

Figure 7 shows the accuracy of the model’s induction of error prone knowledge points in task analysis for three modal data as the number of task types increases. Overall, with the increase of homework types, the accuracy of induction showed varying degrees of decline. In textual modal data, the induction accuracy of one task type reached 99%, but when it increased to six types, it decreased to 96%. The image modality decreased from 98% to 91%. In the image mode of text annotation, when the number of task types increased to 5, the induction accuracy was already below 90%, and when there were 6, it dropped to 85%. Although the accuracy was less than 90%, 85% was an excellent number for multimodal task analysis. According to the experimental results, as the types of tasks increase, it shows strong generalization ability and adaptability. This shows that the model has excellent performance and potential in multimodal task analysis, which can effectively process new types of job data and promote students’ knowledge transfer.

5. Conclusions

In the new generation of learning space, cross-modal knowledge transfer is of great significance in improving student performance and helping teachers summarize knowledge points. This article used the Transformer model, combined with natural language processing and Faster R-CNN technologies, to achieve knowledge transfer to students through intelligent question answering and homework analysis. The experimental results showed that the model proposed in this article not only performed better than other models in testing with single-mode data, but also had good accuracy in answering multimodal questions. In the analysis of multiple types of homework, the accuracy of knowledge point induction is good, reflecting the model’s good generalization ability. The innovation point of this article is that it proposes an innovative method of multi-modal data fusion, and demonstrates the efficiency and effectiveness of Transformer model in processing comprehensive information, which is an extension of its application field. The limitation of the experiment is that in the analysis of multi type tasks with text annotated images, the accuracy of some experiments did not reach over 90%. It is hoped to improve the results here to the ideal level through training on more types of datasets in the future.

Data availability statement

The data of this paper can be obtained through the email to the authors.

Funding

There is no funding information for the work in this paper.

Footnotes

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this work.

References

Steyaert

Pizurica

Nagaraj

Khandelwal

Boussard

Gentles

, et al. Multimodal data fusion for cancer biomarker discovery with deep learning. Nature Machine Intelligence. 2023; 5(4): 351-362.

Diwakar

Singh

Shankar

Nayak

Vimal

Singh

Sisodia

. Directive clustering contrast-based multi-modality medical image fusion for smart healthcare system. Netw. Model. Anal. Health Informatics Bioinform. 2022; 11(1): 15.

Acosta

Falcone

Rajpurkar

Topol

. Multimodal biomedical AI. Nature Medicine. 2022; 28(9): 1773-1784.

Sleeman IV

Kapoor

Ghosh

. Multimodal classification: Current landscape, taxonomy and future directions. ACM Computing Surveys. 2022; 55(7): 1-31.

Windsor

Cao

. Improving exchange rate forecasting via a new deep multimodal fusion model. Applied Intelligence. 2022; 52(14): 16701-16717.

Shi

Sun

Jin

Cheng

. Soft robotic perception system with ultrasonic auto-positioning and multimodal sensory intelligence. ACS Nano. 2023; 17(5): 4985-4998.

Peng

Xia

Nyarko

Zhu

Liu

, et al. An intelligent fault diagnosis method for rotating machinery based on data fusion and deep residual neural network. Applied Intelligence. 2022; 52(3): 3051-3065.

Chen

Ren

Mao

Zhao

. Electroglottograph-based speech emotion recognition via cross-modal distillation. Applied Sciences. 2022; 12(9): 4338.

Wang

Chen

Luo

. A high-dimensional sparse hashing framework for cross-modal retrieval. IEEE Transactions on Circuits and Systems for Video Technology. 2022; 32(12): 8822-8836.

10.

Nassiri

Akhloufi

. Transformer models used for text-based question answering systems. Applied Intelligence. 2023; 53(9): 10602-10635.

11.

Zhaoxia

. Improvement of Computer Adaptive Multistage Testing Algorithm Based on Adaptive Genetic Algorithm. International Journal of Intelligent Information Technologies (IJIIT). 20(1): 1-19.

12.

Zhao

Cong

Shi

Miao

. Queryformer: A tree transformer model for query plan representation. Proceedings of the VLDB Endowment. 2022; 15(8): 1658-1670.

13.

Zhang

Tao

. Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision. 2023; 131(5): 1141-1162.

14.

Chu

Zhang

Wang

Zhang

Wang

, et al. A transformer-based model to predict peptide – HLA class I binding and optimize mutated peptides for vaccine design. Nature Machine Intelligence. 2022; 4(3): 300-311.

15.

Yang

Zhou

Jiao

Liu

, et al. A hybrid model of ghost-convolution enlightened transformer for effective diagnosis of grape leaf disease and pest. Journal of King Saud University-Computer and Information Sciences. 2022; 34(5): 1755-1767.

16.

Baid

Cook

Shafin

Yun

López

Berthet

, et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nature Biotechnology. 2023; 41(2): 232-238.

17.

Xingtong

. Research on sentiment analysis algorithms based on sentiment dictionaries and Transformer models. Journal of Nanjing University of Posts and Telecommunications: Natural Science Edition. 2020; 40(1): 55-62.

18.

Mohammadi Farsani

Pazouki

. A transformer self-attention model for time series forecasting. Journal of Electrical and Computer Engineering Innovations (JECEI). 2020; 9(1): 1-10.

19.

Widmann

Wich

. Creating and comparing dictionary, word embedding, and transformer-based models to measure discrete emotions in german political text. Political Analysis. 2023; 31(4): 626-641.

20.

Chencheng

Liner

Yingying

Yongping

Erhong

. Chinese grammar correction method based on Transformer enhanced architecture. Chinese Journal of Information Science. 2020; 34(6): 106-114.

21.

Kang

Cai

Tan

Huang

Liu

. Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics. 2020; 7(2): 139-172.

22.

Khurana

Koli

Khatter

Singh

. Natural language processing: State of the art, current trends and challenges. Multimedia Tools and Applications. 2023; 82(3): 3713-3744.

23.

Caushaj

Sugumaran

. Classification and security assessment of android apps. Discov Internet Things. 2023; 3: 15.

24.

Singh

Diwakar

Cheng

Shankar

. A new wavelet-based multi-focus image fusion technique using method noise and anisotropic diffusion for real-time surveillance application. J. Real Time Image Process. 2021; 18(4): 1051-1068.

25.

Haq

Nasrun

Setianingsih

Murti

. Speech recognition implementation using MFCC and DTW algorithm for home automation. Proceeding of The Electrical Engineering Computer Science and Informatics. 2020; 7(2): 78-85.

26.

Mai

Zhang

Jia

Meng

MQH

. Faster R-CNN with classifier fusion for automatic detection of small fruits. IEEE Transactions on Automation Science and Engineering. 2020; 17(3): 1555-1569.

27.

Chen

Wang

Sakaridis

Dai

Gool

. Scale-aware domain adaptive faster r-cnn. International Journal of Computer Vision. 2021; 129(7): 2223-2243.

28.

Vershynin

. Memory capacity of neural networks with threshold and rectified linear unit activations. SIAM Journal on Mathematics of Data Science. 2020; 2(4): 1004-1033.

29.

Zhao

Zhong

Tang

Dong

Pecht

. Deep residual networks with adaptively parametric rectifier linear units for fault diagnosis. IEEE Transactions on Industrial Electronics. 2020; 68(3): 2587-2597.

30.

Zhouyu

. Current status of research on magnetic confinement fusion and superconducting tokamak devices. Procedia Computer Science, Volume 228; 2023, pp. 163-170.

31.

Kurani

Doshi

Vakharia

Shah

. A comprehensive comparative study of artificial neural network (ANN) and support vector machines (SVM) on stock forecasting. Annals of Data Science. 2023; 10(1): 183-208.

32.

Schonlau

Zou

. The random forest algorithm for statistical learning. The Stata Journal. 2020; 20(1): 3-29.