Abstract
Introduction:
Neurological disorders, such as multiple sclerosis (MS), can significantly impair bodily functions due to the immune system mistakenly attacking the body. Assessing cognitive workload in MS patients is crucial for understanding their condition, and one effective approach is the use of an n-back working memory task combined with magnetoencephalography (MEG) recordings. MEG provides valuable insights into brain activity through its magnetic signals.
Methods:
This study proposes an automated framework to detect mental workload levels in MS patients using MEG data. The EEGNet model is employed to assess mental workload, with transfer learning techniques used for fine-tuning to enhance model performance. Additionally, traditional machine learning models are evaluated to compare their performance with the deep learning-based approach.
Results:
Experimental results indicate that the proposed model achieves an accuracy of 51.68 ± 11.92% for healthy subjects and 51.77 ± 13.29% for MS patients across various workload levels, significantly outperforming baseline methods.
Conclusions:
Deep learning-based end-to-end models can effectively assess mental workload in MS patients, achieving competitive performance without requiring explicit feature extraction or dimensionality reduction steps typically used in conventional classification pipelines.
Introduction
Empirical research has long examined the limited capacity of human working memory (WM), initially estimated at approximately seven discrete “chunks” of information, 1 but later refined to an upper bound closer to three or four items. 2 These inherent constraints play a central role in mental workload regulation and cognitive performance. Excessive workload can impair decision-making, increase error rates, and compromise safety in high-stakes environments, whereas insufficient workload may lead to disengagement and reduced vigilance.3,4 Effective mental workload management is therefore essential for maintaining performance, safety, and cognitive engagement across diverse operational contexts.5,6 In this work, we use the term mental workload to refer to task-induced cognitive demands, which are closely related to concepts such as cognitive load and WM load.
To quantify mental workload, prior research has relied on four main methodological approaches: analytical, subjective, performance-based, and neurophysiological methods. 7 While analytical and subjective approaches provide useful theoretical and self-reported estimates, and performance-based metrics capture behavioral outcomes, neurophysiological measures offer objective, time-resolved insights into underlying brain processes by leveraging signals such as neural activity, heart rate, and eye movements. 8 Owing to their sensitivity to dynamic changes in cognitive state and reduced susceptibility to reporting bias, neurophysiological techniques have become increasingly prominent in workload assessment. 9
Neurophysiological assessments of mental workload have employed a range of modalities, including electroencephalography (EEG), magnetoencephalography (MEG), functional magnetic resonance imaging, functional near-infrared spectroscopy, electrooculogram (EOG), and electrocardiogram (ECG).10–12 Among these, EEG is widely used due to its portability and relatively low cost,13,14 whereas MEG measures the magnetic fields generated by neuronal activity and offers improved spatial resolution with reduced sensitivity to muscle artifacts.15,16 Although MEG requires specialized and costly instrumentation, its signal fidelity and spatial precision make it particularly valuable for studying the neural correlates of cognitive workload.
The continuous n-back task is a well-established paradigm for manipulating mental workload and cognitive demand. 17 By requiring participants to identify stimuli that match those presented n steps earlier, the task systematically engages WM, executive control, and information processing speed (IPS). These cognitive domains are frequently impaired in multiple sclerosis (MS), a chronic autoimmune disease affecting the central nervous system and a leading cause of neurological disability.18–20 Prior studies have demonstrated that individuals with MS exhibit marked difficulties in retaining visual information in WM, supporting the use of the n-back task as a sensitive probe of mental workload in this population. 21
Safety
Despite the growing use of neurophysiological data for mental workload assessment, several important gaps remain in the existing literature.22,23 Most prior studies rely on handcrafted features—such as band-limited power or connectivity metrics—followed by conventional machine learning classifiers, which can limit the ability to capture the full spatiotemporal structure of high-dimensional MEG signals and hinder scalability across tasks and populations.24–26 Moreover, while deep learning approaches have increasingly been applied to EEG-based workload decoding, end-to-end deep learning frameworks operating directly on raw MEG data remain relatively underexplored, particularly in the context of WM tasks.27–29 An additional challenge is limited cross-subject generalizability: models trained at the group level often fail to adapt effectively to unseen individuals, a limitation that is especially pronounced in heterogeneous or clinical populations such as MS. 30 Together, these factors underscore the need for robust end-to-end MEG-based models that can decode cognitive workload from raw signals while explicitly addressing inter-individual variability.
In neurophysiological research, MEG has increasingly been used to study mental workload and WM, with prior work spanning statistical feature extraction and machine learning-based analyses.31,32 For example, Syrjälä et al. examined phase-coupling patterns across multiple frequency bands in MEG, while Bhalerao et al. combined EEG and MEG signals for cognitive visual object detection using pattern recognition techniques.31,32 Using n-back task data, Rossi et al. applied a hidden Markov model to reveal altered event-related network dynamics in MS patients compared with healthy control (HC) and further employed data-driven network decomposition to characterize temporal, spatial, and spectral disruptions in WM processing.33,34
Despite MEG’s demonstrated potential, EEG remains the dominant modality for workload assessment due to its accessibility and lower cost. Traditional analysis pipelines typically rely on handcrafted features such as power spectral density, followed by classifiers including support vector machines (SVMs) or random forests (RFs). More recently, deep learning approaches have gained traction by enabling end-to-end analysis directly from raw neurophysiological signals, removing the need for explicit feature extraction and dimensionality reduction.35,36 Among these models, EEGNet has emerged as a compact and effective convolutional neural network for electrophysiological data, leveraging depthwise separable convolutions and principled spatial–temporal filtering to achieve strong performance and generalization across diverse brain–computer interface (BCI) paradigms. 27
In this study, we investigate mental workload during a WM task using raw MEG signals and the EEGNet model. Our primary objective is to decode multilevel n-back workload (0-, 1-, and 2-back) directly from MEG data, leveraging MEG’s millisecond temporal resolution to capture neural dynamics that may not be evident from behavioral measures alone. We include individuals with MS, not to perform clinical classification but to assess the generalizability of workload decoding and to examine whether neural representations of mental workload differ from those of HCs. Given that WM and sustained attention deficits are well documented in MS, the n-back task provides a well-established and sensitive probe of mental workload in this population.
Participants, including HCs and MS patients, completed a visual n-back task with three difficulty levels to induce graded cognitive load. Raw MEG recordings, represented as Channel × Time matrices, were used as direct input to EEGNet for workload classification. To mitigate inter-subject variability, we employed transfer learning by fine-tuning the model using a subset of subject-specific data.37,38 EEGNet performance was evaluated against traditional machine learning approaches trained on power spectral density features. The remainder of the article is organized as follows: the “Data description” section describes the dataset, participants, and task design; the “Methodology” section details the model architecture and training procedures; the “Results” section presents the experimental results and performance comparisons; and the “Discussion” section concludes with key findings and future directions.
Data Description
Participants
MEG data were collected from 114 participants who performed a visual–verbal n-back task, including 36 HCs and 78 individuals with MS (pwMS), aged 18–65 years. Demographic and clinical characteristics of the study cohort, including age, sex, education level, disease duration, and disability scores, are summarized in Table 1 and Figure 1. All pwMS were recruited from the National MS Center Melsbroek in Belgium and met the diagnostic criteria for MS based on the revised McDonald criteria. 39 Participants had an Expanded Disability Status Scale score of 6 or lower, 40 indicating no greater than moderate disability. Exclusion criteria included a history of relapses or corticosteroid treatment within the 6 weeks preceding participation. Additional exclusion criteria were the presence of pacemakers, dental wires, major psychiatric disorders, or epilepsy, as these conditions could interfere with MEG recordings or compromise study outcomes.

Age distributions of HC and pwMS participants. HC, healthy control; MS, multiple sclerosis.
Subjects Description
EDSS, expanded disability status scale; HC, healthy control; MS, multiple sclerosis.
MEG data acquisition
MEG data acquisition was conducted at CUB Hôpital Erasme in Brussels, Belgium, using an Elekta Neuromag system. The system was equipped with a sensor array consisting of 102 triple sensors, each comprising one magnetometer and two orthogonal planar gradiometers. To minimize external magnetic interference, the system was housed within a lightweight magnetically shielded room (Maxshield™; Elekta Oy, Helsinki, Finland). Prior to MEG recording, four head position indicator coils were attached to the participants’ left and right mastoid processes and forehead to enable accurate tracking of head movements throughout the session. In addition, the positions of these coils, along with at least 400 head-surface points (including the nose, face, and scalp), were recorded relative to anatomical fiducial points (nasion and left and right preauricular points) using an electromagnetic tracking system (Fastrak; Polhemus, Colchester, Vermont, USA). MEG data were acquired at a sampling rate of 1000 Hz and band-pass filtered between 0.1 and 330 Hz. Participants were instructed to remain as still as possible during acquisition and were seated in an upright position with their heads oriented toward the back of the MEG helmet to ensure consistency and minimize motion artifacts. In addition to MEG signals, EOG and ECG signals were simultaneously recorded. These physiological signals were subsequently used for offline artifact rejection, thereby enhancing the integrity and accuracy of the MEG data.
Task paradigm
During MEG acquisition, all participants performed a visual–verbal n-back task 41 designed to assess WM load across three conditions: 0-back, 1-back, and 2-back. Participants were instructed to press a response button with their right hand under specific conditions. In the 0-back condition, they responded when the letter “X” appeared on the screen. In the 1-back condition, a response was required when the current letter matched the one immediately preceding it. In the 2-back condition, participants responded when the current letter matched the one presented two trials earlier. Figure 2 illustrates the n-back paradigm used in this study. Letter stimuli (6 × 6.5 cm) were projected onto a screen positioned 72 cm in front of the MEG helmet. Each stimulus was displayed for 1 s, followed by an intertrial interval of 2.8 s. At the beginning of each block, instructions corresponding to the specific WM condition were presented for 15 s to ensure participants fully understood the task requirements. A photodiode was used to accurately detect the onset of each visual stimulus, allowing reaction time (RT) to be calculated as the interval between stimulus onset and the participant’s button press. The task consisted of 12 blocks, with 4 blocks assigned to each WM level (0-, 1-, and 2-back). Blocks were presented in a randomized order to minimize order effects. Each block contained 20 stimuli, resulting in a total of 240 stimuli across the entire task. The number of target trials varied slightly between conditions, with 25 targets in the 0-back condition, 23 in the 1-back condition, and 28 in the 2-back condition. Nontarget (distractor) trials were also included to ensure a comprehensive evaluation of task performance. Trials containing incorrect button responses were excluded from subsequent analyses to maintain data integrity. This structured design enabled a robust evaluation of mental workload across increasing levels of cognitive demand, facilitating accurate detection and analysis of workload-related neural activity using MEG data.
Layer Specifications with Parameters, Outputs, and Activations for the EEGNet-Based Model
Methodology
In this study, we employed the EEGNet model in combination with a fine-tuning strategy to classify mental workload from MEG signals. To establish comparative baselines, we also implemented traditional machine learning approaches, including SVMs, 42 RF, 43 and logistic regression (LR). 44 These classical models served as reference benchmarks for evaluating the performance and effectiveness of the proposed deep learning-based framework.
MEG data preprocessing and parcellation
First, the temporal extension of the signal space separation algorithm (MaxFilter™; Elekta Oy, Helsinki, Finland, version 2.2, default parameters) was applied to remove external magnetic interference and correct for head movements. 45 Subsequent preprocessing was conducted using the Oxford Software Library (OSL) pipeline, which is built upon FSL, SPM12 (Wellcome Trust Centre for Neuroimaging, University College London), and FieldTrip. 46 The initial step in this pipeline involved applying a finite impulse response antialiasing low-pass filter, after which the data were downsampled to 250 Hz. Using OSL’s RHINO algorithm, the MEG data were automatically registered with each subject’s T1-weighted anatomical image. This procedure aligned the digitized head shape points to the scalp surface, which was extracted using FSL’s BETSURF and FLIRT tools,47,48 and subsequently transformed into a common MNI152 space. 49 The data were further band-pass filtered between 0.1 and 70 Hz. To suppress power-line noise, a fifth-order Butterworth notch filter was applied between 49.5 and 50.5 Hz. Artifact removal was performed using a semiautomated independent component analysis approach, in which ocular and cardiac components were visually identified and removed based on correlations with simultaneously recorded EOG and ECG signals, respectively. Components exhibiting a correlation coefficient >0.90 with EOG or ECG channels were flagged as candidate artifacts. All candidate components were subsequently visually inspected to confirm their artifactual nature before removal, ensuring that neural components were not inadvertently rejected. This combined threshold-based and visual inspection procedure was adopted to balance objectivity with expert validation and to minimize both false positives and false negatives in artifact rejection. Following artifact correction, a linearly constrained minimum variance beamformer was employed to project the MEG data into source space.46,50,51 The source-reconstructed data were then segmented using a parcellation atlas comprising 42 cortical parcels. 52 For each parcel, the first principal component of the corresponding time series was extracted and used as the representative signal for that region.
EEGNet
EEGNet 27 is a deep learning architecture specifically designed for processing neurophysiological data and achieves robust performance by leveraging depthwise and separable convolutions. This compact model offers several advantages, including efficient spatial filtering, effective filter-bank construction, and a substantial reduction in the number of trainable parameters compared with conventional deep learning architectures. These properties make EEGNet particularly well suited for applications involving relatively small datasets, as it enables effective learning while extracting interpretable features from neurophysiological signals. The EEGNet architecture comprises three distinct blocks, each responsible for a specific stage of feature extraction and classification:
Block 1
This block consists of two sequential convolutional layers. The first layer applies F1 2D horizontal convolutional filters to generate feature maps corresponding to different band-pass frequencies. This is followed by depthwise convolutions of size (C, 1), which extract spatial filters specific to each channel. These convolutions are not fully connected to all preceding feature maps, thereby reducing the number of trainable parameters. Batch normalization 53 is applied to stabilize the learning process, followed by the ReLU activation function. 54 Dropout 55 is then used to mitigate overfitting, and average pooling reduces the spatial dimensions of the feature maps.
Block 2
This block incorporates a separable convolution layer composed of a depthwise convolution followed by F2 pointwise (1 × 1) convolution. The depthwise convolution applies individual kernels to each input channel, producing separate feature maps, while the pointwise convolution optimally combines these maps. This design further minimizes the number of trainable parameters while preserving classification performance.
Block 3 (classification block)
The final block processes the extracted features through a dense layer followed by a softmax activation function with three output units, corresponding to the three mental workload classes (0-, 1-, and 2-back).
The detailed layer specifications and overall EEGNet architecture used in this study are presented in Table 2 and illustrated in Figure 3. To address intersubject variability, we adopted a subject-specific adaptation setting in which a pretrained EEGNet model was fine-tuned using a small subset of data from the target subject. Model performance was evaluated on the remaining held-out portion of the subject’s data that were not used during fine-tuning. This protocol reflects a transductive/online adaptation scenario rather than a conventional inductive learning setting and is intended to assess the potential benefit of limited subject-specific calibration. The details of EEGNet fine-tuning are described in Figure 4.

Illustration of the visual–verbal n-back paradigm. Each intertrial interval lasted 2.8 s, with stimuli displayed on the screen for 1 s.

Overall structure of EEGNet. The effect of the first two layers is decomposed into temporal and spatial filters. Separable depthwise convolution is also illustrated.

EEGNet-based fine-tuning approaches. First approach (scenario 1) freezes the earlier layers and SeparableConv2D layers during fine-tuning, updating only the fully connected layer. The second approach (scenario 2) expands fine-tuning to include both SeparableConv2D and fully connected layers while freezing earlier layers. Both scenarios involve a two-step process with initial full-network training, followed by fine-tuning using a subset of test data (either 25% or 50%) to adapt to the classification task.
In both scenarios, two test data proportions were used for fine-tuning: 25% of the test data in the first condition and 50% in the second condition. The overall fine-tuning strategy is summarized in Figure 4. To account for potential differences in class-specific learning dynamics, class weights were assigned during training: C0 = 1.0, C1 = 1.16, and C2 = 0.875. The 1-back class (C1) was given the highest weight, as it represents an intermediate mental workload between the simpler 0-back and more demanding 2-back conditions, thereby serving as a critical indicator of model sensitivity to workload variation. These class weights were empirically determined through an iterative trial-and-error process to optimize performance across all classes. All EEGNet-based experiments were conducted on NVIDIA TITAN RTX GPUs using TensorFlow and the Keras API. 56
Classical machine learning
In addition to EEGNet, baseline machine learning models were implemented to establish performance benchmarks for mental workload classification. These included SVMs, RF, and LR. Each model was evaluated across all workload classes (C0, C1, and C2) for all subjects, with classification accuracy used as the primary performance metric.
The comparative analysis of these models enabled an objective assessment of the relative advantages offered by the deep learning approach over traditional machine learning techniques. All classical models were implemented and evaluated on a CPU-based system equipped with a 12th Generation Intel® Core™ i7-12700H processor, using Python version 3.9.12.
Behavioral analysis
Accuracy (proportion correct) and RT were summarized for each participant across the three WM conditions (0-, 1-, and 2-back). Accuracy was calculated as the proportion of correct responses across all trials within each condition, while RT was computed using only correctly performed trials and summarized across participants as mean ± standard deviation (SD).
Between-group differences between HC and individuals with MS were evaluated separately for each workload condition using Welch’s two-sample t-tests (two-tailed, unequal variances; α = 0.05). For each comparison, the t-statistics, degrees of freedom, and corresponding p-values were reported. More comprehensive behavioral analyses of this dataset have been previously described by Costers et al. 57
Results
Classification performance
In this study, EEGNet was implemented using fine-tuning strategies in which network weights were updated under two distinct conditions to optimize classification performance. Model efficacy was evaluated using leave-one-subject-out cross-validation, a rigorous validation framework in which data from one subject were iteratively held out as the test set while the model was trained on data from all remaining subjects. This procedure was repeated until each subject had served once as the test set, ensuring a robust assessment that accounts for intersubject variability and enhances generalizability. To comprehensively evaluate classification performance, multiple evaluation metrics were employed, including accuracy, F1-score, and the area under the receiver operating characteristic curve (AUC). Accuracy quantifies the proportion of correctly classified samples across all classes, while the F1-score provides a balanced measure of precision and recall, making it particularly informative in the presence of class imbalance. Although the dataset was relatively balanced, the F1-score remains valuable as it captures potential disparities in model performance across workload classes. AUC assesses the model’s ability to discriminate between classes across all possible decision thresholds, offering a nuanced measure of overall classification capability. Given the balanced dataset, AUC serves as a reliable indicator of global classification effectiveness.
Table 3 presents a detailed comparison of the mean accuracy, F1-score, and AUC values (mean ± SD) for EEGNet and the three classical models (SVM, LR, and RF). EEGNet was configured according to Scenario 2 and utilized 50% of the data for fine-tuning, a choice justified in the “Fine-tuning comparison” section. The results demonstrate that EEGNet consistently outperformed the classical models across both the HC and MS groups. For healthy participants, EEGNet achieved an average classification accuracy of 51.68%, exceeding the performance of SVM (41.12%), LR (48.22%), and RF (41.64%). Similarly, in the MS group, EEGNet attained an accuracy of 51.77%, compared with 40.33% for SVM, 44.56% for LR, and 41.76% for RF. Furthermore, EEGNet achieved an F1-score of 0.51 ± 0.11 and an AUC of 0.70 ± 0.08. These findings highlight EEGNet’s robustness and its superior ability to maintain balanced and reliable classification performance, as reflected by consistently higher accuracy, F1-score, and AUC values across both healthy and MS populations.
Performance Metrics for the Healthy and MS Groups (Value ± SD)
Bold values indicate the best performance for each model.
ACC, accuracy; AUC, under the receiver operating characteristic curve; LR, logistic regression; RF, random forest; SD, standard deviation; SVM, support vector machine.
Statistical analysis (machine learning models)
To evaluate the statistical significance of the observed performance differences among the machine learning models, paired t-tests were conducted comparing EEGNet with each of the three classical models (SVM, LR, and RF) in terms of classification accuracy. The resulting p-values, presented in Table 4, demonstrate that for most workload classes as well as for overall accuracy, the differences between EEGNet and the classical models were statistically significant, with all p-values below the threshold of 0.05. These findings indicate that the performance gains achieved by EEGNet over traditional machine learning approaches are not attributable to random variation but instead reflect a meaningful improvement in classification capability. Collectively, the results confirm that EEGNet consistently delivers superior performance across all evaluated metrics. The statistical significance analysis further reinforces that these improvements are robust and unlikely to occur by chance, thereby affirming the effectiveness of EEGNet for MEG-based mental workload classification.
Statistical Significance of EEGNet Compared with Other Methods
Fine-tuning comparison
Comparisons between the two fine-tuning scenarios and the two reupdating conditions (25% and 50%) for both the HC and MS groups are presented in Tables 5 and 6, respectively. Table 5 summarizes the results for the HC group, while Table 6 reports the corresponding outcomes for the MS group. Together, these tables provide a comprehensive evaluation of model performance across both groups, highlighting key differences in classification accuracy and overall effectiveness.
Performance Metrics for the Healthy Group in Two Scenarios (Value ± SD)
Performance Metrics for the MS Group in Two Scenarios (Value ± SD)
Across all configurations, EEGNet consistently outperformed the traditional classification models, including SVM, LR, and RF, with particularly notable improvements observed under Scenario 2. In this extended fine-tuning configuration, EEGNet achieved superior performance across all evaluation metrics, underscoring its robustness and effectiveness for both the HC and MS populations.
The fine-tuning process was conducted under two conditions: (1) reupdating the trainable weights using 25% of the test data and (2) reupdating the trainable weights using 50% of the test data. The results demonstrated that EEGNet consistently achieved higher classification performance under the 50% fine-tuning condition for both groups. This improvement was especially pronounced in Scenario 2, where the inclusion of trainable SeparableConv2D layers enabled more refined weight adjustments, thereby enhancing the model’s adaptability and generalization capability.
Furthermore, the observed performance gains under the 50% fine-tuning condition highlight the importance of using a sufficiently large subset of test data during the reupdating phase. This strategy allows the model to more accurately capture the inherent variability and complexity associated with mental workload classification. Notably, within the MS group, the extended fine-tuning configuration using 50% of the test data yielded the highest accuracy, reinforcing EEGNet’s robustness in handling intergroup differences and task-related variability.
Visualization
EEGNet-based visualization
WM tasks in individuals with MS are frequently associated with alterations in key brain frequency bands. In particular, reduced alpha activity has been consistently observed during such tasks, 58 as demonstrated in the topographic maps shown in Figure 5. This figure illustrates the spatial distribution of weights derived from four distinct spatial filters in the EEGNet model, with panel (a) representing MS patients and panel (b) representing HCs. The reduction in alpha power in MS patients is indicative of impaired cognitive function and may reflect compromised neural connectivity.

Topographic maps representing the spatial distribution of EEGNet model weights during mental workload classification for
The delta rhythm also plays a critical role in facilitating long-range communication across brain regions, and its disruption can signal deficits in neural signaling required for effective WM processing. The reduction in alpha activity correlates with impaired cognitive function and suggests compromised neural connectivity. Disruption of delta rhythm may also indicate impaired neural signaling necessary for effective WM. 59 Delta activity is predominantly observed over frontal regions, which are implicated in executive functions and WM. Figure 5a highlights decreased delta activity in MS patients compared with healthy subjects (Fig. 5b).
Theta band activity plays an important role in WM. Prior studies have shown reduced theta power in MS patients during n-back tasks, consistent with altered engagement of regions typically involved in WM. In MS, this task-related theta increase is attenuated relative to HCs, and lower theta activity has been associated with slower RTs. 57 Consistent with these findings, topographies illustrate that MS patients show reduced theta activity over frontal and frontocentral regions during WM tasks, compared with the more balanced frontoparietal distribution observed in HCs. Furthermore, alterations in both phase-locked and nonphase-locked MEG activity have been reported in MS, suggesting difficulties in sustaining attentional focus and highlighting broader cognitive challenges associated with the disease. 60
Behavioral-based information visualization
Accuracy and RT were summarized for each workload condition (0-, 1-, and 2-back) and visualized using grouped bar plots. RT is presented as mean ± SD (Fig. 6), while accuracy is shown as mean ± 95% confidence interval (Fig. 7).

Reaction time (HC vs. MS) across 0-, 1-, and 2-back.

Accuracy (HC vs. MS) across 0-, 1-, and 2-back.
A clear performance pattern emerged across groups (Tables 7 and 8). Participants with MS performed significantly worse than HC in the 1-back (Δmean = + 0.054, p_Holm = 0.015, d ≈ 0.47) and 2-back conditions (Δmean = + 0.109, p_Holm = 0.015, d ≈ 0.52). Both effects were of moderate magnitude, indicating meaningful workload-dependent impairment. In contrast, no significant group difference was observed in the 0-back condition (Δmean = + 0.007, p = 0.405), consistent with the minimal WM demands of this vigilance task.
Welch Independent t-Tests (HC vs. MS)—Reaction Time
Welch Independent t-Tests (HC vs. MS)—Accuracy
MS participants were also significantly slower than HC at higher workload levels. Specifically, RT was significantly longer in the MS group for the 1-back (Δmean = −0.044 s, p_Holm = 0.038, d ≈ −0.51) and 2-back conditions (Δmean = −0.060 s, p_Holm = 0.007, d ≈ −0.65). The RT difference for the 0-back condition did not reach statistical significance (Δmean = −0.029 s, p = 0.062).
These results demonstrate that increasing task complexity acts as a cognitive stressor, with performance differences between HC and MS becoming more pronounced as WM demands rise. While simple vigilance (0-back) yields comparable performance across groups, higher cognitive loads (1- and 2-back) reveal consistent impairments in both accuracy and speed among MS participants. This pattern underscores the sensitivity of multi-load n-back paradigms in detecting WM dysfunction in MS and supports the use of MEG-based decoding as a complementary, time-resolved neural measure of these workload-dependent effects.
Discussion
This study investigated two fine-tuning scenarios for the EEGNet architecture to enhance mental workload classification performance using MEG signals. In the first scenario, only the fully connected layers were trainable, whereas, in the second scenario, both the fully connected layers and the depthwise separable convolution layer were trainable. For each scenario, two different fine-tuning data proportions (25% and 50%) were used for the model fine-tuning. The results indicate modest improvements in classification performance for the HC group under the second scenario compared with the first, with slight increases in accuracy, F1-score, and AUC values. However, no significant improvement was observed in the MS group, as performance metrics remained consistent across both scenarios. These findings suggest that enabling trainability in the depthwise separable convolution layers provided limited benefits for the MS group, while offering some advantages for the HC group by better capturing spatial and temporal features of MEG signals.
Our findings also underscore the significant influence of training data volume on model performance. Fine-tuning with 50% of the training data consistently yielded superior results compared with using 25%, particularly in the HC group. This improvement can be attributed to the model’s greater exposure to subject-specific characteristics, which enhanced its robustness capabilities. Notably, distinct differences in classification performance emerged between the HC and MS groups under specific fine-tuning conditions. For example, in the first scenario, the HC group outperformed the MS group when 50% of the training data was used, whereas no significant performance gap was observed with 25% data. In the second scenario, the HC group consistently achieved better performance, particularly when fine-tuned with 25% of the training data. This result highlights the potential advantage of enabling trainable depthwise separable convolution layers for the HC group, as it likely improved the model’s ability to capture relevant spatial and temporal features in the MEG signals.
A limitation of this study is that subject-specific fine-tuning requires access to a small portion of data from the target subject, which differs from a fully independent test setting. Although we evaluated performance on data not used for fine-tuning, this adaptation setup may lead to slightly optimistic performance estimates compared with a setting with no access to target-subject data. Future work will explore validation-based and nested cross-validation strategies to obtain more unbiased performance estimates while preserving the benefits of subject-specific adaptation. Another limitation is generalizability across MEG systems. Our models were trained on data from a single MEG device and sensor layout, and performance may differ with other manufacturers, sensor types, or newer technologies such as optically pumped magnetometers. Differences in sensor geometry and noise characteristics can affect the learned features. Future work will test cross-system robustness and explore harmonization strategies to improve transferability across MEG platforms.
These performance discrepancies are likely to be attributable to inherent differences in cognitive processing and WM capacity between the HC and MS groups. Research has consistently shown that MS-affected individuals often exhibit deficits in cognitive processing speed and WM, resulting from neurological changes such as demyelination and neurodegeneration. For example, studies have highlighted that IPS and WM are among the most prevalent cognitive impairments in MS, affecting up to 36% of individuals with the condition. 61 These impairments often manifest early in the progression of the disease, further highlighting their link to the underlying pathology of MS. 62 Addressing these disparities may require the application of domain adaptation techniques to mitigate variability in MEG signals derived from MS patients. By aligning the feature space of MEG signals between the HC and MS groups, classification performance for the MS group could be significantly enhanced. Furthermore, personalized fine-tuning strategies, specifically tailored to individual subjects—particularly those with MS—represent a promising approach to improve the model robustness. These strategies hold the potential to bridge intergroup disparities and optimize the utility of MEG-based models in clinical applications.
Detecting mental workload, particularly in individuals with MS, poses significant challenges due to the substantial variability in MEG signals across subjects. This variability arises from differences in cognitive abilities, task engagement, physiological states, and the inherently nonstationary nature of MEG signals. In MS patients, these challenges are further amplified by disease-related heterogeneity, adding complexity to the analysis. As a result, models trained on data from multiple subjects often struggle to generalize effectively to unseen subjects. The variability observed in this study highlights the critical need for innovative approaches to overcome these limitations and enhance the robustness of the workload detection models for this population. Future research could focus on developing personalized models or incorporating subject-specific fine-tuning strategies to address the heterogeneity in MEG signals. Additionally, domain generalization techniques could be explored to design models capable of adapting to new, unseen subjects without requiring individual customization. These approaches offer promising avenues for improving classification performance and advancing the application of deep learning in mental workload detection for diverse populations.
Consistent with prior literature, behavioral performance showed relatively preserved accuracy at low workload and increasing difficulty-related impairments in pwMS at higher n-back loads. This pattern supports the interpretation that group differences in MEG-based decoding reflect load-dependent alterations in WM processes rather than general vigilance differences. The behavioral findings therefore provide important context for the neural decoding results as a complementary, time-resolved index of workload-related dysfunction.
Conclusion
This study investigates the mental workload detection in both healthy individuals and MS patients using MEG signals. To the best of our knowledge, this is the first study to explore mental workload detection in MS patients. We employ a fine-tuned implementation of the EEGNet deep learning model to assess mental workload during an n-back WM task. This research represents a pioneering effort to analyze MEG data from MS patients for the purpose of mental workload detection using a deep learning framework. EEGNet was chosen for its compatibility with MEG signals, given its success in BCI systems using EEG data. To enhance performance, we applied two fine-tuning strategies within a transfer learning framework.
To validate the effectiveness of the EEGNet model, we conducted extensive experiments. Unlike traditional machine learning models that require separate stages for preprocessing, feature extraction, feature selection, and classification, deep learning approaches offer a unified, end-to-end learning solution. Building on the success of this approach in detecting workload in MS patients through transfer learning, our future work will focus on exploring advanced techniques such as distant domain transfer learning. 63
This approach aims to leverage domain-specific information from individual subjects to improve performance for others, thereby addressing intersubject variability. Additionally, we aim to develop new transfer learning models specifically tailored for detecting WM tasks using MEG signals.
Footnotes
Acknowledgments
The authors would like to thank all participants and collaborators who contributed to data collection and system support. In addition, the authors acknowledge the use of ChatGPT (OpenAI) for assistance with language editing and improving clarity. All scientific content, interpretations, and conclusions are solely those of the authors.
Informed Patient Consent
Informed consent was obtained from all participants prior to their participation in the study. The privacy rights of all human subjects were strictly observed, and all data were anonymized to ensure confidentiality.
Ethical Statement
This study was approved by the local ethics committees of the University Hospital Brussels (Commissie Medische Ethiek UZ Brussel; B.U.N. 143201423263, approval November 2015) and the National MS Center Melsbroek (approval February 12, 2015). All experimental procedures involving human participants were conducted in accordance with the ethical standards of the institutional research committees and with the 1964 Declaration of Helsinki and its later amendments.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Author Disclosure Statement
No competing financial interests exist.
Funding Information
This study was partially supported by the NSF award (IIS 2045848, United States) and the Brussels-Capital Region—Innoviris (Brussels Public Organization for Research and Innovation, Belgium). G.N. was supported by a personal grant (1805620N) from Fonds Wetenschappelijk Onderzoek–Flanders in Belgium.
