Immersive language acquisition: Quantifying VR-Induced neurocognitive enhancements in English phonemic competencea

Abstract

Virtual Reality (VR) transforms second language acquisition by immersing learners in interactive, context-rich environments that foster cognitive engagement and enhance phonemic competence. This research explores the neurocognitive enhancements induced by VR-based English phoneme learning through multimodal bio-signal analysis and advanced machine learning (ML) techniques. A total of 450 English learners participated in immersive VR scenarios designed to target challenging English phonemes within authentic conversational tasks. Two types of datasets were collected: one in the form of a CSV file containing EEG signals and eye-tracking data, and the other comprising audio signal data. EEG and eye-tracking data were preprocessed using Z-score normalization to ensure consistency. Audio data were denoised using the Savitzky–Golay filter, which effectively preserves phonetic information while removing environmental noise. The cleaned data were fed into the feature extraction process. For the EEG and eye-tracking data, feature extraction was carried out using Independent Component Analysis (ICA), while Mel-frequency cepstral coefficients (MFCCs) were extracted from the audio data to capture detailed phonetic features essential for phoneme classification. This approach ensures accurate classification of phonemic performance and prediction of neurocognitive load in immersive VR-based phoneme learning. A feature-level fusion technique was employed to integrate the normalized event-log features and audio-based MFCCs into a unified, high-dimensional feature space, enabling comprehensive multimodal analysis. The Manta Ray Foraging Optimized Light Gradient-Boosting Machine (MRFO-LGBM) was introduced to optimize the LGBM model, enabling accurate classification of phonemic performance and prediction of neurocognitive load. The proposed method was implemented using Python 3.10.1. Experiments demonstrate that the proposed VR-enhanced cognitive phoneme recognition framework significantly outperforms other models, achieving superior results in terms of accuracy, F1-score, precision, and recall, with all metrics ranging from 95% to 96% in predicting neurocognitive states during immersive language acquisition. This research introduces a novel, scalable VR-based system that integrates bio-signal fusion and intelligent modeling to deliver personalized, measurable improvements in phonemic competence.

Graphical Abstract

Keywords

virtual reality (VR)phonemic competence neurocognitive enhancements multimodal bio-signals manta ray foraging optimization (MRFO)machine learning (ML)feature-level fusion light gradient-boosting machine (LGBM)

Introduction

Phonemics, an important subfield of linguistics, focuses on identifying and examining phonemes, which are minimal units of sound in a language that distinguish meaning.¹ In English, there are approximately 44 phonemes, which vary slightly across dialects and accents of spoken English, including consonants and vowels. A comprehensive understanding of all phonemes is necessary to achieve intelligible pronunciation, good listening comprehension, and fluent speech production.² As a component of English phonemic competence, the ability to appropriately perceive, produce, and distinguish these units of sound when communicating is a valuable skill for second language learners.³

When learning a second language, acquiring phonemic competence involves perceiving, differentiating, and producing the sounds (phonemes) of the language, which is a crucial step in developing fluency, comprehensibility, and communicative effectiveness.⁴ English phonemics refers to the smallest sound units that differentiate word meanings in the English language.⁵ These include vowel sounds such as /i:/ in “sheep” and /ɪ/ in “ship,” and consonant sounds such as /θ/ in “think” and /ð/ in “this.”⁶ Phonemes are the building blocks of spoken language, and even minor differences can change meanings and levels of comprehension. Thus, for learners seeking to develop native-like pronunciation and listening skills, accurately perceiving and producing English phonemes was critical.⁷ Table 1 displays the 44 phonemic symbols and an example word.

Table 1.

44 Phonemic symbols and an example word.

1	iː	Sheep	12	ɒ	On	23	t	Tree	34	z	Zoo
2	ɪ	Ship	13	ɪə	Deer	24	d	day	35	ʃ	Shout
3	ʊ	Good	14	Eɪ	Say	25	ʧ	Chair	36	ʒ	Vision
4	uː	Tooth	15	ʊə	Pure	26	ʤ	June	37	m	Man
5	e	Bed	16	ɔɪ	Boy	27	k	cat	38	n	Never
6	ə	Her	17	əʊ	Soap	28	g	Goal	39	ŋ	Sing
7	ɜː	Bird	18	Eə	Pair	29	f	Photo	40	h	Honey
8	ɔː	Law	19	Aɪ	mine	30	v	Very	41	l	Lake
9	æ	cat	20	Aʊ	Now	31	θ	Think	42	r	Red
10	ʌ	Up	21	p	Park	32	ð	This	43	w	What
11	αː	Car	22	b	Bike	33	s	Sorry	44	j	Yes

With its unique ability to create realistic environments, VR has become an increasingly important medium in educational technologies.⁸ VR creates the possibility of transformative experiences in language learning contexts by engaging students in rich sensory input that resembles authentic communication situations.⁹ Unlike the static context of a traditional language classroom, VR immerses users in a living and dynamic environment that allows for practice using contextualized interactions, experiential learning, and embodied cognition.¹⁰ The power of VR could enhance neurocognitive functions corresponding to English phonemic competence, due to VR’s capacity to enhance sensory engagement and cognitive processing in immersive, interactive language learning environments.¹¹ VR allows for contextually authentic simulations and immediate feedback to assist learners in practicing and developing their phonemic awareness, auditory discrimination, and pronunciation. This research investigates the neurocognitive benefits of VR-based English phoneme learning using sophisticated ML algorithms and multimodal bio-signal analysis. The LGBM model was refined with the introduction of the Manta Ray Foraging Optimized Light Gradient-Boosting Machine (MRFO-LGBM), which allowed for precise phonemic performance classification and neurocognitive load prediction. The main contributions of this work are.

• Designed an engaging VR-based system for the development of English phonemic competence through interactive and contextualized scenarios.

• Applied Z-score normalization on the EEG and eye-tracking samples, and S-G filtering on audio signals, as suitable methods to ensure clean and consistent input.

• Used ICA on the EEG and eye-tracking data to obtain relevant neurocognitive features, and used MFCCs for all audio samples to extract relevant phonetic features.

• Developed a feature-level fusion approach and applied a tuned MRFO-LGBM model, which achieved 96.81% accuracy in classifying phonemic performance and predicting neurocognitive loading.

Related works

This section examines a range of immersive and neurocognitive techniques—including electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), virtual reality (VR), augmented reality (AR), and artificial intelligence (AI)—employed to enhance phonemic learning, pronunciation, and language acquisition across diverse learner demographics.

Yang et al.¹² assessed the efficiency of high variability phonetic training (HVPT) for enhancing the phonological decoding skills of Chinese learners. Their study, involving 40 undergraduate students, utilized a modified phonics screening test, a WeChat application, and an fNIRS device to track network connectivity and brain activity. Despite observing inhibition in the learners’ right fusiform gyrus, Yang et al.¹² reported significant improvement in performance on 5-phoneme pseudo words. Their research findings indicated that Mobile Multisensory Visual Phonetic Training (MVPT) successfully assisted learners in segmentally evaluating novel, challenging words using phonetic patterns. LaRocco et al.¹³ developed a phoneme-based imagined speech EEG-based brain–computer interface (BCI) for paralyzed or physically disabled individuals. They identified gamma power (GP) on channels F3 and F7 as the key distinguishing feature among phonemes. LaRocco et al.¹³ suggested that with modifications to classifier algorithms, trial window duration, and feature selection, the technique could be adapted for practical applications. Juyal et al.¹⁴ investigated how language proficiency affects EEG signals during imagined conversation in both native English speakers and non-native Chinese speakers. Analyzing signals from 16 participants using the 14-channel EEG for the Imagined Speech dataset, they employed various connectivity metrics and fusion approaches to increase the reliability of imagined vowel phoneme recognition. Juyal et al.¹⁴ found that combining the imagined phase and rest state correlations with phase slope index (PSI) features yielded the best accuracy for both native and non-native speakers. Ge et al.¹⁵ investigated the effects of integrating the Accent Method Rhythm and Pitch (AMRP) with phonemic contrast on speech production in Mandarin-speaking stroke patients with dysarthria. Their results demonstrated that this combination significantly improved the production of targeted words, sentence clarity, and the standard deviation of amplitude. Duraivel et al.¹⁶ addressed communication loss in patients with neurodegenerative disorders by employing high-quality µECoG brain recordings during speech production to re-establish communication. Their approach improved decoding accuracy by 35% by acquiring signals with a 48% higher signal-to-noise ratio and a 57x higher spatial resolution. Duraivel et al.¹⁶ concluded that future neural speech prostheses might achieve high-quality speech decoding using high-density µECoG. Heidlmayr et al.¹⁷ examined phonological deafness among second language (L2) learners, linking it to neuroplastic alterations in the phonological system. Their EEG investigation revealed three event-related potential (ERP) effects that improved with L2 ability: the left frontal auditory N100 component, smaller fronto-central phonological mismatch negativity (PMN) effects, and the semantic N400 effect. Heidlmayr et al.¹⁷ suggested these findings indicate neuroplasticity in the human brain may support the acquisition of hard-wired linguistic features, such as the capacity to recognize phonetic variations in an L2. Tolba et al.¹⁸ presented an interactive AR system combining AI speech recognition with AR to teach phonetics. Their technology enhanced reading skills by incorporating a 3D animated model of speech systems onto a 2D image. Additionally, it utilized a customized AI-based phoneme recognition system to provide real-time pronunciation feedback. A user evaluation with 83 adult participants indicated that the system enhanced comprehension and mastery of specific phonemes, suggesting its potential utility in conventional classroom environments.

Yuan et al.¹⁹ aimed to enhance English language learning for international trade using project-based activities aided by VR and speech recognition technologies. Their approach employed time-delay neural networks (TDNN) to record temporal data in speech within virtual reality environments simulating actual trading situations. Yuan et al.¹⁹ reported that the virtual experiment system lowered costs while enhancing comprehension, with results indicating improved critical thinking, communication, and creative abilities. Xue and Wang²⁰ investigated the incorporation of virtual worlds and wireless sensing microprocessors to improve English listening abilities, focusing on phonetic mastery in junior and senior high school education. Participants explored effective phonics teaching strategies due to the perceived ineffectiveness of traditional methods. Xue and Wang²⁰ also analyzed findings from interviews and walkthrough tests. Xie et al.²¹ examined the effect of immersive virtual reality (IVR) on primary school pupils’ language memory. They divided 77 students into two groups: one using PowerPoint and the other using IVR. While both methods effectively promoted vocabulary retention, PowerPoint proved superior for instantaneous retention. However, IVR demonstrated better long-term retention, indicating that incorporating language learning theory into IVR settings improved vocabulary acquisition. Ultrasound-enhanced pronunciation training techniques increase articulation learning effectiveness by visualizing the articulatory system as biofeedback. Mozaffari and Lee²² addressed associated challenges by developing a novel ultrasound-enhanced multimodal pronunciation training system that fuses AI and ultrasound technologies. This technology enhanced user flexibility, system engagement, and autonomy by enabling users to move their heads freely. Bahameish et al.²³ examined the effectiveness of an AR smartphone application in improving vocabulary learning outcomes for children with Autism Spectrum Disorder (ASD). Their study involved nine participants aged 8 to 12 with mild ASD recruited from a local special needs center. Findings revealed significant individual variation in vocabulary learning abilities, with some participants showing notable improvements. Watthanapas et al.²⁴ investigated the effectiveness of a multilingual VR learning evaluation system for teaching grammatical structures to Thai learners. In a 5-week teaching experiment, participants practiced Thai word order for 20 minutes per session. Results indicated that Thai language anxiety negatively impacted learning retention and task value, while Thai language self-efficacy positively correlated with both conditions. Pellas and Christopoulos²⁵ explored the effectiveness of the machinima method, developed using Open Simulator and Scratch4SL, in supporting vocabulary acquisition among students with developmental dyslexia. They presented empirical evidence of its influence on language acquisition, along with standards for machinima construction and insights into its educational implications. Wang and Liu²⁶ examined the usefulness of songs and rhymes as teaching tools for phoneme segmentation and categorization among 11th-grade English as a Foreign Language (EFL) underachievers in Taiwan. Results showed that phoneme segmentation improved somewhat, but not significantly. Their project proposes investigating the potential benefits of song lyrics education for English listening skills.

Recent advancements in immersive and neurocognitive language learning techniques, such as EEG, VR, and AR, have demonstrated significant potential for improving phonemic acquisition and auditory processing. However, several persistent challenges remain evident in the current literature.

(1) Limited Generalizability of EEG Systems: Most EEG-based phoneme recognition systems, while accurate, primarily rely on gamma power (GP) derived from limited scalp regions (e.g., F3 and F7). This reliance limits their generalizability across broader cognitive states and diverse individuals.¹³

(2) Neglect of Real-time Engagement in Imagined Speech: Language learning experiments utilizing imagined speech EEG signals predominantly focus on classification accuracy. They often overlook real-time cognitive engagement and its correlation with phonemic retention within naturalistic learning settings.¹⁴

(3) Lack of Integrated Bio-signal Assessment in Mobile Training: Mobile phonics training platforms emphasize performance improvement metrics but lack integration with real-time multimodal bio-signals (e.g., EEG combined with eye-tracking) necessary to assess the underlying dynamics of phonemic awareness and decoding during learning.¹²

To address these limitations, this study proposes a framework designed to enhance immersive language acquisition by quantifying VR-induced neurocognitive adaptations related to English phonemic competence. The framework enables the integration of multimodal bio-signals, including EEG and eye-tracking data, gathered within immersive VR learning contexts to measure phonemic decoding processes and cognitive engagement. Complementing the predictive ability of the Light Gradient-Boosting Machine (LGBM) classifier, we introduce the Manta Ray Foraging Optimized Light Gradient-Boosting Machine (MRFO-LGBM). This optimized model aims to characterize learners’ neurocognitive adaptations and phonological processing within immersive settings, achieving high accuracy with real-time performance.

Methodology

EEG, eye-tracking, and audio streams were collected and preprocessed using Z-score normalization. The audio data were denoised using the S-G filter to maximize phonetic quality. Feature extraction was conducted using ICA for bio-signals and MFCCs for audio streams. Feature-level fusion technique was used to combine the normalized event-log features and MFCCs into a unified high-dimensional feature space. An adaptation of the Manta Ray Foraging Optimized Light Gradient-Boosting Machine (MRFO-LGBM) was proposed to find optimal classifications. This integrated approach enables the prediction of phonemic performance and neurocognitive load in immersive VR-based language learning environments. Figure 1 shows the overall process of the VR-based learning environment.

Figure 1.

Overall process of phonemic performance and prediction of neurocognitive load.

Dataset

The VR English Phoneme Learning dataset was collected from Kaggle. This dataset includes neurocognitive and behavioral data from 450 participants who participated in immersive VR environments to improve English phonemic competence. The data includes EEG-derived PSD features for cognitive state monitoring, speech-derived MFCC features for phoneme articulation assessment, eye-tracking metrics, GSR and ECG data for physiological arousal and heart rate analysis, and audio file references for individual English phoneme pronunciations.

Source: https://www.kaggle.com/datasets/programmer3/vr-english-phoneme-learning-dataset

Data preprocessing

Data preprocessing involved two key steps: Z-score normalization was applied to EEG and eye-tracking data to standardize signal scales, while the Savitzky–Golay filter was used to denoise audio signals, preserving critical phonetic information.

Z-score normalization

Z-score normalization was a technique used to transform the EEG and eye-tracking features that were recorded for English learners in immersive phoneme learning scenarios. This processing step ensured uniform scaling across all data points, eliminating variances caused by differing signal magnitudes and enabling the classifier. This transformation centers the data around a mean of zero and a standard deviation of one, which increases the comparability and stability of multimodal features. The Z-score is calculated using equation (1):

y_{i} = \frac{w_{i} - μ}{σ}

(1)

where

w_{i}

is the original feature value,

μ

is the mean,

σ

is the standard deviation, and

y_{i}

is the normalized output.

Savitzky–Golay filter (S-G)

The S-G filter was a low-pass filter that resembled an FIR filter in structure and a generalized average movement filter. Additionally, it is perfect for operating smoothing signals to improve signal-to-noise ratios without causing signal distortion. This smoothing approach was accomplished by a convolution process that uses the linear minimum squares technique to replace successive neighboring measurements with a low-degree polynomial. The S-G filter was utilized to smooth the audio signal to remove noise, which improves the clarity of phonemic features, imperative for accurate classification and neurocognitive load prediction in VR-based phoneme learning during VR-based phoneme learning.

The S-G filter smooths the signal by fitting a polynomial of degree o to a window of M consecutive data points using least squares. The central point of this fitted polynomial replaces the original data point at the center of the window. This process iterates across the signal. Key parameters are the window size M, controlling the smoothing strength (larger M = more smoothing), and the polynomial order o, affecting the fit’s flexibility. The core polynomial fitting concept is represented by equation (2):

ρ (q_{j}) = d_{0} + d_{1} q + d_{2} q^{2} \dots + d_{o} q^{o}

(2)

where “

q_{j}

” represents the position within the window. The coefficients “

d_{k}

” are determined by solving the least squares problem across the window. The first step in creating an S-G filter is to determine its distance of the filter “

l

,” the polynomial sequence “

o

” the derivative sequence “

m

,” and the smoothing window size “

M

”

$M$ is regarded as odd when $M \geq o + 1$ . When the S-G coefficients of filters are transmitted to the signal, a polynomial is replaced for the signal’s points $M = M_{q} + M_{j} + 1$ . In this case, the left and right signal points of a current signal point are denoted by $M_{j}$ and $M_{q}$ . It was possible to estimate the polynomial coefficients as follows: equation (3):

N d = c

(3)

where

N

can be expressed as equation (4),

N = | \begin{array}{l} 1 \\ \dots \dots \dots \dots \\ \begin{array}{l} 1 \\ \begin{array}{l} 1 \\ \begin{array}{l} 1 \\ \dots \dots \dots \dots \end{array} \end{array} \end{array} \end{array} \begin{array}{l} (\frac{l - 1}{2}) \\ \dots \dots \dots \dots \\ \begin{array}{l} - 1 \\ \begin{array}{l} 0 \\ \begin{array}{l} 1 \\ \dots \dots \dots \dots \end{array} \end{array} \end{array} \end{array} \begin{array}{l} {(- \frac{l - 1}{2})}^{2} \\ \dots \dots \dots \dots \\ \begin{array}{l} {(- 1)}^{2} \\ \begin{array}{l} \begin{array}{l} 0 \\ 1^{2} \end{array} \\ \dots \dots \dots \dots \end{array} \end{array} \end{array} \begin{array}{l} \begin{array}{l} \dots \dots \dots \dots \\ \dots \dots \dots \dots \\ \begin{array}{l} \dots \dots \dots \dots \\ \dots \dots \dots \dots \end{array} \end{array} \\ \begin{array}{l} \dots \dots \dots \dots \\ \dots \dots \dots \dots \end{array} \end{array} \begin{array}{l} {(- \frac{l - 1}{2})}^{o} \\ \dots \dots \dots \dots \\ \begin{array}{l} {(- 1)}^{o} \\ \begin{array}{l} \begin{array}{l} 0 \\ {(1)}^{o} \end{array} \\ {- (l - 1) / 2)}^{0} \end{array} \end{array} \end{array} |

(4)

Equation (5) can be employed to denote the vector of the polynomial coefficients $″ d^{″}$ :

d = [\begin{array}{l} d_{0} \\ d_{1} \\ \begin{array}{l} d_{2} \\ \dots \\ d_{o} \end{array} \end{array}]

(5)

Equation (6) displays a sequence of data values of size l, denoted by

″ ρ .^{″}

ρ = | \begin{array}{l} ρ - \frac{l - 1}{2} \\ ρ - \frac{l - 1}{2} \\ \begin{array}{l} \dots \\ ρ_{0} \\ \begin{array}{l} \dots \\ ρ_{\frac{l - 1}{2}} \end{array} \end{array} \end{array} |

(6)

Equation (7), which uses the matrix’s variables’ least squares approach, can be used to find the vectors of polynomial coefficients.

d {= (N^{s} N)}^{- 1} N^{s} ρ

(7)

The linear combination of row values of ${(N^{'} N)}^{- 1} N^{s} ρ$ could be used to represent the polynomial coefficients of “ $d$ .” Assuming all other polynomial values were zero, the value of the polynomial at $ρ_{0}$ may equal $d_{0}$ . The S-G filter can be represented as the elements of the middle row of the matrix ${(N^{'} N)}^{- 1} N^{s} o$ at the derivative sequence $0$ .

Feature extraction

Feature extraction was performed using ICA on the EEG and eye-tracking data to isolate meaningful neural components, and MFCCs were extracted from audio data to capture essential phonemic features.

Independent Component Analysis (ICA)

EEG and eye-tracking data gathered during immersive VR-based phoneme learning were subjected to ICA to extract features. This method reliably separates neural and ocular signals associated with phonological processing and attentional engagement. ICA enables the identification of source signals while extracting the features for better and more accurate classification of neurocognitive states. ICA has been used extensively for blind sources and signal separation because it can recognize statistically independent components from highly correlated and high-dimensional data. ICA has been widely used to monitor intricate industrial operations. Because ICA can extract non-Gaussian features, which are thought to be dominant in contemporary processes, it outperforms many conventional techniques.

A linear combination of signals originating from an assortment of independent sources or concealed variables can be regarded as the process being tracked for process monitoring. Assume that the normal operating conditions of a process produce a process data matrix $W \in Q^{c \times m}$ with $F$ process variables and $m$ samples. The expression for $W$ 's ICA decomposition is equation (8):

W = B T + F

(8)

where the residual matrix is denoted by

F

and the mixing matrix by

B

T = [T_{1}^{m}, T_{2}^{n}, T_{3}^{n}, \dots, T_{o}^{n}] \in Q^{o \times m}

was the independent component matrix, which is made up of

o (o < c)

. The fast fixed-point ICA algorithm (FastICA) 34 was an iterative technique that estimates both

B

and

T

from

W

. First, used to transfer the whitened data into the ICA feature space, eliminating pairwise dependency between process variables.

Y = C^{- \frac{1}{2}} U^{S} W

(9)

F [{W W}^{S}] = {u C U}^{S} . Y = [Y_{1}^{m}, Y_{2}^{m}, Y_{3}^{m}, \dots, Y_{o}^{m}] \in Q^{o \times m}

is the whitening data matrix in the ICA subspace, where

C a n d U

represent the eigenvalue and eigenvector matrices of the covariance matrix of

W

. As might have been seen,

F [{Y Y}^{S}] = C^{- 1 / 2} U^{S} F [{W W}^{S}] {U C}^{- \frac{1}{2}} = 1

. The following modification, equation (10), is then applied to transfer the whitened data into the ICA feature space.

\hat{T} = X Y

(10)

where

\hat{T} = [{\hat{T}}_{1}^{m}, {\hat{T}}_{2}^{m}, {\hat{T}}_{3}^{m}, \dots, {\hat{T}}_{o}^{m}] \in Q^{o \times m}

is the predicted independent components matrix and

X = B^{- 1}

is the normalized demixing matrix. Finding the

X

maximizes every component

{\hat{T}}_{o}^{m} o f \hat{T}

non-Gaussianity to ensure the separate feasibility of the FastICA method. Implementing considers an isolated component,

{\hat{T}}_{j}^{m}, j \in {1, 2, 3, \dots, o}

{\hat{T}}_{j}^{m} = X_{j}^{S} Y_{j}^{m}

(11)

The vector of the $j^{t h}$ the column is denoted by $X_{j}$ . The non-Gaussianity of ${\hat{T}}_{j}^{m}$ is measured by its negative entropy, or negentropy, which may be generally computed using equation (12):

{I ({\hat{T}}_{j}^{m}) \propto [F {H ({\hat{T}}_{j}^{m}) - F {H (u)}]}^{2}

(12)

where

H

is a non-quadratic function and

u

is the standardized Gaussian variable.

H

typically takes one of the two forms listed below, equation (13):

{\begin{cases} H_{1} (v) = \frac{1}{b_{1}} \log \cosh b_{1} v \\ H_{2} (v) = - \exp (- \frac{v^{2}}{2}) \end{cases}

(13)

The subsequent objective function is then repeatedly maximized to yield $x_{j}$ .

{x_{j} = \begin{array}{l} \arg \max \\ x_{j} \end{array} [F {H (x_{j}^{S} y_{j}^{m})} - F {H (u)}]}^{2}

(14)

Mel-frequency cepstral coefficients (MFCC)

To facilitate the correct classification of phonemic performance and the inference of neurocognitive load, MFCCs were used to extract the relevant phonetic features based on the preprocessed audio signals collected during the immersive VR-based English phoneme learning. MFCCs, widely used in phonetic and speech-emotion recognition, were extracted. MFCCs are derived from the Mel filter bank, which models the nonlinear perception of sound in the human auditory system of sound. The Mel filter considers how humans produce and perceive speech, which is beneficial to improving the feature representation for learning models. The relation between the real frequencies of speech and the Mel frequencies, which account for the human auditory perception, is provided in equation (15):

e_{M e l} = 2595 \times k_{h} (1 + \frac{e}{700})

(15)

where

e

represents the speech frequency in hertz (Hz). Based on the Mel frequency, the predicted cepstral coefficient is known as the MFCC. The MFCC extraction process involves several stages:

Stage 1: Preprocessing and framing

The audio signal is segmented into overlapping frames. For each frame $(w_{j} (m))$ , where $j$ denotes the $j^{t h}$ frame, the signal is treated as a finite discrete sequence for further processing.

Stage 2: FFT

FFT is applied to convert each time-domain frame into the frequency domain:

W (j, l) = F F T [w_{j} (m)] = \sum_{m = 0}^{M - 1} w_{j} (m) X_{M}^{l m}, l = 0, 1, \dots, M - 1

(16)

where

M

stands for the number of sample points in every frame. Equation (17) is then used to obtain the power spectrum:

{F (j, l) = [W (j, l)]}^{2}

(17)

Stage 3: Determine the energy through the MF using equation (18).

T (j, n) = \sum_{l = 0}^{M - 1} F (j, l) G_{n} (l)

(18)

Stage 4: Utilizing equation (19), the DCT, determine the MFCC.

M F C C (j, m) = \sqrt{\frac{2}{N}} \sum_{n = 0}^{N - 1} \log [T (j, n)] \cos (\frac{π m (n - 0.5)}{N}), m = 1, 2, \dots, K

(19)

where

K

is often assumed to be 12 and

m

is the MFCC sequence. The number of Mel filters is denoted by

N

. Extracting MFCC features from the denoised audio signal provides a compact and informative representation of phonemic features, which will be critical for accurately assessing learners’ pronunciation proficiency while enabling a prediction of cognitive engagement during immersive VR phoneme training tasks.

Data fusion

Feature-level fusion

This approach forms a unique feature vector by combining the features extracted from various modalities. The main benefit of this approach is that measuring multimodal data correlations at an early stage yields higher accuracy. This approach allows us to fuse EEG, eye-tracking, and audio-based phonemic features during immersive VR-supported English learning. The features from the different modalities were converted into the same format before the feature fusion process. Feature-level fusion provided the best results for multimodal integration. It also showed improvement in processing time from previously used approaches. Consequently, this enables real-time assessment of phonemic decoding and the students’ neurocognitive responses in real time for improved language acquisition.

Manta ray foraging optimized light gradient-boosting machine (MRFO-LGBM)

The MRFO-LGBM is a hybrid predictive model used to improve both the accuracy and speed in the classification of phonemic performance and the prediction of neurocognitive load. The MRFO algorithm is a bio-inspired optimization method that performs well for both exploration and exploitation of the search space to identify an optimal solution. The study employed the MRFO algorithm to optimize the hyperparameters for phonemic classification accuracy and neurocognitive prediction. The GBM procedure sequentially executes ensemble learning. When sequentially building models, GBM minimizes the prediction error through gradient descent. This procedure combines weak learners, usually decision trees, to build a strong prediction model. LGBM is a very effective and scalable implementation of GBM that uses histogram-based algorithms and leaf-wise growth for trees. LGBM was used to classify phonemic performance and neurocognitive load from combined multimodal datasets. The hybrid MRFO-LGBM model leverages MRFO’s optimization and LGBM’s predictive capabilities to build a precise, flexible learning framework. The MRFO determines optimal hyperparameters for LGBM, improving overall model performance across EEG, eye‐tracking, and audio-derived features. The hybrid approach allows for precise classification of learner responses and performance in immersive VR experiences, mitigating the drawbacks of traditional phoneme training activities. Algorithm 1 shows the pseudocode for MRFO-LGBM.

Gradient-boosting machine (GBM)

In the domain of immersive language learning, the GBM algorithm was deployed to improve the classification of phonemic performance while also predicting neurocognitive load from VR learning environments. Assumed $T = {({w_{j}, z_{j}}}_{j = 1}^{m}$ , where $w_{j}$ represents multimodal input features (e.g., acoustic-phonetic signals, cognitive biometrics, and VR engagement metrics), and $z_{j}$ denotes corresponding phoneme labels or cognitive load levels, GBM seeks to model the mapping $\hat{e} (w),$ $e * (w)$ by minimizing the expected loss $K (z, e (w))$ . The GBM algorithm constructs a strong learner gradually from several weak base learners to identify and model the complex relationships between linguistic performance and neurocognitive responses. The ensemble model is updated in each iteration as follows in equation (20):

e_{s} {(w) = e}_{s - 1} (w) + ρ_{s} g_{s} (w)

(20)

where

s = 1, \dots S

for

s^{t h}

, the base learner, and

ρ_{s}

is the corresponding weight. The initial model

e_{0} (w)

is defined by a constant function:

e_{0} (w) = \arg \min_{α} \sum_{j = 1}^{m} K (z_{s,} α)

(21)

where

K (z_{s,} α)

is a differentiable loss function. Each base learner

g_{s}

is trained to fit the pseudo-residuals:

ρ_{s,} g_{s} (w) = \arg \min_{{ρ, g}^{,}} \sum_{j = 1}^{m} K (z_{j,} E_{s - 1} (w_{1}) + ρ g (w_{j}))

(22)

The continuous process captures the remaining phonemic classification errors and changing cognitive response patterns and allows the model to learn difficult phonemic distinctions and context-sensitive neurocognitive loads. The weight $ρ_{s}$ can be computed by solving equation (23):

q_{s j} = {[\frac{\partial K (z_{j,} e (w))}{\partial e (w)}]}_{e (w) = e_{s - 1} (w)}

(23)

In VR contexts where learners must decode sounds under various degrees of sensory and cognitive stimulation, this method allows for fine-grained phoneme recognition.

Light gradient-boosting machine (LGBM)

To provide an accurate classification of phonemic performance and prediction of neurocognitive load in immersive VR-based language learning spaces, this research utilized LGBM as the primary classifier due to its scalability, performance, and capacity to model complex interactions in high-dimensional data.

LGBM is a novel gradient-boosting framework built on decision trees. It employs a leaf-wise growth method with depth limitations and speeds up training by using histogram-based algorithms. Continuous feature values were discretized into k bins using the histogram approach, which lowers memory usage and increases model accuracy. To avoid overfitting, the decision tree’s weak analysis model provides coarser segmentation points. Leaf-wise growth is more efficient, minimizing errors and increasing precision while consistently identifying leaves with the maximum branching gain. However, LGBM adds a maximum pth limit to avoid overfitting, which can be caused by leaf orientation. For the given training dataset $W {= {(w_{j}, z_{j})}}_{j = 1}^{n}$ , LGBM aims to search for an estimate $\hat{e} (w)$ of the function $e^{*} (w)$ for reducing the anticipated values of particular loss calculations $K (z, e (w))$ :

\hat{e} (w) \begin{array}{l} \arg m i n F K (z, e (w)) \\ e \end{array}

(24)

LGBM will estimate the final model, which is described as follows, by integrating several $S$ regression trees $\sum_{j = 1}^{S} e_{s} (W)$ .

e_{S} (W) = \sum_{j = 1}^{S} e_{s} (W)

(25)

The definition of the regression trees is as follows: ${a s w}_{r (w)}, r \in {1, 2, . ., M}$ , where $w$ is a vector that represents the sample’s leaf node measurements, $r$ is the tree’s selection rule, and $M$ is the number of leaves in the tree. The framework is trained sequentially at step $s$ in the manner described in equation (26):

Γ_{s} ≅ \sum_{i = 1}^{M} K (z_{j}, E_{s - 1} (w_{j}) + e_{s} (w_{j}))

(26)

Newton’s method provides a quick approximation of the objective function. Eliminating the constant term simplifies equation (26):

Γ_{s} ≅ \sum_{i = 1}^{M} (h_{j}, e_{s} (w_{j}) + \frac{1}{2} g_{j} e_{s}^{2} (w_{j}))

(27)

where

h_{j} a n d g_{j}

represent the gradient analytical results of loss functions at the first and second enquiries. If

J_{i}

represents an instance set of leaf

i

, then equation (27) is further transformed into equation (28):

Γ_{s} = \frac{1}{2} (\sum_{j \in J_{i}} g_{j} + λ) ω_{i}^{2}) + \sum_{i = 1}^{I} ((\sum_{j ϵ J_{i}} h_{j}) ω_{i}

(28)

Equations (29) and (30) determine the extreme values of $Γ_{l}$ and the optimal leaf weights for each node $ω_{i}^{*}$ based on the tree structure $r (w)$ :

ω_{i}^{*} = - \frac{\sum_{j \in J_{i}} h_{j}}{\sum_{j \in J_{i}} g_{j} + λ}

(29)

Γ_{S}^{*} = - \frac{1}{2} \sum_{i = 1}^{I} \frac{{(\sum_{j \in J_{i}} h_{j})}^{2}}{\sum_{j \in J_{i}} g_{j} + λ}

(30)

where

r (w)

is the weight function used to measure the tree structure’s quality. Integrating the split produces the objective function:

H = \frac{1}{2} [\frac{{(\sum_{j \in J_{k}} h_{j})}^{2}}{\sum_{j \in J_{k}} g_{j} + λ} + \frac{{(\sum_{j \in J_{q}} h_{j})}^{2}}{\sum_{j \in J_{q}} g_{j} + λ} + \frac{{(\sum_{j \in J} h_{j})}^{2}}{\sum_{j \in J} g_{j} + λ}]

(31)

where the left and right branches are represented by the samples

J_{1} a n d J_{q}

. LGBM plays a pivotal role in this research by enabling robust classification of English phonemic outputs and estimating neurocognitive load through its efficient, leaf-wise boosting strategy and memory-optimized architecture, making it ideal for real-time, immersive language learning applications.

Manta ray foraging optimized (MRFO)

This section summarizes the core steps of the Manta Ray Foraging Optimization (MRFO) algorithm. MRFO simulates three foraging behaviors—chain foraging, cyclone foraging, and somersault foraging—that guide the population of candidate solutions within the search space. These behaviors enable the algorithm to balance exploration and exploitation effectively. In our application, MRFO is employed to optimize hyperparameters of the LGBM classifier, enhancing its ability to accurately classify phonemic performance and predict neurocognitive load during immersive VR language learning.

Chain foraging

In chain foraging, phonemic classification iteratively refines the correctness of the classification by using the most confident phonemic representations for the VR audio and visual information. Manta rays form a line by connecting head to tail in a line to create a foraging chain. According to MRFO, a greater quantity of plankton, the manta rays’ preferred food, is the ideal response. The initial individual primarily moves in the direction of food, but the subsequent individuals in the foraging chain also move in the direction of those in front of them. The following equation (32) is a description of the chain foraging mathematical model.

w_{j}^{(s + 1)} = {\begin{cases} w_{j}^{s} + q_{1} \cdot (w_{b e s t}^{s} - w_{j}^{s}) + α \cdot (w_{b e s t}^{s} - w_{j}^{s}), j = 1 \\ w_{j}^{s} + q_{2} \cdot (w_{j - 1}^{s} - w_{j}^{s}) + α \cdot (w_{b e s t}^{s} - w_{j}^{s}), j = 2, \dots, M O \end{cases}

(32)

Here,

α = 2 \cdot q_{3} \cdot \sqrt{| \log (q_{4}) |}

Where $j = 1, 2, 3, 4$ are disseminated equally, random vectors, $q_{j} \in [0, 1]$ , $w_{j}^{s}$ is the location of the $j^{t h}$ individual across generations $s$ , $M O$ is the number of individuals, $α$ is a weight coefficient, and $w_{b e s t}^{s}$ is the plankton with the greatest quantity or the optimal individual.

Cyclone foraging

Cyclone foraging used to simulate the spiral cognitive processing involved when learners encounter new or unfamiliar phonemic inputs in immersive VR environments. By spiraling toward the best solution and also accounting for variation among neighboring agents, the algorithm explores subtle cognitive patterns tied to different phoneme complexities.

Manta rays form long hunting chains while finding phytoplankton in deep water, then spiral toward the food. Although it spirals near food and follows the people in front of it, its behavior is comparable to WOA. The following equation (33) provides the mathematical representation of cyclone foraging.

w_{j}^{(s + 1)} = {\begin{cases} w_{b e s t}^{s} + q_{5} \cdot (w_{b e s t}^{s} - w_{j}^{s}) + β \cdot (w_{b e s t}^{s} - w_{j}^{s}), j = 1 \\ w_{b e s t}^{s} + q_{6} \cdot (w_{j - 1}^{s} - w_{j}^{s}) + β \cdot (w_{b e s t}^{s} - w_{j}^{s}), j = 2, \dots, M O \end{cases}

(33)

Here,

β = 2 \cdot \exp (q_{7} \cdot \frac{{i t e r}_{\max} - i t e r + 1}{{i t e r}_{\max}}) \cdot \sin (2 π q_{7})

Where $j = 5, a n d 6$ are uniformly distributed random vectors, and $q_{j} \in [0, 1]$ . The weight component is denoted by $β$ . A uniformly distributed random number is $q 7 \in [0, 1]$ . The maximum and present number of iterations are denoted by ${i t e r}_{\max} a n d i t e r$ . Food is primarily utilized in equation (33) as a point of reference for spiral foraging, which helps to fully utilize the area close to food. Additionally, spiral foraging employs a chosen random position within the search space as an indicator place to expand the search region. This makes it possible for everyone to search for places that are near their present ideal location. Exploring is the main objective of the random spiral foraging process, which allows MRFO a thorough global search. These are the details of the particular mathematical model shown in equation (34).

w_{j}^{(s + 1)} = {\begin{cases} w_{r a n d} + q_{9} \cdot (w_{r a n d} - w_{j}^{s}) + β \cdot (w_{r a n d} - w_{j}^{s}), j = 1 \\ w_{r a n d} + q_{10} \cdot (w_{j - 1}^{s} - w_{j}^{s}) + β \cdot (w_{r a n d} - w_{j}^{s}), j = 2, \dots, M O \end{cases}

(34)

Here,

w_{r a n d} = k a + q_{h} (v a - k a)

Where $w_{r a n d}$ is a randomly generated location in the exploration space. The random vectors $j = 8, 9, 10$ have an equal distribution and $q_{j} \in [0, 1]$ . The search space’s upper and lower boundaries are denoted by $v a a n d k a$ .

Somersault foraging

Somersault foraging simulates rapid re-evaluation around the best phoneme-cognitive interaction points, essentially acting as a refinement mechanism. The prey location is regarded as a pivot point during this period. Every person rotates the pivot, searching for a new spot in the process. This phase’s mathematical model is shown as follows equation (35).

w_{j}^{s + 1} = w_{j}^{s} + T (q_{11} . w_{b e s t}^{s} - q_{12} . w_{j}^{s}), j = 1, 2, \dots ., M O

(35)

where

T = 2

and

T

is the somersault factor that determines the manta rays’ somersault range. Two random numbers in

[0, 1]

are

q_{11} a n d q_{12}

. By regulating the modification of

(\frac{i t e r}{{i t e r}_{\max}})

, MRFO controls the behavior of exploration and exploitation. The exploratory behavior is primarily carried out when

\frac{i t e r}{{i t e r}_{\max}}) < r a n d

, and food sources are created at random as indicators in the exploration space. The perfect individual is employed as a reference point when

(\frac{i t e r}{{i t e r}_{\max}}) \geq

r a n d

, making algorithm exploitation easier. Additionally, chain foraging or spiral foraging is chosen at random. Somersault foraging is then carried out.

This research utilized the MFRO’s adaptable exploration and exploitation features to demonstrate the ability to concurrently merge linguistic precision and cognitive processing efficiency. Each foraging strategy will play an important role in the exploration of a range of phoneme-cognitive patterns versus the convergence on an optimum decision boundary. Overall, the MFRO-based system will yield not only better approximate English phonemic classification accuracy but also real-time, learner-specific neurocognitive load predictions. Collectively, the MRFO represents a valuable tool for developing responsive and cognitively aware VR-based systems for language acquisition.

Results and discussion

Experimental setup

The experimental design utilized Python 3.10.1 to simulate the proposed MRFO-LGBM-based immersive language learning framework. Python was chosen due to its excellent compatibility with leading scientific computing libraries. The experiments were run with preprocessed multimodal datasets to evaluate the system’s ability to predict learner performance in immersive VR settings.

Performance evaluation of the proposed method

Figure 2 illustrates relationships among various neurocognitive and physiological features related to immersive VR-based English phonemic learning. Notably, MFCC mean exhibits a moderate negative correlation (−0.13) with correct pronunciation. This suggests that audio signal characteristics substantially influence how well audio is parsed and pronounced. This analysis confirmed the validity of incorporating these features into a model for predicting learner performance in VR-enhanced language acquisition systems.

Figure 2.

Correlation matrix for the relationships among various neurocognitive and physiological features.

Figure 3 summarizes the occurrence percentages of target phonemes presented during the immersive VR-based phonemic training. Among the target phonemes, /dʒ/ and /θ/ had higher occurrences, suggesting a greater emphasis on voiced affricates and voiceless interdental fricatives. Frequent occurrences of the phonemes /l/, /r/, and /æ/ reflect their significance for learners of English whose pronunciation challenges are articulated by those phonemes. The notably low occurrence of /ʃ/ suggests that the sample of voiceless postalveolar fricatives was relatively small. Overall, this distribution demonstrates a balanced examination of the articulatory complexity of prominent English phonemes in the neurocognitive evaluation process.

Figure 3.

Phonemes used in the immersive VR-based phonemic training.

Figure 4 shows that auditory signal quality and theta-band EEG activity are the most influential factors in predicting correct English phoneme pronunciation in an immersive VR setting. Neurocognitive load and PSD beta follow closely, while eye-tracking and physiological signals such as GSR and ECG demonstrate moderate importance. These results validate the model’s reliance on neurocognitive and affective signals to accurately quantify pronunciation accuracy.

Figure 4.

Feature importance for predicting correct English phoneme pronunciation in an immersive VR setting.

In this research, the GBM and LGBM models were implemented, but the hybrid method MRFO-LGBM outperforms them in terms of performance. The results demonstrate a significant performance improvement when using the proposed MRFO-LGBM method for classifying phonemic performance and predicting neurocognitive load. The MRFO-LGBM, as the proposed method, outperformed both the GBM and LGBM models, showing remarkable improvement across all evaluation metrics, with significant gains in accuracy (96.81%), precision (96.13%), recall (95.67%), and F1-score (95.89%). This increase in performance can be attributed to the hybrid optimization approach of MRFO in fine-tuning the LGBM model’s parameters, leading to better phonemic classification accuracy and neurocognitive load predictions in immersive VR-based language learning environments. Table 2 and Figure 5 show the model performance comparison for phonemic classification.

Table 2.

Performance comparison for phonemic classification.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-score (%)
GBM	89.67	88.12	87.45	87.78
LGBM	92.34	91.20	90.65	90.92
MRFO-LGBM (Proposed)	96.81	96.13	95.67	95.89

Figure 5.

Outcome of the model performance comparison for phonemic classification using MRFO-LGBM.

Table 3 and Figure 6 present the accuracy results for selected phonemic word classifications using three models: GBM, LGBM, and the proposed MRFO-LGBM. For the phonemic word /æ/, the MRFO-LGBM model achieves the highest accuracy of 92.34%, outperforming both GBM and LGBM. Similarly, for the phonemic word /b/, the proposed MRFO-LGBM model again leads with an accuracy of 94.23%. The performance trend continues with the phonemic words /t/ and /s/, where MRFO-LGBM outperforms the models, achieving accuracies of 93.65% and 95.72%, respectively. This demonstrates the superior capability of the MRFO-LGBM model in accurately classifying phonemic words compared to GBM and LGBM, highlighting its effectiveness in immersive language acquisition tasks involving complex phonemic distinctions.

Table 3.

Phonemic words Classification Performance Comparison.

Phonemic words	GBM (%)	LGBM (%)	MRFO-LGBM (Proposed) (%)
/æ/	85.67	88.45	92.34
/b/	89.12	91.78	94.23
/t/	86.45	89.23	93.65
/s/	87.23	90.54	95.72

Figure 6.

Accuracy results for some phonemic word classifications using MRFO-LGBM.

Discussion

Immersive Language Acquisition demonstrates that virtual reality (VR) technology can improve language learning by activating multiple senses and increasing neurocognitive functioning. The µECoG-based brain decoding system in work¹⁶ exhibited exceptional signal accuracy, but its applicability is limited by the invasive nature and high setup cost, making it unsuitable for widespread educational implementation. The interactive AR system with AI speech recognition presented in research¹⁸ significantly improved phoneme mastery but required long-term evaluation to corroborate learning outcomes. Similarly, the VR-based project approach described in paper¹⁹ increased critical thinking and communication; however, it was limited in linguistic scope and short-term exposure, restricting its generalizability. The use of GBM and LGBM in this research demonstrates specific limitations primarily hindering their ability to capture the inherent complexity within the phonemic word classification tasks, as reflected in their precision of 88.12% and 91.20%. GBM is not ineffective; however, it struggles with overfitting while dealing with high-dimensional datasets. Consequently, when applied to phonemic features with subtleties. LGBM is faster and more efficient, but it also struggles with high dimensionality and the variance of components and variable patterns associated with phonemic recognition. The models failed to accurately predict neurocognitive load and phonemic performance; therefore, more optimal methodology is necessary to yield adequate results.

The MRFO-LGBM was created to solve these problems. The MRFO algorithm is designed to optimize LGBM, thus yielding a more adaptable capacity for the classification of phonemic word patterns. When the MRFO algorithm is utilized to balance exploration and exploitation during calibration, it will improve the accuracy and generalization capacity of LGBM to account for the latent phonemic features and neurocognitive load predictions. The optimization of LGBM capabilities provided by the MRFO algorithm results in a more accurate and reliable model for the recognition of phonemes. Our model shows how MRFO-LGBM outperforms GBM and LGBM to achieve greater phonemic learning and cognitive analysis in the learning of immersive language acquisition.

While demonstrating significant efficacy, this study has several limitations affecting real-world applicability. Firstly, the controlled VR environment, though immersive, may not fully replicate the unpredictable social dynamics and background noise of authentic communication settings, potentially limiting the transferability of acquired phonemic skills. Secondly, the reliance on specific hardware (EEG headsets and eye-trackers) introduces practical constraints for widespread deployment in diverse educational contexts due to cost and setup complexity. Thirdly, individual differences in VR adaptation (e.g., susceptibility to cybersickness and varying levels of technological familiarity) were not explicitly controlled for, which could influence neurocognitive engagement and learning outcomes. Finally, the study focused on a specific demographic (450 learners within the dataset context); broader generalizability across age groups, linguistic backgrounds, and proficiency levels requires further investigation. These factors necessitate cautious interpretation of the results and highlight areas for future development.

Future research should address the current limitations and explore promising extensions. Primarily, developing real-time adaptive VR systems is crucial. These systems would dynamically adjust phoneme difficulty (e.g., noise levels, speaker accent variations, and contextual complexity) and provide personalized feedback based on continuous analysis of learners’ multimodal bio-signals (EEG cognitive load, eye-tracking focus, and pronunciation accuracy via MFCCs). Second, exploring Augmented Reality (AR) integration could create hybrid immersive environments, overlaying phonemic visualizations (e.g., articulatory gestures) onto real-world objects or scenarios, potentially enhancing ecological validity and accessibility. Third, rigorous longitudinal studies are needed to assess the long-term retention of VR-induced phonemic improvements and neurocognitive adaptations beyond the experimental session. Furthermore, expanding the framework to incorporate a broader spectrum of learners (different ages, L1 backgrounds, and learning disabilities) and extend it to multiple languages with diverse phonetic inventories (e.g., tonal languages and languages with click consonants) is essential for establishing generalizability. Such multilingual VR phonemic training platforms hold significant potential for corporate language programs and cross-cultural communication training within the global workforce.

Conclusion

The potential benefits of VR used to improve phonemic competence in second language acquisition were demonstrated in this research. In total, 450 English learners participated in immersive VR contexts to target difficult phonemes. It employed multimodal bio-signal analysis, capturing EEG, eye-tracking, and audio signal data to determine neurocognitive gains. All data were preprocessed using Z-score normalization for EEG and eye-tracking data and a Savitzky–Golay filter denoising process for audio data to maintain phonemic features. Significant features were extracted using techniques like ICA for eye-tracking data and MFCC for audio, to capture phonemic information. The feature-level fusion technique was used to combine normalized features and audio-based MFCC into a unified, high-dimensional feature space for a comprehensive multimodal analysis. The Manta Ray Foraging Optimized Light Gradient-Boosting Machine (MRFO-LGBM) was designed to optimize the LGBM model, improving phonemic performance classification and predicting neurocognitive load. The MRFO-LGBM model had an accuracy of 96.81%, an F1-score of 95.89%, a precision of 96.13%, and a recall of 95.67%.

Footnotes

ORCID iD

Jing Lan

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the University foreign language teaching and scientific research project of Foreign Language Teaching and Research Press Ltd. (Nos 2023060101, 2022060801, and 2020050601).

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The authors declare that the data supporting the findings of this study are available within the article. The raw/derived data supporting the findings of this study are available from the corresponding author at request.*

Appendix

$H i g h v a r i a b l e p h o n e t i c t r a i n i n g = H V P T$	$V i r t u a l r e a l i t y = V R$
$B r a i n - c o m p u t e r i n t e r f a c e = B C I s$	$A r t i f i c i a l i n t e l l i g e n c e = A I$
$E l e c t r o e n c e p h a l o g r a m = E E G$	$T i m e - D e l a y N e u r a l N e t w o r k s = T D N N$
$G a m m a p o w e r = G P$	$I m m e r s i v e v i r t u a l r e a l i t y = I V R$
$A c c e n t s M e t h o d r h y t h m p a t t e r n s = A M R P$	$A u t i s m s p e c t r u m d i s o r d e r = A S D$
$M i c r o - e l e c t r o c o r t i c o g r a p h i c = µ E C o G$	$F i n i t e i m p u l s e r e s p o n s e = F I R$
$P h o n o l o g i c a l m i s m a t c h n e g a t i v i t y = P M N$	$F a s t F o u r i e r T r a n s f o r m = F F T$
$A u g m e n t e d R e a l i t y = A R$	$D i s c r e t e C o s i n e T r a n s f o r m = D C T$
$M e l f i l t e r = M F$	$F u n c t i o n a l N e a r - I n f r a r e d S p e c t r o s c o p y = f N I R S$
$P r o g r e s s i v e S l e e p I n v e r s i o n = P S I$	$M o b i l e - b a s e d V a r i a b i l i t y P h o n e t i c T r a i n i n g = M V P T$

References

Natvig

. Modeling heritage language phonetics and phonology: toward an integrated multilingual sound system. Languages 2021; 6(4): 209. DOI: 10.3390/languages6040209.

Habbash

Mnasri

Alghamdi

, et al. Recognition of Arabic accents from English spoken speech using deep learning approach. IEEE Access 2024; 12: 37219–37230. DOI: 10.1109/ACCESS.2024.3374768.

Coumel

Groß

Sommer-Lolei

, et al. The contribution of music abilities and phonetic aptitude to L2 accent faking ability. Languages 2023; 8(1): 68. DOI: 10.3390/languages8010068.

Karpovich

Sheredekina

Krepkaia

, et al. The use of monologue speaking tasks to improve first-year students’ English-speaking skills. Educ Sci 2021; 11(6): 298. DOI: 10.3390/educsci11060298.

Liu

Morris

, et al. Knowledge-based features for speech analysis and classification: pronunciation diagnoses. Electronics 2023; 12(9): 2055. DOI: 10.3390/electronics12092055.

Altakhaineh

AL-Junaid

Younes

. Pronunciation and spelling accuracy in English words with initial and final consonant clusters by Arabic-speaking EFL learners. Languages 2024; 9(12): 356. DOI: 10.3390/languages9120356.

O’Brien

Seward

Zhang

. Multisensory interactive digital text for English phonics instruction with bilingual beginning readers. Educ Sci 2022; 12(11): 750. DOI: 10.3390/educsci12110750.

Mamani-Calapuja

Laura-Revilla

Hurtado-Mazeyra

, et al. Learning English in early childhood education with augmented reality: design, production, and evaluation of the “Wordtastic kids” app. Educ Sci 2023; 13(7): 638. DOI: 10.3390/educsci13070638.

Cho

Kim

. Production of mobile English language teaching application based on text interface using deep learning. Electronics 2021; 10(15): 1809. DOI: 10.3390/electronics10151809.

10.

Rodríguez-Cano

Delgado-Benito

Ausín-Villaverde

, et al. Design of a virtual reality software to promote the learning of students with dyslexia. Sustainability 2021; 13(15): 8425. DOI: 10.3390/su13158425.

11.

Wilang

Seepho

Kitjaroonchai

. Exploring the relationship of reading fluency and accuracy in L2 learning: insights from a reading assistant software. Educ Sci 2025; 15(4): 488. DOI: 10.3390/educsci15040488.

12.

Yang

Wang

, et al. Mobile application-based phonetic training facilitates Chinese-English learners’ learning of L2. Learn InStruct 2024; 93: 101967. DOI: 10.1016/j.learninstruc.2024.101967.

13.

LaRocco

Tahmina

Lecian

, et al. Evaluation of an English language phoneme-based imagined speech brain computer interface with low-cost electroencephalography. Front Neuroinform 2023; 17: 1306277. DOI: 10.3389/fninf.2023.1306277.

14.

Juyal

Muthusamy

Kumar

, et al. Resting state EEG assisted imagined vowel phonemes recognition by native and non-native speakers using brain connectivity measures. Phys Eng Sci Med 2024; 47(3): 939–954. DOI: 10.1007/s13246-024-01417-w.

15.

Wan

Wang

, et al. The combination of accent method and phonemic contrast: an innovative strategy to improve speech production on post-stroke dysarthria. Front Hum Neurosci 2024; 17: 1298974. DOI: 10.3389/fnhum.2023.1298974.

16.

Duraivel

Rahimpour

Chiang

, et al. High-resolution neural recordings improve the accuracy of speech decoding. Nat Commun 2023; 14(1): 6938. DOI: 10.1038/s41467-023-42555-1.

17.

Heidlmayr

Ferragne

Isel

. Neuroplasticity in the phonological system: the PMN and the N400 as markers for the perception of non-native phonemic contrasts by late second language learners. Neuropsychologia 2021; 156: 107831. DOI: 10.1016/j.neuropsychologia.2021.107831.

18.

Tolba

Elarif

Taha

, et al. Interactive augmented reality system for learning phonetics using artificial intelligence. IEEE Access 2024; 12: 78219–78231. DOI: 10.1109/ACCESS.2024.3406494.

19.

Yuan

Yang

, et al. Research on project-based learning of foreign trade English in speech recognition virtual reality environment. Soft Comput 2023; 8: 1–2. DOI: 10.1007/s00500-023-08896-1.

20.

Xue

Wang

. English listening teaching device and method based on virtual reality technology under wireless sensor network environment. J Sens 2021; 2021(1): 8261861. DOI: 10.1155/2021/8261861.

21.

Xie

Zhang

Yang

. Effect of immersive virtual reality based upon input processing model for second language vocabulary retention. Educ Inf Technol 2025; 10: 1–21. DOI: 10.1007/s10639-025-13333-x.

22.

Mozaffari

Lee

. Second language pronunciation training by ultrasound-enhanced visual augmented reality. In: 2021 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, 2021, pp. 3043–3050. DOI: 10.1109/BIBM52615.2021.9669622.

23.

Bahameish

Khowaja

Abdelaal

, et al. Pathways to learning: exploring the impact of augmented reality on vocabulary development in children with autism spectrum disorder. Interact Learn Environ 2025; 4: 1–24. DOI: 10.1080/10494820.2025.2485407.

24.

Watthanapas

Hao

, et al. The effects of using virtual reality on Thai word order learning. Brain Sci 2023; 13(3): 517. DOI: 10.3390/brainsci13030517.

25.

Pellas

Christopoulos

. The effects of machinima on communication skills in students with developmental dyslexia. Educ Sci 2022; 12(10): 684. DOI: 10.3390/educsci12100684.

26.

Wang

Liu

. English song lyrics in EFL underachievers’ phoneme categorization. Sage Open 2025; 15(2): 21582440251330350. DOI: 10.1177/21582440251330350.