Leveraging machine learning techniques for grading the difficulty of English vocabulary learning

Abstract

As globalization accelerates, the significance of English in international communication becomes increasingly prominent, making the effective learning of English vocabulary a pivotal aspect of language acquisition. This study aims to explore a personalized grading method for English vocabulary learning difficulty through machine learning technology, facilitating learners in more efficiently mastering English vocabulary. Initially, the paper analyzes the limitations of traditional methods for grading the difficulty of English vocabulary learning, highlighting the lack of dynamic adaptation to the differences among learners. Subsequently, it introduces a novel approach to predicting the difficulty of learning English vocabulary using machine learning, particularly through the application of transfer learning techniques at the edge. This method adjusts the prediction model based on the learner’s background knowledge and learning history, thus enhancing the accuracy and applicability of predictions. Finally, by predicting the English vocabulary learning difficulty levels of students at different stages (beginner, intermediate, and advanced), this study validates the effectiveness of the proposed method. The results indicate that the machine learning model employing transfer learning demonstrates significant advantages in grading the difficulty of English vocabulary learning, offering more personalized learning guidance to learners of varying levels. The findings of this research not only provide a new perspective for the field of English education but also offer technical support for designing personalized learning paths.

Keywords

English vocabulary learning machine learning transfer learning difficulty grading personalized learning

Introduction

In the context of globalization and informatization, English, as the main language of international communication, plays an undeniably crucial role in learning.^1–3 With the development of educational technology, the application of machine learning techniques in the field of language learning has become increasingly widespread, especially in the personalized guidance and difficulty grading of English vocabulary learning, showing great potential.^4–6 English vocabulary, as the foundation of language learning, directly impacts the mastery of listening, speaking, reading, and writing skills. Therefore, researching how to grade the difficulty of English vocabulary learning using machine learning techniques is significant for improving the efficiency and quality of English learning.^7,8

Although English vocabulary learning occupies an important position in language education, traditional methods for grading vocabulary difficulty often rely on teachers’ experience or static vocabulary lists, lacking dynamic adaptability to the specific needs of different learners.^9–11 With the development of artificial intelligence technology, utilizing machine learning techniques for the personalized prediction and grading of English vocabulary learning difficulty can better meet learners’ personalized learning needs, enhancing efficiency and motivation.^12–15 Hence, exploring the application of machine learning techniques in grading the difficulty of English vocabulary learning holds significant research importance for promoting the personalized development of education.

However, existing machine learning-based methods for grading the difficulty of English vocabulary learning mostly focus on the application of generic algorithms, with less consideration for the differences in learners’ background knowledge and learning history, which leaves room for improvement in the precision and practicality of difficulty predictions.^16–18 Additionally, a few studies have attempted to incorporate learner characteristics, but still face challenges in efficiently integrating these features with algorithm optimization. Therefore, researching how to optimize machine learning models to better adapt to the English vocabulary learning difficulty prediction for learners with different backgrounds is a pressing issue that needs to be addressed.^19–21

The main research contents of this paper include two parts: one is the estimation method for English vocabulary learning difficulty based on machine learning, especially the use of edge transfer learning techniques, to improve the model’s adaptability to the background knowledge of different learners; the other is through the prediction and verification of the English vocabulary learning difficulty levels of students at beginner, intermediate, and advanced stages, explaining the applicability and effectiveness of the model at different learning stages. Through these two parts of research, this paper not only optimizes the accuracy and adaptability of English vocabulary learning difficulty prediction but also provides a new personalized teaching strategy for the field of English education, possessing significant theoretical and practical value.

English vocabulary learning difficulty estimation method based on machine learning

When considering the practical needs for grading the difficulty of English vocabulary learning, facing student groups with different levels of knowledge and cultural backgrounds, traditional estimation methods often fail to accurately reflect the specific needs of each group. Therefore, this paper proposes an English vocabulary learning difficulty estimation method based on domain generalization. This method takes into account the differences in English vocabulary mastery among different student groups, aiming to provide an estimation of “relative difficulty.” This not only reflects the degree to which a specific student group has mastered certain vocabulary but also avoids the overfitting problem caused by sparse data. Unlike traditional absolute difficulty estimation, relative difficulty estimation can provide teachers with more targeted information. For example, knowing the relative difficulty value of a vocabulary, teachers can infer that only a small portion of their student group might be able to correctly understand and use that vocabulary. By adopting this domain generalization approach, it not only optimizes the use of computational resources, avoids the risk of overfitting on sparse datasets, but also better adapts to the learning needs of different student groups, providing them with more personalized learning resources and guidance, thereby enhancing the overall teaching effectiveness and learning efficiency.

This machine learning-based technology enhances the model’s adaptability to learners from different backgrounds by analyzing a large amount of vocabulary usage data and using edge transfer learning techniques. Specifically, the model first learns the vocabulary difficulty feature curves from the training dataset, and then uses these features to estimate difficulty across different types of vocabulary and learning stages. By comparing vocabulary difficulty estimation performance, the model can accurately identify the relative learning difficulty of various vocabularies and verify its effectiveness among learners at different stages: beginner, intermediate, and advanced. The core of this technology lies in its adaptive ability and precise difficulty estimation, providing scientific basis and personalized guidance for English vocabulary teaching and self-directed learning.

The innovation of this method is based on a core assumption: student groups with similar learning backgrounds and capabilities exhibit similar learning behaviors and understanding levels when facing English vocabulary of different difficulties. Therefore, by encoding the English vocabulary text and mapping each vocabulary to a feature space, this framework first pre-trains the encoder on a global dataset to capture the absolute difficulty characteristics of English vocabulary, ensuring the model can adapt to the uniqueness of English as a specific subject area. Then, by aggregating the edge distribution data of various student groups to form a new group feature representation, integrating labeled English vocabulary usage data, group learning characteristics, and vocabulary text features, a linear regression model capable of predicting the relative difficulty of English vocabulary is trained. Once the model is trained, it can estimate the relative difficulty of existing or newly added English vocabulary in the vocabulary library without needing to retrain for each new student group, thereby achieving fast and accurate personalized estimation of English vocabulary learning difficulty for different learner groups.

Scenario description and data analysis

In this study, relying on the English vocabulary usage dataset from multiple schools, an in-depth analysis of the data characteristics of student groups in English vocabulary learning was conducted. The analysis revealed that for the collected data samples, only a few frequently occurring vocabulary difficulty labels were reliable enough. Most vocabulary appeared at a moderate frequency, but this included some potentially harmful data that could affect the accuracy of difficulty estimation, which needed to be identified and corrected. Meanwhile, other vocabularies that appeared only a few times were often treated as unlabeled samples. Through this analysis, the following characteristics can be summarized.

(1) In the data of English vocabulary learning, the distribution of vocabulary quantities that different groups are exposed to is extremely unbalanced. A minority of groups may learn most of the vocabulary, while the majority may only learn a few. At the same time, the number of interactions between students and vocabulary also follows a long-tail distribution, leading to many data that may not be conducive to difficulty estimation.

(2) By comparing the mastery levels of the same vocabulary among different groups, it was found that even vocabularies with similar difficulty values had different rankings among groups. To reduce the impact of this ranking difference on the difficulty estimation model training, we screened for vocabulary samples that have high consistency across different groups through sampling.

(3) To explore the transferability of edge distribution characteristics between different groups, this study plotted heat scatter plots, using the number of overlaps and Jaccard similarity to represent the capability differences between groups. The results showed that groups with similar edge distributions had smaller differences in learning capabilities. Figure 1 shows the distribution curve of student English vocabulary learning data.

Figure 1.

Distribution curve of student English vocabulary learning data.

Problem definition

The core issue addressed in this paper is how to accurately estimate the learning difficulty of English vocabulary for different student groups. Unlike the general estimation of absolute difficulty, which only considers the vocabulary itself and its overall difficulty, this paper focuses on customizing the estimation model for each specific student group ℎ. Based on observations from V training student groups {h_u}_1≤k≤V, training samples for each group ℎ are obtained. The challenge here is that the reliability of sample data follows a long-tail distribution, meaning that for some vocabulary, data is very limited, and the arrangement of samples within a group is random, lacking time-series related information. Assuming the number of labeled samples is represented by v_u, and the number of unlabeled samples by v′_u, then the training sample for group ℎ is:

F_{h_{u}}^{T R} = {(a_{u k}, b_{u k})}_{1 \leq k \leq v k} \cup {(a_{u k})}_{v_{u} \leq k \leq v_{u}^{'}}

(1)

The goal of this paper is to extract a universally applicable rule from these training data, which can be applied to existing or previously unseen unlabeled datasets, and generate a classifier for new estimation tasks. This classifier can be applied to any English vocabulary a, even those beyond the scope of the entire dataset. In other words, the model intended to be built in this paper aims to output a function d:H×A→E, which can map the relationship between student groups H and the learning difficulty of English vocabulary E, and is applicable to various English vocabularies A. Figure 2 provides a more intuitive illustration of the English vocabulary learning difficulty estimation problem’.

Figure 2.

Schematic diagram of the English vocabulary learning difficulty estimation problem.

Student group modeling under edge transfer learning

Figure 3 shows the schematic diagram of the network structure built. In the framework of this study for estimating the difficulty of learning English vocabulary, the given set of English vocabularies {q₁,q₂,…, q_V} is processed first, with each word or token q_u initialized as a vector w = {q₁,q₂,…, q_V} through f₀-dimensional word embeddings, where each n_u represents the vector representation of vocabulary q_u. This step constructs an initial mathematical representation for each English vocabulary, converting natural language into a form that can be understood and processed by machines, laying the foundation for subsequent extraction of semantic information from the vocabulary. Specifically, a text encoder with parameters ϕ_s is executed to generate its semantic representation a from its content w:

g = d_{φ_{s}} ({n_{1}, n_{2}, \dots, n_{V}})

(2)

Figure 3.

Schematic diagram of network structure.

Next, upon obtaining the textual representation of the vocabulary, further processing of the dataset {(g_uk,b_uk)}_1≤k≤vu∪{(g_uk)}_{1≤u≤v′u} is conducted by concatenating each student group’s semantic information with the English vocabulary through aggregation operations, obtaining the group’s representation h^∼ = Aggr ({(g_uk)}_{1≤u≤vu+v′u}). The choice of aggregation function is crucial for capturing the diversity of learning behaviors within the group, and this study chooses the element-wise max pooling operation as the aggregation function, considering its advantages in handling unordered data and computational efficiency, effectively extracting key features from group learning activities. Considering the unordered nature and computational efficiency, assuming the sigmoid activation function is represented by σ, the learnable matrix and bias are represented by Q_PO and y, let ϕ_h = Q_POIy. The operation expression is:

\tilde{h} = MAX (σ (Q_{P O} \times g_{u} + y)), u \in [1, V]

(3)

In the model’s final layer, the goal is to output the estimated learning difficulty value for each English vocabulary. To maintain the model’s simplicity and effectiveness, this study employs a linear model with parameters ϕ_o as the interaction function, directly calculating the difficulty value from the aggregated representation of the group and vocabulary. This design leverages the interpretability and computational efficiency of linear models in handling complex issues, aiming to provide an accurate difficulty estimation for each English vocabulary. Based on enhanced features, the model output is:

\tilde{b} = d_{φ_{o}} (c o c a n t (\tilde{h}, g))

(4)

Finally, this study employs the Adam optimizer to minimize the loss function, optimizing the parameters of the entire English vocabulary learning difficulty estimation model. The Adam optimizer is widely used in machine learning tasks due to its adaptive learning rate features, helping to accelerate the convergence speed during the model training process while also improving the accuracy of difficulty estimation. Through this series of steps, this study has established an English vocabulary learning difficulty estimation model that considers both the characteristics of the vocabulary itself and the learning behaviors of student groups, aiming to provide more precise and personalized

M = \frac{1}{V} \sum_{u = 1}^{V} M (F_{u}^{T R}, d) = \frac{1}{V} \sum_{u = 1}^{V} \frac{1}{v_{u}} \sum_{k = 1}^{v_{u}} m (\tilde{b}, b_{u k})

(5)

Encoder training

Due to the dataset for grading the difficulty of English vocabulary learning, the frequency of each vocabulary appearing is unbalanced, meaning some vocabularies may be very common, while others are relatively rare. This imbalance makes it difficult for the model to learn sufficient semantic information through a few high-frequency vocabularies alone. Therefore, in this study, we chose to pre-train the text encoder with an absolute difficulty estimation task. The pre-trained encoder can utilize the wide coverage of the overall dataset to learn rich semantic representations, providing more accurate and robust vocabulary features for subsequent difficulty estimation tasks. At the same time, by avoiding re-training the encoder for each difficulty estimation task, not only a significant amount of computational resources are saved, but also the model’s training and prediction processes are accelerated. Specifically, suppose the number of student interactions between group h_u and vocabulary a_u is represented by z_uk, and the relative difficulty is represented by b_uk. Given V training groups, the absolute difficulty b_k of vocabulary a_k can be obtained as:

{\bar{b}}_{k} = \frac{1}{\sum_{u = 1}^{V} z_{u k}} \times \sum_{u = 1}^{V} z_{u k} b_{u k}

(6)

Further, a linear model d_slo is used to obtain the estimated value of absolute difficulty. Assuming an absolute difficulty dataset of size v' is represented by F^TR_ALL, then the loss function to be minimized can be expressed as:

M^{'} = (F_{A L L}^{T R}, d) = \frac{1}{v^{'}} \sum_{u = 1}^{v^{'}} m (d_{s l o} (d_{φ_{s}} (a_{u})), {\bar{b}}_{u})

(7)

In this study, considering one of the main challenges encountered in the process of estimating the difficulty of learning English vocabulary is the noise problem in the training data, especially due to the differences and inconsistencies in interactions among student groups. Such noise might prevent the difficulty estimation model from accurately learning vocabulary characteristics, thus affecting the model’s performance in practical applications. To address this issue, the concept of the influence function is introduced to evaluate how small perturbations in the training samples affect the accuracy of model predictions. By analyzing the results of the influence function, researchers can identify which training samples are most likely to be noisy and the specific impact of these noisy samples on model predictive performance. Subsequently, corresponding measures, such as reweighting or removing these high-impact noisy samples, can be taken to optimize the training process.

Assuming the Hessian matrix is represented by G⁻¹_ϕ, and the training parameters of the sample are represented by ϕ, let the input vector to the prediction layer θ_pϕ_o be the enhanced feature a^∼ = CO(h^∼,g), and the output is the estimated relative difficulty b^∼. Then we have the formula:

Ψ_{φ} (\tilde{a}, b) = - G_{φ}^{- 1} \nabla_{φ} m_{u} (φ)

(8)

Traditional pointwise sample evaluation sets may not fully reveal the subtle changes in difficulty differences between different vocabularies, especially when the evaluation and training data follow the same distribution, leading to biased assessment results. Therefore, to meet the specific needs of English vocabulary learning difficulty estimation and enhance the unbiased nature of the evaluation data, this paper chooses to construct evaluation sets using paired samples. By constructing paired samples c^o=(a^∼₁,a^∼₂,b₁-b₂), where the difficulty difference between two vocabularies directly serves as the label, the model’s ability to recognize difficulty differences can be captured more accurately, thereby enhancing the effectiveness and accuracy of the assessment. Assuming the traditional pointwise sample evaluation set is represented by c^z=(a^∼,b), then we have the loss function expression as:

m_{k}^{o} (φ) = m (d_{φ} ({\tilde{a}}_{1}) - d_{φ} ({\tilde{a}}_{2}), b_{1} - b_{2})

(9)

Figure 4 displays the process of constructing the model evaluation set. The model has two constraints: one is |b₁-b₂|>β and the other is for the construction dataset F^o_EV = {c^o_k}_1≤u≤vn for pairs (a₁,b₁) and (a₂,b₂) within the same group, where the threshold parameter is represented by β. Assuming the entire evaluation dataset is represented by F^o_EV, then the estimated result variation of c^o_k can be estimated by the following equation:

Φ_{φ} (c_{u}, c_{k}^{o}) - \nabla_{φ} m_{k}^{o} (φ) G_{φ}^{- 1} \nabla_{φ} m_{u} (φ)

(10)

Figure 4.

Model evaluation set construction process.

Assuming the inverse Hessian matrix vector product is represented by G⁻¹_ϕN_ϕmu(ϕ), it can be proposed on F^o_EV as:

Φ_{φ} (c_{u}) = \sum_{k = 1}^{v_{n}} Φ_{φ} (c_{u}, c_{k}^{o}) = \sum - \nabla_{φ} m_{k}^{o} (θ) F_{φ}^{- 1} \nabla_{φ} m_{u} (φ)

(11)

If the linear model ϕ is a fully connected layer, the Hessian matrix is:

G_{φ}^{- 1} = \frac{1}{V} \sum {\tilde{a}}^{S} \tilde{a}

(12)

In the domain of estimating the difficulty of learning English vocabulary, for noise samples identified through the influence function, this study defines three strategies to optimize the training process, ensuring the accuracy and effectiveness of the difficulty estimation. The first is the “Discard” strategy, which involves removing those samples identified as harmful from the training set to avoid their negative impact on model training. The second is the “Downweight” strategy, which reduces the impact of specific samples by adjusting the sample weights in the loss function. For example, the original weight ε is adjusted to b' = 0.5ab, and based on the results of the influence function, the weights of the downweighted samples are progressively reduced in order, such as reducing to 0.9 times the original weight, thus diminishing the impact of noise samples on the model. Lastly, the “Repair” strategy involves adjusting the sample label to (b^∼_u,b^{^}_u)/2 to correct its value if the estimated relative difficulty b^∼_u is within an acceptable range (i.e., close to its true value b_u); if not within this range, then the discard treatment is applied. These three strategies provide flexible options for dealing with noise issues in the training data, ensuring that the English vocabulary learning difficulty estimation model maintains high accuracy and robustness under various data conditions. By precisely adjusting the noise data in the training set, these strategies help to enhance the performance of the final model, thus providing learners with more accurate learning difficulty estimations. Figure 5 presents the model learning process diagram.

Figure 5.

Model learning process diagram.

The decision to perform local computation of the influence function in the model’s final linear layer is driven by several factors. Firstly, calculating the influence function is a resource-intensive process, especially for models with a large number of parameters, such as when using deep learning models like BERT, where computational efficiency significantly decreases. To address this challenge, freezing the text encoder and locally calculating the influence function in the final linear layer with fewer parameters can greatly reduce computational complexity and time consumption. Secondly, the calculation of the influence function is primarily used for identifying and handling noise labels in the training data, a step critical to the overall accuracy and robustness of the model. By focusing this calculation in the linear layer, we can effectively manage computational resources while maintaining model performance. Considering that the parameter volume O of the linear layer is usually much smaller than the training data volume V, this approach significantly improves computational efficiency while maintaining the accuracy of the influence function calculation. Moreover, this strategy also allows the model to quickly adapt to the needs of estimating the difficulty of learning English vocabulary, especially in educational scenarios where frequent model updates are required to respond to new data. Therefore, performing local computation of the influence function in the final linear layer is not only for computational efficiency but also to ensure the practicality and flexibility of the model when handling large-scale English vocabulary data.

English vocabulary learning difficulty level prediction validation

To validate the prediction of English vocabulary learning difficulty levels, this paper specifically addresses the complexity and diversity of English vocabulary learning through the collection and selection of corpora. Initially, texts suitable for English vocabulary learning were filtered from large corpora, ensuring they were primarily written by native English speakers to guarantee the quality and authenticity of the corpus. Subsequently, the preliminarily downloaded corpora underwent further screening, especially seeking instances that contain high-frequency learning vocabulary and their contexts. This includes sentences both with and without specific vocabularies to facilitate the analysis of the usage and understanding difficulty of vocabulary in different contexts. Lastly, focus was placed on selecting sentences with complex structures, such as conditional clauses, which are usually more challenging for English learners and thus highly suitable for the study of vocabulary learning difficulty level prediction. After this series of stringent selection steps, 260 high-quality corpora particularly suited for the purposes of this study were meticulously chosen from an initial pool of over 800.

Furthermore, the principle for determining the vocabulary learning difficulty levels of English learners adopted a multidimensional approach, ensuring the comprehensiveness and accuracy of the evaluation. The first step involves a preliminary determination based on the analysis of the frequency and complexity of specific vocabulary usage by learners, combined with the course content and vocabulary expression complexity at the learners’ current stage of study. For example, learners at the beginner stage might primarily use basic and common vocabulary, while advanced learners would be able to employ more complex and specialized vocabulary. The second step considers the type of course, such as advanced English writing or literary analysis courses, which often involve higher level vocabulary use and can serve as an important basis for determining the difficulty level of vocabulary mastery. Moreover, the assessment of vocabulary usage difficulty also references the level classification of specific vocabulary in standardized English proficiency tests; the occurrence of frequently used advanced vocabulary can infer a higher language proficiency of the author. The third step includes the article’s style and subject as auxiliary criteria; complex argumentative and analytical texts generally require more advanced vocabulary knowledge and linguistic expression capabilities.

Based on reliable data support, this paper can proceed with the prediction validation of vocabulary learning difficulty levels for students at the beginner, intermediate, and advanced stages.

(1) For beginner-stage students, the basic principle of vocabulary learning difficulty level prediction validation focuses on the mastery of basic vocabulary and daily expressions. Beginners are usually at the introductory stage of English learning, so the emphasis is on assessing their recognition and use of the most common and basic English vocabulary. The prediction validation process focuses on understanding simple sentences, correct use of basic tenses, and whether these vocabularies can be correctly used in daily communication. By analyzing learners’ vocabulary usage frequency and accuracy in specific contexts, their vocabulary learning difficulty level can be effectively predicted, thus providing them with corresponding learning resources and guidance.

(2) For intermediate-stage students, vocabulary learning difficulty level prediction validation shifts towards more complex contexts and vocabulary usage. Students at this stage have a certain foundation in English, so the focus of prediction validation is whether learners can understand and use more complex vocabulary and expressions, including idioms, idiomatic expressions, and technical terms. Moreover, for intermediate learners, the ability to accurately understand and use these vocabularies in complex contexts and whether they can express thoughts and viewpoints accurately through these vocabularies are key to assessment. By considering learners’ performance in reading comprehension, writing, and oral communication, their vocabulary learning difficulty level can be comprehensively assessed.

(3) For advanced-stage students, the core of vocabulary learning difficulty level prediction validation lies in assessing learners’ mastery of advanced vocabulary and their ability to flexibly use these vocabularies in complex and abstract topics. Advanced learners should be able to understand and use complex vocabularies in professional or academic fields, effectively conducting critical thinking and in-depth linguistic expression. Additionally, the assessment at the advanced stage also includes learners' understanding of the nuances of vocabulary and their ability to accurately select and use appropriate vocabularies in different contexts. By simulating real scenarios or academic discussions, evaluating learners’ adaptability and language application capabilities in advanced English usage environments can accurately predict their vocabulary learning difficulty level.

Through this series of comprehensive judgments, the vocabulary learning difficulty levels of different English learners can be accurately assessed and differentiated, providing targeted teaching suggestions for English education.

Experimental results and analysis

To train the model, this study collected a large amount of data on learners’ vocabulary usage, covering students at different learning stages and backgrounds. The feature selection of the model focuses on the difficulty feature curves of vocabulary, including factors such as usage frequency, context dependence, and spelling complexity. Using edge transfer learning techniques, the model can transfer features learned from the source domain to the target domain, thereby enhancing its adaptability to learners from different backgrounds. In terms of algorithms, the study employed a multi-layer neural network structure combined with adaptive optimization methods to ensure the model can fully capture subtle differences in vocabulary difficulty during training and maintain high accuracy in difficulty estimation across different types of vocabulary and learning stages. In the experimental setup, the study first collected a large amount of English learners' vocabulary usage data from various sources, covering students at beginner, intermediate, and advanced stages. The large sample size ensured data diversity and representativeness. Background variables of learners, such as native language, study time, and educational environment, were controlled in the experiment to eliminate external factors’ influence. The model’s performance was quantified using various evaluation metrics, including the goodness of fit of vocabulary difficulty feature curves, accuracy and precision of relative vocabulary difficulty estimation, and performance across different types of vocabulary and learning stages. Specific metrics included Mean Squared Error (MSE), Mean Absolute Error (MAE), precision, recall, and others, which collectively verified the model’s applicability and effectiveness under different conditions.

When determining learner classification criteria, constructing the training dataset, and selecting different influence function strategies, the study classified learners based on their background knowledge and learning stage to ensure that the model can estimate vocabulary learning difficulty for learners at different levels. The training dataset was constructed based on a large amount of real learning data, including information such as students’ vocabulary usage rates and accuracy, reflecting the characteristics of different learning stages. When selecting influence function strategies, the study considered the difficulty features of vocabulary, the relative difficulty within the vocabulary set, and the characteristics of different types of vocabulary to ensure the model’s applicability and effectiveness in various learning contexts. The comprehensive application of these strategies ensures the model’s precision and broad applicability, providing a scientific basis for English vocabulary teaching and self-directed learning.

From the data presented in Figure 6, we can observe the relationship between the similarity and difference pairs of vocabulary. As the vocabulary index increases, the similarity pair values show a trend of slowly increasing at first and then rapidly rising, from 0 gradually to 1, especially between the indexes 400 to 500, where the similarity pair values jump from 0.33 to 1 sharply, indicating a significant acceleration in the difficulty increase of vocabulary within this range. At the same time, the values for difference pairs also show a trend of increasing first and then decreasing, especially between indexes 400 to 500, where the difference pair values drop from 0.5 to 0.98, suggesting that in the range of higher vocabulary difficulty, the differences between vocabularies begin to diminish. This phenomenon may reflect that as vocabulary difficulty increases, the similarity among high-difficulty vocabularies increases, while their differences relatively decrease, indicating that learners require a more refined discrimination ability. These experimental results highlight that the proposed method, by precisely capturing the similarity and difference features among vocabularies, can delicately depict the trends in vocabulary difficulty changes, providing targeted learning suggestions to learners. Secondly, the use of edge transfer learning technology enables the model to adapt to the background knowledge of different learners, effectively improving the model’s universality and accuracy. Especially in dealing with high-difficulty vocabularies, the model can provide more precise difficulty level predictions for advanced students by analyzing the subtle differences between vocabularies.

Figure 6.

Difficulty characteristic curve of vocabulary within the set.

Based on the comparative performance of English vocabulary relative difficulty prediction as shown in Table 1, we can see that the model presented in this paper demonstrates superior performance in both existing and new vocabulary scenarios. Specifically, in the existing vocabulary scenario, the model achieved scores of 0.1030, 0.6259, and 0.7523 on three metrics: mean absolute error (MAE), Pearson Correlation Coefficient (PCC), and Decision Consistency Index (DOA), outperforming other methods such as RoBERTa, Transformer, QuesNet, TextRNN, MAML, XLNet, ERNIE, ELECTRA, and Reptile across all metrics. Particularly, in the accuracy determination metric DOA, the model surpassed other models, showcasing its high accuracy in vocabulary difficulty prediction. In the new vocabulary scenario, the model also showed better performance than other methods, especially on the PCC and DOA metrics, reaching 0.5326 and 0.6698 respectively, indicating that the model not only excels in processing existing vocabularies but also has strong generalization ability to accurately predict the difficulty levels of new vocabularies.

Table 1.

Comparative performance of English vocabulary relative difficulty prediction.

Scenario	Existing vocabulary			New vocabulary
Method	MAE	PCC	DOA	MAE	PCC	DOA
RoBERTa	0.1325	0.5269	0.6784	0.1756	0.2784	0.5848
Transformer	0.1457	0.2547	0.5746	0.1623	0.1236	0.5236
QuesNet	0.1236	0.4659	0.6451	0.1879	0.1245	05,487
TextRNN	0.1358	0.4587	0.6789	0.2326	0.1865	0.2145
MAML	0.1269	0.5126	0.6652	0.1456	0.2147	0.5569
XLNet	0.1247	0.5326	0.6894	0.1659	0.4123	0.6235
ERNIE	0.1125	0.5487	0.6894	0.1754	0.3456	0.6248
ELECTRA	0.1236	0.5896	0.6784	0.1623	0.4189	0.6358
Reptile	0.1247	0.6235	0.7154	0.1548	0.4625	0.6489
The proposed model	0.1030	0.6259	0.7523	0.1369	0.5326	0.6698

These experimental results fully demonstrate the effectiveness and advancement of the machine learning-based method for predicting the difficulty of learning English vocabulary, especially the edge transfer learning technique employed in this paper. By comparing the performance of different methods in existing and new vocabulary scenarios, it’s clear that the model not only predicts the difficulty of known vocabularies more accurately but also shows higher adaptability and accuracy in predicting the difficulty of unknown vocabularies. This improvement in performance is attributed to the application of edge transfer learning technology, enabling the model to better utilize existing knowledge for accurate prediction of new vocabularies, thereby adapting to the background knowledge of different learners.

Further, this paper compares the performance of vocabulary relative difficulty prediction among different types of English vocabulary using concentrated methods based on transfer learning technology (Figure 7). According to the data provided in the three tables, the model in this paper exhibits outstanding performance in all types of English vocabulary (basic vocabulary, academic vocabulary, professional vocabulary, informal vocabulary). On the MAE metric, the model shows the lowest error values across all vocabulary types, indicating higher precision in predicting vocabulary difficulty. Similarly, on the PCC and DOA metrics, the model also demonstrates superior performance, especially the PCC values for academic and professional vocabularies, where the model significantly outperforms other models, showing effectiveness in understanding and predicting more complex vocabulary structures. These results emphasize the effectiveness and adaptability of the model proposed in this paper for predicting the difficulty of different types of vocabulary. The model can accurately predict the difficulty of vocabularies across different fields and styles, which is crucial for designing personalized and phased English learning programs. Through edge transfer learning technology, the model can not only efficiently learn from existing data but also adapt to new vocabularies and learning environments, providing precise learning difficulty estimations for learners with different background knowledge, thereby optimizing the allocation of learning resources and enhancing learning efficiency and effectiveness.

Figure 7.

Comparative performance of vocabulary relative difficulty prediction across different types of English vocabulary.

Further, this paper validates the effectiveness of the proposed prediction method for vocabulary learning difficulty levels among beginner, intermediate, and advanced-level students. According to the data shown in Figure 8, beginner-level students display certain differences in usage and correctness rates of English vocabulary. The usage rate of everyday language vocabulary is the highest at 70%, but its correctness rate is only 50%, indicating that although beginners frequently use this type of vocabulary, their mastery level has not reached an ideal standard. The usage rate of greetings and basic communicative expressions is 20%, with a correctness rate of 20%, suggesting that although these vocabularies are less used, learners’ accuracy is comparable to their usage frequency. Numbers and colors have the lowest usage rate at 10%, but their correctness rate reaches 20%, showing that students have a relatively good mastery level in this category of vocabulary. These data reflect significant differences in the usage and mastery of different types of vocabulary among beginner-level students.

Figure 8.

Usage and correctness rates of English vocabulary by beginner-level students.

The data shown in Figure 9 reflects the usage and correctness rates of different categories of English vocabulary among intermediate-level students. The data indicates that the usage rate of beginner-level vocabulary is 31%, with a correctness rate of 67%, showing that students have a good grasp of this type of basic vocabulary. The usage rate of descriptive adjectives and adverbs is relatively low at 14%, with a correctness rate of only 34%, which may indicate that this type of vocabulary poses more difficulty for intermediate students and requires further reinforcement. The usage rates of simple conjunctions and prepositions, and of time and dates are both 16%, but their correctness rates vary significantly, at 59% and 77%, respectively, showing that intermediate-level students have a relatively good mastery of vocabulary related to time and dates. The usage rate of common verb phrases is 23%, with a correctness rate of 65%, indicating that students are also proficient in this category of vocabulary.

Figure 9.

Usage and correctness rates of English vocabulary by intermediate-level students.

Figure 10 shows the usage and correctness rates of different types of English vocabulary among advanced-level students. The usage rates of academic vocabulary and industry-specific terminologies are relatively high, at 34% and 22%, respectively, and the correctness rates of these two categories are also relatively high, at 76% and 79%, indicating that advanced students have a good mastery and application ability of these more complex and specialized vocabularies. The usage and correctness rates of idioms and phrases, and complex phrases also show higher levels, especially the correctness rate of complex phrases at 79%, demonstrating that students can proficiently use more complex language structures. In contrast, the usage rates of argumentative expressions and slang are lower, at 4% and 2%, although their correctness rates still remain above 65%, suggesting that students have less practical experience in these areas. Notably, the usage rate of rhetorical devices is 0%, yet its assumed correctness rate is 55%, which may be based on theoretical knowledge rather than practical application, hinting at a potential learning growth point.

Figure 10.

Usage and correctness rates of English vocabulary by advanced-level students.

The data above emphasize that the in-depth vocabulary learning analysis proposed in this paper provides valuable information for teachers, helping them identify students’ strengths and potential learning challenges, thereby enabling the design of more personalized and targeted teaching plans. For example, given the extremely low usage rate of rhetorical devices, teachers can specifically increase related teaching content and exercises to improve students' abilities in this area. Therefore, the model presented in this paper not only excels in predicting students’ vocabulary learning difficulties but also offers practical guidance strategies for advanced-stage English teaching, helping to enhance students' comprehensive language application skills.

Conclusion

This thesis successfully demonstrates the development and application of a machine learning-based method for estimating the difficulty of learning English vocabulary, particularly through the effective use of edge transfer learning technology, to enhance the model’s adaptability to the background knowledge of different learners. The two core parts of the research, namely the development of a method for estimating the difficulty of learning English vocabulary and the validation of the applicability and effectiveness of this method at different learning stages, provide valuable guidance for English vocabulary teaching and self-learning.

The results show that the model proposed in this paper can accurately estimate the difficulty of vocabulary learning, which is fully validated through experiments such as the difficulty characteristic curves within the vocabulary set, comparative performance of English vocabulary relative difficulty prediction, and comparative performance of vocabulary relative difficulty prediction across different types of English vocabulary. Additionally, application research on the usage and correctness rates of English vocabulary by students at the beginner, intermediate, and advanced stages further confirms its applicability and effectiveness at different learning stages. These research findings not only highlight the potential application of machine learning technology in the field of language learning but also provide specific guidance strategies for English education practice, helping teachers and learners to more effectively address the challenges of vocabulary learning.

Despite the achievements of this research, there are still some limitations. For instance, the prediction performance of the model largely depends on the quality and diversity of the training data, and future research needs to explore how to expand and optimize the dataset to further improve the model’s generalization ability. Additionally, the model’s adaptability to certain specific types of vocabulary or specific learner backgrounds needs to be strengthened.

Although the study demonstrated the effectiveness of the machine learning-based method for estimating English vocabulary learning difficulty, potential limitations include the model’s limited adaptability to different cultural backgrounds, especially for non-native English speakers, whose vocabulary learning habits and difficulty perceptions may differ significantly. Additionally, since the model is primarily trained on existing data, it may not fully capture dynamic changes in the language learning process. Future research directions should include expanding the dataset to cover more cultural backgrounds and language environments, exploring model optimizations for non-native English speakers, and developing adaptive learning models that can adjust in real-time to learners’ dynamic changes. This would help to further enhance the model’s generalizability and practical application effectiveness.

Future research could explore the application of this model in multilingual environments, assessing its ability to estimate vocabulary learning difficulty across different language backgrounds. Additionally, a personalized learning path recommendation system could be developed based on this model, dynamically adjusting learning content according to the specific needs and progress of learners to enhance learning efficiency and effectiveness. This would further expand the model’s applicability and increase its practicality and impact in the global language learning field.

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Vamsi

Al Bataineh

Doppala

. Lexical based reordering models for English to Telugu machine translation. Rev Intell Artif 2023; 37(5): 1109–1120. DOI: 10.18280/ria.370503.

Lin

. The impact of English learning motivation and attitude on well-being: cram school students. Future Internet 2020; 12(8): 131. DOI: 10.3390/fi12080131.

Nagaraj

Ravikumar

Kasyap

, et al. Kannada to English machine translation using deep neural network. Ing Syst Inf 2021; 26(1): 123–127. DOI: 10.18280/isi.260113.

Wong

Michaels

. Transfer learning for radio frequency machine learning: a taxonomy and survey. Sensors 2022; 22(4): 1416. DOI: 10.3390/s22041416.

Suleymanov

Rustamov

. Automated news categorization using machine learning methods. IOP Conf Ser Mater Sci Eng 2018; 459(1): 012006. DOI: 10.1088/1757-899X/459/1/012006.

Deore

. Human behavior identification based on graphology using artificial neural network. Acadlore Trans Mach Learn 2022; 1(2): 101–108. DOI: 10.56578/ataiml010204.

Hassan

. Rough set machine translation using deep structure and transfer learning. J Intell Fuzzy Syst 2018; 34(6): 4149–4159. DOI: 10.3233/JIFS-171742.

Schmaltz

Beam

. Sharpening the resolution on data matters: a brief roadmap for understanding deep learning for medical data. Spine J 2021; 21(10): 1606–1609. DOI: 10.1016/j.spinee.2020.08.012.

Lam

Chen

. The crossover effects of morphological awareness on vocabulary development among children in French immersion. Read Writ 2018; 31(8): 1893–1921. DOI: 10.1007/s11145-017-9809-2.

10.

Uchikoshi

Yang

Liu

. Role of narrative skills on reading comprehension: Spanish-English and Cantonese-English dual language learners. Read Writ 2018; 31(2): 381–404. DOI: 10.1007/s11145-017-9790-9.

11.

O'Connor

Sanchez

Widaman

, et al. Systematic CHAAOS: teaching vocabulary in English/Language Arts special education classes in middle school. J Learn Disabil 2021; 54(3): 187–202. DOI: 10.1177/0022219420922839.

12.

Avalos

Bengochea

Massey

. Building on ELA vocabulary instruction to develop language resources. Read Teach 2021; 75(3): 305–315. DOI: 10.1002/trtr.2061.

13.

Lawson-Adams

Dickinson

. Sound stories: using nonverbal sound effects to support English word learning in first-grade music classrooms. Reading Res Q 2020; 55(3): 419–441. DOI: 10.1002/rrq.280.

14.

Wang

. Detecting pronunciation errors in spoken English tests based on multifeature fusion algorithm. Complexity 2021; 2021: 6623885. DOI: 10.1155/2021/6623885.

15.

Mpofu

. The implementation of English across the curriculum: an exploratory study of how South African educators teach writing in history lessons. TESOL J 2024; 15(1): e748. DOI: 10.1002/tesj.748.

16.

Zhu

Niu

. Image captioning with word gate and adaptive self-critical learning. Appl Sci-Basel 2018; 8(6): 909. DOI: 10.3390/app8060909.

17.

Shang

Tang

Guo

, et al. Accurate identification of bacteriophages from metagenomic data using Transformer. Brief Bioinform 2022; 23(4): bbac258. DOI: 10.1093/bib/bbac258.

18.

Lee

Song

. Understanding recurrent neural network for texts using English-Korean corpora. Commun Stat Appl Methods 2020; 27(3): 313–326. DOI: 10.29220/CSAM.2020.27.3.313.

19.

Crosson

Lei

McKeown

, et al. The curious role of morphological family size in language minority learners' problem solving of unfamiliar words. Sci Stud Read 2020; 24(6): 445–461. DOI: 10.1080/10888438.2019.1701475.

20.

Hashimoto

Egbert

. More than frequency? Exploring predictors of word difficulty for second language learners. Lang Learn 2019; 69(4): 839–872. DOI: 10.1111/lang.12353.

21.

Geva

Gottardo

, et al. Exploring sources of poor reading comprehension in English language learners. Ann Dyslexia 2021; 71(2): 299–321. DOI: 10.1007/s11881-021-00214-4.