Preschool language development prediction model based on deep learning: Cross-cultural validity evaluation

Abstract

Traditional research on preschool language development often fails to capture the complex nonlinear relationships and high-dimensional characteristics of language growth, leading to low prediction accuracy and poor cross-cultural applicability. This paper introduces a novel BERT (Bidirectional Encoder Representations from Transformers)-based model to predict preschool language development and evaluate its cross-cultural effectiveness. Text data from preschool children’s language datasets across multiple cultural backgrounds is collected, cleaned, and preprocessed to create suitable training samples. Special attention is given to the unique grammatical structures and cultural expressions in each language to ensure compatibility with the model. The BERT model is used to encode the processed text, leveraging its bidirectional self-attention mechanism to extract contextual information and generate deep feature representations essential for understanding preschool language development. The model combines both grammatical and semantic features for meaningful representations in subsequent predictions. Fine-tuning the pre-trained BERT model using the Adam optimizer enhances prediction accuracy, while cross-validation and hyperparameter tuning further improve its performance. Culturally specific annotations and vocabularies are incorporated to ensure the model’s effective prediction of language development across different regions. Experimental results show that the BERT model achieves an MAE (Mean Absolute Error) between 0.20 and 0.25, an MSE (Mean Squared Error) between 0.05 and 0.08, and an average R² value of 0.84 across English, Chinese, Spanish, and Japanese. These results demonstrate the model’s high accuracy and strong cross-cultural stability in predicting preschool language development.

Keywords

preschool language language development prediction prediction model deep learning bidirectional encoder representations from transformers

Introduction

Preschool language development is the basis of children’s cognitive and social adaptability, and it can directly affect their future learning and living abilities.^1,2 Preschool children’s language development in different cultural backgrounds has its own characteristics. How to accurately assess these differences is an important topic in the current education field.^3,4 Language development is influenced by culture, social environment, and education methods. Effective cross-cultural assessment is very important for understanding and promoting the development of children’s language ability all over the world. In order to achieve this, it is necessary to combine advanced deep learning technology in building a cross-culturally applicable language development prediction model.

In the current research on preschool language development prediction, feature extraction relies on manual work, and the models rely on rule-based methods, which have many defects in fitting the complexity and nonlinear characteristics of preschool children’s language development. The traditional methods depend on a few language features, too simple to reflect the development of children’s language ability.^5,6 A single traditional method cannot capture the diversity and changes in children’s complex and multi-dimensional language ability; the accuracy of prediction is low accordingly.^7,8 Most of the existing methods adopt some statistical analysis model, such as regression analysis or simple classification model. The expressive power of such models is very weak, which can’t effectively mine deep semantic and contextual information from a large amount of language data.^9,10 Existing methods ignore the contextual relationship of language when processing language data, and it is difficult to obtain subtle differences and potential patterns.^11,12 Studies on preschool children’s language development tend to ignore the impact of cultural background on language learning.^13,14 Each language has unique expressions, grammatical structures, and vocabulary usage, which are deeply rooted in its cultural background.^15,16 Previous language development predictions focused on the construction of a single language or cultural background, lacking cross-cultural flexibility and adaptability,^17,18 and could not provide accurate predictions when faced with diverse language and cultural groups. Traditional methods are difficult to identify and adapt to the characteristics of various cultures and languages in the context of multiple languages and cultures, which greatly limits the wide application of models and prediction accuracy.^19,20 There are still some problems with traditional preschool language development prediction methods, which seriously hinder the effectiveness and practicality of these methods in the field of preschool language research.

This paper proposes a BERT-based model for predicting preschool children’s language development, aiming to assess its effectiveness across different cultural backgrounds. Existing methods for studying language development often fall short in addressing the multifaceted characteristics and complexities of preschool language growth. By leveraging BERT’s bidirectional self-attention mechanism, this study delves deeper into the intrinsic features of preschool children’s language data, effectively processing language information from diverse cultural contexts. This approach enables accurate extraction of grammatical structures, semantic characteristics, and expression patterns unique to different languages and cultures, overcoming the limitations of traditional methods and better handling the complexity of multilingual and cross-cultural language data.

The model undergoes fine-tuning with hyperparameter optimization, coupled with cross-validation, to ensure its stability and robustness across various cultural backgrounds. The ultimate aim of this research is to create a universal tool for language development prediction that is adaptable to different languages and cultural contexts. By providing more scientific and precise insights into language development, this model supports global preschool education, assisting educators in better understanding and fostering children’s language growth. Through this work, the study contributes to a more effective, cross-culturally relevant approach to language development research and offers valuable support for early childhood education worldwide.

Related work

Currently, there are more and more studies on the prediction of preschool language development. Many studies have explored the impact of different language features on language development by analyzing various data in the process of children’s language acquisition. Some studies have proposed methods based on neural networks^21,22 to process different levels of children’s language expression and more accurately obtain the laws of language development. In order to overcome the problem of insufficient prediction effect of traditional methods, some scholars have adopted methods based on deep learning^23,24 to automatically extract language features to improve the accuracy of prediction and have achieved certain results.^25,26 Oh B D used deep learning and statistics to analyze Korean transcription data and combined it with a multi-core deep learning model to successfully determine the speaker’s age group and language development level with a high average accuracy rate.²⁷ These studies actually focused on a single language environment and achieved success within a certain range but ignored the differences between different languages and cultural backgrounds. The applicability of the model was limited in a cross-cultural environment. The culturally specific elements in a language, the differences in syntactic structure, vocabulary choice, and expression, to a large extent, make it impossible for traditional deep learning methods to effectively handle diverse language features; traditional research results cannot fully solve the problem of predicting language development in a cross-cultural context.

In order to solve the problem of insufficient prediction accuracy in traditional research, some new deep learning methods have been widely used in language development prediction. The introduction of the BERT^28,29 model makes the processing of text data more accurate. The BERT model is based on a bidirectional self-attention mechanism and can obtain deep semantic information in the context, capturing the complexity and diversity of language. Compared with traditional methods, BERT can automatically extract and learn grammatical and semantic features, avoiding the limitations of manual feature selection and performing well in many language processing tasks. Some studies have shown that BERT and its derivative models not only achieve excellent performance in single language tasks but also have certain potential in multilingual and cross-cultural tasks.^30,31 Acs J used multilingual datasets to detect morphological information in language models and found that the pre-trained BERT model performed strongly in these tasks. Using masked context and Shapley value methods, the previous text contains more prediction-related information than the following text.³² These studies provide theoretical support for the cross-cultural language development prediction model proposed in this paper. However, these methods usually lack systematic evaluation of cross-cultural effectiveness and do not consider the diversity of language features in different cultural backgrounds. Therefore, the BERT-based preschool language development prediction model proposed in this paper aims to solve these problems and achieve cross-cultural effectiveness evaluation through targeted adjustments and optimizations.

Methods

Data collection and processing

Text data can be collected from preschool children’s language datasets of multiple cultural backgrounds, and text cleaning and denoising can be performed to generate training samples that can be used for deep learning. The special grammatical structures and cultural expressions of different languages are properly processed to adapt them to model input.

Data collection

Text data were collected from a variety of preschool children’s language datasets across multiple cultural contexts, encompassing sources such as oral records, children’s books, parent–child conversations, and audio transcriptions. To ensure cross-cultural adaptability, the datasets include language samples from different languages and cultural backgrounds, reflecting the unique linguistic characteristics of each culture. In addition to maintaining high data quality, the data collection process prioritizes diversity and representativeness, ensuring that the samples span a wide range of language development levels across children of various ages. This approach provides a comprehensive and culturally inclusive foundation for the model, enhancing its ability to predict language development across diverse populations.

Each text data is annotated in detail, including children’s vocabulary usage, syntactic structure, and specific cultural expressions. The annotation standards of the text refer to the commonly used grammatical annotation methods in multilingual research. Considering the language differences between different cultures, a consistent annotation standard is adopted to facilitate unified processing of the model.

Table 1 summarizes the basic statistical information of datasets from different cultural backgrounds, including language, sample size, age range, text type, and main language features. It contains datasets in four languages: English, Chinese, Spanish, and Japanese. The sample size and age range of each language are for preschool children. The language features show the differences between the languages in terms of grammatical structure, vocabulary usage, etc. This information helps to understand the differences in language data under different cultural backgrounds and provides the necessary basis for subsequent model processing and cross-cultural adaptability adjustments.

Table 1.

Basic statistical information of datasets under different cultural backgrounds.

Language background	Sample size	Age range (years)	Text type	Linguistic features
English	1257	3–6	Child speech and parent dialogues	Verb tense and noun plurals
Chinese	1226	3–6	Parent–child interaction and children’s stories	Measure words and sentence structures
Spanish	970	3–6	Child speech and extra-curricular reading	Noun gender and verb conjugation
Japanese	1307	3–6	Child dialogues and daily conversation	Sentence endings and honorifics

Data processing

The text data was cleaned and denoised during the data processing stage. Irrelevant punctuation, extra spaces, special characters, and non-semantic content can be removed. In order to further ensure the quality of the data, regular expression cleaning and text standardization technology are applied. Assuming that the original text is $T = {t_{1}, t_{2}, . . ., t_{n}}$ , each $t_{n}$ represents a text unit, and the cleaning operation can be expressed as follows:

T_{cleaned} = {t_{1}^{'}, t_{2}^{'}, . . ., t_{n}^{'}}

(1)

t_{1}^{'}

represents a cleaned text unit, removing irrelevant content and noise; the cleaned text is further divided into sentences and part-of-speech tagged.

To handle special grammatical structures in multilingual texts, language-specific processing methods are used. There are obvious differences in syntax between English and Chinese. English texts specifically handle tenses, plural forms of nouns, etc. Chinese uses standardized processing of quantifiers and word order, and establishes a symbolic mapping function $f_{lang} (t_{i})$ to adjust the text format according to language features:

f_{lang} (t_{i}) = {\begin{cases} standardize (t_{i}) \\ normalize (t_{i}) \end{cases}

(2)

After the processed text data is standardized using this function, it can better adapt to the input requirements of the model. The text denoising algorithm is applied in data processing, based on TF-IDF (Term Frequency-Inverse Document Frequency)^33,34 keyword extraction and denoising, to remove common stop words and reduce data dimensions while retaining core information. The goal of denoising is to minimize the ratio between signal and noise, making the data used more effective.

Adaptive processing of cultural differences, taking into account culturally specific language expressions and syntactic structures, uses syntax tree adjustment methods to unify the data formats of different cultural languages. Assuming $G (T_{cleaned})$ represents the syntax tree adjustment operation, the processed text data $T_{adjusted}$ can be expressed by the following formula:

T_{adjusted} = G (T_{cleaned})

(3)

This adjustment enables the model to handle grammatical differences and expression habits in different cultural backgrounds in the input data, improving the model’s adaptability and cross-cultural prediction capabilities.

Data enhancement techniques can be used to enrich data, and operations such as synonym replacement and sentence structure rearrangement can be used to expand the training set and prevent overfitting. The enhancement operation was $augment$ . Using data enhancement, a diverse set of training samples was obtained:

T_{augmented} = {augment (T_{cleaned})}

(4)

The enhanced dataset expands the sample size and provides better coverage of diverse language development contexts, ensuring greater diversity in the training process and improving the model’s robustness. By incorporating techniques such as text cleaning, denoising, grammar structure adjustments, and data augmentation, the training samples are not only of high quality but also exhibit strong cross-cultural adaptability. These preprocessing steps lay a solid foundation for subsequent deep learning model training, enabling the model to more effectively capture the complex features of children’s language development.

Table 2 shows the changes in feature dimensions of different language datasets before and after processing. The feature dimensions of the dataset before processing are high, containing a large number of stop words, low-frequency words, and redundant information; the cleaning and normalization steps are used to remove stop words, standardize tenses, and adjust grammatical structures. After preprocessing, the feature dimension was significantly reduced; the core purpose of this process is to remove irrelevant features, retain key information that is meaningful for predicting the language development of preschool children, and improve the training efficiency and prediction accuracy of the model. Different languages were adaptively adjusted according to their respective language characteristics during the processing process to ensure the effective integration of cross-cultural data.

Table 2.

Changes in feature dimensions before and after preprocessing of different language datasets.

Language	Feature dimensions before preprocessing	Feature dimensions after preprocessing	Key changes
English	52	31	Removal of stop words and irrelevant features, and retention of key features
Chinese	67	34	Removal of low-frequency words and retention of core syntax information
Spanish	55	32	Standardization of tenses and vocabulary adjustments
Japanese	65	37	Syntax tree adjustments and formal language normalization

Feature extraction and modeling

Feature extraction

In the feature extraction stage, the cleaned text data is encoded using the BERT model, and the bidirectional self-attention mechanism of BERT is used to obtain the contextual information in the text. The text data is divided into several tokens and further refined using the WordPiece word segmentation technology. For each input text, $c_{i}^{'}$ is the unit after word segmentation, which is input into the BERT model for encoding. Given the word vector representation of each word in the vocabulary, BERT uses the following embedding layer to generate the initial embedding representation of each token:

E_{i} = W_{token} \cdot c_{i}^{'} + W_{position} . p_{i} + W_{segment} . s_{i}

(5)

W_{token}

W_{position}

, and

W_{segment}

are the word embedding matrix, position encoding matrix, and segment embedding matrix, respectively.

c_{i}^{'}

is the unit after word segmentation,

p_{i}

is the position information of the token, and

s_{i}

is the identifier of the sentence where the token is located.

The BERT model processes the above embedding representation through the Transformer encoder and uses the self-attention mechanism to obtain the dependency between different tokens in the input sequence. In each attention layer, the BERT model calculates the attention weight through the query, key, and value:

A (C, J, Z) = softmax (\frac{C J^{T}}{\sqrt{d_{k}}}) Z

(6)

C

J

, and

Z

are the matrix representations of query, key, and value, respectively, and

d_{k}

is the dimension of the key vector. This approach allows BERT to effectively obtain long-distance dependencies in the text and generate a deep representation of each token in the context. After being processed by a multi-layer Transformer encoder, the contextual representation

H = {h_{1}, h_{2}, . . ., h_{n}}

of each token is obtained, and

h_{n}

is the final representation of the

n

-th token in the text. These representations contain rich grammatical and semantic information for subsequent feature analysis and language development prediction.

Feature modeling

In the feature modeling stage, the token representations generated by BERT are further processed to extract deep features related to the language development of preschool children. The grammatical and semantic features of the text are effectively integrated into the prediction model, and the feature aggregation technology is used to integrate the token representation of each sentence. For the token representation of each sentence, a pooling operation is used to generate a sentence-level representation, either maximum pooling or average pooling:

H_{sentence} = pooling (H)

(7)

This pooling operation compresses all token representations in a sentence into a vector of fixed dimension, so that the vector can contain the grammatical and semantic features of the entire sentence.

Combining cultural background and language characteristics, the grammatical features of the text are refined and modeled. For key information such as verb phrases and noun phrases in the sentence, a weighted summation method based on the attention mechanism is used to weightedly fuse the sentence features. The following formula is used to weighted sum the structural units to obtain the final sentence feature representation $h_{S}$ :

h_{S} = \sum_{i = 1}^{m} α_{i} \cdot h_{i}

(8)

α_{i}

is a weighted coefficient, and it can dynamically adjust the contribution of each structural unit to the sentence feature representation.

Features are input into the fully connected layer for further processing to obtain the deep feature representation required for the prediction of language development in preschool children:

f = σ (W_{fc} \cdot h_{S} + b)

(9)

W_{fc}

is the weight matrix of the fully connected layer,

b

is the bias term, and

σ

is the activation function. This approach combines the grammatical, semantic features, and cultural adaptability of the text to generate an effective feature representation for subsequent language development prediction.

The deep features extracted by the BERT model through the bidirectional self-attention mechanism effectively obtain the contextual information in the text and also combine the language features of different cultural backgrounds, so that the model can better understand the complex patterns of language development in preschool children. These feature representations provide strong support for subsequent prediction models, improving the accuracy and cross-cultural adaptability of language development predictions.

Figure 1 shows the structure of the BERT model. The input text is converted into multiple tokens through the word segmentation process, and each token represents a vocabulary unit. The input token is processed by the embedding layer and position encoding together with the special [CLS] tag and then sent to multiple Transformer encoder layers for processing. Each Transformer layer obtains the relationship between each word in the text through the self-attention mechanism. The output of each layer represents the context information of each token in the text. Using the pooling operation, the model extracts sentence-level feature representations to provide deep semantic features for subsequent tasks. Figure 1 shows the overall structure from the input layer to the final output, highlighting how BERT gradually learns the complex features of text through the nested Transformer mechanism when processing text.

Figure 1.

Structure of the BERT model.

Model training and optimization

The pre-trained BERT model is fine-tuned on the preschool language dataset, with model parameters adjusted using optimization algorithms such as the Adam optimizer to enhance the accuracy of language development predictions. The model’s stability and predictive performance are further improved through cross-validation and hyperparameter tuning, ensuring more reliable and precise results.

Model training

In the model training phase, the pre-trained BERT model is loaded onto the preschool language dataset for fine-tuning to adapt the pre-trained model to the characteristics of preschool children’s language data; all layers of the BERT model are initialized, trained on the dataset, the loss function is minimized, the model parameters are optimized, and the prediction accuracy is improved.

The study adopted a loss function for the prediction error calculation in the regression task. Given the predicted value and the true value, the mean square error is calculated as follows:

L = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}

(10)

N

is the number of samples,

y_{i}

is the true label of the

i

th sample, and

{\hat{y}}_{i}

is the model’s predicted value for the sample. To optimize this loss function, the model continuously adjusts its parameters during training to minimize the error between the predicted value and the true label.

The study also used the Adam optimizer to adjust the model’s parameters to better optimize the model. The Adam optimizer combines the advantages of the momentum method and the RMSProp optimizer, adaptively adjusting the learning rate of each parameter. The update rule of the Adam optimizer is as follows:

θ_{t + 1} = θ_{t} - η \cdot \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ}

(11)

θ_{t}

is the value of the parameter at the

t

-th iteration,

η

is the learning rate,

{\hat{m}}_{t}

and

{\hat{v}}_{t}

are the variance estimates of the momentum and gradient, respectively, and

ϵ

is a smoothing constant to prevent division by zero errors. This method allows the Adam optimizer to efficiently adjust the learning rate, and the model converges quickly during training.

The training also uses an early stopping strategy to avoid overfitting. The loss value of the validation set is monitored. If the validation set loss does not decrease within several consecutive epochs, the training is stopped. This strategy effectively prevents the model from overfitting the training data and improves its adaptability.

Hyperparameter tuning and cross-validation

Tuning the model’s hyperparameters further improves the model’s stability and predictive ability; the main hyperparameters to be tuned include learning rate, batch size, and number of training rounds. Grid search or random search is used to train under different hyperparameter combinations to select the parameter combination with the best performance.

The learning rate can be adjusted. The learning rate is a key factor affecting the convergence speed and stability of model training. If the learning rate is too large, the model can skip the optimal solution and the convergence can be unstable. If the learning rate is too small, the model may converge too slowly and fail to achieve the best results within the limited training time. To find a suitable learning rate, a learning rate decay strategy needs to be adopted. Assume that the initial learning rate is $η_{0}$ , and the learning rate gradually decays during training according to the following formula:

η_{t} = η_{0} \cdot \frac{1}{1 + λ t}

(12)

λ

is the attenuation rate and

t

is the current number of iterations; this strategy helps the model reduce the learning rate in the later stages of training, stabilize convergence, and avoid oscillation.

The batch size can be adjusted, and the batch size determines the number of training samples used for each parameter update. A smaller batch size may cause unstable gradient estimation, while a larger batch size may lead to a waste of computing resources. The experiment selects the optimal batch size to balance training efficiency and stability.

The number of training rounds is selected based on cross-validation. Using k-fold cross-validation, the dataset is divided into k subsets, each subset is used as a validation set, and the rest is used as a training set. The model is trained k times, and the average performance index of each training is calculated. Cross-validation can effectively evaluate the adaptability of the model and prevent performance fluctuations caused by different data partitions. Stability is evaluated by calculating the variance of each validation result. The smaller the variance, the more stable the model is under different data partitions.

After the above steps obtain the optimal hyperparameter combination, the model can undergo final training. Cross-validation helps select appropriate hyperparameters and further verifies the performance of the model on different datasets, allowing the model to adapt to diverse language data and have strong cross-cultural prediction capabilities.

Cross-cultural adaptability adjustment

Culture-specific input adjustment

The characteristics of preschool children’s language data from different cultural backgrounds can be adjusted for input. The text data of each cultural background is language-specific to meet the requirements of the model. For the vocabulary and grammar structure of different languages, the corresponding language-specific processing methods are adopted, and the cultural background related vocabulary and word segmentation rules are adopted, and the input data meets the specific needs of each language.

Certain languages feature unique grammatical structures or special punctuation marks, requiring the use of regular expressions and custom word segmentation tools for text processing. For languages like Chinese and Japanese, which lack clear word boundaries, the word segmentation process must be adjusted according to contextual information. In contrast, for languages with standard space separators, such as English, word segmentation follows conventional methods, with custom algorithms designed to accommodate the specific characteristics of each language, ensuring the model can accurately process input in all languages.

To address vocabulary differences across cultural contexts, culture-specific vocabularies are incorporated into the input data, enabling the model to recognize high-frequency words and idioms unique to each language and culture. For example, there are significant differences between Chinese and English in expressing concepts like time, numbers, and emotions. These differences are addressed through cultural adaptation in data processing. For instance, while “suishu” in Chinese and “age” in English convey the same meaning in context, they differ in word form and expression. Vocabulary mapping is employed to standardize these culturally distinct terms. This approach ensures that the model can effectively extract relevant features while accounting for cultural differences, mitigating the impact of cultural bias on model performance when processing multilingual data.

Adjustment of model input and feature encoding

The adjusted culturally specific input data is passed to the BERT model through the encoding layer. Using vocabularies and word embeddings for different languages and cultural backgrounds, the encoding capabilities of the BERT model are used to obtain semantic differences between cultures. The processing of culture-specific words also involves the expression of complex features such as emotional color and syntactic differences. The weighted word embedding technology is used to adjust the cross-cultural adaptation to fully express the culture-specific information. With large weight on the corresponding embedding vectors, these words have high influence in the learning process of the model. The weighted word embeddings are expressed as follows:

e_{w}^{'} = α_{w} \cdot e_{w}

(13)

e_{w}^{'}

is a weighted word embedding, a cultural adaptability layer is added to optimize the cross-cultural adaptability, which is responsible for dynamically adjusting the weight of the model according to the input cultural context. The cultural background of the input data automatically adjusts the weight of each layer, so that the input data under different cultural backgrounds can use the model to generate the output in line with the cultural background. The introduction of this layer enables the model to extract more representative features from the input data of different cultures and generate the language development prediction results suitable for the cultural background.

Figure 2 shows how to adjust the input and process the data of the preschool children’s language development prediction model in a cross-cultural context. Data can be collected from multiple cultural backgrounds and processed in a culturally specific manner. Language-specific word segmentation and culturally specific vocabulary mapping can be performed for the characteristics of different languages, so that the unique expressions of each language can be properly processed. The adjusted data is passed into the BERT model as input, encoded using BERT’s bidirectional self-attention mechanism, and contextual information and deep semantic features in the text are extracted. The model outputs predicted results of preschool children’s language development and achieves cross-cultural language prediction. The entire process fully considers different cultural backgrounds and language characteristics, ensuring that the model can make stable and accurate predictions in a multicultural environment.

Figure 2.

Data processing flow of the cross-cultural preschool children’s language development prediction model.

Method effect evaluation

Self-attention weight analysis

Figure 3 shows the attention weight distribution in the BERT model self-attention mechanism, reflecting the relationship between different word pairs in the text. Each cell represents the strength of attention between two words. The color depth corresponds to the weight. The light-colored area indicates a stronger attention relationship, which means that the two words are more closely related in the model; the dark-colored area indicates a weaker attention relationship, which means that the two words are loosely related. The model allocates stronger attention when processing common words such as “the” and “lazy” and has lower attention weights between some insignificant word pairs. This visualization clearly shows how the BERT model uses contextual information to dynamically adjust the focus of each word and effectively obtain grammatical and semantic information in the text. This mechanism is crucial for language understanding tasks. It enables the model to flexibly transfer information between different words and enhances the ability to process complex language structures.

Figure 3.

BERT model self-attention weight heat map.

Prediction accuracy and comparison

The prediction accuracy of the evaluation model is selected as the main regression evaluation indicators, and MSE and MAE can effectively measure the deviation between the model prediction results and the true value. MSE focuses more on the penalty of large errors, while MAE can provide an intuitive measurement of the error of each sample.

The study compared BERT with traditional BiLSTM (Bi-directional Long Short-Term Memory) and GRU (Gated Recurrent Unit) to verify BERT’s advantages in processing preschool language development prediction, calculated the MSE and MAE values of each model, and used the average value to evaluate its overall performance.

The 10 marks in Figure 4 range from 1 to 10, and there are three columns above each mark, representing the MAE values of the BERT, BiLSTM, and GRU models. The smaller the MAE value is, the more accurate the model prediction.

Figure 4.

Comparison of MAE of different models.

The data shows that the MAE value of the BERT model is between 0.20 and 0.25. Compared with BiLSTM and GRU, BERT always shows a lower error. In the first fold, BERT’s MAE is 0.25, while BiLSTM and GRU are 0.35 and 0.38, respectively. Comparing the MAE values of different folds, it can be seen that BERT is always better than the other two models in terms of prediction accuracy. These data changes show that the BERT model has a stronger ability to obtain language features and contextual information than the BiLSTM and GRU models, and is more accurate in predicting the language development of preschool children.

Figure 5 is similar to the MAE bar chart. Each fold sign corresponds to the MSE value of the BERT, BiLSTM, and GRU models. The MSE value range is between 0 and 0.15.

Figure 5.

MSE comparison of different models.

The MSE of BERT in the first fold is 0.08, while that of BiLSTM and GRU are 0.12 and 0.14, respectively, showing a relatively small prediction error. The MSE value of BERT is always at a low level in the 10 folds, with the lowest being 0.05, showing the superiority of BERT in preschool language prediction tasks. The MSE values of BiLSTM and GRU are always higher than that of the BERT model, which once again verifies the advantage of BERT in processing language data with greater accuracy and stability. The trend of MSE reflects the stability of BERT in each dataset, and it can still maintain high prediction accuracy in different cultural and language backgrounds. The MSE values of BiLSTM and GRU are larger, showing a certain instability.

Cross-cultural stability

The cross-cultural stability of the model is evaluated using R² as an evaluation indicator to measure the degree of fit of the model on different cultural data. The R² value reflects the proportion of variability in the observed data that the model can explain. The closer the value is to 1, the better the model fits the data. The model is trained and tested on the language data of preschool children in different cultural backgrounds to comprehensively evaluate the cross-cultural adaptability of the model and calculate the R² value of each cultural sample.

The R² values of different cultural samples can be compared to evaluate the stability of the model in different languages, grammars, and cultural expressions. If the model has a high R² value on each cultural dataset, it means that the model can stably adapt to the language characteristics of multicultural backgrounds and has strong cross-cultural generalization capabilities. If the R² value is large or small, it indicates that the model may have cultural bias and cannot effectively handle data features from different cultural backgrounds. This process can verify the stability of the model in a multicultural environment and provide guidance for further optimization and adjustment of the model.

As can be seen from Figure 6, the prediction accuracy of the BERT model is better than that of the BiLSTM and GRU models in all languages, with an average R² value of 0.84. BERT can better obtain the deep features and contextual information of the language, and has a higher degree of fit on the data of these languages. The R² values for Chinese and Japanese are 0.83 and 0.80, respectively, which are relatively low, but still demonstrate BERT’s strong cross-cultural adaptability and can effectively process preschool children’s language data from different cultural backgrounds. The performance of the BiLSTM and GRU models is relatively poor, with low fit in Chinese and Japanese. These two models have poor adaptability when processing different cultural languages and cannot fully capture the complex grammatical structures and cultural expressions of these languages. The stability and cross-cultural generalization ability of the BERT model are significantly better than those of the BiLSTM and GRU models, and it has wide applicability and high prediction ability on multicultural datasets.

Figure 6.

R² comparison of different models.

Conclusions

This paper proposes a preschool language development prediction method based on the BERT model and evaluates its cross-cultural effectiveness. By processing language datasets of preschool children in multiple cultural backgrounds, a deep learning model was constructed that can capture the complex nonlinear relationships in language development. In feature extraction and modeling, the BERT model extracted the deep features of the language through a bidirectional self-attention mechanism, and combined it with grammatical and semantic information to provide meaningful input for language development prediction. By fine-tuning the pre-trained BERT model, cross-validation and hyperparameter tuning were used to greatly improve the accuracy of language prediction.

In terms of cross-cultural adaptability, the model’s input has been carefully adjusted to account for the unique language characteristics of different cultural backgrounds. This enables it to effectively handle the special grammatical structures and cultural expressions inherent in various languages, ensuring the model’s robustness across multicultural datasets. The evaluation of the BERT model’s prediction accuracy demonstrated its superiority over traditional BiLSTM and GRU models, highlighting its strengths in predicting preschool language development. Furthermore, the model’s cross-cultural stability, as measured by the R² value, shows strong adaptability and stability across diverse cultural contexts, indicating its significant potential for cross-cultural application.

The BERT-based preschool language development prediction model not only enhances prediction accuracy but also successfully addresses the complexity of cross-cultural data. This marks a substantial advancement in preschool language education research, offering novel approaches for better understanding and supporting language development in young children from various cultural backgrounds.

Looking ahead, future research could explore integrating additional deep learning techniques, such as reinforcement learning or transfer learning, to further enhance the model’s ability to adapt to evolving linguistic trends and real-time data. Expanding the dataset to include more languages and diverse cultural contexts could improve the model’s generalizability and make it a more universal tool for global preschool language education. Additionally, enhancing the model’s interpretability will be critical for providing educators with clear insights into the factors influencing language development, ultimately supporting more personalized and effective teaching strategies.

Statements and declarations

Footnotes

Conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Jiasong Huo

References

Brodin

Renblad

. Improvement of preschool children’s speech and language skills. Early Child Dev Care 2020; 190(14): 2205–2213.

Washington-Nortey

Zhang

, et al. The impact of peer interactions on language development among preschool English language learners: A systematic review. Early Child Educ J 2022; 50(1): 49–59.

Nasreddinova

Sadikova

. Features of the development of preschool children in a bilingual environment. Sci Innovat 2022; 1(B7): 1440–1444.

Alatalo

Westlund

. Preschool teachers' perceptions about read-alouds as a means to support children’s early literacy and language development. J Early Child Literacy 2021; 21(3): 413–435.

Gandolfi

Viterbori

. Inhibitory control skills and language acquisition in toddlers and preschool children. Lang Learn 2020; 70(3): 604–642.

Chen

Justice

Rhoad-Drogalis

, et al. Social networks of children with developmental language disorder in inclusive preschool programs. Child Dev 2020; 91(2): 471–487.

Lyster

SAH

Snowling

Hulme

, et al. Preschool phonological, morphological and semantic skills explain it all: Following reading development through a 9-year period. J Res Read 2021; 44(1): 175–188.

Grover

Rydland

Gustafsson

, et al. Shared book reading in preschool supports bilingual children’s second-language learning: A cluster-randomized trial. Child Dev 2020; 91(6): 2192–2210.

Thomas

Colin

Leybaert

. Interactive reading to improve language and emergent literacy skills of preschool children from low socioeconomic and language-minority backgrounds. Early Child Educ J 2020; 48(5): 549–560.

10.

Limlingan

McWayne

Sanders

, et al. Classroom language contexts as predictors of Latinx preschool dual language learners’ school readiness. Am Educ Res J 2020; 57(1): 339–370.

11.

Kronenberger

Pisoni

. Longitudinal development of executive functioning and spoken language skills in preschool-aged children with cochlear implants. J Speech Lang Hear Res 2020; 63(4): 1128–1147.

12.

Zhang

, et al. Global digital compact: A mechanism for the governance of online discriminatiory and misleading content generation. Int J Hum Comput Interact 2024; 2(3): 1–28. DOI: 10.1080/10447318.2024.2314350.

13.

Piasta

Park

Farley

, et al. Early childhood educators’ knowledge about language and literacy: Associations with practice and children’s learning. Dyslexia 2020; 26(2): 137–152.

14.

Bal

Fok

Lord

, et al. Predictors of longer-term development of expressive language in two independent longitudinal cohorts of language-delayed preschoolers with autism spectrum disorder. JCPP (J Child Psychol Psychiatry) 2020; 61(7): 826–835.

15.

Al-Harbi

. Language development and acquisition in early childhood. EduLearn 2020; 14(1): 69–73.

16.

Redondo

Cózar-Gutiérrez

Gonzalez-Calero

, et al. Integration of augmented reality in the teaching of English as a foreign language in early childhood education. Early Child Educ J 2020; 48(2): 147–155.

17.

Pezold

Imgrund

Storkel

. Using computer programs for language sample analysis. Lang Speech Hear Serv Sch 2020; 51(1): 103–114.

18.

Alstad

Sopanen

. Language orientations in early childhood education policy in Finland and Norway. Nordic J Studies in Educ Policy 2021; 7(1): 30–43.

19.

Hansen

Broekhuizen

. Quality of the language-learning environment and vocabulary development in early childhood. Scand J Educ Res 2021; 65(2): 302–317.

20.

Yilmaz

Topu

Takkaç Tulgar

. An examination of vocabulary learning and retention levels of pre-school children using augmented reality technology in English language learning. Educ Inf Technol 2022; 27(5): 6989–7017.

21.

Alzubi

Jain

Nagrath

, et al. Deep image captioning using an ensemble of CNN and LSTM based deep neural networks. J Intell Fuzzy Syst 2021; 40(4): 5761–5769.

22.

Altwaijry

Al-Turaiki

. Arabic handwriting recognition system using convolutional neural network. Neural Comput Appl 2021; 33(7): 2249–2261.

23.

Wang

Liu

. A teaching quality evaluation model for preschool teachers based on deep learning. Int J Emerg Technol Learn 2021; 16(3): 127–143.

24.

Aslam

Khan

Alamri

, et al. An improved early student’s academic performance prediction using deep learning. Int J Emerg Technol Learn 2021; 16(12): 108–122.

25.

Tahsin Mayeesha

Md Sarwar

Rahman

. Deep learning based question answering system in Bengali. J Inf Telecommun 2021; 5(2): 145–178.

26.

Shahi

Phillips

, et al. Using deep learning and natural language processing models to detect child physical abuse. J Pediatr Surg 2021; 56(12): 2326–2332.

27.

Lee

Song

, et al. Age group classification to identify the progress of language development based on convolutional neural networks. J Intell Fuzzy Syst 2021; 40(4): 7745–7754.

28.

Alaparthi

Mishra

. BERT: A sentiment analysis odyssey. J Market Anal 2021; 9(2): 118–126.

29.

Acheampong

Nunoo-Mensah

Chen

. Transformer models for text-based emotion detection: a review of BERT-based approaches. Artif Intell Rev 2021; 54(8): 5789–5829.

30.

Zou

, et al. Shortcut learning of large language models in natural language understanding. Commun ACM 2023; 67(1): 110–120.

31.

Mao

Liu

, et al. The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection. IEEE Trans Affect Comput 2022; 14(3): 1743–1753.

32.

Acs

Hamerlik

Schwartz

, et al. Morphosyntactic probing of multilingual BERT models. Nat Lang Eng 2024; 30(4): 753–792.

33.

Cahyani

Patasik

. Performance comparison of TF-IDF and Word2Vec models for emotion text classification. Bulletin EEI 2021; 10(5): 2780–2788.

34.

Zhang

. Research on case reasoning method based on TF-IDF. Int J Syst Assur Eng Manag 2021; 12(3): 608–615.