Abstract
Class imbalance is a persistent challenge in deep learning, often leading to suboptimal performance for underrepresented classes. This challenge is particularly pronounced in natural language processing (NLP) tasks, such as named entity recognition (NER), where traditional oversampling methods may overlook important linguistic nuances. In this study, we introduce a novel synonym-based oversampling technique that employs pre-trained Word2Vec embeddings to generate semantically coherent examples. This approach augments minority classes using contextually appropriate synonyms. Experiments on an imbalanced social media NER dataset demonstrate enhanced model performance, with improved recognition of named entities across diverse categories. By generating synthetic samples that closely mirror the original data’s semantic characteristics, our method offers a compelling solution to data imbalance in semantically driven NLP tasks. This research highlights the potential of semantic-based oversampling in enhancing the generalization capabilities of deep learning models for NER challenges.
Keywords
Introduction
Deep learning techniques have revolutionized the fields of artificial intelligence and data science by enabling the development of intelligent models that recognize highly complex patterns beyond the scope of traditional programming methods. 1 However, a primary challenge to deep learning’s success is the need for labeled data. 2 Structuring data for better neural network understanding and pattern identification requires effective data pre-processing. 3 Imbalances in class representation can significantly hinder classifier performance in fields like machine learning, data mining, and deep learning. 4 Addressing this challenge involves balancing data representation across different dataset classes. 5 In binary classification problems, imbalanced data typically consists of an over-represented majority class and an under-represented minority class. 6 Multi-class imbalance issues, also known as long-tailed distribution challenges, arise when datasets contain multiple classes with varying levels of representation. 7
To address class imbalance, two prevailing strategies can be distinguished. The first is undersampling, 8 which focuses on reducing instances in the majority class. This approach requires careful instance selection to ensure model coherence and accuracy while avoiding overfitting. Although various studies have explored this method,4,9–15 it might be insufficient for intricate problems like NLP, where diverse datasets are essential for effective generalization in large-scale models.
The second strategy, oversampling, 8 addresses the minority class imbalance by generating synthetic data. A key challenge here is creating synthetic data that accurately mirrors the original dataset’s characteristics. Various studies have aimed to tackle this effectively.16–23
While numerous data balancing techniques exist, their application to NLP and NER remains challenging due to the critical role of semantics and lexical relationships. Preserving linguistic structure is imperative to retain the authentic meaning within textual data. This intricacy underlines the need for specialized data-balancing approaches tailored to the nuances of NLP and NER tasks. When applied to NER datasets, conventional oversampling techniques may introduce issues such as semantic drift or loss of contextual information critical for precise entity labeling. Such oversampled data may also exhibit grammatical inconsistencies and insufficient diversity, hampering model training and generalization.
Our study introduces a novel data balancing technique leveraging Word2Vec models 24 trained on reputable corpora, including Google News, 25 and Glove Wiki data Gigaword. 26 This approach addresses the data imbalance challenge in NER datasets while preserving the semantic essence and contextual significance of sentences. By incorporating Word2Vec models, we can capture word associations and contextual nuances to address the limitations of conventional data balancing methods with NER datasets. To validate the effectiveness of the proposed methodology, experiments were conducted using SocialNER, a dataset tailored for NER in social media content from previous work. 27 Empirical results highlight promising outcomes, emphasizing enhanced class balance without compromising the semantic integrity of the processed data and inherent meaning. This refinement boosts performance in identifying named entities across diverse categories.
This proposal’s key contributions include:
The introduction of a novel method leveraging pre-trained Word2Vec model embeddings to find and incorporate synonyms for words in underrepresented classes, generating synthetic data that remains contextually relevant. The evaluation of this synonym-based oversampling approach on the imbalanced SocialNER dataset. A comparative analysis between pre-trained BERT transformer-based models,
28
assessing the effectiveness of the newly introduced technique. This comparison delves into two main research questions:
The remainder of the paper is organized as follows: Section 2 explores the background and related works; Section 3 elucidates the proposed algorithm; Section 4 presents the experimental results and discussion; and Section 5 concludes, shedding light on the findings, suggesting avenues for future research, and addressing challenges encountered.
Background
In this section, we lay the foundation for addressing imbalanced data by exploring various approaches. We begin by diving into resampling techniques as the fundamental starting point.
Traditional resampling techniques
Resampling techniques encompass three primary approaches: undersampling, oversampling, and hybrid techniques.
Undersampling techniques
Undersampling is a valuable technique for tackling data imbalance by strategically reducing instances of the dataset’s majority class(es). This creates a balanced class distribution, aiding effective model learning from both classes. In this context, we find:
Edited Nearest Neighbors (ENN) undersampling 29 examines each instance, removing majority class samples that differ from most of their k-nearest neighbors. ENN aims to eliminate noisy majority class instances to enhance class separability and improve model performance. It offers simplicity and computational efficiency, making it suitable for large-scale datasets; however, it has limitations. Removing instances from the majority class may lead to information loss, affecting the model’s grasp on majority class distribution and patterns. Additionally, ENN may not be effective when there’s a significant overlap between the majority and minority classes.
Tomek Links 30 are pairs of closely situated instances that belong to different classes. This technique removes the majority class instance from each Tomek Link, creating clearer class boundaries. By eliminating instances near the decision boundary, Tomek Links help reduce overlapping regions between classes, potentially improving the model’s ability to discriminate between them. One of the significant advantages of Tomek Links is its ability to remove noisy majority class instances that the model may misclassify. By removing such instances, the model is less likely to be influenced by noisy data points, which can enhance its generalization performance. However, there are also limitations to consider when using Tomek Links. In some cases, removing instances from the majority class might lead to an overly aggressive reduction, causing the loss of valuable information and negatively impacting the model’s performance. Moreover, Tomek Links may not be as effective in cases where the class overlap is more complex and not well-separated.
Random Undersampling 31 involves removing instances from the majority class until a balanced distribution is achieved. It offers simplicity and ease of implementation, making the model focus more on the minority class, potentially improving its performance in classifying the underrepresented class. However, it has limitations. One primary concern is the loss of valuable information from the majority class, which can reduce data diversity and richness, potentially affecting the model’s ability to generalize effectively to unseen data. Moreover, when the majority class already has limited instances, random undersampling can result in an excessively small training set, leading to subpar model performance.
Cluster Centroids undersampling 32 involves clustering the majority class instances and selecting cluster centroids to represent the majority class. By selecting cluster centroids as representatives, the technique ensures that the majority class instances retained in the training set are diverse and representative of the overall majority class distribution. Cluster Centroids are computationally efficient, suitable for larger datasets, and appreciated for their simplicity and ease of implementation. While this method reduces the majority class instances, it aims to preserve the underlying distribution. However, it is essential to recognize the limitations of Cluster Centroids as well. The technique may lead to information loss, as many majority class instances are removed from the training data, potentially reducing the diversity of the majority class. This can negatively impact the model’s ability to generalize well, especially when dealing with complex and diverse datasets. Furthermore, in cases with significant class overlaps, the technique may not effectively address the imbalance.
Oversampling techniques
* Oversampling is another technique used to address the issue of data imbalance in machine learning tasks. In contrast to undersampling, oversampling focuses on augmenting the instances of the minority class(es) to achieve a balanced dataset.
Random Oversampling: this method involves randomly duplicating instances from the minority class to increase its representation in the dataset. While it is simple to implement, it may lead to overfitting and potentially cause the model to memorize the duplicated data points.
Synthetic Minority Oversampling Technique (SMOTE) 33 is a widely-used technique that generates synthetic examples for the minority class. It selects a data point from the minority class and finds its k-nearest neighbors. It then synthesizes new samples by interpolating between the chosen data point and its neighbors. However, one limitation of SMOTE is its potential to generate synthetic samples that mirror existing minority class instances too closely. By interpolating between neighboring instances, SMOTE may not fully capture the true distribution and variability of the minority class, leading to an over-optimistic representation. This can result in overfitting, reduced model generalization, and poorer performance on unseen data.
Adaptive Synthetic Sampling (ADASYN) 34 is an extension of SMOTE that introduces additional randomness to the synthetic sample generation process. While it’s designed to emphasize challenging instances by attributing higher weights, its pronounced sensitivity to noisy data can sometimes generate unrepresentative synthetic samples. Moreover, the computational demands of its adaptation approach, though effective, can be computationally intensive, especially in extensive datasets.
Designed to address ambiguous cases, Borderline-SMOTE 35 specifically targets instances situated near the decision boundary. A major weakness of Borderline-SMOTE is its vulnerability to misclassifying borderline instances near the decision boundary between the minority and majority classes. Due to their proximity to the boundary, these instances are challenging to classify accurately, leading to uncertainty in their labels. When Borderline-SMOTE generates synthetic samples for these borderline instances, it may inadvertently include misclassified or noisy samples in the new data. Consequently, the synthetic samples may not truly represent the minority class distribution, potentially resulting in decreased model performance and generalization.
Another approach, Safe-Level-SMOTE, 36 focuses on adjusting oversampling based on the perceived noise in a dataset. It computes a ‘safe level’ for every minority instance to guide the generation of synthetic samples. A major weakness of Safe-Level-SMOTE is its sensitivity to class imbalance and the distribution of instances within the minority class. In datasets with extreme class imbalance or complex minority class distributions, Safe-Level-SMOTE may struggle to generate representative synthetic samples effectively. The method’s reliance on an additional parameter, the safe level, also poses a challenge. Selecting an appropriate safe level value is crucial to balance the risk of overfitting and underfitting. Furthermore, Safe-Level-SMOTE may not perform optimally when the decision boundary between classes is intricate, as it uses a fixed distance metric to identify safe instances. This can lead to the creation of synthetic samples that do not adequately represent the true minority class distribution.
Despite significant efforts in oversampling textual data, a critical research gap persists regarding the preservation of semantics. The unique characteristics of labeled textual data pose several challenges that hinder the effective generation of semantically coherent synthetic instances:
Semantic drift: Oversampling can introduce synthetic instances that fail to align with the original context and semantics of the text data, leading to a divergence from the true language distribution. Loss of contextual information: Oversampling may struggle to preserve the intricate relationships between entities and their surrounding context, which is crucial for accurate NER performance. Grammatical inconsistency: The simple replication of instances during oversampling can result in ungrammatical or semantically inconsistent sentences that hinder the training of models tasked with understanding and generating coherent text. Lack of diversity: Oversampling can overemphasize specific phrases or patterns found in the original minority class, limiting the model’s generalization to diverse sentence structures. Named entity type imbalance: Oversampling can disproportionately amplify certain named entity types, disrupting the balanced representation of these categories and influencing the model’s overall performance.
Hybrid techniques
Hybrid techniques encompass a fusion of oversampling and undersampling methods aimed at mitigating the limitations inherent in each approach.
In Wong et al. 37 the contribution is twofold: firstly, it introduces a hybrid re-sampling method labeled SMOTE+CHC, combining oversampling and undersampling techniques. This novel approach addresses the potential over-generalization issue of using SMOTE alone by incorporating the CHC method over the synthetic minority and majority class samples. Secondly, the proposed SMOTE+CHC method is rigorously compared to various other re-sampling techniques, including RUS, TL, ROS, SMOTE, and SMOTE+TL, across 22 datasets. The evaluation is conducted using the C4.5 decision tree classifier. The findings reveal that the performance of all oversampling and hybrid methods surpasses undersampling approaches, demonstrating the efficacy of these techniques in practice. Furthermore, SMOTE+CHC emerges as a standout candidate with consistently lower oversampling rates than the alternatives, offering an appealing trade-off between enhanced performance and minimal expansion of training set size.
One-Sided Selection (OSS) 38 combines undersampling with creating synthetic samples for the minority class. Initially, it applies a neighborhood cleaning rule to remove noisy majority class instances. Subsequently, it generates synthetic samples from majority and minority class instances. However, OSS’s removal of majority instances might lead to valuable information loss, impacting class representation and generalization. Moreover, OSS might prove less effective for highly imbalanced datasets, as it only removes a subset of majority class instances, potentially maintaining an imbalance that hampers learning.
CUSBoost technique 38 is a fusion of cluster-based sampling and the Adaboost algorithm. It stands out by introducing cluster-based sampling from the majority class. This method divides majority and minority instances, organizing the majority class into clusters (k clusters) using k-means clustering. The value of parameter k is determined through hyper-parameter optimization. Then, random undersampling is performed on each cluster, retaining around 50% of instances (adjustable based on the specific domain) and discarding the rest. As clustering precedes sampling, CUSBoost performs well when clear clusters exist. This results in representative samples that, when combined with minority class instances, yield balanced datasets. The algorithm’s strength lies in carefully considering instances spanning all subspace clusters of the majority class attributed to how k-means clustering assigns each instance to a cluster. In contrast, analogous methods often face difficulties in accurately representing the majority class.
Shi et al. 39 introduced an innovative hybrid resampling approach that combines undersampling and oversampling techniques to address learning challenges from imbalanced datasets, particularly in sparse contexts. The undersampling component of the approach is based on a technique called “safe double screening,” which swiftly identifies and removes non-informative instances and features from the dataset. This ensures that only boundary instances and informative features are retained for classification, leveraging the sparse structure of the data. Through experimentation, the hybrid sampling method demonstrates its effectiveness in reducing the number of instances and features while accurately reflecting the true imbalance ratio of the dataset. Furthermore, classifiers built using this hybrid technique showcase improved classification performance, particularly for data belonging to the minority class. This approach is well-suited for constructing classifiers that rely on decision boundaries within sparse, imbalanced datasets characterized by many features.
In Lin et al., 40 the author’s main contribution is identifying the best order for combining undersampling and oversampling techniques. The study involves extensive testing across 44 diverse domain datasets, incorporating three undersampling (instance selection) approaches (IB3, DROP3, and GA) and three oversampling techniques (SMOTE, CTGAN, and TAN). These investigations yield insightful outcomes. Notably, carefully selecting the undersampling method, specifically, IB3, highlights that introducing oversampling might not result in significant performance enhancements. Furthermore, the preferable sequence of beginning with instance selection, followed by oversampling, emerges more effectively, especially when leveraging the IB3 method. This configuration results in the highest AUC rate achieved by the random forest classifier. These discoveries offer practical guidance for refining data resampling strategies. Having provided an overview of hybrid techniques, we will delve into related works, specifically focusing on oversampling techniques applied to text datasets.
Related works
In this section, we explore more advanced methodologies, specifically embedding replacement, also referred to as synonym replacement. These techniques are particularly well-suited for datasets where the semantic integrity of words is a key point. Embedding replacement is an oversampling technique that addresses class imbalance by substituting or augmenting traditional text representations, such as bag-of-words or TF-IDF, with dense vector representations called embeddings. These embeddings map words, sentences, or documents to a continuous vector space, capturing semantic relationships and enabling more advanced analysis. 41 The key challenge in embedding replacement is creating synthetic data that accurately mirrors the original dataset’s characteristics while preserving the semantic essence and contextual significance of the text. Various studies have explored methods for generating contextually relevant synthetic samples using embeddings.41–47
Mosolova et al. 42 explored text augmentation techniques for neural networks, focusing on synonym replacement to enhance small text datasets. Their method involves substituting words with their synonyms to generate synthetic examples while preserving the original semantic meaning. To find synonyms, the authors utilized WordNet, 48 a large lexical database of English. WordNet organizes words into sets of cognitive synonyms called synsets, which represent specific concepts. By leveraging these synsets, the authors could identify appropriate synonyms for words in the text. This approach was tested using the Toxic Comment Classification Challenge dataset, demonstrating significant improvements in model performance. By increasing the diversity of training data, the technique helps mitigate overfitting and enhances model robustness, making it particularly useful for low-resource settings.
Similarly, Bayer et al. 41 discuss various synonym replacement techniques, emphasizing the use of pre-trained word embeddings such as Word2Vec 24 and GloVe. 26 These embeddings help identify semantically similar words by calculating cosine similarity between word vectors. For instance, Wang and Yang 43 utilized Word2Vec embeddings to replace words with their nearest neighbors, improving performance in text classification tasks. Feng et al. 44 proposed a tailored text augmentation technique for sentiment analysis, which includes probabilistic word sampling for synonym replacement. This method enhances the coverage of discriminative words and applies contextual replacement to words irrelevant to sentiment, thereby improving the model’s generalization capability. The technique was shown to yield notable improvements in sentiment analysis tasks, particularly in low-data regimes.
Combining these approaches, Perçin et al. 45 introduced a method that merges the use of WordNet with GloVe word embeddings for legal text augmentation. Their method involves selecting candidate words from WordNet synsets and choosing replacements based on their similarity in the GloVe embedding space. This hybrid approach was evaluated by legal experts and demonstrated superior performance compared to other augmentation methods.
Additionally, Wei and Zou 46 introduced Easy Data Augmentation (EDA) techniques, which include synonym replacement, random insertion, random swap, and random deletion. These simple yet effective methods have been shown to improve text classification performance across various datasets.
Furthermore, the study 47 introduced contextual augmentation, which replaces words with contextually appropriate synonyms generated by a pre-trained language model. This method leverages the context of the surrounding words to ensure that the replacements are semantically coherent, leading to improved performance in text classification tasks.
Despite advances in the field of synonym replacement, there remains a challenge in balancing datasets for tasks such as NER. One of the major concerns involves the correct labeling of entities, oversampling can lead to incorrect tagging and reduced model performance, as the contextual relationships between entities and their tags are crucial for accurate NER, which requires the preparation of task-specific algorithms. Our proposed methodology addresses the aforementioned constraints by harnessing the power of word embeddings to generate synthetic samples that more effectively maintain semantic fidelity and contextual associations within NER datasets. We empirically validate the efficacy of our approach through comprehensive experimentation on SocialNER corpus.
Proposed approach
Class imbalance presents a significant challenge in deep learning, leading to overfitting and poor generalization. To address this challenge, we propose a novel algorithm specifically designed for NER datasets using Word2Vec embeddings. While it is optimized for NER, our algorithm is versatile enough to be adapted to various textual datasets.
The proposed approach relies on pre-trained Word2Vec models that provide rich semantic representations for identifying synonyms. These models have been trained on extensive corpora, including Google News and GloVe Wiki, allowing them to capture the semantic relationships between words via vector proximity. Consequently, words with similar vector representations, defined as
Initially, our algorithm enumerates the occurrences
Subsequently, the “OversampleNERTag” function augments the minority tokens by generating synonyms through Word2Vec.
To ensure the quality and relevance of the generated synonyms, we conducted preliminary experiments to establish the optimal similarity threshold. We found that lower thresholds, like 0.5, led to the adoption of more distant synonyms, undermining the semantic integrity of the data. On the other hand, higher thresholds, such as 0.9, resulted in a negligible synonym contribution, thereby diminishing the desired effect of data augmentation. The chosen threshold of 0.7 provides an optimal equilibrium, offering synonyms sufficiently similar to maintain the intended meaning while introducing a beneficial degree of variation.
It is important to note that the “ComputeSimilarity” function is symmetric, meaning that
This approach generates a spectrum of token variations, thereby enhancing the robustness of the data (Figure 1).

Extracting synonyms from Word2Vec to balance the dataset.
Algorithm 1 outlines the process of data preprocessing and synthesis for NER tasks, aiming to rectify the class imbalance and fortify the dataset for more effective NER performance.
Word2Vec is a widely recognized and impactful NLP technique that generates word embeddings, representing words in a dense and continuous vector space. These embeddings capture semantic relationships, allowing machine learning models to leverage contextual information effectively. Two notable pre-trained Word2Vec models, “GloVe-Wiki-Gigaword-100” and “Google-News-300,” have gained substantial attention due to their impressive capabilities.
The “GloVe-Wiki-Gigaword-100” model originates from the Global Vectors for Word Representation (GloVe) project, designed to capture word meanings by analyzing global word co-occurrence patterns across extensive text corpora. This model comprises 100-dimensional word embeddings, where each dimension encodes a specific aspect of word semantics. These embeddings are pre-trained on a vast corpus of Wikipedia articles covering diverse topics and linguistic contexts. By incorporating global co-occurrence statistics during training, the model effectively captures rich semantic nuances, rendering it suitable for various NLP tasks, including, but not limited to, named entity recognition.
In contrast, the “Google-News-300” model, developed by Google, stands out for its extensive coverage of a diverse vocabulary and concepts. Trained on an enormous collection of internet news articles, this model generates 300-dimensional word embeddings. The training data encompasses various topics, languages, and writing styles, contributing to its comprehensive understanding of word semantics. The embeddings produced by the “Google-News-300” model capture intricate relationships between words and their contextual usage, making it a valuable resource for applications that require a deep grasp of nuanced meanings and contextual information.
The “GloVe-Wiki-Gigaword-100” and “Google-News-300” models are readily available through libraries like Gensim, providing pre-trained embeddings that seamlessly integrate into downstream NLP models. These embeddings showcase the power of transfer learning, as they have undergone extensive training on diverse textual sources. This training equips them to improve the quality of embeddings for a wide array of words, thereby enhancing the effectiveness of NLP tasks by adeptly capturing intricate semantics and nuanced context.
Implementation tools
The described experiments were implemented in Python 3.9, a widely used programming language in data science and NLP tasks. We leveraged key libraries, including NLTK 3.7 for essential text processing tasks such as tokenization and string handling, and Gensim 4.2.0 for accessing pre-trained Word2Vec models. The pre-trained BERT model used for fine-tuning was based on PyTorch 1.13.1, which provided efficient GPU-accelerated training of neural networks. The system used for experimentation featured an Intel Core i7 12700F CPU with a frequency of 2.10 GHz, 64 GB of RAM, and an Nvidia GeForce RTX 3090 Ti GPU with 24 GB of video memory.
The BERT models were trained locally using this system, with a batch size of 32 examples and the Adam optimizer for gradient-based optimization. The GPU acceleration allowed for faster iteration through epochs during the fine-tuning process.
In summary, combining Python’s extensive libraries, a powerful multi-core CPU, and a high-memory GPU enabled efficient implementation and rigorous experimentation to validate our proposed method. Using open-source tools and a local machine provided the flexibility and performance needed for this NLP research.
Evaluation metrics
In assessing model performance on multi-class labeled datasets, a comprehensive set of evaluation metrics is pivotal to gaining an in-depth understanding of model capabilities.
True Positive Rate (TPR) and True Negative Rate (TNR) gauge the model’s competence in correctly identifying positive and negative instances. Meanwhile, Positive Predictive Value (PPV) and Negative Predictive Value (NPV) offer valuable insights into the precision of positive and negative predictions. Further exploration revealed the model’s propensity to misclassify positive and negative instances through metrics like False Negative Rate (FNR) and False Positive Rate (FPR). Additionally, False Discovery Rate (FDR) and False Omission Rate (FOR) spotlight the model’s tendencies to generate incorrect positive and negative predictions.
The F1-Score melds precision and recall seamlessly, delivering a comprehensive overall efficacy measure. Critical Success Index (CSI) extends the evaluation by providing a holistic view of the model’s ability to make correct predictions for positive and negative instances simultaneously. Accuracy (ACC) presents a high-level overview of overall correctness, computed as the ratio of correctly classified instances to the total. The concept of Balanced Accuracy (BA) takes into account class imbalances, ensuring a fair evaluation across different classes.
The Matthews Correlation Coefficient (MCC) is a balanced metric, diligently considering all four values within the confusion matrix. Bookmaker Informedness (BM) navigates the model’s predictive capabilities while accounting for chance. Lastly, Markedness (MK) quantifies the alignment of the model’s predictions with actual positive and negative instances, offering a comprehensive view of its precision and reliability.
Collectively, using these metrics culminates in a multifaceted evaluation of model performance, empowering well-informed decisions and critical insights that are instrumental in advancing the objectives of this research endeavor.
Experimentation
Training phase
To evaluate the effectiveness of our oversampling method, we first fine-tuned a pre-trained BERT model
28
on the SocialNER imbalanced dataset from previous work.
49
We implemented traditional oversampling techniques, such as [SMOTE, ADASYN …], along with our novel method on the same dataset and fine-tuned the same BERT model for comparison. The choice of BERT is motivated by its bidirectional training strategy, which is instrumental in comprehending context within sentences-a key factor for advancing NER tasks with innovative semantic oversampling techniques and ensuring accurate entity recognition based on contextual cues. Subsequently, we charted the learning curves for the original and oversampled datasets (Figure 2), providing visual insights into the model’s performance across differently balanced data sets. To compare the different learning curves of models effectively, we normalized the error rates for each model. Normalization was done using the formula:

Training curve of pre-trained BERT on different Datasets.
During the training phase, we also computed various evaluation metrics. The excessively high metric values observed with traditional oversampling methods, coupled with very low or absent error metrics, suggest a possible overfitting issue-likely because these methods tend to replicate minority class examples, causing the model to memorize rather than generalize. Conversely, as demonstrated in Table 1, the metrics improved when comparing the imbalanced SocialNER dataset with the Word2Vec-oversampled dataset. This enhancement suggests that our method may be more effective in presenting minority class patterns without leading to overfitting, indicating a potential advancement in the approach to handling imbalanced datasets in NER tasks.
Training phase evaluation metrics results on oversampled datasets.
To validate the effectiveness of our approach, we conducted tests using pre-trained BERT models. These models underwent initial fine-tuning on the imbalanced SocialNER dataset and subsequent fine-tuning on the balanced SocialNER dataset, employing our proposed method and classical oversampling techniques. The results from these models clearly exhibit variations in their F1 score performance on unseen data. The tests were conducted on the WNUT-17 benchmark dataset. Table 2 provides a comprehensive presentation of these results, offering a detailed breakdown of data distribution across different classes, with the best values highlighted in bold. We observed a substantial (
Results scores for BERT fine-tuned on imbalanced and balanced datasets across different classes.
Results scores for BERT fine-tuned on imbalanced and balanced datasets across different classes.
To provide a more detailed analysis of our model’s performance across different classes, we present the confusion matrices for each oversampling technique. These matrices offer insights into the classification accuracy for individual classes and highlight areas where the model excels or struggles. The Word2Vec oversampling technique (Table 3) shows strong performance across most classes, with minimal misclassifications. It particularly excels in recognizing the majority class and several minority classes. SMOTE (Table 4) demonstrates improved performance for some minority classes but still shows significant misclassifications, particularly for the majority class. Borderline-SMOTE (Table 5) shows improvements in recognizing some minority classes but still faces challenges with others, as evidenced by misclassifications. ADASYN (Table 6) provides a more balanced performance across classes but still encounters difficulties with some minority classes. Random oversampling (Table 7) shows some improvement in minority class recognition but continues to struggle with certain classes, as indicated by misclassifications.
Confusion matrix for Word2Vec oversampling.
Confusion matrix for Word2Vec oversampling.
Confusion matrix for SMOTE.
Confusion matrix for borderline-SMOTE.
Confusion matrix for ADASYN.
Confusion matrix for random oversampling.
Overall, these confusion matrices provide valuable insights into the performance of each oversampling technique across different classes. The Word2Vec oversampling method demonstrates the most balanced and accurate performance, particularly for minority classes, while traditional methods like SMOTE and ADASYN show improvements but still face challenges with certain classes.
Our study aimed to address Research Question 1 (RQ1) concerning the effectiveness of using synonyms to improve imbalanced textual datasets. The results are promising, with a substantial 29.55% increase in the macro F1-score, indicating significant enhancement in overall performance. This suggests that employing synonyms for data augmentation effectively captures the semantics of the training data, generating synthetic data that aligns well with the original context.
Regarding Research Question 2 (RQ2), we assessed the advancement of semantic-based oversampling techniques compared to traditional methods. Our approach, which uses Word2Vec embeddings to generate synonyms and diversify the dataset, addresses class imbalance and enhances model performance. This is particularly critical in NLP tasks, where token context is crucial for accurate predictions. The improvement is reflected in the substantial increase in F1 scores for various classes, as highlighted in our comparison table.
Upon closer examination, we observed interesting patterns in class detection and performance improvement. Initially, the model failed to detect some minority classes. However, the semantic-based oversampling technique enabled their successful identification, suggesting that our approach can rectify limitations in class detection that traditional techniques might overlook. Some classes exhibited enhanced performance, while others remained consistent, indicating the varying impact of our approach on different categories.
Despite compelling results, especially in NLP text datasets, our approach has limitations with classes that have an extremely small token count (around 1-5%). For instance, the class “MISC” remained undetected even after employing our approach. Addressing this issue may require additional efforts, potentially involving hybrid methodologies.
Traditional oversampling techniques like SMOTE, SMOTE Borderline, ADASYN, and Random Oversampling showed limited impact on improving multi-class NLP models. Although theoretically sound, these methods often lead to overfitting and limited generalization due to ungrammatical sentences generated during oversampling.
Our study highlights the significance of capturing semantic nuances in synthetic data generation. By applying synonyms guided by Word2Vec embeddings, we achieved improved model performance and a better understanding of underlying semantics. The increase in macro F1-score underscores the relevance and potential of semantic-based oversampling techniques in advancing multi-class text-labeled dataset benchmarks.
Conclusion
In this study, we introduced a novel semantic-based oversampling technique to address the critical challenge of class imbalance in NER datasets. Our key innovation was leveraging pre-trained Word2Vec embeddings to generate synthetically augmented data that preserves linguistic semantics. The empirical results demonstrated that using contextually relevant synonyms for underrepresented tokens significantly improved model performance, with a macro F1-score increase of 29.55% compared to an imbalanced baseline. This underscores the effectiveness of our approach in enhancing overall NER capabilities, outperforming conventional oversampling techniques that showed limitations in preserving the structural coherence of sentences in textual datasets when generating synthetic samples. Although the tags were correctly retained, the semantic meaning of the sentences was compromised. This highlights a challenge in traditional approaches where the structure of the original sentences is lost, negatively impacting the overall coherence and meaning of the synthetic data. By enabling the creation of richer training data that preserves contextual relationships, our oversampling algorithm takes significant steps toward building more robust NER models. The thoughtfully integrated semantic knowledge from Word2Vec embeddings emerges as a promising approach for addressing data imbalances. In conclusion, this research provides a uniquely tailored solution for NER tasks, advancing the state-of-the-art in intelligent data balancing. Our semantic-based oversampling technique demonstrates significant performance improvements and holds promise for enhancing NER models across various applications. As the field of NLP continues to evolve, our approach provides a valuable contribution to addressing a critical challenge in data preprocessing and enhancing the robustness of NER models. In future work, we aim to explore hybrid oversampling methods to address challenges posed by extremely small classes. Semantic-aware oversampling offers immense potential to train deep learning models that generalize accurately across textual datasets.
Footnotes
Acknowledgments
We sincerely thank Dr CHEHILI Hamza who was Vice Chancellor of Research at Frères Mentouri University Constantine 1 for providing us with a high-performance cluster that was instrumental in conducting our learning and performance tests. This generous support greatly facilitated our research endeavors and contributed significantly to the successful execution of our experiments. We acknowledge and appreciate the valuable resources and assistance provided by the university, which played a crucial role in advancing our work.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
