Abstract
This study addresses the issue of semantic similarity in sentences using the BERT model through various aggregation techniques, such as max-pooling, mean-pooling, and an LSTM network applied to the output of the BERT model. Subsequently, the linguistic interpretability of the BERT-Base transformer model is analyzed through the unsupervised learning approach, specifically through dimensionality reduction using autoencoders and clustering algorithms, utilizing the representation of the classification token CLS.
The results highlight that the CLS classification token achieves better abstractions than the proposed methods. In terms of interpretability, it is observed that sequence length is relevant in the early layers, with a gradual decrease across the layers. Additionally, attention to semantic similarity is concentrated in the intermediate and upper layers, especially in layers 6, 8, 9, and 10. All these findings were obtained by addressing the semantic similarity task using the STS-Benchmark dataset.
Keywords
Introduction
Semantic similarity is defined as the degree to which two expressions share similar meanings. Transformer models, such as BERT, have proven to be highly effective in addressing this task, which poses a challenge due to the inherent complexities of language, such as diversity and ambiguity [16]. Semantics explores meaning in language through elements like words or sentences, playing a crucial role in various natural language processing tasks, including information retrieval, question answering, sentiment analysis, machine translation, and text summarization, among others [17, 18].
Despite considerable efforts to improve scores in various tasks using transformer models like BERT, numerous variations of these have emerged. However, the understanding of the model, specifically what aspects of language are abstracted by a complex model like BERT and in which layers this abstraction occurs, has been addressed to a lesser extent. Although transformer models are complex and often challenging to interpret, exploring them could provide a deeper understanding of why they work, allowing us to identify vulnerabilities, areas for improvement, and optimization opportunities.
Therefore, it is relevant to study and analyze the BERT model from the perspective of semantic similarity in sentences, focusing on key components such as the classification token [CLS] and attention mechanisms. There are studies that have addressed sentence semantic similarity, mostly using the BERT transformer model through techniques such as pretraining with language inference datasets [1] or data augmentation techniques [2].
Regarding interpretability, most studies have focused on supervised approaches using probes or performance and attention classifiers, relying on the output representations of the BERT model, as observed in [10–13] and [14]. To a lesser extent, research has been conducted on analyzing the attention heads of the transformer, as done by [3] and [14]. Additionally, there are scarce works that, through visual tools like [15], have sought to provide a more detailed insight into what happens at each layer of the model.
In this study, the question is posed of whether the [CLS] token effectively abstracts linguistic aspects of a sequence (sentence pair) to tackle the task of semantic similarity, or if it would be more beneficial to use the final layer’s output representation of the BERT model for each token in the sequence, applying aggregation methods or even incorporating a recurrent network to obtain a more effective abstraction or representation.
Considering the intrinsic importance of semantics in language, questions are also explored regarding how the BERT model addresses semantic similarity through its self-attention mechanisms. In particular, the aim is to understand which linguistic aspects are abstracted in the self-attention representations at each layer of the model when facing the challenge of solving semantic similarity. The question is formulated around whether these self-attention representations can reveal patterns linked to the degree of semantic similarity between sentences and if they also contain information about superficial aspects such as sequence length and grammatical structure.
The quest for linguistic interpretability poses a challenge, and the inherent complexity of the problem led to the development of a visualization tool that automates certain processes and facilitates exploratory analysis of the BERT model.
Background and related work
The transformer model, created by [4], excelled in natural language processing tasks due to its ability to perform parallel processing and handle lengthy sequences.
BERT (Bidirectional Encoder Representations from Transformers), developed by [5], is based on the architecture of the transformer encoder block proposed by [4]. It incorporates the idea of ELMo (Embeddings from Language Models) from [6] to generate contextualized word representations and adopts the transfer learning approach of the GPT (Generative Pre-Training) model introduced by [7].
The effectiveness of BERT lies in the implementation of the Masked Language Model (MLM) during its pretraining phase, where it predicts masked tokens based on their context. Additionally, it can generate contextualized representations for sentence pairs (next sentence prediction) and adapt to generate representations for more than two sentences. BERT’s architecture also incorporates specific tokens like [CLS], placed at the beginning of the sequence to obtain a contextualized representation of the entire sequence, which is beneficial in classification tasks.
Semantic similarity resolution using BERT
A study highlighting the superiority of the BERT model over its predecessors in the transformer family was conducted by [22]. Additionally, various works have implemented specialized strategies to enhance semantic similarity scores. For instance, in the investigation by [1], training was carried out using a language inference corpus, incorporating a Siamese network and aggregation methods applied to sentences individually. Other approaches, such as the one proposed by [2], achieved positive results by exploring data augmentation techniques.
BERT model interpretability
The exploration of BERT’s interpretability has predominantly been approached through supervised learning methods employing classifiers, although the outcomes vary. In particular, the study by [10] suggests that the upper layers of BERT specialize in semantics, the intermediate layers in syntax, and the lower layers in surface-level information. In the work of [11], it is asserted that BERT performs natural language processing tasks, such as part-of-speech tagging, constituent identification, syntactic dependencies, semantic roles, and coreference resolution, in an interpretable and adaptable manner. From their perspective, syntax manifests in the early layers, while semantic aspects emerge in the upper layers. On the other hand, the study by [12] contends that semantics and syntax develop in intermediate and upper layers, while superficial tasks occur in the lower layers. They argue that semantic tasks achieve optimal performance in intermediate layers (layers 6-9), and individual layers of BERT do not encapsulate the alleged processing pipeline proposed by [11].
The analysis conducted by [13] focuses on examining the geometric structure of embeddings in the BERT model. The results indicate that word meanings form distinguishable clusters in the vector space, revealing the existence of both syntactic and semantic subspaces. Furthermore, the study by [19] uncovers how BERT intertwines semantics with aspects such as syntax and sentiment.
In the research by [14], attention heads in BERT were thoroughly analyzed. It was observed that some attention heads focused on specific linguistic aspects, such as attention to the previous and next token, coreference resolution, among others. However, a substantial number of heads did not exhibit a clear focus on a particular feature or relationship. It was noted that in the early layers of BERT, attention distributions displayed high entropy, which decreased in later layers. In the work by [20], the self-attentions of the model are also examined, revealing the presence of attention patterns that are repeated across different heads. They observe that deactivating attention in certain heads leads to an improvement in the model’s performance.
Methodology
In order to carry out a linguistic and interpretative analysis of the BERT model, the base version of BERT was used, which consists of 12 encoder blocks, 12 attention heads in each layer, and a hidden dimension of 768. It is important to note that this analysis is conducted under the approach of semantic similarity and is based on the observations of the following elements of the model: Classification token [CLS] Attention heads
To carry out the analysis of both components of the model (CLS and attentions), the STS-Benchmark dataset (Semantic Textual Similarity Benchmark) [21] was employed. This dataset focuses on the task of semantic similarity for pairs of sentences in English. The dataset labels similarity on a continuous scale from 0 to 5, where 0 indicates no similarity and 5 represents perfect similarity.
The abstraction provided by the [CLS] token is effective in capturing linguistic aspects of a sequence. However, the question arises as to whether it would be more beneficial to use the final output representation of the last layer for each token in a sequence and apply aggregation methods or even subject it to an evaluation process through a recurrent network.
In this approach, a series of tests were conducted using the [CLS] token directly. Two aggregation methods were proposed: one based on the mean-pooling and another on the max-pooling of the last representation generated by the BERT model. Additionally, a bidirectional LSTM-type recurrent network was implemented. The results of these aggregation methods, along with the representation of the [CLS] token, were used as input for a linear regressor with the aim of obtaining the actual value of semantic similarity.
In the case of the LSTM recurrent network, the output of the BERT model was used directly as input, and the final hidden state of the recurrent network was taken as the output, which was then fed into the linear regressor.
The PyTorch library was used for developing machine learning algorithms, and the Hugging Face’s Transformers library was used for data processing and fine-tuning. The Optuna framework was used for hyperparameter optimization of the four configurations. The configuration for each of the proposed methods can be seen in Table 1.
Hyperparameters of the proposed methods
Hyperparameters of the proposed methods
Performance measurement of the model was done using the Pearson correlation coefficient (ρ p ), the Spearman correlation coefficient (ρ s ), and the mean squared error (MSE).
Dimensionality reduction was performed to explore and analyze patterns in the self-attentions of the semantic similarity sequences in a two-dimensional space.
The attentions of each head and all layers of the base BERT model were extracted by addressing the semantic similarity task through fine-tuning. This was done using the classification token CLS for each sequence in the test set of the STS-Benchmark dataset. These attentions were treated as representations for analysis. For the representation of a specific layer, the attentions of its 12 heads were flattened. In Fig. 1, the first subscript of the attention matrix A refers to the layer l, the second subscript represents the head number h, and the third subscript refers to the position of a specific attention weight (i, j), that is, Al,h,(i,j). In addition, n represents the number of tokens in a sequence.
Subsequently, we utilized a dimensionality reduction technique employing a recurrent LSTM network-based autoencoder to reduce the dimensions to just 2, aiming to simplify the visualization and analysis of the model’s abstractions. The batch size was set to 12, corresponding to the number of layers in the base BERT model. The sequence dimension was defined according to the length of the flattened attention for each head, while the number of features expected at each time step of the LSTM network is equal to the number of heads in a layer of the base BERT model, which is 12. The structure of the input data is illustrated in Fig. 2.

Stacked head attentions for each layer of the BERT model. Each layer contains the 12 heads with flattened attention matrices.

Structure of the representations used as input for the autoencoder model.
The autoencoder model was trained using the mean squared error (MSE) loss function, the Adam optimizer, and a learning rate of 1e-3.
After the dimensional reduction, we proceeded to perform a clustering analysis using scatter plots. This analysis was applied to the vectors of all sequences in the dataset for each layer. Given the large number of vectors, the complexity of clustering strategies, the diversity of algorithms and similarity or dissimilarity metrics, as well as the hyperparameters of each algorithm, a visual tool was developed to automate the execution of tests and hyperparameter configurations.
This tool facilitated testing with various clustering algorithms, such as k-means, DBSCAN, hierarchical clustering, spectral clustering, and gaussian mixture. It also included a visual tool that allowed defining value ranges and highlighting them in the scatter plots, improving the understanding of data behavior.
To assess whether there were differences or similarities in the distribution of features among data groups and to understand the overall distribution, box plots and statistical tests such as Kruskal-Wallis and Mann-Whitney U were used. The null hypothesis considered no significant differences between groups, in contrast to the alternative hypothesis suggesting the presence of differences between groups.
To evaluate clustering quality, intrinsic metrics like silhouette coefficient, Davis Bouldin index, and Calinski Harabasz index were used, which were helpful in determining the optimal number of clusters. Determining the optimal number of clusters involved running executions for different values of K. The decision on the optimal number of clusters was based on the consensus of the metrics and the context of data analysis. More details about the algorithms and hyperparameters used can be found in Table 2.
In the hyperparameters column of 2, the distance metric used is defined for the k-means algorithm, the distance and linkage type are specified for the agglomerative algorithm, and the type of covariance used is defined for the gaussian mixture algorithm.
The choice of the clustering algorithm was based on the evaluation of the five available algorithms. Results were compared using intrinsic metrics, selecting the algorithm with the best performance in these metrics and the ability to generate easily interpretable clusters. The visual tool allowed the definition of user-defined value ranges and the simultaneous visualization of clusters generated by each algorithm. This approach provided additional guidance for selecting the algorithm with greater interpretability.
In general, in the layers where DBSCAN was applied, intrinsic metrics yielded poor results. The visual tool also indicated that the clusters generated by DBSCAN were not optimal, as it identified as outliers elements considered relevant in the analysis. It’s worth noting that the spectral algorithm was not included in the results in Table 2, as the evaluations of intrinsic metrics and interpretability of clusters in the other three algorithms (k-means, agglomerative, and gaussian mixture) were superior.
Clustering algorithms used for each layer analysis
Clustering algorithms used for each layer analysis
The first analysis aimed to identify patterns generated by the model based on the semantic similarity of the samples. The data distribution, labeled according to their semantic similarity, is shown in the box plot in Fig. 3. It can be observed that the distribution of labels does not have outliers in either direction but exhibits a slight left skew, indicating that there are labels with higher semantic similarity values in the upper half of the interquartile range.

Distribution STS-Benchmark test dataset with respect to semantic similarity labels.
In addition to box plots, the visual tool designed for clustering analysis also allows for the input of ranges of semantic similarity. It highlights the respective samples and provides the capability to contrast them with the groups identified by clustering algorithms, thereby contributing to a better understanding of data behavior.
The dimensions of each sequence from the STS-Benchmark test set were extracted based on the tokenization size using BERT’s WordPiece. The data distribution according to sequence length is presented in the box plot in Fig. 4. It can be observed that the dataset exhibits positive skewness due to the presence of sequences with very high dimensions.
To conduct this analysis, we developed a visual tool that allows the visualization of sequence lengths within their respective groups. This tool facilitates the identification of minimum and maximum sequence length values and enables the definition of sequence length ranges specified by the user, which are highlighted in the scatter plots.

Distribution STS-Benchmark test dataset with respect to sequence length.
For this analysis, the grammatical structures present in all sequences from the test set were identified. The cosine similarity of the most frequent grammatical structure with respect to the rest of the structures was calculated, and these values were used as an analysis variable.
The most frequent grammatical structure is shared by 32 sequences in the test set. This structure consists of two sentences evaluated in terms of their semantic similarity according to the STS-Benchmark dataset. Both the first and the second sentence share the same dependency labels, which include:
Analysis and results
Classification token CLS
Unlike the approaches used by [1] and [2], our work is not based on highly specialized techniques to improve semantic similarity scores. Our experimental tests focused on evaluating aggregation methods and a BiLSTM network, aiming to determine if any of these methods could surpass the abstraction capability of the CLS token.
None of the methods implemented during the fine-tuning process, as detailed in Table 1, managed to surpass the abstraction capacity provided by the classification token [CLS] when evaluating semantic similarity. In fact, it was observed that the abstraction provided by the [CLS] token was slightly superior to that obtained by other methods, as shown in Table 3.
To explain the results showing the slight superiority of the [CLS] token compared to other methods, let’s consider how BERT evaluates the relationships and relevance between words. The dot product in self-attentions provides a measure of these relationships within the sequence. The [CLS] token measures the context’s global relationship with all the words. As information is abstracted and propagates in the higher layers, BERT captures more complex and contextual relationships between words through the [CLS] token.
Results of the evaluation of the STS-Benchmark dataset test set using the aggregation methods and LSTM network used in fine-tuning the BERT model
Results of the evaluation of the STS-Benchmark dataset test set using the aggregation methods and LSTM network used in fine-tuning the BERT model
It’s important to note that the relationships between words can vary in their level of relevance. When applying a mean-pooling operation to the model’s output, we are considering word relationships with the same level of relevance, which may not be appropriate due to the presence of stop words that can establish less relevant relationships in semantic analysis. On the other hand, the max-pooling method may lose significant contextual information. Additionally, it should be considered that BERT already provides high-quality abstraction, limiting the ability of a recurrent LSTM network to extract additional useful information in this context.
The dimensional reduction by the bidirectional LSTM autoencoder resulted in a training loss of 0.0447. In Fig. 5, the scatter plots and their respective clusters are visible. It’s observed that the data tends to have more dispersion in the earlier layers, while the last two layers exhibit more compact and well-defined clusters. These observations are supported by intrinsic metrics that evaluate the clustering quality, as shown in Table 4.

Scatter plots with clustering for all layers of the BERT model.
Results of the intrinsic cluster quality evaluation metrics
In Fig. 6, box plots for each layer of the model with respect to the semantic similarity label of the data are presented. It is worth noting that in layer 9, the groups exhibit different interquartile ranges, suggesting distinct groupings in terms of their central tendencies. In the case of layer 10, cluster 5 shows significantly high values of semantic similarity, while cluster 2 concentrates over 50% of its data within a range of values between 0.4 and 2.2. The other clusters display intermediate values with partial similarity between groups. These findings suggest that in layers 9 and 10, semantic similarity becomes relevant.

Box plot for semantic similarity analysis.
Table 5 presents the probability (p) values derived from Kruskal-Wallis and Mann-Whitney U analyses concerning semantic similarity. From these results, it can be observed that layers 6, 8, 9, and 10 exhibit a clear clustering trend in terms of semantic similarity, as evidenced by the extremely low p-values. This suggests strong evidence against the null hypothesis for these layers. These findings corroborate the observations made in the box plot, particularly in the case of layers 9 and 10.
In Fig. 8, scatterplots for layers 2, 7, 10, and 12 are presented, with semantic similarities of 0 and 5. The dispersion of samples in layers 2 and 12 is evident, while in layer 10, samples with similarity 5 predominate in the lower part, and labels with zero similarity are found in the upper part.

Box plot for sequence length.

Scatterplots highlighting the samples with semantic similarities of 0 and 5.
In Fig. 7, box plots are presented in relation to the sequence lengths. Although there are 15 clusters in layer 1, it is observed that some of them intersect in the same interquartile range, while others show clear differences in values. The overlapping interquartile ranges in the box plots are quite close to each other. For example, group 10 contains very long sentences (with more than 60 tokens), and the adjacent group, group 4, concentrates its data between 45 and 60 tokens. On the other hand, groups 3 and 14, which are contiguous, exhibit the lowest values. In contrast, the remaining groups, which are also contiguous, show values in the center. This trend is more noticeable in layer 3. In contrast, in layers 9, 11, and 12, the interquartile ranges overlap, suggesting similarities in the clusters and a decrease in the model’s attention to sentence length.
The results of the p-value from Kruskal-Wallis and Mann-Whitney U tests described in Table 5 again reflect agreement with the box plots; the lower layers are particularly fixated on sequence length. However, a gradual loss of the model’s attention to sequence length can be observed.
Results of the p-value for Kruskal-Wallis and Mann-Whitney U for semantic similarity, sequence length, and grammatical structure
Results of the p-value for Kruskal-Wallis and Mann-Whitney U for semantic similarity, sequence length, and grammatical structure
In Fig. 9, scatter plots for layers 1, 4, 8, and 12 are presented. These plots color-code the samples with sequence lengths in the range of 9–15, 20–22, and 60–81 dimensions. These plots were generated using the visual tool used in the context of sequence length. In these figures, it is evident that layer 1 of the model maintains a clear separation between sequence lengths, but eventually, low and medium-dimensional sizes cluster, while high-dimensional sizes tend to remain separate.

Scatter plots of layers 1,4,8 and 12 indicating sequence length ranges.
The analysis of grammatical structure can be conducted based on the box plot shown in Fig. 10, which reveals greater variability in most of the layers. In most cases, the interquartile ranges overlap or show slight variations. For example, in layers 3, 7, and 8, it is evident that the groups contain similar values, with a concentration around 0.6 similarity. On the other hand, layers 1, 2, 4, 5, 6, 9, and 10 also exhibit high variability but with partial overlap, indicating that the groups are somewhat less similar to each other. As for layers 11 and 12, they show groups with more than 50% of their values different from each other.

Box plot to analyze the structural similarity of the most frequent sequence with respect to the rest structures.
The results of the Kruskal-Wallis and Mann-Whitney U p-values shown in Table 5 reveal that the later layers of the model have a more pronounced focus on grammatical structure, while the other layers show existing but less defined attention to this structure.
The visual tool reveals that samples with similar structures tend to stay close to each other across all layers of the model, as can be seen in Fig. 11. Although a pattern or trend is observed in samples with similar grammatical structures, this pattern is not clearly defined and sometimes presents challenges when attempting to group them coherently.

Scatter plots of layers 2, 3, 8 and 12, highlighting the most frequent sequence and those with similar structures that reach a similarity level of 0.95 or higher.
The CLS token demonstrates the generation of high-quality abstractions that are challenging to surpass, even for a recurrent LSTM network. Additionally, the results suggest that aggregation methods can, in some cases, lead to information loss by assuming that all relationships between words captured by the model have the same relevance, whether by averaging them or selecting the best one.
These findings align with previous observations, as mentioned in [14], where increased dispersion in self-attentions is identified in the early layers of the model, in contrast to decreased dispersion in the later layers. This is reflected in the quality of the achievable clusters.
This approach to linguistic interpretation through unsupervised learning, rather than relying on specialized probes and corpora, offers valuable insights for research. Similar to [10], we have identified that aspects such as sequence size are prominent in the early layers of the model and gradually lose significance in the later layers. Concerning semantic similarity, there is greater activity in layers 9 and 10 of the model, with hints of a trend in layers 6 and 8, supporting previous claims by [12]. As for syntax, these findings indicate that it is not entirely clear which layers of the model focus on grammar. However, they do reveal a tendency towards clustering similar grammatical structures across multiple layers. Conducting a more detailed analysis and exploring various approaches would be valuable to determine whether this pattern persists in grammatical structure.
Furthermore, it would be beneficial to conduct a deeper semantic analysis to ascertain whether the clusters are abstracting concepts and if the model can discern homonyms and synonyms based on context.
By confirming that the results of this study align with previous state-of-the-art research, it can be concluded that both the method used to generate abstractions from BERT model attentions and the autoencoder model design achieved effective abstraction, despite the substantial dimensionality reduction. However, conducting tests and cluster evaluations with higher dimensionality would be an interesting avenue for future research.
