Exploring the interpretability of the BERT model for semantic similarity

Abstract

This study addresses the issue of semantic similarity in sentences using the BERT model through various aggregation techniques, such as max-pooling, mean-pooling, and an LSTM network applied to the output of the BERT model. Subsequently, the linguistic interpretability of the BERT-Base transformer model is analyzed through the unsupervised learning approach, specifically through dimensionality reduction using autoencoders and clustering algorithms, utilizing the representation of the classification token CLS.

The results highlight that the CLS classification token achieves better abstractions than the proposed methods. In terms of interpretability, it is observed that sequence length is relevant in the early layers, with a gradual decrease across the layers. Additionally, attention to semantic similarity is concentrated in the intermediate and upper layers, especially in layers 6, 8, 9, and 10. All these findings were obtained by addressing the semantic similarity task using the STS-Benchmark dataset.

Keywords

Linguistic interpretability aggregation methods unsupervised learning attention mechanisms token CLS

1 Introduction

Semantic similarity is defined as the degree to which two expressions share similar meanings. Transformer models, such as BERT, have proven to be highly effective in addressing this task, which poses a challenge due to the inherent complexities of language, such as diversity and ambiguity [16]. Semantics explores meaning in language through elements like words or sentences, playing a crucial role in various natural language processing tasks, including information retrieval, question answering, sentiment analysis, machine translation, and text summarization, among others [17, 18].

Despite considerable efforts to improve scores in various tasks using transformer models like BERT, numerous variations of these have emerged. However, the understanding of the model, specifically what aspects of language are abstracted by a complex model like BERT and in which layers this abstraction occurs, has been addressed to a lesser extent. Although transformer models are complex and often challenging to interpret, exploring them could provide a deeper understanding of why they work, allowing us to identify vulnerabilities, areas for improvement, and optimization opportunities.

Therefore, it is relevant to study and analyze the BERT model from the perspective of semantic similarity in sentences, focusing on key components such as the classification token [CLS] and attention mechanisms. There are studies that have addressed sentence semantic similarity, mostly using the BERT transformer model through techniques such as pretraining with language inference datasets [1] or data augmentation techniques [2].

Regarding interpretability, most studies have focused on supervised approaches using probes or performance and attention classifiers, relying on the output representations of the BERT model, as observed in [10–13] and [14]. To a lesser extent, research has been conducted on analyzing the attention heads of the transformer, as done by [3] and [14]. Additionally, there are scarce works that, through visual tools like [15], have sought to provide a more detailed insight into what happens at each layer of the model.

In this study, the question is posed of whether the [CLS] token effectively abstracts linguistic aspects of a sequence (sentence pair) to tackle the task of semantic similarity, or if it would be more beneficial to use the final layer’s output representation of the BERT model for each token in the sequence, applying aggregation methods or even incorporating a recurrent network to obtain a more effective abstraction or representation.

Considering the intrinsic importance of semantics in language, questions are also explored regarding how the BERT model addresses semantic similarity through its self-attention mechanisms. In particular, the aim is to understand which linguistic aspects are abstracted in the self-attention representations at each layer of the model when facing the challenge of solving semantic similarity. The question is formulated around whether these self-attention representations can reveal patterns linked to the degree of semantic similarity between sentences and if they also contain information about superficial aspects such as sequence length and grammatical structure.

The quest for linguistic interpretability poses a challenge, and the inherent complexity of the problem led to the development of a visualization tool that automates certain processes and facilitates exploratory analysis of the BERT model.

2 Background and related work

The transformer model, created by [4], excelled in natural language processing tasks due to its ability to perform parallel processing and handle lengthy sequences.

BERT (Bidirectional Encoder Representations from Transformers), developed by [5], is based on the architecture of the transformer encoder block proposed by [4]. It incorporates the idea of ELMo (Embeddings from Language Models) from [6] to generate contextualized word representations and adopts the transfer learning approach of the GPT (Generative Pre-Training) model introduced by [7].

The effectiveness of BERT lies in the implementation of the Masked Language Model (MLM) during its pretraining phase, where it predicts masked tokens based on their context. Additionally, it can generate contextualized representations for sentence pairs (next sentence prediction) and adapt to generate representations for more than two sentences. BERT’s architecture also incorporates specific tokens like [CLS], placed at the beginning of the sequence to obtain a contextualized representation of the entire sequence, which is beneficial in classification tasks.

2.1 Semantic similarity resolution using BERT

A study highlighting the superiority of the BERT model over its predecessors in the transformer family was conducted by [22]. Additionally, various works have implemented specialized strategies to enhance semantic similarity scores. For instance, in the investigation by [1], training was carried out using a language inference corpus, incorporating a Siamese network and aggregation methods applied to sentences individually. Other approaches, such as the one proposed by [2], achieved positive results by exploring data augmentation techniques.

2.2 BERT model interpretability

The exploration of BERT’s interpretability has predominantly been approached through supervised learning methods employing classifiers, although the outcomes vary. In particular, the study by [10] suggests that the upper layers of BERT specialize in semantics, the intermediate layers in syntax, and the lower layers in surface-level information. In the work of [11], it is asserted that BERT performs natural language processing tasks, such as part-of-speech tagging, constituent identification, syntactic dependencies, semantic roles, and coreference resolution, in an interpretable and adaptable manner. From their perspective, syntax manifests in the early layers, while semantic aspects emerge in the upper layers. On the other hand, the study by [12] contends that semantics and syntax develop in intermediate and upper layers, while superficial tasks occur in the lower layers. They argue that semantic tasks achieve optimal performance in intermediate layers (layers 6-9), and individual layers of BERT do not encapsulate the alleged processing pipeline proposed by [11].

The analysis conducted by [13] focuses on examining the geometric structure of embeddings in the BERT model. The results indicate that word meanings form distinguishable clusters in the vector space, revealing the existence of both syntactic and semantic subspaces. Furthermore, the study by [19] uncovers how BERT intertwines semantics with aspects such as syntax and sentiment.

In the research by [14], attention heads in BERT were thoroughly analyzed. It was observed that some attention heads focused on specific linguistic aspects, such as attention to the previous and next token, coreference resolution, among others. However, a substantial number of heads did not exhibit a clear focus on a particular feature or relationship. It was noted that in the early layers of BERT, attention distributions displayed high entropy, which decreased in later layers. In the work by [20], the self-attentions of the model are also examined, revealing the presence of attention patterns that are repeated across different heads. They observe that deactivating attention in certain heads leads to an improvement in the model’s performance.

3 Methodology

In order to carry out a linguistic and interpretative analysis of the BERT model, the base version of BERT was used, which consists of 12 encoder blocks, 12 attention heads in each layer, and a hidden dimension of 768. It is important to note that this analysis is conducted under the approach of semantic similarity and is based on the observations of the following elements of the model:

-
Classification token [CLS]
-
Attention heads

To carry out the analysis of both components of the model (CLS and attentions), the STS-Benchmark dataset (Semantic Textual Similarity Benchmark) [21] was employed. This dataset focuses on the task of semantic similarity for pairs of sentences in English. The dataset labels similarity on a continuous scale from 0 to 5, where 0 indicates no similarity and 5 represents perfect similarity.
3.1 Comparison of the [CLS] token with other abstraction methods

The abstraction provided by the [CLS] token is effective in capturing linguistic aspects of a sequence. However, the question arises as to whether it would be more beneficial to use the final output representation of the last layer for each token in a sequence and apply aggregation methods or even subject it to an evaluation process through a recurrent network.

In this approach, a series of tests were conducted using the [CLS] token directly. Two aggregation methods were proposed: one based on the mean-pooling and another on the max-pooling of the last representation generated by the BERT model. Additionally, a bidirectional LSTM-type recurrent network was implemented. The results of these aggregation methods, along with the representation of the [CLS] token, were used as input for a linear regressor with the aim of obtaining the actual value of semantic similarity.

In the case of the LSTM recurrent network, the output of the BERT model was used directly as input, and the final hidden state of the recurrent network was taken as the output, which was then fed into the linear regressor.

The PyTorch library was used for developing machine learning algorithms, and the Hugging Face’s Transformers library was used for data processing and fine-tuning. The Optuna framework was used for hyperparameter optimization of the four configurations. The configuration for each of the proposed methods can be seen in Table 1.

Table 1
Hyperparameters of the proposed methods

Parameter CLS MEAN MAX LSTM

Learning rate 2.4e-5 3.1e-5 2.8e-5 3.6e-5

Dropout 0.068 0.06 0.04 0.8

Dropout(LSTM) N/A 0.6

Dense layers 1

Batch 32

Sentence length 128

Epochs 5

Optuna runs 150

Loss function Mean Squared Error(MSELoss)

Optimizer Root Mean Square Propagation(RMSprop)

Parameter	CLS	MEAN	MAX	LSTM
Learning rate	2.4e-5	3.1e-5	2.8e-5	3.6e-5
Dropout	0.068	0.06	0.04	0.8
Dropout(LSTM)	N/A	0.6
Dense layers	1
Batch	32
Sentence length	128
Epochs	5
Optuna runs	150
Loss function	Mean Squared Error(MSELoss)
Optimizer	Root Mean Square Propagation(RMSprop)

Performance measurement of the model was done using the Pearson correlation coefficient (ρ_p), the Spearman correlation coefficient (ρ_s), and the mean squared error (MSE).

3.2 Dimensionality reduction of self-attentions

Dimensionality reduction was performed to explore and analyze patterns in the self-attentions of the semantic similarity sequences in a two-dimensional space.

The attentions of each head and all layers of the base BERT model were extracted by addressing the semantic similarity task through fine-tuning. This was done using the classification token CLS for each sequence in the test set of the STS-Benchmark dataset. These attentions were treated as representations for analysis. For the representation of a specific layer, the attentions of its 12 heads were flattened. In Fig. 1, the first subscript of the attention matrix A refers to the layer l, the second subscript represents the head number h, and the third subscript refers to the position of a specific attention weight (i, j), that is, A_l,h,(i,j). In addition, n represents the number of tokens in a sequence.

Subsequently, we utilized a dimensionality reduction technique employing a recurrent LSTM network-based autoencoder to reduce the dimensions to just 2, aiming to simplify the visualization and analysis of the model’s abstractions. The batch size was set to 12, corresponding to the number of layers in the base BERT model. The sequence dimension was defined according to the length of the flattened attention for each head, while the number of features expected at each time step of the LSTM network is equal to the number of heads in a layer of the base BERT model, which is 12. The structure of the input data is illustrated in Fig. 2.

Fig. 1

Stacked head attentions for each layer of the BERT model. Each layer contains the 12 heads with flattened attention matrices.

Fig. 2

Structure of the representations used as input for the autoencoder model.

The autoencoder model was trained using the mean squared error (MSE) loss function, the Adam optimizer, and a learning rate of 1e-3.

3.3 Attention analysis through clustering

After the dimensional reduction, we proceeded to perform a clustering analysis using scatter plots. This analysis was applied to the vectors of all sequences in the dataset for each layer. Given the large number of vectors, the complexity of clustering strategies, the diversity of algorithms and similarity or dissimilarity metrics, as well as the hyperparameters of each algorithm, a visual tool was developed to automate the execution of tests and hyperparameter configurations.

This tool facilitated testing with various clustering algorithms, such as k-means, DBSCAN, hierarchical clustering, spectral clustering, and gaussian mixture. It also included a visual tool that allowed defining value ranges and highlighting them in the scatter plots, improving the understanding of data behavior.

To assess whether there were differences or similarities in the distribution of features among data groups and to understand the overall distribution, box plots and statistical tests such as Kruskal-Wallis and Mann-Whitney U were used. The null hypothesis considered no significant differences between groups, in contrast to the alternative hypothesis suggesting the presence of differences between groups.

To evaluate clustering quality, intrinsic metrics like silhouette coefficient, Davis Bouldin index, and Calinski Harabasz index were used, which were helpful in determining the optimal number of clusters. Determining the optimal number of clusters involved running executions for different values of K. The decision on the optimal number of clusters was based on the consensus of the metrics and the context of data analysis. More details about the algorithms and hyperparameters used can be found in Table 2.

In the hyperparameters column of 2, the distance metric used is defined for the k-means algorithm, the distance and linkage type are specified for the agglomerative algorithm, and the type of covariance used is defined for the gaussian mixture algorithm.

The choice of the clustering algorithm was based on the evaluation of the five available algorithms. Results were compared using intrinsic metrics, selecting the algorithm with the best performance in these metrics and the ability to generate easily interpretable clusters. The visual tool allowed the definition of user-defined value ranges and the simultaneous visualization of clusters generated by each algorithm. This approach provided additional guidance for selecting the algorithm with greater interpretability.

In general, in the layers where DBSCAN was applied, intrinsic metrics yielded poor results. The visual tool also indicated that the clusters generated by DBSCAN were not optimal, as it identified as outliers elements considered relevant in the analysis. It’s worth noting that the spectral algorithm was not included in the results in Table 2, as the evaluations of intrinsic metrics and interpretability of clusters in the other three algorithms (k-means, agglomerative, and gaussian mixture) were superior.

Table 2
Clustering algorithms used for each layer analysis

Layer Algorithm Hyperparameters K

1 K-means Euclidean 15

2 Gaussian Mixture Spherical 6

3 Agglomerative Euclidean/ward 3

4 Agglomerative Euclidean/ward 4

5 K-means Euclidean 13

6 K-means Euclidean 6

7 Gaussian Mixture Tied 2

8 Gaussian Mixture Diagonal 3

9 Agglomerative Euclidean/full 2

10 Gaussian Mixture Spherical 6

11 Agglomerative Euclidean/full 2

12 Gaussian Mixture Spherical 2

Layer	Algorithm	Hyperparameters	K
1	K-means	Euclidean	15
2	Gaussian Mixture	Spherical	6
3	Agglomerative	Euclidean/ward	3
4	Agglomerative	Euclidean/ward	4
5	K-means	Euclidean	13
6	K-means	Euclidean	6
7	Gaussian Mixture	Tied	2
8	Gaussian Mixture	Diagonal	3
9	Agglomerative	Euclidean/full	2
10	Gaussian Mixture	Spherical	6
11	Agglomerative	Euclidean/full	2
12	Gaussian Mixture	Spherical	2

3.3.1 Semantic similarity

The first analysis aimed to identify patterns generated by the model based on the semantic similarity of the samples. The data distribution, labeled according to their semantic similarity, is shown in the box plot in Fig. 3. It can be observed that the distribution of labels does not have outliers in either direction but exhibits a slight left skew, indicating that there are labels with higher semantic similarity values in the upper half of the interquartile range.

Fig. 3

Distribution STS-Benchmark test dataset with respect to semantic similarity labels.

In addition to box plots, the visual tool designed for clustering analysis also allows for the input of ranges of semantic similarity. It highlights the respective samples and provides the capability to contrast them with the groups identified by clustering algorithms, thereby contributing to a better understanding of data behavior.

3.3.2 Sequence length

The dimensions of each sequence from the STS-Benchmark test set were extracted based on the tokenization size using BERT’s WordPiece. The data distribution according to sequence length is presented in the box plot in Fig. 4. It can be observed that the dataset exhibits positive skewness due to the presence of sequences with very high dimensions.

To conduct this analysis, we developed a visual tool that allows the visualization of sequence lengths within their respective groups. This tool facilitates the identification of minimum and maximum sequence length values and enables the definition of sequence length ranges specified by the user, which are highlighted in the scatter plots.

Fig. 4

Distribution STS-Benchmark test dataset with respect to sequence length.

3.3.3 Grammatical structure

For this analysis, the grammatical structures present in all sequences from the test set were identified. The cosine similarity of the most frequent grammatical structure with respect to the rest of the structures was calculated, and these values were used as an analysis variable.

The most frequent grammatical structure is shared by 32 sequences in the test set. This structure consists of two sentences evaluated in terms of their semantic similarity according to the STS-Benchmark dataset. Both the first and the second sentence share the same dependency labels, which include:

[’dep’,’det’,’nsubj’,’aux’,’ROOT’,’det’,’dobj’,’punct’]

As in the previous sections, the visual tool facilitates the analysis of grammatical structure. This tool aimed to identify the most frequent structures and subsequently allowed the visualization of structures that were more similar to a particular structure, based on a user-adjustable similarity threshold.

4 Analysis and results

4.1 Classification token CLS

Unlike the approaches used by [1] and [2], our work is not based on highly specialized techniques to improve semantic similarity scores. Our experimental tests focused on evaluating aggregation methods and a BiLSTM network, aiming to determine if any of these methods could surpass the abstraction capability of the CLS token.

None of the methods implemented during the fine-tuning process, as detailed in Table 1, managed to surpass the abstraction capacity provided by the classification token [CLS] when evaluating semantic similarity. In fact, it was observed that the abstraction provided by the [CLS] token was slightly superior to that obtained by other methods, as shown in Table 3.

To explain the results showing the slight superiority of the [CLS] token compared to other methods, let’s consider how BERT evaluates the relationships and relevance between words. The dot product in self-attentions provides a measure of these relationships within the sequence. The [CLS] token measures the context’s global relationship with all the words. As information is abstracted and propagates in the higher layers, BERT captures more complex and contextual relationships between words through the [CLS] token.

Table 3
Results of the evaluation of the STS-Benchmark dataset test set using the aggregation methods and LSTM network used in fine-tuning the BERT model

Method Spearman Pearson

CLS [1] 84.3 –

CLS 84.8 86.1

MAX 84.2 84.2

MEAN 84.4 85.7

LSTM 84 85.4

It’s important to note that the relationships between words can vary in their level of relevance. When applying a mean-pooling operation to the model’s output, we are considering word relationships with the same level of relevance, which may not be appropriate due to the presence of stop words that can establish less relevant relationships in semantic analysis. On the other hand, the max-pooling method may lose significant contextual information. Additionally, it should be considered that BERT already provides high-quality abstraction, limiting the ability of a recurrent LSTM network to extract additional useful information in this context.

4.2 Dimensionality reduction and clustering

The dimensional reduction by the bidirectional LSTM autoencoder resulted in a training loss of 0.0447. In Fig. 5, the scatter plots and their respective clusters are visible. It’s observed that the data tends to have more dispersion in the earlier layers, while the last two layers exhibit more compact and well-defined clusters. These observations are supported by intrinsic metrics that evaluate the clustering quality, as shown in Table 4.

Fig. 5

Scatter plots with clustering for all layers of the BERT model.

Table 4

Results of the intrinsic cluster quality evaluation metrics

Layer	Silhouette	Bouldin	Calinski
1	0.387	0.762	1093
2	0.357	0.784	932
3	0.435	0.695	1394
4	0.431	0.674	1909
5	0.347	0.792	1072
6	0.349	0.845	1076
7	0.548	0.594	2057
8	0.502	0.618	2917
9	0.535	0.629	2428
10	0.377	0.843	1704
11	0.678	0.331	3760
12	0.784	0.201	5959

4.2.1 Semantic similarity

In Fig. 6, box plots for each layer of the model with respect to the semantic similarity label of the data are presented. It is worth noting that in layer 9, the groups exhibit different interquartile ranges, suggesting distinct groupings in terms of their central tendencies. In the case of layer 10, cluster 5 shows significantly high values of semantic similarity, while cluster 2 concentrates over 50% of its data within a range of values between 0.4 and 2.2. The other clusters display intermediate values with partial similarity between groups. These findings suggest that in layers 9 and 10, semantic similarity becomes relevant.

Fig. 6

Box plot for semantic similarity analysis.

Table 5 presents the probability (p) values derived from Kruskal-Wallis and Mann-Whitney U analyses concerning semantic similarity. From these results, it can be observed that layers 6, 8, 9, and 10 exhibit a clear clustering trend in terms of semantic similarity, as evidenced by the extremely low p-values. This suggests strong evidence against the null hypothesis for these layers. These findings corroborate the observations made in the box plot, particularly in the case of layers 9 and 10.

In Fig. 8, scatterplots for layers 2, 7, 10, and 12 are presented, with semantic similarities of 0 and 5. The dispersion of samples in layers 2 and 12 is evident, while in layer 10, samples with similarity 5 predominate in the lower part, and labels with zero similarity are found in the upper part.

Fig. 7

Box plot for sequence length.

Fig. 8

Scatterplots highlighting the samples with semantic similarities of 0 and 5.

4.2.2 Sequence length

In Fig. 7, box plots are presented in relation to the sequence lengths. Although there are 15 clusters in layer 1, it is observed that some of them intersect in the same interquartile range, while others show clear differences in values. The overlapping interquartile ranges in the box plots are quite close to each other. For example, group 10 contains very long sentences (with more than 60 tokens), and the adjacent group, group 4, concentrates its data between 45 and 60 tokens. On the other hand, groups 3 and 14, which are contiguous, exhibit the lowest values. In contrast, the remaining groups, which are also contiguous, show values in the center. This trend is more noticeable in layer 3. In contrast, in layers 9, 11, and 12, the interquartile ranges overlap, suggesting similarities in the clusters and a decrease in the model’s attention to sentence length.

The results of the p-value from Kruskal-Wallis and Mann-Whitney U tests described in Table 5 again reflect agreement with the box plots; the lower layers are particularly fixated on sequence length. However, a gradual loss of the model’s attention to sequence length can be observed.

Table 5
Results of the p-value for Kruskal-Wallis and Mann-Whitney U for semantic similarity, sequence length, and grammatical structure

Layer Semantic (p) Length (p) Structure (p)

1 0.00188 3.36e-243 3.38e-46

2 0.00239 1.06e-206 4.62e-85

3 0.00058 9.769e-168 1.079e-08

4 0.00086 7.57e-121 1.63e-34

5 0.00101 6.207e-120 9.22e-64

6 2.31e-47 2.96e-105 3.75e-24

7 0.1377 1.34e-140 1.58e-12

8 4.545e-34 1.32e-162 1.99e-09

9 1.21e-107 3.69e-20 1.23e-22

10 1.94e-113 3.94e-113 2.48e-47

11 0.00075 8.87e-29 8.82e-95

12 0.0353 5.06e-26 8.82e-93

Layer	Semantic (p)	Length (p)	Structure (p)
1	0.00188	3.36e-243	3.38e-46
2	0.00239	1.06e-206	4.62e-85
3	0.00058	9.769e-168	1.079e-08
4	0.00086	7.57e-121	1.63e-34
5	0.00101	6.207e-120	9.22e-64
6	2.31e-47	2.96e-105	3.75e-24
7	0.1377	1.34e-140	1.58e-12
8	4.545e-34	1.32e-162	1.99e-09
9	1.21e-107	3.69e-20	1.23e-22
10	1.94e-113	3.94e-113	2.48e-47
11	0.00075	8.87e-29	8.82e-95
12	0.0353	5.06e-26	8.82e-93

In Fig. 9, scatter plots for layers 1, 4, 8, and 12 are presented. These plots color-code the samples with sequence lengths in the range of 9–15, 20–22, and 60–81 dimensions. These plots were generated using the visual tool used in the context of sequence length. In these figures, it is evident that layer 1 of the model maintains a clear separation between sequence lengths, but eventually, low and medium-dimensional sizes cluster, while high-dimensional sizes tend to remain separate.

Fig. 9

Scatter plots of layers 1,4,8 and 12 indicating sequence length ranges.

4.2.3 Grammatical structure

The analysis of grammatical structure can be conducted based on the box plot shown in Fig. 10, which reveals greater variability in most of the layers. In most cases, the interquartile ranges overlap or show slight variations. For example, in layers 3, 7, and 8, it is evident that the groups contain similar values, with a concentration around 0.6 similarity. On the other hand, layers 1, 2, 4, 5, 6, 9, and 10 also exhibit high variability but with partial overlap, indicating that the groups are somewhat less similar to each other. As for layers 11 and 12, they show groups with more than 50% of their values different from each other.

Fig. 10

Box plot to analyze the structural similarity of the most frequent sequence with respect to the rest structures.

The results of the Kruskal-Wallis and Mann-Whitney U p-values shown in Table 5 reveal that the later layers of the model have a more pronounced focus on grammatical structure, while the other layers show existing but less defined attention to this structure.

The visual tool reveals that samples with similar structures tend to stay close to each other across all layers of the model, as can be seen in Fig. 11. Although a pattern or trend is observed in samples with similar grammatical structures, this pattern is not clearly defined and sometimes presents challenges when attempting to group them coherently.

Fig. 11

Scatter plots of layers 2, 3, 8 and 12, highlighting the most frequent sequence and those with similar structures that reach a similarity level of 0.95 or higher.

5 Conclusions

The CLS token demonstrates the generation of high-quality abstractions that are challenging to surpass, even for a recurrent LSTM network. Additionally, the results suggest that aggregation methods can, in some cases, lead to information loss by assuming that all relationships between words captured by the model have the same relevance, whether by averaging them or selecting the best one.

These findings align with previous observations, as mentioned in [14], where increased dispersion in self-attentions is identified in the early layers of the model, in contrast to decreased dispersion in the later layers. This is reflected in the quality of the achievable clusters.

This approach to linguistic interpretation through unsupervised learning, rather than relying on specialized probes and corpora, offers valuable insights for research. Similar to [10], we have identified that aspects such as sequence size are prominent in the early layers of the model and gradually lose significance in the later layers. Concerning semantic similarity, there is greater activity in layers 9 and 10 of the model, with hints of a trend in layers 6 and 8, supporting previous claims by [12]. As for syntax, these findings indicate that it is not entirely clear which layers of the model focus on grammar. However, they do reveal a tendency towards clustering similar grammatical structures across multiple layers. Conducting a more detailed analysis and exploring various approaches would be valuable to determine whether this pattern persists in grammatical structure.

Furthermore, it would be beneficial to conduct a deeper semantic analysis to ascertain whether the clusters are abstracting concepts and if the model can discern homonyms and synonyms based on context.

By confirming that the results of this study align with previous state-of-the-art research, it can be concluded that both the method used to generate abstractions from BERT model attentions and the autoencoder model design achieved effective abstraction, despite the substantial dimensionality reduction. However, conducting tests and cluster evaluations with higher dimensionality would be an interesting avenue for future research.

Footnotes

Notes

References

Reimers

Gurevych

, Sentence-bert: Sentence embeddings using siamese bert-networks. ArXiv Preprint ArXiv:1908.10084. (2019)

Wang

Liu

Verspoor

Baldwin

, Evaluating the utilityof model configurations and data augmentation on clinical semantictextual similarity, Proceedings Of The 19th SIGBioMed WorkshopOn Biomedical Language Processing (2020), 105–111.

Voita

Talbot

Moiseev

Sennrich

Titov

, Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ArXiv Preprint ArXiv:1905.09418 (2019).

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

Kaiser

Ł.

Polosukhin,

, Attention is all you need, Advances In Neural Information Processing Systems30 (2017).

Devlin

Chang

Lee

Toutanova

, Bert: Pretraining of deep bidirectional transformers for language understanding. ArXiv Preprint ArXiv:1810.04805 (2018).

Peters

Neumann

Iyyer

Gardner

Clark

Lee

Zettlemoyer,

, Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227–2237 (2018,6), https://aclanthology.org/N18-1202.

Radford

Narasimhan

Salimans

Sutskever

, Others Improving language understanding by generative pre-training. (OpenAI, 2018).

Michel

Levy

Neubig

, Are sixteen heads really betterthan one?, Advances In Neural Information Processing Systems32 (2019).

Gordon

Duh

Andrews

, Compressing bert: Studying the effects of weight pruning on transfer learning. ArXiv Preprint ArXiv:2002.08307 (2020).

10.

Jawahar

Sagot

Seddah

, What does BERT learn about the structure of language?. ACL 2019-57th Annual Meeting of the Association For Computational Linguistics (2019).

11.

Tenney

Das

Pavlick

, BERT rediscovers the classical NLP pipeline. ArXiv Preprint ArXiv:1905.05950. (2019).

12.

Niu

Penn

, Does BERT Rediscover a Classical NLPPipeline?, Proceedings of the 29th International Conference onComputational Linguistics (2022), 3143–3153.

13.

Reif

Yuan

Wattenberg

Viegas

Coenen

Pearce

Kim

, Visualizing and measuring the geometry of BERT, Advances In Neural Information Processing Systems32 (2019).

14.

Clark

Khandelwal

Levy

Manning

, What does bert look at? an analysis of bert’s attention. ArXiv Preprint ArXiv:1906.04341. (2019).

15.

Vig

, BertViz: A tool for visualizing multihead self-attention inthe BERT model, ICLR Workshop: Debugging Machine LearningModels23 (2019).

16.

Kolesnikova

, Exposition of the Natural Language Processing Laboratory. (Center for Computing Research, 2022).

17.

Han

Zhang

Yuan

Jiang

Yun

Gao

, A survey onthe techniques, applications, and performance of short text semanticsimilarity, Concurrency and Computation: Practice and Experience33 (2021), e5971.

18.

Chandrasekaran

Mago

, Evolution of semantic similarity–asurvey, ACM Computing Surveys (CSUR)54 (2021), 1–37.

19.

Yenicelik

Schmidt

Kilcher

, How does BERT capturesemantics? A closer look at polysemous words, Proceedings ofthe Third BlackboxNLP Workshop on Analyzing and Interpreting NeuralNetworks for NLP (2020), 156–162.

20.

Kovaleva

Romanov

Rogers

Rumshisky

, Revealing the dark secrets of BERT. ArXiv Preprint ArXiv:1908.08593. (2019).

21.

Cer

Diab

Agirre

Lopez-Gazpio

Specia

, Semeval-2017 task 1: Semantic textual similaritymultilingual and cross-lingual focused evaluation. ArXiv Preprint ArXiv:1708.00055. (2017).

22.

Han

Zhang

Yuan

Jiang

Yun

Gao

, A survey onthe techniques, applications, and performance of short text semanticsimilarity, Concurrency and Computation: Practice andExperience33 (2021), e5971.

Method	Spearman	Pearson
CLS [1]	84.3	–
CLS	84.8	86.1
MAX	84.2	84.2
MEAN	84.4	85.7
LSTM	84	85.4

Exploring the interpretability of the BERT model for semantic similarity

Abstract

Keywords

1 Introduction

2 Background and related work

2.1 Semantic similarity resolution using BERT

2.2 BERT model interpretability

3 Methodology

4 Analysis and results

4.1 Classification token CLS

Table 3 Results of the evaluation of the STS-Benchmark dataset test set using the aggregation methods and LSTM network used in fine-tuning the BERT model Method Spearman Pearson CLS [1] 84.3 – CLS 84.8 86.1 MAX 84.2 84.2 MEAN 84.4 85.7 LSTM 84 85.4

Footnotes

Notes

References

Table 3
Results of the evaluation of the STS-Benchmark dataset test set using the aggregation methods and LSTM network used in fine-tuning the BERT model

Method Spearman Pearson

CLS [1] 84.3 –

CLS 84.8 86.1

MAX 84.2 84.2

MEAN 84.4 85.7

LSTM 84 85.4