Abstract
In the speaker verification task based on Gaussian Mixture Model-Universal Background Model (GMM-UBM), by constructing the UBM as a tree structure, the kernel Gaussians suitable for different speakers can be quickly selected, which speeds up the modeling of speaker acoustic space by GMM. The tree-based kernel selection algorithm (TBKS) introduces a beam-width, which increases the candidate range of kernels and improves the kernel selection accuracy. In this paper, we improve the TBKS algorithm by introducing a recall rate to adjust the number of nodes recalled in each layer of the tree structure. This adjustment refines the quantity and resolution of Gaussian distributions in various subspaces within the acoustic space, compensating for the loss caused by discarding some significant Gaussians erroneously. Speaker verification experiments are carried out based on the Aishell2 dataset. The results reveal that the modified TBKS algorithm reduces EER by 7.5% relatively and increses computational reduction factor to 42.93, enhancing both recognition accuracy and speed. In addition, the test speech is spliced into different lengths and common environmental noise is added to verify the universality of the improved algorithm.
Introduction
The Gaussian Mixture Model (GMM) is commonly used to model data distributions and finds wide applications in fields such as cluster analysis, pattern recognition, and anomaly detection [1, 2]. In the task of speaker verification, the Gaussian Mixture Model-Universal Background Model (GMM-UBM) [3, 4] is used to model the acoustic space for general background and specific speakers. GMM-UBM first trains the UBM, and then the speaker-specific GMM is obtained based on the UBM and speaker-specific speech data. Acoustic space modeling via GMM eliminates the need for massive training data or long-term training processes. In addition, GMM-UBM requires less computational resources, making it possible to realize real-time applications on resource-constrained embedded devices [5, 6].
However, in the verification phase, it takes up most of the operation time to calculate the likelihood values of all Gaussian components in UBM and the speaker model [7]. In order to reduce the computation time in the validation phase, the following three perspectives are usually considered: the number of feature vectors to be scored in a test utterance (downsampling) [8], the dimension of the feature vector, and the number of Gaussian components in the UBM. Typically, only Gaussians near the feature vectors are selected as the kernels for likelihood calculations. This approach reduces the number of Gaussians that need to be considered, as these chosen kernels contribute significantly to the likelihood value.
In order to achieve fast search of kernels, researchers have proposed various methods. Yu et al. introduced the Sorted GMM [9], a method that arranges Gaussian components in a sorted manner and subsequently identifies the optimal Gaussian index through a codebook search applied to the feature vectors. Roland et al. analyzed three hash GMMs [10]. For each Hash GMM component, a candidate list of indexes was generated containing the indexes of the Gaussian components used for validation. Rahim et al. combined Gaussian selection with frame selection [11]. Xiang et al. proposed Structural GMM [12], in which each Gaussian component in the UBM are used as leaf nodes in a tree structure to narrow down the selection of kernels through a top-down hierarchical search. This process of selecting kernels has also been effectively utilized in GMM-SVM models [13] as well as Ivector-based models [14–16]. The introduction of the aforementioned fast selection methods has led to a substantial enhancement in scoring speed. However, since it’s not possible to precisely identify the top Gaussians with the highest global likelihood values, a compromise must be made within a tolerable range, sacrificing some recognition accuracy.
Enhancing recognition accuracy while maintaining fast scoring has become a key focus in the practical application of speaker verification systems in real-world scenarios. Building upon their work in reference [12], Xiang et al. employed a Multi-Layer Perceptron (MLP) as a speaker identity discriminator after computing the likelihood values. However, the additional MLP requires a significant amount of training data, which in turn increases computational load. Xiong et al. proposed the TBKS (tree-based kernel selection) algorithm [19], which directly optimizes the strategy for selecting kernel distributions. By introducing a beam-width, the TBKS algorithm expands the local search range, thus reducing the likelihood of selecting core distributions incorrectly. The TBKS algorithm balances recognition accuracy and speed. With just a 1% increase in EER, the number of Gaussians that need to be calculated can be reduced by a factor of 15.2, and the algorithm remains robust to additive background noise [20].
However, the TBKS algorithm sacrifices the acoustic information contained in the discarded Gaussian components to ensure recognition speed. To address this issue, this paper proposes an improved approach aimed at recalling some of the discarded but relatively important Gaussian components to optimize the kernel selection process based on TBKS. The structure of this paper is organized as follows. Firstly, hierarchical clustering is performed on each single Gaussian component of the UBM to create a tree-like structure. Secondly, the matching degree between each candidate node at each layer of the tree structure and the speech frames is computed, serving as the basis for selecting kernels. Finally, while determining the search range for each layer, some nodes outside the beam-width are recalled, and their aggregate nodes are added to the kernel group. The effectiveness of the improved TBKS algorithm is validated through speaker verification experiments on the Aishell2 dataset.

Adaptation process of the speaker model.
The Gaussian mixture model (GMM) is a linear combination of several Gaussian probability density functions. It can be used to represent the characteristic distribution of any speech feature vector by the model parameters Λ = {w
i
, μ
i
, Σ
i
}, which are the weight, mean vector, and diagonal covariance matrix respectively for the i-th component, where i = 1, 2, ⋯ , M. The speaker-independent universal background model (UBM) [21] and speaker models are commonly represented by GMM, in which speaker models are adapted from UBM through maximum a posteriori (MAP). A set of D-dimensional feature vectors X = {x1, x2, ⋯ , x
T
} are extracted from overlapping windows (the so-called frames) of speech, where T is the number of frames. The likelihood value of the feature vector on GMM indicates the fitness of the feature vector on GMM. Speaker verification amounts to computing the log-likelihood ratio of sequence between the UBM and speaker model, which determines the correction of the claimed speaker identity. To avoid arithmetic underflow, the likelihood ratio is usually converted to the difference of the log-likelihoods and is called the likelihood score:
According to Eqs. (2) and (3), the computation load of likelihood values on all M Gaussian components occupies most of the scoring time. However, a side benefit of MAP adaptation is that it keeps the correspondence between the Gaussians of the target-speaker model and the UBM. This allows the kernels selected from UBM to be directly applied to the speaker model, thereby reducing the time to calculate the likelihood score.
Figure 1 illustrates the adaptation process of a speaker model. For a feature vector x t of the speaker, only a few nearby Gaussians update the parameters and get closer to x t . The more feature vectors are input, the more Gaussians are updated, and the more accurate the model expresses speaker acoustic information. As shown in the figure:
(1) When calculating the likelihood value of x t on GMM, only a few Gaussians near it contribute the most to the likelihood, while the Gaussians far away from it almost have no contribution.
(2) If x t fits well with a Gaussian in UBM, it will also fit well with the corresponding Gaussian in the target speaker model, and vice versa.
In this paper, only the mean vectors are updated during MAP adaptation. Firstly, calculate the posterior probability of x
t
on the i-th Gaussian component:
For each input feature vector, top C highest-scoring components are selected as kernels after the likelihood values are calculated on all M Gaussians in UBM and applied to the speaker model. C is typically set to 5 [4]. At this time, the number of Gaussians to be calculated is reduced from 2M to M + C. According to Eq. (2), since C is usually far less than M, the computational cost of calculating the likelihood value is reduced by almost a half. However, the traversal search requires a calculation of all Gaussian components, which is still not efficient enough. It becomes essential to select kernels quickly and accurately for system performance improvement.
To use as few Gaussians as possible to more accurately represent the local acoustic space where the feature vector is located, this paper builds an L-layer tree to perform hierarchical clustering of all Gaussians in UBM based on the TBKS algorithm. When selecting the kernels, search from top to bottom. The candidate nodes of the current layer are arranged in descending order according to the likelihood values. Some nodes enter the next layer to search, some nodes are reserved as part of the kernel set, and the remaining nodes are pruned. Thus, an adaptive resolution GMM is constructed for the local acoustic space.
Construction of the tree structure
In the tree structure, the leaf nodes (M in total) correspond exactly to the components in UBM. The nodes generated in the first L - 1 layers are the merging of corresponding child nodes, and so on, the root node is the merging of M leaf nodes. Before constructing the tree structure, it is necessary to define the distance between two single Gaussians and the combined Gaussian model parameters of multiple Gaussians.
As one of the common gaussian distance measures, symmetrized Kullback–Leibler (KL) divergence [17] represents the sum of the relative entropy of two Gaussians. When diagonal covariance matrices are assumed, the distance d
KL
(m, n) between two Gaussian components G
m
∼ N (μ
m
, Σ
m
) and G
n
∼ N (μ
n
, Σ
n
) can be simplified to
After determining the number of layers L and the number of branches of all nodes in the upper L - 1 layers, the tree is constructed as follows [17].

Construction of the tree structure.
According to the construction process above, the number of branches in the last non-leaf layer is unknown, and the number of nodes in the last layer is equal to the total Gaussian number of UBM. Figure 2 shows the distribution of tree-based Gaussian nodes in the acoustic space. Each layer of the tree represents a more precise or higher-resolution GMM for the acoustic space than the layers near the root node.
After the tree structure is established, the local optimal search for kernels can be performed from top to bottom. Normally, Gaussians near the boundary of two clusters will be hard classified into one of them. They may be discarded by mistake if these two clusters have similar likelihood values. The TBKS algorithm introduces beam-width K to search for C representative kernels. As shown in Fig. 3, the search process is as follows [22].

Construction of the TBKS algorithm.
The introduction of beam-width K balances accuracy and scoring computation. The number of candidate nodes in each layer increases with the increase of K, which leads to an increase in the computing load of the likelihood value. However, the accuracy of the local optimal search is improved due to the consideration of the Gaussian components near the decision boundary.
Although the TBKS algorithm expands the search range of the kernels, the accuracy of the speaker verification system still drops a bit. First, the number of branches in each layer is fixed, which results in some nodes being hard classified to a farther cluster. Second, one of the two nodes with similar likelihood values may still have to be discarded near the beam-width. Thus, some representative Gaussians will be missed, and the acoustic space where each feature vector is located cannot be fully characterized.
For the nodes near the beam-width, although their representation ability is lower than the nodes in the beam width, they still contain some useful information. If K keeps increasing, the child nodes of such ambiguous nodes will be included in the set of candidate nodes at the next layer. However, these child nodes are likely to be removed at the next level due to their fairly low representational capability. Therefore, this paper directly retains the nodes near the beam-width as part of the kernels and uses the lower-resolution merged Gaussians in the upper L - 2 layers (excluding the root node layer) to roughly represent the acoustic space that is far away from the feature vector and have lower significance.

Construction of the improved TBKS algorithm.
The construction process is shown in Fig. 4(a). The search starts from the root node. In the beam-width, all the child nodes of selected nodes are scored and sorted in descending order layer by layer. The first K child nodes are treated as candidate nodes in the next layer. Some lower-scoring child nodes are directly retained as components of the kernel set, while the rest of the lowest-scoring child nodes are pruned.
In this paper, the recall rate ξ is introduced to adjust the proportion of retained child nodes. ξ = 1 means that all child nodes outside the beam-width are recalled. When ξ equals to 1, it is the TBKS algorithm, in which only K child nodes are retained. The kernel set of the feature vector is composed of the retained merged gaussian nodes (such as node 3,4,5,6) in upper L - 2 layers and the original Gaussian nodes (such as node 1,2) in the last layer. Figure 4(b) shows the distribution of the kernel set in the acoustic space. Compared with TBKS, the computational cost of selecting the kernels is constant. Instead of discarding nodes outside the beam-width directly, this paper retains some useful nodes to compensate for the loss caused by the tree structure and the local searching. The implementation steps are as follows:
In the scoring stage, the Gaussians used in the UBM and the speaker model need to be corresponding for a feature vector. Thus, the tree structure of the speaker model and the UBM should be kept consistent. Since the Gaussian distribution of non-leaf nodes is required, the index sets of nodes at all layers of the speaker model’s tree structures should also be matched with the UBM’s. When enrolling the speaker model, M leaf nodes are obtained by MAP adaption from UBM, and then the model parameters of nodes in each layer are calculated according to Eq. (9).
In section 3.3, the resolution and number of Gaussians in each subspace of the speaker are adjusted according to the local acoustic space where the feature vector is located. The original GMM is replaced by a few kernels. In this paper, the computational reduction factor [18] is chosen to represent the speed up of a speaker verification system. Taking the traversal search as the baseline, the ratio of the Gaussian number required is calculated before and after using the fast selection algorithm. Without considering the Gaussian sorting or searching time, the expression is as follows:
Let the branch number set of the first L - 1 layer be R = {R1, R2, ⋯ , RL-1}, and C original Gaussians are selected in the last layer. The number of kernels Csm is given by
In this section, trees with several structures are built according to the process in section 3.1. All the components in UBM are hierarchically clustered. As described in section 3.3, the search of kernels is carried out from top to bottom. The candidate nodes will be retained, discarded, or further subdivided according to their importance on the likelihood values. The kernel set is composed of the nodes retained in each layer. Based on the traversal search, the effects of different tree structures, beam-widths, and recall rates on accuracy and recognition speed are discussed. Considering the application of speaker verification in real scenes, the robustness of the system under different audio durations and two common kind of noises is also explored.
Processing of the speech data
In this paper, the experiments are based on Aishell2 [23], a Mandarin Chinese dataset recorded in a quiet indoor environment. The duration of each record is about 3∼5 seconds. 300 people are selected to train UBM with 1,024 mixtures. Use records in the dataset to splice 30s audio for each person. Another 50 people are selected for enrollment and test, with a balanced ratio of male to female. After the voice activity detection, 13-dimensional Mel-Frequency cepstral coefficients (MFCC) are extracted from speech. Typically, the frame length is 200ms and the neighboring frame is overlapped by 80ms. Then, 13 delta-cepstral coefficients and 13 delta-delta-cepstral coefficients are concatenated to the cepstral vector to generate a 39-dimensional vector. Finally, these vectors are processed with CMVN [24] to eliminate convolution noise.
Configurations of the tree structure
Before the speaker verification experiments, we investigate the effect of the kernel number C in traversal search and the most suitable C is selected for the baseline. As shown in Fig. 5, as C increases, the EER obtained through traversal search gradually rises and tends to be stable. This indicates that most Gaussians contribute a little to the accuracy. Hence, this paper takes the traversal search with C = 1 as the baseline, EER = 2.1633.

Influence of kernel number on the performance of the traversal search.
Note that, although the best accuracy can be obtained using only one kernel, the fast scoring algorithms cannot accurately select the best Gaussian component. Only by increasing the number of kernels can the loss caused by the wrong selection be minimized.
Set the layer number of the tree to be L, and the set of branch number in upper L - 1 layers to be R = {R1, R2, ⋯ , RL-1}. The TBKS algorithm indicates that the computational reduction factor can reach the maximum when R1 = K × R2 = ⋯ = K × RL-1, and maintain the value when L > 5. Hence, 6 types of tree structures are selected to discuss the influence of beam-width K and recall rate ξ on fast verification system. The number of leaf nodes with the highest likelihood values of being selected as part of the kernels in the last layer is 4.

Performance under different tree structures.
Figure 6 shows the relationship between EER and actual computational reduction factor F′ when the beam-width K is set to {1, 2, 3} and the recall rate ξ is set to {0, 0.05, 0.1, 0.2, 0.3}. When ξ equals to 0, it is the TBKS algorithm. As shown in Fig. 6(a), L5 : {1, 16, 4, 4, *} indicates that the tree has 5 layers, and the branch number in the first L - 2 layers is {16, 4, 4}. ’*’ indicates that the branch number in the penultimate layer is uncertain. It can be seen from the figures above:
(1) Among the six tree structures, there are parameter configurations that achieve higher recognition rates than the baseline. Except for structure L3 : {1, 16, *}, the other tree structures all can achieve the best performance, where EER is 2, and relatively improves 7.5% compared to the baseline system.
(2) As shown in Fig. 6(e), among all tree configurations that achieve the optimal recognition performance with EER = 2, structure L5 : {1, 8, 4, 4, *} experiences the greatest reduction in the actual computational reduction factor, with F′ = 42.93. At this point, the beam-width K = 1 and the recall rate ξ = 0.1.
(3) Within the same tree configuration, when ξ = 0.05, the number of nodes recalled per layer is 1 ∼ 2. In comparison to xi = 0, there is a significant improvement in recognition performance. As illustrated in Fig. 6(d), when ξ = 0.05, the EER can decrease by up to around 50%, while the reduction in Gaussian number is only around 6.2%.
(4) After recalling a portion of the nodes, when ξ > =0.1, most tree structures can achieve the best accuracy, and the EER remains unchanged, indicating that the recognition performance tends to be stable.
In this section, the robustness of the speaker verification system under different speech length t = {30, 15, 9, 3}s is discussed. To ensure the consistency and integrity of the speech, all the test audios are spliced from the records in the dataset. The enroll audio duration of each speaker remains 30s.
According to the experimental results in the previous section, the TBKS algorithm achieved the best performance under L4 : {1, 32, 8, *}, K = 2. For the improved TBKS algorithm, three configurations with better performance are selected according to different heights of the tree: L5 : {1, 8, 4, 4, *}, L4 : {1, 32, 8, *}, L3 : {1, 32, *}, where K = 1, ξ = 0.1.

Performance under different audio durations.
As shown in Fig. 7, the improved TBKS algorithm leads to a decrease in EER for various tree structures. For any speech length, there is at least one optimal tree configuration with EER lower than that of the traversal search or the original TBKS algorithm. Notably, the structure L4 : {1, 32, 8, *} maintains the best recognition performance across different speech lengths. When t = 3s, the accuracy of all algorithms decreases greatly, indicating that short test audio is a big bottleneck in the application of speaker verification applications.
In this section, we investigate the noise robustness of TBKS after the introduction of recall rate in two common noisy environments. Two sets of testing conditions with babble noise and factory noise from the NoiseX-92 database is evaluated at 15dB, 30dB, 50dB SNR. Also, the duration of speech has been considered (i.e. 30s, 15s, 9s, and 3s). According to the experimental results in Section 4.3, the TBKS algorithm is still set as L4 : {1, 32, 8, *}, K = 2, and the improved TBKS algorithm is set as L4 : {1, 32, 8, *}, K = 1, ξ = 0.1.

Performance under different noises and audio durations.
The performance is shown in Fig. 8, and it can be seen that:
(1) When the speech length is t = 30s, under different SNRs and noise types, the improved TBKS algorithm demonstrates a notable decrease in EER compared to both traversal search and the original TBKS algorithm.
(2) In a total of 24 testing conditions, the TBKS algorithm with recall rate has the highest accuracy for 19 times, and the lowest accuracy for 2 times, which indicates that the improved algorithm can also remain robust in complex environments.
(3) Compared with Factory noise, the improved TBKS algorithm is more robust to Babble noise and maintains the best accuracy.
Based on the tree-structured GMM-UBM speaker verification system, we optimize the TBKS kernel selection algorithm to enhance recognition performance. During the kernel selection process at each layer of the tree structure, a portion of important nodes outside the beam width are recalled, resulting in aggregated nodes used as kernels, thereby improving the modeling of speaker acoustic space. The improved algorithm is validated on the Aishell2 dataset, investigating the influence of node information within the tree structure on recognition speed and accuracy. Furthermore, experiments are conducted under various speech lengths, noise types, and signal-to-noise ratios to examine the robustness of the enhanced TBKS algorithm.
With the introduction of recall rate, a smaller number of lower-resolution nodes are utilized to represent the less crucial acoustic space, ensuring that kernels cover the local acoustic space where speech frames are located as much as possible. When the recall rate is set to 0, it corresponds to the TBKS algorithm. While the introduction of recall rate slightly increases the number of kernels, it significantly improves recognition accuracy. For instance, in the tree structure L4 : {1, 32, 8, *}, when the recall rate is set to 0.05, compared to the TBKS algorithm, the number of kernels increases by 1-2, while the EER decreases by around 50%. Additionally, in comparison to traditional traversal search, the improved TBKS algorithm notably accelerates the kernel selection process, resulting in a maximum reduction factor of gaussian components up to 42.93.
The improved TBKS algorithm also enhances the robustness of the speaker verification system. Across various combinations of speech lengths (t = 30, 15, 9, 3s), noise types (Babble and Factory), and Signal-to-Noise Ratios (SNR = 50, 30, 15), we compared the recognition accuracy of traversale search, TBKS, and the improved TBKS algorithm. In most scenarios, the improved TBKS algorithm exhibited the highest recognition accuracy. When t = 30s, the improved TBKS algorithm showcased significant performance improvement. However, for t = 3s, the performance of all three algorithms declined, indicating that short speech remains a challenge in speaker verification tasks.
