Abstract
In recent times, Dynamic Time Warping (DTW) based template matching systems have again come to the forefront in the field of text-dependent speaker verification. Its integration with the latest technology, like i-vector/Probabilistic Linear Discriminant Analysis (PLDA) and Deep Neural Network (DNN), has resulted in significant improvement in the performance of the systems. DTW algorithm time-aligns two templates and gives a similarity score based on the optimal warping path. It however weighs all the local distances equally, along the optimal path. In this paper, we propose complementing the DTW based text-dependent speaker verification systems with local scores derived from the vicinity of speaker-identity-rich regions. The vowel regions are used to determine portions along the warping path that are more important in terms of speaker discriminating information content. Two systems, namely the DTW/ Mel-frequency Cepstral Coefficients (MFCC) system and the online i-vector/PLDA/DTW system have been extended to incorporate the knowledge of specific regions of interest. The results have been evaluated on Part 1 of RSR2015 database. Relative improvements of upto 11.85% and 49.41% are observed for the extended systems based on MFCC and i-vector respectively.
Introduction
Voice biometry is the task of recognizing a speaker from his/her voice print. It has two main types: speaker identification and speaker verification. In speaker identification, the system identifies an unknown speaker from a set of known speakers. In the case of speaker verification, the system task is to authenticate the claim of an unknown speaker against a particular identity. Speaker recognition can also be divided into text-independent and text-dependent modes. In text-independent mode, the speaker is free to use any text, while in the text-dependent mode, there lies a constraint on the text content of what the speaker speaks to the system [1, 2].
Text-dependent speaker verification is highly relevant for authentication applications where a speaker tries to gain access to certain resource or information, by uttering a specified piece of text (e.g. password, pass-phrase). One of the most traditional techniques for text-dependent speaker verification is template matching of streams of acoustic features, particularly MFCCs, using DTW algorithm [3, 4]. Over the deep last decade, many advancements have been made in the field of text-independent speaker recognition owing to the annual NIST Speaker Evaluation. Many of the advanced techniques introduced for text-independent mode of speaker verification have been adopted for text-dependent speaker verification. One such technique is i-vector/PLDA [5–7]. In [5, 6], to suit the text-dependent mode better, it has been proposed to train the PLDA model with speaker-phrase labelled data, with speaker-phrase definition of classes. Paper [5] also proposes a Hidden Markov Model (HMM) based Hierarchical Acoustic Model (HiLAM), which is able to model the temporal information of the pass-phrase as well. The model, by virtue of the additional information is able to outperform the i-vector/PLDA framework. Also, Joint Factor Analysis (JFA), initially introduced as a channel compensation technique has been found more effective than i-vectors, in case of very short utterances [8]. Replacing the backend classifier, PLDA with LDA and WCCN in the i-vector framework, has also been shown to provide very good performance for text-dependent speaker verification [9]. In recent times, DNN based approaches have gained a lot of popularity among researchers in this field [10, 11]. DNNs have been used broadly in two ways. In one approach, phonetically aware DNN is used to replace the Universal Background Model (UBM) used in the i-vector extraction process, resulting in DNN/i-vector framework [12]. The other approach has been to derive speaker-phrase discriminative deep features from DNN followed by cosine scoring or any backend classifier like PLDA [10–13].
Lately, DTW techniques have again come to the forefront with its integration with the new technology [14–16]. In fact even the traditional DTW/MFCC frame-work outperforms most of the latest techniques for the particular trial scenario of text content mismatch [17]. In recent times, DTW has been integrated with DNN and online i-vector frameworks and the systems so developed have resulted in significant improvement in performance. The DTW algorithm takes as inputs two streams of acoustic features, be it MFCCs, i-vectors or DNN posteriors, and time aligns them. It finally provides a similarity measure score corresponding to an optimal path that is calculated based on minimum distance criterion using the local distances between the frames of the two streams. When computing the final score, it weighs all the local distances on the optimal path equally. However, previous works in the field of text-independent speaker recognition have proved that speaker information is not distributed uniformly across the whole utterance. In [18], the authors have shown that features from vowel-like regions are more effective in making speaker discrimination in the text-independent scenario. In the text-dependent mode, however, in addition to speaker identity information, it is important to model the specific text information. Yet, it is conjectured that complementing the DTW global score with local scores from specific regions of interest having more prominent speaker identity information would lend the system a performance boost.
In this work, we extend two DTW based baseline systems, namely the traditional MFCC/DTW system and the lately introduced online i-vector/PLDA/DTW system [17]. The DTW global score is complemented with local scores, corresponding to vowel regions, which have more prominent speaker identity information.
The paper is organized as follows: Section 2 gives the relevant theory, while Section 3 describes the methodology. Section 4 presents results and discussion and finally, the work is concluded in Section 6.
Theory
This section describes DTW based text-dependent speaker verification frameworks that have been considered in this work. MFCCs have been used as the acoustic features and DTW algorithm has been used for template matching purpose.
MFCCs
MFCCs are the most widely used features in the domain of speech processing [19]. It is modeled on human perception system. It was first introduced in the field of speech recognition. However, it has been successfully used for various other applications of speech processing. This feature has been the de-facto standard in the field of speaker recognition. It represents the vocal tract information and is computed using time-frequency analysis and a filter-bank method.
The speech signal is first normalized and then pre-emphasized to boost high frequencies. The impulse response of the pre-emphasis filter is given by
DTW is a dynamic programming technique that is used to find the similarity between two time series, which are speech utterances in our case [20]. It is used to find optimal alignment by non-linearly warping the time series to find the corresponding frames. At first all possible local distances between the frames of the two utterances are calculated. Let us suppose two speech utterances, A and B of length m and n respectively, given by
The algorithm first forms an m by n matrix where the ijth element is given by the euclidean distance between a
i
and b
j
, denoted by d (a
i
, b
j
)
Once all local distances are available, the score is accumated according to:
The algorithm backtracks along an optimal path based on a minimum distance criterion and the global score gives the similarity measure between the two sequences. This score becomes the basis of deciding whether the two sequences are same or different.
A speaker verification system involves two phases, namely, training phase and trial phase. In the training phase of an MFCC/ DTW [21] system, speech utterances are first recorded from speakers. These utterances are featured into streams of MFCC feature vectors and stored as enrollment templates. In the trial phase, when a speaker claims a certain identity and speaks a pre-defined pass-phrase, his/her voice print is recorded and then featurized into MFCCs. This trial template is then time registered against the stored enrollment template of the speaker of the claimed identity, using DTW algorithm. The DTW algorithm time aligns the two templates and gives a similarity measure in the form of distance score. Figure 1 shows the block diagram of a typical MFCC/DTW system. The pre-processing module does pre-emphasis and Voice Activity Detection (VAD). VAD is used to find the end points of the speech utterance. Then it is passed through the featurization and normalization module to finally obtain the MFCC template which is aligned with another template to find similarity using DTW algorithm.

MFCC/DTW template matching system.
The decision making can be based on either hard thresholding or cohort-based method [23]. In hard thresholding method, either a global threshold is calculated from a development set of speakers or speaker-specific threshold is calculated offline using the multiple enrollment templates. In the cohort-based method, additional template matching is performed with enrollment templates of some speakers which are imposters to the speaker of the claimed identity. The scores corresponding to the target speaker and its imposters are compared and a soft thresholding criterion is used to take decision.
In recent times, online i-vector features have been successfully used for speaker diarization and also for speaker adaptation in speech recognition. In [17], online i-vectors have been integrated with DTW technique for text-dependent speaker verification and it has been shown to outperform many of the latest techniques.
An i-vector is a low dimensional representation of acoustic features. It uses a set of total variability factors, where each factor represents an eigen dimension of Total Variability Space. An i-vector may be represented as
Online i-vectors are extracted from short window of MFCC features. It is calculated framewise, taking L frames from the left and the right context of each frame. For the ith frame of the utterance, the window of (i - L) th to (i + L) th frame is used for i-vector extraction. Figure 2 illustrates the online i-vector extractor process.

Online i-vector extraction.
The online i-vectors are then projected on a PLDA subspace to obtain PLDA projection vectors [17]. The PLDA is trained on development data with a speaker-phrase definition of classes. It is trained on similar short segments of speech utterances.
PLDA is a generative model used as a backend in the i-vector framework. It models i-vectors according to the equation
Then, PLDA projection vectors (μ′) are used in place of MFCC features in the DTW template matching framework. DTW algorithm, in this case, uses cosine distance score for finding the local distances between frames of the two templates.
Vowel regions are relatively high signal to noise ratio (SNR) regions. These regions have been found to be more significant in terms of speaker discriminative information. In [18], it has been found that features extracted from such regions are more robust to noise degradation and lead to significant improvement in performance. The authors used the GMM-UBM technique in the text-independent scenario to evaluate their work. Vowel regions are identified using vowel onset point (VOP). Speech spanning over a certain number of frames starting from VOP roughly corresponds to a vowel sound. These portions of speech are associated with impulse-like excitation and therefore more informative of the speaker vocal tract configuration.
Methodology
In the proposed method, we complement the global DTW score with local scores from specific regions of the speech segments. The regions are determined based on their vowel characteristics. Vocal tract information is manifested more prominently in such regions as it corresponds to impulse-like excitation. VOP detection algorithm is used to detect the starting/onset points of vowel regions in an utterance. Then, corresponding to each such point, a vowel is approximated by considering a speech span of 100 ms to its right. These vowel regions in an enrollment template are mapped to the DTW warping path and the corresponding local distance scores are considered. Figure 3 shows the DTW warping path obtained from template matching between an enrollment template and an imposter template of the same pass phrase. It marks the regions, corresponding to VOPs that are chosen to obtain the local scores related to speaker-identity-rich regions.

DTW warping path marked with speaker-identity rich regions anchored by VOPs.
In this work, we consider pass-phrases of an average duration of 3s. Three instances of a pass-phrase are considered as enrollment data for a particular speaker. In the case of MFCC/DTW system, every trial utterance is time-aligned with each of the three utterances. An optimal path based global similarity score is obtained corresponding to every enrollment template. Thereafter, the three scores are averaged to finally give the final similarity score. This is the basis of the first baseline system that is considered in this work.
In the proposed system, a VOP detection algorithm is run on each of the three enrollment templates. Then, vowel regions in the utterances are determined, as explained earlier in this section. The frame indices corresponding to each such region are stored in a look-up table. At the trial phase, after alignment of the trail utterance with an enrollment template, the look-up table is used to map each vowel region to the warping path. The local distances corresponding to the mapped portion of the warping path are summed to obtain a score corresponding to each individual vowel. An utterance-level local score is then obtained by summing the scores corresponding to all the vowels in the utterance. This score is used in weighted combination with the global score corresponding to the same enrollment template, obtained in the conventional MFCC/DTW system. The combined scores with respect to all three enrollment templates are then averaged to obtain the modified similarity score. Experiments have been performed with different weight combinations to analysis the impact of each of the score components on the performance of the system.
Extended online i-vector/PLDA/DTW system
For the baseline online i-vector/PLDA/DTW system, online i-vectors are first computed and then PLDA projection vectors are obtained as per the procedure explained in section 2.4. A context of 10 frames to the left and 10 to the right are considered when computing online i-vector for a particular frame. As in the case of MFCC/DTW system, three enrollment utterances per speaker are considered for this system as well. The stream of PLDA projection vectors for each of the three utterances is computed and stored as enrollment template. At the time of trial, the trial utterance is subjected to the same featurization procedure. The stream of projection vectors obtained in the process is time-aligned using DTW algorithm with each of the enrollment templates. As in the case of the earlier system, the DTW scores corresponding to the three enrollment templates are averaged to give the final similarity score.
This system is extended in a similar way the MFCC/DTW is extended. VOP detection algorithm is run on each of the three enrollment templates. Accordingly, vowel regions, each of 100 ms duration, are located for each of the templates and the frame indices are stored in a look-up table. The trial utterance is first time-aligned with each of the templates to obtain an optimal path. The portions of the optimal path corresponding to vowels in the enrollment template are selected using the look-up table. Local distances of the individual portions are summed to obtain a score corresponding to every vowel. The scores of all vowels in the template are summed to give an utterance-level local score. This score is combined with certain weightage with the score obtained against the same template using the conventional system. The combined weighted scores with respect to all the three templates are averaged to finally give the modified similarity score.
Experimental setup
The systems have been evaluated on Part 1 of the RSR2015 database [5]. It consists of 30 pass-phrases of an average duration of 3s and a pool of 300 speakers. Nine sessions are recorded for each of the pass phrases by each speaker. It has been divided into three sets: background set, development set and evaluation set. Background set has 50 males and 47 females; development set has 50 males and 47 females and the evaluation set has 57 males and 49 females. The results have been obtained for the evaluation set of the database. Following the setup used in [17], performance has been evaluated on four types of imposter conditions. Cond1 consists of trials where different phrase is involved. Cond2 refers to the trials where the same phrase is pronounced by different speakers, while in Cond3 both speaker and phrase are different. Finally, Cond4 involves all the different trials from the first three conditions.
Features and normalization
The speech signal is downsampled to 8 kHz to make easy comparison with contemporary work. It is framed into blocks of 20 ms with 50% overlap. Based on an energy based voice activity detection, the speech and the non-speech frames are distinguished. 39-dimensional MFCC features (including delta and double-delta features) are extracted of the speech frames using HTK toolkit [22]. For channel compensation, Cepstral Mean Subtraction (CMS) is done taking one utterance at a time.
i-vector extractor and PLDA model
Two gender-dependent GMM-UBMs are trained using the background set data of RSR2015 database, with 512 number of Gaussians. i-vector extractors are trained from the same data to give 400-dimensional features. The i-vectors are then centered and white-transformed with parameters, mean and whitening matrix respectively. These parameters are obtained using development set data. The development set data is also used for training the PLDA model. The classes are defined to be speaker-phrase specific. The PLDA subspace dimension is set at 200. MSR Identity Toolkit has been used for training the Total Variability and PLDA subspaces [24].
Score normalization
The scores are T-normalized [23] before calculating the Equal Error Rate (EER) of the system. The development data set is used to find the parameters for T-normalization. The parameters are obtained separately for the male and the female sets.
Results and discussions
In this section, we discuss the results obtained with the two enhanced DTW based systems. Table 1 shows the results for MFCC/DTW system with respect to different weighting schemes, on male evaluation set of Part 1 of RSR2015 database. The EERs tabulated in the row, against ‘O’ represents the performance of the baseline system. ‘L’ refers to the case, where only the local scores are considered. The rest of the rows give EER readings for various weighting schemes. The average local score, L, alone is found to outperform the overall score, O, in case of Cond2. In the rest of the cases, overall score, O, is found to be more discriminating. This behavior may be said to be on expected lines, as in case of Cond2, the text content of the imposter trials is the same as that of enrollment template. As a result, the vowel regions, which are believed to be rich in speaker identity information, are more likely to get aligned against their corresponding segments through DTW algorithm, as compared to the other cases. Thus, the corresponding vowel regions of the two templates, of same text content, but pronounced by different speakers are scored one to one. These scores are more indicative of speaker variability; the reason being that the vowel regions corresponding to similar text but pronounced by different speakers have more discernible speaker discriminating information. However, experimental results suggest that the two scores, O and L carry complementary information and leads to reduction in EER to the tune of 8.49% relative to the baseline system, for weighting schemes, 0.3xO+0.7xL and 0.4xO+0.6xL.
Performance of the extended MFCC/DTW system with respect to different weighting schemes for male speakers
Performance of the extended MFCC/DTW system with respect to different weighting schemes for male speakers
In case of Cond1, the overall score is more discriminating than the local score. Nevertheless, even in this case, the two scores, as per the experimental results, have complementary information. The weighting scheme, 0.6xO+0.4xL, leads to a relative improvement of 20% over the baseline system. This improvement may be attributed to the text-characteristic information that is captured through alignment of vowel regions corresponding to different texts. This proves that other than speaker information, text information are also more easily discernible through vowel regions.
Cond3, where both text contents and speakers are different, shows 32.10% relative improvement after inclusion of local scores, for the same weighting scheme. Cond4, which considers all the three types of trials, shows relative performance improvement of 11.85% with respect to the baseline system, for equal weightage of O and L.
Table 2 shows the results for the female set of speakers of Part 1 of RSR2015 database. Here too, the weighted combinations of O and L have been found to be more discriminating, resulting in relative improvements of upto 16.32%, 9.01%, 16.12% and 9.54% for Cond1, Cond2, Cond3 and Cond4 respectively. The best weighting scheme for Cond4 is found to be 0.3xO+0.7xL.
Performance of the extended MFCC/DTW system with respect to different weighting schemes for female speakers
Tables 3 and 4 give the results obtained for the online i-vector/PLDA/DTW system for the male and the female evaluation sets respectively. The proposed extended system has been found to give improved performance relative to the baseline system. In this case, the local score, L alone proves more discriminating than the overall score, O, throughout all four conditions. Relative improvements of upto 67.74%, 12.04%, 23.52% and 46.15% have been observed for Cond1, Cond2, Cond3 and Cond4 respectively, in case of male dataset. The EERs obtained for female evaluation set follow a similar trend. For weighting scheme of 0.7xO+0.3xL, a relative improvement of 49.41% has been observed for Cond4.
Performance of the extended online i-vector/PLDA/DTW/ MFCC system with respect to different weighting schemes for male speakers
Performance of the extended online i-vector/PLDA/DTW/ MFCC system with respect to different weighting schemes for female speakers
Figure 4 illustrates the relative EER readings of the baseline systems and their extended versions. Here, S1 stands for MFCC/DTW system and S1E for its extended version. Similarly, S2 and S2E stand for online i-vector/PLDA/DTW system and its extended version respectively.

EERs for baseline and extended systems.
The DTW techniques are known to model the temporal information which is critical in a text-dependent scenario of speaker verification. On the other hand, the other widely used techniques, like i-vector/ PLDA and JFA do not model the temporal structure of the pass-phrase [17]. These techiques, rather average the phonetic and speaker information across the whole utterance. Table 5 gives the performance comparison of the proposed extended systems (male), taking their best yielding weighting combinations, with the rest of the techniques considered in this work. The MFCC/DTW system and its extended version outperform i-vector/PLDA and JFA, with respect to test conditions, Cond1 and Cond3. The temporal information modelled by DTW technique gives the systems distinct advantage in these cases. The integration of DTW technique with i-vector/ PLDA framework has helped achieve improved performance. And complementing it with knowledge of specific regions of interest, as proposed in this work, has further reduced the EER of the system. It may be observed that the proposed extension has reduced the EER values across all four testing conditions, with respect to the baseline systems. The results confirm the importance of modelling the temporal information and also the usefulness of complementary information in the vowel regions.
Performance comparison of the proposed system with some of the existing techniques in terms of EER (%)
bCorresponding to the weighting scheme that results in lowest EER for Cond4.
Experimental results confirm the importance of vowel regions for text-dependent speaker verification. Extending the baseline systems to incorporate the knowledge of vowel regions have helped achieve significant reduction in EER. The extended systems outperform the baseline systems across all conditions and evaluation sets. The EERs corresponding to Cond4 may be considered to represent the overall performance of the system. The extended MFCC/DTW system shows relative improvement of upto 11.85% and 9.54% for the male and female evaluation sets respectively. Extending the online i-vector/PLDA/DTW with the proposed method results in relative improvements of 46.15% and 49.41% for male and female sets respectively. It may be inferred that vowels are more important for both text and speaker discrimination. Of the two proposed systems, the extended i-vector/PLDA/DTW system gives relatively better performance by 75.43% and 76.11% for male and female evaluation sets respectively.
It is however observed that the weighting scheme that results in the minimum EER differs across the two systems and across different conditions. For practical system building, it may be helpful to have a pre-calculated optimum weighting scheme. This task of finding an optimum weighting scheme can be explored in future course of work.
Footnotes
Acknowledgments
The authors would like to thank the Speech and Image Processing Laboratory of the National Institute of Technology Silchar for supporting the research work.
