Abstract
The present work describes different research techniques for collecting and organizing speech database in different scenario at the institute and successfully structuring the text independent speaker identification database in Indian context. In order to get the Multi-Scenario dataset, each speaker performed multiple sessions recording in reading style with English and Hindi language with same passages but under different conditions. This work analyzed different scenario affecting the performance of speaker recognition system when tested under dissimilar training conditions. Here four different scenarios are considered; sensor and environment, language, aging and health. To study the effect of sensor, language and environment on the performance of ASR system a database of 200 speaker was created. Under different environmental conditions, four different types of sensors in parallel configuration were used to study the sensor mismatch conditions over testing and training phase. The database contains speech samples of the individual in English and Hindi in read speech styles under two environment i.e. a controlled recording chamber and library. To study the aging effect, an aging NSIT speaker database (AG-NSIT-SD) of 53 famous personalities was collected from online source varying over a period of 10–20 years. Further to study the effect of health, a cough and cold NSIT speaker database (CC-NSIT-SD) of 38 speakers was also collected to study the performance of system. Apart from this, the effect of different noise types on the speaker identification was also studied on different sensors.
Introduction
From last many years, there is a breakthrough in the progress of the speaker recognition field due to the availability of the databases [1–3]. It speeds up the performance comparison of different techniques performance on database available, allowing us to decide which technique is better to pursue. These standard speech databases also help in learning the state-of-the art performance under certain experimental conditions and give prominence to the issues that require additional research. This speaker recognition database is designed to further explore new ideas in speaker recognition, to develop advanced technology merging these schemes. All of the datasets will be available for use in technology development, research and linguistic education to study speech related studies, like the effect of noise, age, health, channel mismatch, reverberation and sensor mismatch.
Survey of database used in the speaker recognition system
The problem of sufficiently obtaining speech to train and test an ASR system is possible through the use of a prerecorded speech database. Standard speech database is very important for the development and evaluation of the speaker recognition systems, because they enable formal assessment and comparison of systems, and hence measurement of progress. By using available standard database results can be directly compared with those existing research. Speech database are of many types depending upon the type of speech, language (linguistic morphology), size, number of sessions, number of speaker, input devices, acoustic environment and different mismatched conditions. On the basis of linguistic morphology, there are isolating languages (corpora that belongs to this are English, Hindi) and agglutinative language (Hungarian [4], Turkish, Finnish). Hence databases are language specific. Depending upon the type of speech, databases can be of read speech, spontaneous speech and telephonic speech or conversation. The TIMIT Acoustic-Phonetic continuous speech corpus [5] comprises recordings of read speech of 630 individuals in which there are 438 male and 192 female and one recording session was assigned for one speaker. The YOHO Speaker-Verification corpus [6] consists of prompted read speech in office environments with low level office and computer noises in the data. The Switchboard corpus is one of the largest collections of conversational, telephone speech recordings. NYNEX Telephone Version of TIMIT Corpus (NTIMIT) [7] consists of telephonic speech. The Switchboard corpus [8] is one of the largest collections of conversational, telephone speech recordings. ATIS database [9] consists of spontaneous conversation. On the basis of sessions, speech databases are also classified as single-session (TIMIT) and multi-session (Digit-SPL [10]). Digit-SPL [10] is a multi-session database consisting of 68 males and 19 females speakers with relatively clean speech (an average SNR of 41.6dB), spoken on three separate sessions separated by approximately 4–8 weeks. On the basis of input devices, speech database can be varied with microphone speech or telephone speech (local or long distance telephone lines). NYNEX Telephone Version of TIMIT Corpus (NTIMIT) is heir apparent to TIMIT which includes same speech with telephone bandwidth which give general analysis to degradation caused by telephonic line. On the basis of acoustic environment, databases are recorded either in noise free environment, such as in the sound booth, or with office/home noise (Aurora). The Aurora database [11] contains noisy data in which noise is artificially added with the clean data at SNRs of 20 dB, 15 dB, 10 dB, 5 dB, 0 dB and -5 dB. The noise signals added are chosen to reflect realistic environments conditions. In total there are eight different noise types are used: subway, babble, car, exhibition hall, restaurant, street, airport and train station. The training set consists of 8440 different utterances, containing the recordings of a total of 55 male and 55 female speakers. The YOHO Speaker-Verification corpus is a high quality speech database for text-dependent verification with limited vocabulary [6] and low level office and computer noises in the data in office environments. The corpus comprises recordings of 138 subjects, 106 males and 32 females. The Switchboard (I and II) corpus is conversational, telephone speech recordings with different noises include echo or crosstalk in the telephone circuit, background noise (e.g. baby crying, television, radio, etc.) and distortion (refers to echo and other recording problems). There are 543 and 657 speakers in Switchboard I and II corpora respectively. Different Speaker Recognition Evaluation (SRE) corpora were derived from Switchboard to allow assessment and comparison of different systems. POLYPHONE database, a multi lingual project [12, 13], consists of a 5000 speakers collected over the telephone line. POLYCOST corpus consists of 133 speakers (74 male and 59 female) [14]. Each speaker provided more than 5 sentences with an intersession interval ranges from days to weeks. The speech samples include fixed and prompted digit strings, read sentences and free monologue. The recordings were made using variable telephone handsets over digital ISDN channels in a home office acoustic environment. The speakers used non-native English as well as various European languages. SIVA is an Italian speech corpus [15] consisting of four speaker categories (male users, female users, male impostors, female imposters) recorded by fluent Italian speakers. The speech was recorded over Public Switched Telephone Network (PSTN) channels in a home office acoustic environment. The major drawback of this corpus is that it contains only three sessions but not for all the speakers. Each session contains a list of isolated words in the form of digits, a dialogue in which personal information is asked and a read passage, for a total of about 180 seconds. For any other information refer [15]. ELSDSR corpus [6] consists of voice messages from 22 speakers: 10 female, 12 male, and the ages are covered from 24 to 63, long English reading speech sentences recorded by non-native speakers of 20 Dane, one Icelander and one Canadian. POLYVAR speech corpus [2] specifically designed to evaluate the intra speaker variability in French language contains 143 speakers (85 male and 58 female). There are 3600 sessions in total recorded with speakers during 1 to 229 sessions each with an intersession interval ranging from days to months. The speech samples include read digits, words and sentences, and spontaneous speech. The speech is recorded using different telephone handsets over PSTN channels at home office acoustic environment. The King corpus consists of 51 male speakers; each speaker was recorded over 10 sessions providing speech data with intervals ranging from weeks to a month [16]. The speech was recorded using a wideband microphone and an electret handset over clean and PSTN channels. It was recorded in a sound booth. The National Institute of Standards and Technology (NIST) has been coordinating Speaker Recognition Evaluations since 1996. Details of the new releases of NIST corpora can be found on [19].
Firstly, paper is introduced in Section 1. Then, Section 1.1 consists of survey of the database used in the speaker recognition system. Section 2 consists of need for data collection. Section 3 consists of Speech data collection with device details. Section 4 consists of experimental results. Finally Conclusion are discussed in the Section 5.
Need for data collection
The study on the performance of ASR system affected by different handset and telephone transmission channel propel the need of data collection. Speaker database are collected including the different effects of following prototypes that propel the need of data collection.
Prototypes affecting the database
Sensor type and Environment effects: Due to the advancement of cellular technologies; mobile phones, cordless phone and cellular phones are omnipresent and each type of phone consists of different speakerphone and headset. Different telephone handsets can be used by the speakers in different training and test combinations, this may affect speaker recognition performance due to transmission channel mismatch caused by different microphone. Language effects: The performance of the ASR system due to the effect of language differences has been a subject of great interest, but much research has not been conducted in this field, may be due to the lack of data of multiple languages, and also due to the lack of data involved in which speaker who records multiple languages. Aging effects: The effect of age difference in the speech taken over a period of 20 years is clearly examined by our ear but ASR system can find the similarity between the speech of a person due to age difference, maybe the subject of interest, and never been studied yet. No data is available for this study, hence data of some known personalities was captured for this from online sources [17, 18] over a period of 20 years. Health effects: Speech data of a speaker is recorded at normal health condition of the throat, but if speaker is tested at the time of cough and cold, this may affect the performance of ASR system. For this, cough and cold data was collected and analyzed.
Speech data collection
One of the hostel rooms (size
The following recording sensors were used: Headset Microphone: A typical headset microphone used in general personal computers (PCs) was used as the sensor to capture the data. Due to the wideband flat response of headset microphone up to 16 kHz, we were able to collect speech data at 16 kHz. To capture the best possible clean speech data, the headset microphone was closely mounted with the speaker compared to all other sensors. Laptop built-in microphone: The built-in or internal microphone of laptops was used as another sensor with a sampling frequency of 16 kHz. Mobile phones in offline mode: Two mobile phones iphone 4 and HTC explorer were used for recording the data. One phone had sampling frequency of 8 kHz and stored data in.m4a format and the other had sampling frequency of 8 kHz and stored data in.amr format. A distance of 2–3 feet was maintained on the table from speaker to phone. Before storing the data in the PC, both the default format files were converted into WAV format.
About 3–4 minutes of reading style speech data using English passage was initially collected followed by passage reading in Hindi with all the sensors placed close to the speaker recording simultaneously. The data is then converted to .wav format with the help of an audio converter and stored in the laptop. The phones were then synchronized with the computer and the recordings of the phone were also stored in the computer. Mainly, two kinds of environments are selected for data collection namely, Institute library and a recording chamber.
In the library, this database was recorded in real time working conditions in the presence of students and staff with all the air-conditioners and other electrical equipments in on conditions. This adds the ambient noise in the speech database. The library chamber was of around (
Device details
The device details have been tabulated in Table 1. Hi-tech Microphone: It consists of a microphone attached to the headphones and is plugged into a laptop while recording. The sampling rate of such microphones is 16 Khz and the recording format is.wma Iphone4: It is a cellular mobile phone whose embedded voice recording application has been used to record speaker’s voice. Audio Frequency response is 20 Hz to 20,000 Hz. Following are the audio formats supported: AAC (8 to 320 Kbps), Protected AAC (from iTunes Store), HE-AAC, MP3 (8 to 320 Kbps), MP3 VBR, Audible (formats 2, 3, 4, Audible Enhanced Audio, AAX, and AAX+), Apple Lossless, AIFF, and WAV. HTC explorer: It is a cellular mobile phone whose embedded voice recording application has been used to record speaker’s voice. Following are the audio format supported: WAV, AAC, AMR, MP3, and WMA. Toshiba L650 laptop: It is a computer whose embedded microphone has been used for the recording purpose. The voice has been recorded in.wma format.
Device Details
Device Details
Multi scenario NSIT speaker database (MS-NSIT-SD): After assimilating the speech samples and converting them in same format, each sample was tagged with a unique identification code constructed using different flags. Nomenclature 〈Speaker ID〉 〈environment flag〉 〈Language〉 〈Environment flag〉: This flag represents the environment in which the speech sample was recorded. It is a single digit number flag 2 and 3. Number 2 is for recording chamber and 3 for library condition. 〈speaker Id〉: SP2- This flag represents the speaker id, which is unique to every speaker. 〈Language flag〉: This represents the language of the recorded speech sample. It is a single letter alphabetic flag. En stands for English and Hn for Hindi. SP200En represents that this sample was recorded at hostel by the speaker with identity SP2001 in English Language. The nomenclature of the speaker data collected in hostel rooms and the library has been shown in the Table 2. The two recording scenarios, the recording chamber and the college library are shown in Figs. 1 and 2 respectively. Cough and cold NSIT speaker database (CC-NSIT-SD): Nomenclature 〈Speaker ID〉 〈Health flag〉 〈Health flag〉: This flag represents the Health in which the speech sample was recorded. It is a single digit number flag 4 and 5. Number 4 is for clear throat and 5 for cough and cold health condition (Infected throat). 〈speaker Id〉: SP5- This flag represents the speaker id, which is infected by cough and cold. Aging NSIT speaker database (AG-NSIT-SD): This database contains speech samples of 53 famous personalities varying over years between 1990–95 to 2005–15. The speech samples of duration 3–4 mins of famous personalities like Bill Clinton, Narendra Modi etc. were collected by downloading their video of speeches at different point of their lifetime using youtube.com and other online sites. The audio was extracted from the video using all video to audio converter software in.wav format. Three speech samples for each celebrity are collected. With the help of this dataset, we have tried to study the effect of long-term human speech aging on a speaker recognition system. Nomenclature 〈Speaker ID〉 〈Age difference flag 〈Age difference flag〉: This flag represents the Health in which the speech sample was recorded. It is a single digit number flag O and L. Number O is for old speech of year 1990-95 and L for Latest speech of year 2005-15. 〈Speaker Id〉: SPO- This flag represents the speaker id of old speech.
Nomenclature
Nomenclature

Setup in the recording chamber.

Setup in the library environment.
Results Of multi scenario NSIT speaker database (MS-NSIT-SD) collected in the sensors
In this section, the results are reported and analyzed for library and recording chamber between phases of training and testing. Sampled speech data of 8 kHz was used with Hamming window of length 25 ms, frame shift of 10 ms and 0.97 pre-emphasis factor. Speech data is parametrized using Cf1 to Cf13 static MFCC vectors [20] of 13 dimensions. In order to reduce the channel mismatch, the cepstral mean and variance normalization (CMVN) was performed. For this setup, the speaker models were trained using speech utterances of 2 mins and for testing 50–60 sec duration of utterances were used. After the extraction of the features from the speech samples, the coefficients are used for the speaker identification. This process help in finding the probability of various different sensors in the matching of speakers. The naming of the samples used in the identification process has been mentioned in Table 3 and the results for matching of different combinations are shown in Table 4. In verification, GMM-UBM was used with 1024 Gaussian mixture components and diagonal covariance matrices. Average log likelihood score was calculated over the test data features with the claimed speakers model.
Samples used in the identification process
Samples used in the identification process
Results for matching of different combinations
To study the effect of different sensors under different environment condition on the performance of the system, we have used four sample sensors: sample 1, sample 2, sample 3 and sample 4 as mentioned in Table 3. The training data contains 200 speech utterances of 2 minutes from each sample. The test data set contains 4 speech from each sensor of length varying from 50 to 60 seconds. The various combination were tried for training and testing purpose using data from different sensors. In first experiment, 13 MFCC features with mean and variance normalization in an uncontrolled library environment in reading style of speech was used. The performances in terms of EERs (equal error rate) are shown in Table 5. In second experiment, 13 MFCC features with mean and variance normalization in controlled recording chamber environment in reading style of speech were used. The performances in terms of EERs are shown in Table 6. Controlled recording chamber environment with mean and variance normalization gives best EER in most cases.
EERs for different sensors with 13 MFCC features with mean and variance normalization, environment: Library, style of speech: reading
EERs for different sensors with 13 MFCC features with mean and variance normalization, environment: Recording chamber, style of speech: reading
The analysis showed that 94% percent of the speakers were correctly identified. The performances of CC-NSIT-SD in terms of EERs are shown in Table 7.
EERs of health effect using Hitech microphone
EERs of health effect using Hitech microphone
In this step, the effect of different noises and different sound to noise (SNR) ratios on the speaker identification system was studied. The model was trained with a clear speech samples of sensor and then tested with speech samples of the same sensor superimposed with various noises. The results achieved are shown in Table 8. The destroyer engine noise had the worse effect on the identification of speaker and Volvo noise had least amount of effect on identification from the three noises used on different values of SNRs.
Effect of adding babble, volvo and destroyer engine noise in test samples of same sensor
Effect of adding babble, volvo and destroyer engine noise in test samples of same sensor
After identification, the speech samples of 53 famous personalities were analyzed. There were 3 samples for each speaker, with a difference of 10–20 years in the recording of the samples to examine the effect of age on the voice of a person. The analysis showed that 93 percent of the speakers were correctly identified. The 7 percent mismatch was in the samples with poor audio quality. The results shown the change in the voice characteristics of the speaker over a period of 20 years and in terms of EERs as shown in Table 9.
EERs of online data
EERs of online data
To study the effect of language variation, samples of Hitech microphone in English language were tested with Hindi language and vice-verse. The analysis showed that 97 percent of the speakers were correctly identified due to the similar environment and sensor conditions and in terms of EERs are shown in Table 10. The results indicate that while doing training and testing of data collected from different sensors, the higher performance is obtained in sensor matching case than the mismatch case where quality of speech data has no role to play. Hence it is concluded that performance gets degraded when mismatching occur in training and testing sensors. Finally the results for identification were presented, which shows that for the same sensor the identification accuracy goes up to 90%, independent of the language spoken (English and Hindi) and for different sensors the speaker matching ranges from 40 to 60% and for clean samples the probability of identification increases to 80–90%. Speaker identification performance was also evaluated under noisy conditions. The destroyer engine noise had the worse effect on the identification of speaker and Volvo noise had least amount of effect on identification from the three noises used on different values of SNRs.
EERs of Language effect using Hitech microphone
EERs of Language effect using Hitech microphone
In this paper, speech data for the development of speaker identification system was collected by different methodologies in different scenarios. Here four different scenarios were considered; sensor and environment, language, aging and health. Different research techniques were described here for collecting and organizing speech database. To date about 20 hours of data involving more than 200 unique speakers have been collected named Multi-Scenario NSIT speaker database (MS-NSIT-SD). To study the aging effect, an aging NSIT speaker database (AG-NSIT-SD) of 53 famous personalities was collected from online source varying over a period of 10–20 years. Also, a Cough and cold NSIT speaker database (CC-NSIT-SD) of 38 speakers was also collected to study the performance speaker recognition system (SRS). A survey of the database used in the most of SRS systems is also presented here. All the data described herein is or will shortly be available for research and technology development.
The result of our study explored that while doing training and testing of data collected from different sensors, the higher performance is obtained in sensor matching case than the mismatch case where quality of speech data has no role to play in that. So it is concluded that performance gets degraded when mismatching occur in training and testing sensors. Finally the results for identification are presented, which shows that for the same sensor the identification accuracy goes up to 90%, independent of the language spoken (English and Hindi) and for different sensors the speaker matching ranges from 40% to 60% and for clean samples the probability of identification increases to 80–90%. Speaker identification performance is also evaluated under noisy conditions. The destroyer engine noise has the worse effect on the identification of speaker and Volvo noise has least amount of effect on identification from the three noises used on different values of SNRs. Additional research is needed to improve the model so as to improve the identification in cases where different sensors are used by the speaker.
