Abstract
Speech analysis for extracting attributes such as the speaker, gender, accent and like has been a field of great interest and has been widely studied. The paper presents a novel architecture for accent identification by using a cascade of two deep-learning architecture. We design and test our proposed architecture on common voice dataset. The architecture consists of a cascade of Convolutional Neural Network (CNN) and Convolutional Recurrent Neural Network (CRNN). It is trained on Mel-spectrogram of the audios. We consider five of the most popular English accents groups namely India, Australia, US, England, Canada in this study. The proposed model has an accuracy of 78.48% using CNN and 83.21% using CRNN.
Introduction
Traditionally, people have been categorized into different groups based on the similarities they possess such as language, culture or gene pool. These groups are often identified as a race or ethnicity. Each race of people generally exhibits a typical accent in their speech which can be used to distinguish one race from another even when they speak the same language. Accent identification is the process of using various audio features exhibited by an individual to predict his/her accent. Knowing about a person’s accent can reveal useful insights about their interests, culture, psychology and similar high-level attributes. A challenge for predominating research on speech and audio processing technology is recognizing and modeling specific differences in speech or language. Different people possess different dialect, accent and therefore, their speaking styles also vary from one another. One of the significant factors for these differences is the socioeconomic background.
Above mentioned differences introduce modeling difficulties for speaker independent systems.
In accent classification researchers are mostly concerned with the acoustic patterns demonstrated by one’s voice. These acoustic patterns show some variations from person to person. These variations are more profound if they belong to different countries and this serves as the basis for the prediction model. The evolution of Deep Neural Networks (DNNs) in recent years has enabled us to capture different features of the audio more efficiently, thus resulting in accurate predictions.
Literature review
Deep learning, if applied on the various audio features, leads to a precise model which can be used for accent classification. In the past, researchers have used Mel Frequency Cepstral Coefficients (MFCCs), Perceptual Linear Predictions (PLPs) and formant frequencies for the accent recognition and identification with classical machine learning models such as Support Vector Machines (SVMs) and Gaussian Mixture Models (GMMs) [1, 4]. Accent identification problem was also addressed by other approaches which used temporal features and Hidden Markov Models (HMMs) [5]. In recent years with the evolution of DNNs, researchers have tried the fusion of classical approaches such as SVM used along with DNN which raised test set accuracy to around 50% to identify the native languages of non-native English speakers from eleven countries [6]. Multilayer perceptron also came into the picture of accent recognition. Artificial Neural Networks (ANNs) with competitive learning, backpropagation, and counter propagation gave a potential method to accent identification [7]. Then in the following times to improve the accuracy and achieve more insights HMMs are frequently used along with Long Short-Term Memory (LSTM) and SVMs. In recent years with the evolution of CNN and RNN, researchers used the representative power of CNNs and RNNs ability to address the temporal features [8, 9]. Convolutional Neural Network (CNN) has been used extensively in Language Identification (LID) problem with LSTMs. Christian Bartz proposed a LID system that solves the problem in the image domain, rather than the audio domain using hybrid CRNN architecture which takes spectrogram images of the audio samples as input [10]. In a related area music classification, CRNNs showed excellent performance with respect to the number of parameter and training time, intimating the power of its hybrid structure in feature portrayal and feature narration [11, 12]. However, in the area of accent recognition, there have been only limited studies using DNNs [7, 13]. Scott Novich and Andrea Trevino showed that vowel formant analysis provides accurate information on a person’s overall speech and accent using neural networks [14]. A recent study has used DNN with RNN on long and short-term features for accent identification [15]. This work gives a picture that DNNs combined with RNNs result in more robust models. Neural networks methods have improved accent recognition, but yet the employment of CNN, CRNN has been limited to date.
In particular, CNN and CRNN exploit spatially local correlations across input data to improve the performance of audio processing tasks, such as speech recognition, accent identification and acoustic scene classification [16]. Inspired from these works, this paper presents a Convolutional Neural Network and Convolutional Recurrent Neural Network architecture which uses Mel-spectrogram features of audio as input for accent recognition. The work presents that CNN, CRNN architectures with Common Voice Dataset yield an adequate performance for accent identification.
Dataset description
The study uses Common Voice corpus of speech data read by users on the Common Voice website (http://voice.mozilla.org/). The corpus consists of speech data form 18 languages and consisting of 1087 validated hours of recording in MP3 audio format (sampled at 22.05 kHz). The dataset also consists of demographic metadata like age, sex, and accent. The training and testing set has been made to be speaker independent.
Audio samples from five different countries US, England, India, Canada, Australia have been used. All the audios are in English versions of respective accents. 39,000 audio samples have been selected from the dataset in the equal ratio for all the accents. The dataset has been split to form a training set of sample size 28,675, a validation set of sample size 7,560 and test set consists of 2,765 audio samples in equal proportions for all the accents (Table 1).
Dataset description
Dataset description
The audio dataset contains different length audio. However, to use this data in neural nets, we need all samples to be of equal length. In this proposed work we trimmed the longer clips and padded the shorter clips to the length of the median length of audio in the dataset. The median length of the audio samples has been calculated that has come out to be 3.62 seconds. Thus, for the longer audio samples, we trimmed them to have a length of 3.62 seconds. For shorter clips, we padded with zero since it adds no new information to the audio [17]. Though many techniques of length normalization exist, we choose the above approach since it’s very intuitive and easy to implement. Following this, the architectures have been fed by the Mel-spectrograms as inputs.
Model architecture
In this paper, two DNN architectures namely CNN and CRNN have been used.
Convolutional neural network (CNN)
CNNs have been the pith of recent strides in the field of deep learning. CNNs are made up of neurons with learnable parameters like weights and biases. CNN uses convolution operation to search for features of a higher abstraction degree. Dimensionality reduction is performed in the subsequent pooling layer. CNN architectures contain mainly three types of layers: Convolution layer, Pooling layer, Fully connected layer. Convolutional neural network (CNN) architecture proposed in this paper consists of 4 convolutional layers with 4 max-pooling and batch normalization rendered within it. Inputs to the network consist of Mel-spectrogram representations of audio samples. Mel-spectrogram with 276 Mel bands is computed with a window size of 25 ms (sampled at 22.05 kHz) and an overlapping window of 10 ms. As discussed in section 3.1 the excerpts in the dataset are of unequal length, the audio length has been fixed to 3.6 seconds.
In the CNN model shown in Fig. 1 the following architecture has been used:
Layer 1: 32 filters each of size (5,5) with the stride of (1) is used. This is followed by batch normalization and max pooling of (6,6) filter size with the stride of (1). Lastly a rectified linear unit (ReLU) activation function f(x) = max (0, x) is used. L2 regularizer with regularization parameter of 0.1 is used. Layer 2: 64 filters each of size (5,5) with the stride of (1) followed by batch normalization and max-pooling of (2,2) filter size with the stride of (1). Lastly, the same activation function as in the first layer with a regularization parameter of 0.01 is applied. Layer 3: 64 filters of size (5,5) strode followed by batch normalization and max-pooling filter of (2,2) with the stride of (1) along with ReLU activation function is implemented. Layer 4: 96 filters each of size (5,5) followed by batch normalization and max-pooling of (2,1) filter with ReLU activation function is used. Fully Connected Layer: 128 neural units each followed by ReLU activation and a dropout of 0.4, is connected to 5 output units with SoftMax output function.

CNN Architecture.
Entire data have 28,675 training samples and 7560 validation samples. The samples have been divided into mini-batches. There are 64 such mini batches which contain randomly selected data points. Cross entropy loss function has been used to learn the parameters along with AdaDelta optimizer during the training of the model. Dropout with the probability of 0.25 is applied to input for first three convolution layers. A validation set has been used to fine tune the hyperparameters of the network. Early stopping has been used to examine whether the model is not overfitting or underfitting and therefore, checkpoints save the current best model after every epoch.
Convolutional Recurrent Neural Network (CRNN) utilizes the learning power of both CNNs and RNNs. CNN has enabled us to learn the spatial features in data but the temporal features are missing. However, in case of an audio signal, there is a time dependency between different excerpts of the audios. Therefore, RNNs are used which capture the temporal dependencies using recurrent connections.
Nevertheless, RNN faces a major problem of vanishing and exploding gradients. To overcome this limitation a variant of RNN known as Gated Recurrent Unit (GRU) has been used in this work.
Gated Recurrent Units (GRUs) uses a gating mechanism to address the problem of vanishing gradients. It utilizes the concept of selective read (update gate) and selective write (reset gate) through gates. During forward propagation, the gates control the flow of information. Similarly, during backward propagation, they control the flow of gradients. GRUs are computationally efficient than their popular counterparts Long Short-Term Memory (LSTM) as they have lesser number of parameters.
In the proposed CRNN model, 4 convolutional layers with 4 max-pooling filters and batch normalization have been used. Similar to CNN input to the network consists of Mel-spectrogram with 276 Mel bands computed using the window size of 25 ms with 40% overlapping. Along with this, two GRU layers have been used followed by a max pooling layer after last GRU unit.
In the CRNN model shown in Fig. 2 the following architecture has been used:
Layer 1: 64 filters each of size (7,1) are used. This is followed by batch normalization and max pooling of (2,1) filter size. Layer 2: 64 filters of size (5,1) are convolved to the input and followed by batch normalization and max-pooling of (3,1) filter size. Layer 3: Like layer 2, 64 filters but each of size (4,1) is used followed by batch normalization and max-pooling of (4,1) filter size. Layer 4: 96 filters (feature detectors) each of size (3,1) is used. This is followed by batch normalization and max-pooling operation with the filter of (4,1). Layer 5: GRU layer having 512 GRU units is used with recurrent state dropout probability of 10%. Layer 6: Similar to the previous layer, GRU layer having 512 units is used with recurrent state along with an additional 10% dropout.

CRNN Architecture.
During training, cross entropy objective function along with AdaDelta optimizer has been used. Similar to the CNN model here also the whole training data are divided into 64 mini-batches. Dropout with the probability of 25% and 20% are used in convolutional layers 1,2 and 3,4 respectively. L2 regularization with parameter 0.0001 has been used in the last two convolutional layers and GRU layers. The output from the last GRU layer is temporally max pooled with a kernel of (128,1) and is passed into SoftMax output function.
Results of our models
The model corresponding to the lowest validation loss has been used on the test dataset. Test dataset contains 2,765 samples of audio. Each audio sample has been pre-processed using the same methodology as described earlier. Table 3 presents a comparative result (validation and test set accuracy) of both CNN and CRNN architectures.
It is evident from the Table 2 that CRNN has learnt a richer representation of audio feature as compared to CNN alone. This accounts from its high-performance measure in sequence learning problems. Table 4 and Table 5 present the confusion matrix yielded by CNN and CRNN model respectively. The model has achieved accuracies of 70.7%, 73.9%, 83.1%, 84.9%, 79.5% with CNN for US, England, Indian, Australia, Canada accent class prediction respectively and accuracies of 72.5%, 74.5%, 89.8%, 92.4%, 86.7% with CRNN for US, England, Indian, Australia, Canada accent class prediction respectively. Table 2 shows the precision, recall for each class to analyse the results.
Precision, Recall and F1 score for each accent for each class of prediction
Precision, Recall and F1 score for each accent for each class of prediction
Validation and test set accuracy of each model
Confusion matrix for CNN
Confusion matrix for CRNN
We compare the proposed method with deep learning approaches in the literature to better appreciate the experimental results reported in this paper. The first approach is using deep ANN and RNN fusion and the second system uses a convolutional network for speech accent detection in video games. In the first approach authors, for each audio segment, trained DNNs on long-term statistical features, while RNNs are used to train on short-term acoustic features. The overall accuracy attained by this system is 52.48% [15]. In the second approach, the authors have used AlexNet with small changes and trained it on Speech Accent archive dataset. They have tested the model with audio files captured from Dragon Age: Origins. Accuracy reported in this are 52.7% [19].
Analyzing the results
Tables 4 & 5 show that the errors in classifying the US and England accents are almost distributed to all the class labels. This can be characterized by the fact that these two varieties of English accents are the most influential and comprehensive among all [18]. It is also observable that the model is highly confused in the classification of Canada among US, England. This is attributed to the fact that Canadian accent is greatly influenced by the British and US accent [20]. But accent differences are acoustic demonstrations of differences in phonetics, pitch and intonation pattern.
Particularly, British speakers possess much steeper pitch rise and fall pattern and lower average pitch in most of the vowels. Since phonemes and their properties are better observed in the spectrogram, the proposed model has given a remarkable performance in classifying them despite these similarities. Table 2 shows that on moving from CNN to CRNN almost all the accents except the US and England have F1 score≥0.80. It is also noticeable that Indian and Australia accents show better results from others. It is evident from the fact that these accents are much varied on the basis of regional creativity and style with which people speak [21].
Conclusion
The study has explicated that audio which more or less contains time-dependent input features shows efficient performance with models which count for the temporal feature along with contextual dependencies. The analysis also gave insights on different possible variations among accents that confused the model in predicting true labels as explained under section 5. Notwithstanding the study has conferred a new approach to accent recognition.
In future work, one can possibly look for architectures that can capture more abstract features and can be employed on powerful DNNs.
Footnotes
Acknowledgment
The authors sincerely acknowledge TEQIP-III NIT Silchar for the financial support.
