Speaker anonymization using generative adversarial networks

Abstract

The advent use of smart devices has enabled the emergence of many applications that facilitate user interaction through speech. However, speech reveals private and sensitive information about the user’s identity, posing several security risks. For example, a speaker’s speech can be acquired and used in speech synthesis systems to generate fake speech recordings that can be used to attack that speaker’s verification system. One solution is to anonymize the speaker’s identity from speech before using it. Existing anonymization schemes rely on using a pool of real speakers’ identities for anonymization, which may result in associating a speaker’s speech with an existing speaker. Hence, this paper investigates the use of Generative Adversarial Networks (GAN) to generate a pool of fake identities that are used for anonymization. Several GAN types were considered for this purpose, and the Conditional Tabular GAN (CTGAN) showed the best performance among all GAN types according to different metrics that measure the naturalness of the anonymized speech and its linguistic content.

Keywords

Speaker anonymization voice privacy generative adversarial networks CTGAN x-vector

1 Introduction

In the past decade, digitization has revolutionized the world by bringing new possibilities and innovations that ease daily activities. Among these innovations are the applications that understand natural language voice commands and complete tasks for the user. These products are called Smart Personal Assistants (SPA) [13] or Virtual Assistants. A typical virtual assistant system can interpret human speech, deduce the intent, perform an action and/or respond via synthesized speech. Actions might include asking questions, controlling home devices, or managing tasks such as writing an email, playing a song, calling someone, or setting a reminder.

SPA systems rely heavily on emerging technologies such as artificial intelligence, natural language processing, and speech recognition which have witnessed unprecedented development lately, facilitating the offering of many reliable and commercial SPAs such are Google’s Assistant and Amazon’s Alexa. These SPAs are being widely adopted by users. According to Google, 27% of the global online population is using voice search on mobile phones [14]. Moreover, TechCrunch reports that over two billion devices have voice assistants installed in 2020 [17].

The operation of SPAs in different systems such as smartphones, watches, televisions, and smart cars involves the use of different modules such as automatic speech recognition (ASR), natural language understanding (NLU), dialog manager (DM), and text-to-speech (TTS). These modules are usually deployed on the cloud, which implies that the user’s speech, which embeds his private information and identity, is transferred to the cloud via the network. Hence, there is a great risk of exposing the personal privacy of the speech data and the speaker’s identity. Accordingly, adequate privacy preservation is essential to ensure that sensitive bio-metric data, including voice recordings or speech data, are properly protected from misuse. This is a huge concern, especially with the recent EU General Data Protection Regulation [26] that demands giving control to individuals over their personal data.

Generally, speech privacy includes the privacy of the linguistic content and the speaker’s identity. Protecting the speaker’s identity is crucial as it can be used for verification and authentication. Hence, speaker anonymization or de-identification techniques are usually employed. These techniques refer to the process of suppressing the speaker’s identity (timbre, pitch, speaking rate, and speaking style) by modifying the original speech signal to make it sound like an anonymous speaker while maintaining linguistic content, speech quality, and naturalness.

The baseline for such technology was first provided by The INTERSPEECH 2020 VoicePrivacy challenge [6], in an attempt to gather effort in the research community to facilitate the development of voice privacy preservation technology that suppresses speaker-discriminative information, promote effective anonymization techniques, provide benchmarking techniques and to investigate the proper evaluation metrics in anonymization. The challenge provided a baseline system for concealing speaker identity by separating the linguistic content from the speaker’s identifiable information. The baseline performs anonymization by replacing the speaker identity that is modeled as an x-vector, which is a neural network-based latent representation, with a pseudo-identity that is derived from a pool of real speakers’ identities. The problem with this approach is the use of real speaker identities in the anonymization process, in which it can be argued that the anonymized speech can protect the identity of one user at the expense of exposing another identity and/or associating it with a different speaker.

The work presented in this paper addresses the speaker anonymization problem formulated by The VoicePrivacy Challenge by proposing a novel speaker anonymization model through the use of Generative Adversarial Networks (GANs) to generate a pool of synthesized human identities that can be used in the anonymization pipeline provided by the challenge. We believe this is the first attempt to synthesize human speaker identities using GANs. Several GAN types were investigated for this purpose, and the conditional tabular GAN (CTGAN) showed the best performance among all GAN types based on different metrics. Experimental results proved the ability of the proposed anonymization approach to producing diverse synthesized speaker identities (an average of 0.75 for cross-cosine similarity) that are close to real identities (an average of 0.49 for the Kolmogorov-Smirnov measure). Also, the proposed approach resulted in the lowest word error rate of 6.39% when assessed using an external ASR system.

The rest of the paper is organized as follows. In Section 2, anonymization technology and some related work including the baseline system provided by the VoicePrivacy Challenge are presented. Section 3 discusses the proposed anonymization model. Section 4 outlines the process of GAN selection, training, and the performance metrics used to evaluate different GAN designs. In Section 5, the findings of the conducted experiments are discussed. Finally, Section 6 summarizes the main findings of this paper and discusses some of the possible future directions.

2 Related work

As mentioned earlier, speaker anonymization is the process of masking the identity of an original speech signal while preserving the linguistic content and speech quality. Speaker anonymization differs from speech anonymization which targets hiding the linguistic content of speech. Speaker anonymization can be performed physically by introducing some sort of disturbance or noise to the original speech signal [9, 21]. Alternatively, speaker anonymization can be performed logically by manipulating the speaker identity [11]. The anonymization technique presented in the paper falls in the logical category.

Alegre et al. [1] investigated the so-called speaker evasion and hiding using voice conversion techniques to avoid surveillance systems. The system evaluated only how the approach could degrade ASV performance and did not consider degradation to speech quality.

Jin et al. [11] proposed a voice transformation (VT) system to change the speaker identity into a special speaker. In [15] and [18], the authors improved the convenience of the VT-based method to enable the user to select an approximate transformation from a pool of pre-trained VT models for speaker anonymization. A convolutional neural network was also employed as a VT system in [3] where different transformations were averaged to obtain the anonymized speech. Justin et al. [12] performed speaker anonymization by first recognizing the diphones in the input speech using an ASR system and then synthesizing speech from the recognized diphone sequence. The synthesized speech differs from the original one in speaker identity because the synthesizer is speaker-dependent and was trained using the data of different speakers.

To spread the efforts toward developing speech privacy solutions in the form of a competitive challenge, the VoicePrivacy challenge was introduced as part of The Interspeech 2020 special sessions and challenges [24]. In this challenge, participants are required to anonymize speech datasets while preserving the linguistic content and naturalness of the speech.

The VoicePrivacy challenge provided a baseline system for anonymization. The main assumption in the baseline system is that a speech waveform can be decomposed into two sets of features. The first set contains the sequence of spoken words while the second set includes the acoustic features that represent the speaker identity, which are the target to be modified in an anonymization process.

The baseline system is shown in Fig. 1 [6]. It includes three blocks to extract the two sets of features mentioned previously. The first block (F0 Extractor) extracts the fundamental frequency (F0) or pitch of the speech. F0 is a widely adopted feature in ASR systems as it determines the perceived relative loudness of speech and plays an important role in perceived para-linguistic and prosodic information in speech [5]. The second block is an automatic speech recognition (ASR) system that is based on deep neural networks (DNN) and is responsible for extracting the phonemes posteriogram (PPG) that encodes the linguistic content of speech.

Fig. 1

Speaker anonymization pipeline [6].

The third block (x-vector Extractor) is used to extract the x-vector of the speaker, which is a speaker embedding that has been lately and widely used to model the speaker identity irrespective of the speech content. x-vectors have shown impressive performance in speech verification and recognition applications compared to other types of features such as the d-vectors and i-vectors [23]. Basically, The x-vector extractor is a deep neural network (DNN) that performs speaker recognition from frame-by-frame speaker labels to utterance-level speaker labels through an aggregation process. The general structure of the x-vector network is shown in Fig. 2. In this network, time-delay layers are first used to extract frame-level embeddings of an utterance. Afterward, the mean and standard deviation of the frame-level embeddings are concatenated as a segment-level feature using a statistical pooling layer. The generated segment-level features are then used to train a feedforward network that classifies the speaker. All the layers are trained jointly. Once the network is trained, the x-vector is defined as the speaker-level embedding produced by the second to last hidden layer, i.e., embedding a that is shown in Fig. 2, which is a 512-dimensional vector. The trained model can then generate the x-vector for the input utterances.

Fig. 2

x-vector network [23].

In addition to the feature extraction blocks, the baseline has an anonymization block that consists of a pool of real x-vectors, i.e., x-vectors for real speaker identities, and an anonymization scheme based on using a similarity metric to select a subset of the pool then compute it’s average to create a new synthetic x-vector representing a new speaker identity.

With these features (F0, PPGs, and anonymized x-vector) in hand, the baseline model synthesizes the speech waveform using an acoustic model and a neural waveform model. The baseline system suffers from poor linguistic preservation highlighted in the high word error rate (WER) for largely distant anonymized x-vectors, in addition to the fact that it uses real identities in anonymization, which may result in privacy issues, as discussed earlier. The work in this paper attempts to mitigate such issues by generating fake identities that can be used in anonymization.

In [25], Turner et al. modified the anonymization process of the baseline by using a distribution-preserving voice anonymization technique to generate x-vectors that approximate the original distribution of x-vectors and their intra-similarities using Gaussian Mixture Models (GMM). To anonymize an utterance, the GMM is randomly sampled to obtain a fake x-vector. The distance between the original x-vector and the fake vector is calculated to exclude the fake vector if the distance is less than a predefined threshold. This is an optional step that can be used to mitigate the risk of the fake x-vector being similar to the original one. The proposed model resulted in higher word error rates (WER) compared to the baseline. Additionally, the proposed model did not produce consistent, equal error rate (EER) values and failed to supersede the baseline model for some datasets.

Champion et al. [5] modified the baseline by including the pitch (F0) in the anonymization process. This was based on their analysis of the impact of this modification across gender which showed that it can always improve the anonymization process. They linearly transformed the log-scaled version of the F0 using

$\begin{matrix} F = μ_{y} + \frac{σ_{y}}{σ_{x}} (x_{t} - μ_{x}) \end{matrix}$ (1) where F represents the log-scaled F0 of the source speaker at the frame t, μ_x, and σ_x represent the mean and standard deviation of the log-scaled pitch for the source speaker, while μ_y and σ_yrepresents the mean and standard deviation of the log-scaled F_o for the pseudo-speaker. This approach showed that scaling the pitch and picking an x-vector from the opposite gender achieves better privacy protection while reducing the WER.

3 Proposed anonymization model

As discussed in Section 2, the baseline system provided by the VoicePrivacy Challenge uses a pool of real x-vectors to anonymize a given speech signal. The proposed anonymization system is basically similar to the baseline system except for the fact that the pool of real x-vectors is replaced with a pool of fake x-vectors that are generated using adversarial generative networks (GANs). In this section, a swift overview of GANs is presented. Then the architecture of the proposed anonymization model and the process of GAN network selection are discussed.

3.1 Generative adversarial networks

Generative adversarial networks were first proposed in [7] as a framework to estimate and approximate generative models via an adversarial process. The basic GAN architecture comprises two networks; the generator network (G) and the discriminator network (D). As the names imply, the generator network generates data following a desired distribution from some latent representation (usually noise). On the other hand, the discriminator network attempts to distinguish between the data produced by the generator, i.e., fake data, and the data sampled from the real distribution. The training objective in GANs is to fool the discriminator network into misidentifying fake data as real, which indicates that the generator network learned the distribution of the real data. Backpropagation is applied in both networks so that the generator produces better samples while the discriminator becomes more skilled at flagging synthetic samples.

The original GAN [7] is known as the vanilla GAN, and it was benchmarked in imaging applications using relatively simple image data sets. Several GAN architectures have been proposed in the literature such as the conditional GAN (CGAN) [16], the Wasserstein GAN (WGAN) [2] and the conditional tabular GANs (CTGAN) [28]. Different GAN types attempted to address issues related to the training of GANs and their accuracy and stability in approximating real data distributions for different data types such as images and tabular data. Despite the theoretical existence of unique solutions in the training of GANs, the training of GANs is challenging as the generator network may converge to a state where it is generating the same data for different inputs; a phenomenon known as mode collapse in GANs [22]. In general, different GAN types were used in image applications such as de-noising, image generation, and style transfer. This paper is concerned with investigating the use of GANs in generating speaker identities that can be used in the anonymization baseline in [6] instead of the real x-vectors.

3.2 Anonymization model architecture

The architecture of the proposed anonymization model is essentially similar to the baseline system depicted in Fig. 1 in terms of having the feature extraction blocks for F0, PPG, and x-vector of the input utterance, as well as the speech synthesis block that generates the anonymized speech waveform out of F0, PPG, and the anonymized x-vector. The main difference is in the nature of the x-vector pool used for anonymization. The proposed system uses fake x-vectors generated using GANs, which is the major contribution of the paper. In general, we seek to design and train a suitable GAN that can generate pseudo identities in the form of x-vectors. When these fake identities are used to anonymize some input speech waveform, they should produce an anonymized speech that is not only natural but also has a minimal change in the linguistic content of the original speech waveform.

The general architecture of the GAN to be designed and trained is shown in Fig. 3. In this network, the generator network (G) is trained to generate fake 512-dimensional x-vectors based on 100-dimensional input noise vectors Z derived from a uniform distribution. The training of the two networks is performed in an alternating fashion such that the discriminator (D) network is trained on real and fake samples to maximize its classification accuracy in terms of identifying real and fake samples. The gradients of the loss function of the discriminator are calculated and backpropagated through the D network only. Afterward, the G network is trained to generate a new set of x-vectors that are fed into D along with real samples in an attempt to minimize the classification accuracy of D. The gradients of the loss function of G are then calculated through the D and G networks, and only the weights of G are updated.

This process is repeated for a number of epochs during which the generator performance is expected to improve in terms of its ability to generate better fake samples that the discriminator cannot identify as fake. Meanwhile, the discriminator performance is expected to improve to create an environment of adversarial training. More details on GAN training and loss functions can be found in [23].

Fig. 3

GAN-Based architecture for generating anonymized x-vectors

3.3 GAN selection

As mentioned in Section 3.1, there are many types of GANs in the literature. In this paper, the performance of the vanilla GAN, CGAN, WGAN, and CTGAN in generating fake x-vectors is investigated. The first three types of these GANs are popular in image applications while the CTGAN is suitable for tabular data, which is the case of x-vectors. Table 1 presents the architecture and hyperparameters used in implementing and evaluating each of the four GAN designs in this paper.

Table 1
The four GAN architectures and their hyperparameters

Model Generator Discriminator Batch Epoch

GAN Dense (128,activation=’relu’)
BatchNormalization
Dense(512,activation=’relu’)
BatchNormalization
Dense(512,activation=’relu’)
BatchNormalization Dense (1,activation=’relu’)
BatchNormalization
Dense(1,activation=’sigmoid’) 128 30000

WGAN Dense (512,activation=’relu’)
Dense(256,activation=’relu’)
Dense(512,activation=’relu’)
BatchNormalization Dense (512,activation=’relu’)
Dense(128,activation=’relu’)
Dense(1) 128 30000

CGAN Dense (256,)
LeakyReLU(alpha=0.2)
BatchNormalization()
Dense(512)
LeakyReLU(alpha=0.2)
BatchNormalization Dense (512,)
LeakyReLU(alpha=0.2)
Dense(1) 500 20000

CTGAN Dense (256,)
LeakyReLU(alpha=0.2)
BatchNormalization
Dense(256,)
LeakyReLU(alpha=0.2)
BatchNormalization
Dense(512,activation=’tanh’)
Dense(512,activation=’gumble’)
Dense(512,activation=’gumble’) Dense (256,)
LeakyReLU(alpha=0.2)
Dropout
Dense(256,)
LeakyReLU(alpha=0.2)
Dropout
Dense(1) 100 1000

Model	Generator	Discriminator	Batch	Epoch
GAN	Dense (128,activation=’relu’) BatchNormalization Dense(512,activation=’relu’) BatchNormalization Dense(512,activation=’relu’) BatchNormalization	Dense (1,activation=’relu’) BatchNormalization Dense(1,activation=’sigmoid’)	128	30000
WGAN	Dense (512,activation=’relu’) Dense(256,activation=’relu’) Dense(512,activation=’relu’) BatchNormalization	Dense (512,activation=’relu’) Dense(128,activation=’relu’) Dense(1)	128	30000
CGAN	Dense (256,) LeakyReLU(alpha=0.2) BatchNormalization() Dense(512) LeakyReLU(alpha=0.2) BatchNormalization	Dense (512,) LeakyReLU(alpha=0.2) Dense(1)	500	20000
CTGAN	Dense (256,) LeakyReLU(alpha=0.2) BatchNormalization Dense(256,) LeakyReLU(alpha=0.2) BatchNormalization Dense(512,activation=’tanh’) Dense(512,activation=’gumble’) Dense(512,activation=’gumble’)	Dense (256,) LeakyReLU(alpha=0.2) Dropout Dense(256,) LeakyReLU(alpha=0.2) Dropout Dense(1)	100	1000

Different architectures and parameters for the four GAN types were investigated and evaluated in terms of their ability to generate a suitable pool of fake x-vectors for speech anonymization. Grid search was considered in the valuation to search for the best architecture and hyperparameters. The architecture and the hyperparameters with the best performance in terms of their generalization capabilities to generate x-vectors with a distribution similar to real x-vectors were selected for each of the four GAN types. Several evaluation metrics were used for this purpose as discussed in Section 5. Table 1 shows the architecture and hyperparameters selected for implementing and evaluating each of the four GAN designs in anonymization.

It is worth mentioning here that designing the architectures of the vanilla GAN, CGAN, and WGAN; is considered an adaption of these networks to work with the 512-dimensional x-vectors as they are originally tailored to work with images. This included changes in the network size, activation functions, and the type of layers. For example, the use of convolutional layers degraded the performance significantly as the discriminator network dominated the performance of the generator. Alternatively, dense layers are used in different architectures which showed better performance as they are more suitable for approximating and modeling structured data such as x-vectors.

4 Training and evaluation methods

In this section, the evaluation strategy and the performance metrics used to select the best GAN are discussed. The selection process was performed in two phases. In the first phase, the ability of the considered GANs to generate x-vectors with a distribution similar to real x-vectors distributions is evaluated (subsection 4.1. In the second phase, the GANs that produced the best distribution of x-vectors were used to generate pools of fake x-vectors used in the anonymization system. The GAN that provides the best-anonymized speech quality is selected as the generator for the pool of fake x-vectors in the proposed system (subsection 4.2). Also, a brief description of the datasets used in the evaluation is presented in the last subsection.

4.1 Phase I: Generative model evaluation

As discussed earlier, most of the GAN networks were employed in image applications where the evaluation of the GAN output is pretty straightforward through assessing the quality of the generated images. In this case, the GAN is used to generate the 512-dimensional x-vectors. Evaluating the quality of the generated x-vectors requires using them to synthesize the anonymized speech waveform. However, this approach involves several cascaded models, which would imply the possibility of introducing a cascaded error that may compromise the whole evaluation process. Moreover, one of the most common problems in training GANs is mode collapse, where the generator gets stuck in producing one or a few distinct samples rather than capturing the training data distribution and producing diverse samples within that distribution.

As a first phase for GAN selection and to eliminate possible errors in the synthesis system, the ability of the considered GANs to generate x-vectors with the proper distribution is evaluated. For this purpose, two performance metrics are used to quantify the generation quality of the GANs. The first metric is the cross-cosine similarity which is defined by

$cos (θ) = \frac{\sum (a_{i} b_{i})}{\sqrt{\sum a_{i}^{2}} \sqrt{\sum b_{i}^{2}}}$ (2) where a and b are two vectors. This metric is used to examine the nature of the distribution of the generated x-vectors and the presence of mode collapse. Effectively, the generated x-vectors are shuffled, and the cross-cosine distance is calculated between pairs to obtaining a similarity distribution among the data. A distribution centered above 0.9 indicates that most samples are similar to each other, i.e., they represent the same speaker, which implies that the GAN is stuck in mode collapse. The threshold value of 0.9 is derived from evaluating the cosine distance for the same speaker x-vector in different acoustic conditions for a random subset in the LibriSpeech dataset. In general, this implies that the GAN is producing a few distinct speaker identities (x-vectors), which is not suitable for anonymization purposes.

The other metric that is used to evaluate the generative capability of the GANs is the Kolmogorov-Smirnov (KS) score which is used to measure the similarity between two data distributions. In this case, this allows measuring the similarity between the distributions of real x-vectors and the generated x-vectors. A GAN with a higher KS score implies its inability to model the distribution of the real data, i.e., its synthesized identities don’t follow the distribution of the real identities. Such GAN is excluded in from the experiments conducted in the second phase. More details about this evaluation are provided in Section 5.

4.2 Phase II: Speaker anonymization evaluation

Following the first phase of selection, the GANs that showed the best performance in terms of generating reasonable distributions for x-vectors are evaluated in terms of the quality of the anonymized speech. Effectively, these GANs are used to generate anonymized speech utterances, which are then evaluated using an automatic speaker verification system (ASV_eval and an automatic speech recognition system (ASR_eval). Both systems are trained on the LibriSpeech-train-clean-360 dataset using Kaldi speech recognition toolkit [19] as described in the challenge’s evaluation plan [24].

The ASV_eval system is used to test whether the anonymized identity matches the original identity of the speaker. Effectively, the ASV_eval system is used to verify multiple anonymized and real speakers. In addition, the rejection rates are calculated. The equal error rate (EER) metric measures the verification ability. In anonymization, a typical value of 50% for the EER is desired as it indicates a perfect anonymization scenario where the speaker’s identity is truly hidden. In addition to the EER metric, the log-likelihood-ratio cost function (C_llr) that is proposed in [4] is used as an application-independent evaluation objective. The C_llr metric can be decomposed into a (1) discrimination loss ( $C_{llr}^{\min}$ ) (how good are two classes separated for any threshold) and (2) a calibration loss (C_llr - $C_{llr}^{\min}$ ) which represents log-likelihood-ratios (LLRs) in the case of C_llr being less than 1.

The ASV_eval system is based on state-of-the-art x-vector speaker embeddings with a probabilistic linear discriminant analysis (PLDA) backend [10]. Experiments conducted in this work followed the VoicePrivacy Challenge plan such that [24]:

Compute PLDA (LLR) scores for (a) clean enrollment data and (b) anonymized trial data.

Compute PLDA (LLR) scores for (a) anonymized enrollment data and (b) anonymized trial data.

For steps 1 and 2, calculate equal error rate (EER) and log-likelihood-ratio cost function (C_llr).

On the other hand, the ASR_eval system is used to measure the quality of the linguistic content in the anonymized utterance using the word error rate (WER) metric; which is basically the ratio of the wrong words in the anonymized utterance to the total number of words in the original utterance. Lower WER values indicate better matching in the linguistic content between the original and anonymized utterances.

4.3 Datasets

A number of publicly available datasets are used for training, validation, and testing of the proposed speaker anonymization system. A detailed description of the data used for training, validation, and testing in our experiments is given in the following subsections. They generally comprise subsets from the corpora as in Table 2.

Table 2
Datasets statistics

Dataset Size (hr) Speakers Language Sample Rate

LibriSpeech 1000 2484 English 16 kHz

vctk 44 109 English 48 kHz

LibriTTS 585 2456 English 24 kHz

VoxCeleb-1,2 2770 7360 English 16 kHz

Dataset	Size (hr)	Speakers	Language	Sample Rate
LibriSpeech	1000	2484	English	16 kHz
vctk	44	109	English	48 kHz
LibriTTS	585	2456	English	24 kHz
VoxCeleb-1,2	2770	7360	English	16 kHz

4.3.1 Training data

The training in the proposed system refers to GAN training only since all other models in the system are pre-trained based on the setting given in Table 3. The Vox-Celeb1,2 & LibriTTS train-other-100 corpora were used to train the anonymization model, i.e., the GAN model. A more detailed description of the data is provided in Table 4. It can be seen that these two datasets combine roughly more than 3000 hours and close to 1.5 million utterances. The utterances are slightly biased with male voices, especially in the VoxCeleb dataset.

Table 3
Training settings for the pre-trained models in the proposed anonymization architecture

# Model Architecture Input Output Training Data

1 ASR AM TDNN-F 7
TDNN-F hidden layers
softmax: 6032 triphone ids
LF-MMI and CE criteria MFCC⁴⁰
+
i - vectors¹⁰⁰ BN²⁵⁶ features LibriSpeech:
train-clean-100
train-other-500

2 x-vector
Extractor TDNN
7 hidden layers + 1 stats pooling layer
7232 speaker ids
CE criterion MFCC ³⁰ Speaker
x - vectors⁵¹² VoxCeleb: 1, 2

3 Speech
Synthesis AM Autoregressive (AR) network
FF * 2 + BLSTM + AR + LSTM * 2 + highway-postnet
MSE criterion F0¹ + BN²⁵⁶ + x - vectors⁵¹² Mel - filterbanks⁸⁰ LibriTTS:
train-clean-100

4 NSF model sinc1-h-NSF in [27]
STFT criterion F0¹ + Mel - fbanks⁸⁰ + x - vectors⁵¹² speech
waveform LibriTTS:
train-clean-100

5 x-vector
GAN Conditional Tabular
Generative Adversarial Network CTGAN Noise - Vector¹⁰⁰ Pool of speaker
x-vectors LibriTTS-train-other-500
VoxCeleb1,2

#	Model	Architecture	Input	Output	Training Data
1	ASR AM	TDNN-F 7 TDNN-F hidden layers softmax: 6032 triphone ids LF-MMI and CE criteria	MFCC⁴⁰ + i - vectors¹⁰⁰	BN²⁵⁶ features	LibriSpeech: train-clean-100 train-other-500
2	x-vector Extractor	TDNN 7 hidden layers + 1 stats pooling layer 7232 speaker ids CE criterion	MFCC ³⁰	Speaker x - vectors⁵¹²	VoxCeleb: 1, 2
3	Speech Synthesis AM	Autoregressive (AR) network FF * 2 + BLSTM + AR + LSTM * 2 + highway-postnet MSE criterion	F0¹ + BN²⁵⁶ + x - vectors⁵¹²	Mel - filterbanks⁸⁰	LibriTTS: train-clean-100
4	NSF model	sinc1-h-NSF in [27] STFT criterion	F0¹ + Mel - fbanks⁸⁰ + x - vectors⁵¹²	speech waveform	LibriTTS: train-clean-100
5	x-vector GAN	Conditional Tabular Generative Adversarial Network CTGAN	Noise - Vector¹⁰⁰	Pool of speaker x-vectors	LibriTTS-train-other-500 VoxCeleb1,2

4.3.2 Validation data

A validation set is usually used to evaluate the design of different models and to fine-tune the hyperparameters of the model. As stated by the challenge organizers in their evaluation plan [24], anonymized utterances are referred to as trial utterances. In contrast, enrollment utterances are several utterances for each speaker, which may or may not have been anonymized.

Table 5 highlights some details about the validation datasets. The vctk-dev dataset has roughly five times the utterances of the LibriSpeech-dev dataset. For the LibriSpeech-dev dataset, the speakers in the enrollment set are a subset of those in the trial set. For the vctk-dev dataset, two subsets were created of trial utterances, denoted as common parts and different parts. Both include trials from the same set of speakers but from disjoint subsets of utterances. The common part of the trials is composed of utterances # 1 - 24 in the vctk corpus, which is identical for all speakers: the elicitation paragraph6 (utterances # 1 - 5) and rainbow passage7 (utterances # 6 - 24). The enrollment subset and the different parts of the trials are composed of distinct utterances for all speakers (utterances with indexes ≥25).

Table 4
Statistics of the training datasets for the proposed anonymization architecture

Subset Size (hr) Number of Speakers Number of Utterances

Female Male All

VoxCeleb-1,2 2,794 2,912 4,451 7,363 1,281,762

LibriTTS train-other-500 310 560 600 1,160 205,044

Subset	Size (hr)	Number of Speakers	Number of Utterances
VoxCeleb-1,2	2,794	2,912	4,451	7,363	1,281,762
LibriTTS train-other-500	310	560	600	1,160	205,044

Table 5

Statistics of the validation datasets [24]

Dataset	Subset	Female	Male	Total
LibriSpeech-dev	Speakers in enrollment	15	14	29
	Speakers in trials	20	20	40
	Enrollment utterances	167	176	343
	Trial utterances	1,018	960	1,978
vctk-dev	Speakers (same in enrollment and trials)	15	15	30
	Enrollment utterances	300	300	600
	Trial utterances (common part)	344	351	695
	Trial utterances (different part)	5,422	5,255	10,677

4.3.3 Test data

Similar to the validation data, the test subsets from two different corpora (LibriSpeech and vctk) are used for testing. Those datasets are split into enrollment and trial subsets as summarized in Table 6. It can be seen that the number of trial utterances is roughly balanced in both datasets. While speakers in enrollment and the enrollment utterances are slightly more biased with females in the LibriSpeech test dataset.

Table 6
Statistics of the test datasets [24]

Dataset Subset Female Male Total

LibriSpeech-test Speakers in enrollment 16 13 29

Speakers in trials 20 20 40

Enrollment utterances 254 184 438

Trial utterances 734 762 1496

vctk-test Speakers (same in enrollment and trials) 15 15 30

Enrollment utterances 300 300 600

Trial utterances (common part) 346 354 700

Trial utterances (different part) 5,328 5,420 10,748

Dataset	Subset	Female	Male	Total
LibriSpeech-test	Speakers in enrollment	16	13	29
	Speakers in trials	20	20	40
	Enrollment utterances	254	184	438
	Trial utterances	734	762	1496
vctk-test	Speakers (same in enrollment and trials)	15	15	30
	Enrollment utterances	300	300	600
	Trial utterances (common part)	346	354	700
	Trial utterances (different part)	5,328	5,420	10,748

5 Experimental results

In this section, the experimental setup and the results for evaluating both the generative models and the anonymization quality are presented and discussed.

5.1 Experimental setup

Experiments were performed on a Lenovo IdeaPad L340 Gaming 9th Gen Intel Core i7 4.5GHz 12M Cash 6-Cores, 8GB RAM, 256GB SSD +1 TB HDD, Nvidia GTX 1650 4GB, and Ubuntu 20.04 operating system.

All experiments were based on the challenge publicly available baseline [6]. Software packages involved: Python 3.6 for training the GAN models, Kaldi toolkit [19] for the pre-trained models. The GAN, CGAN, and WGAN were implemented using Keras python framework, whereas the CTGAN was implemented using PyTorch framework. The training, validation, and test sets are those discussed in Section 4.3. The pool of external speakers on which x-vectors are computed to train the GAN model is composed of the LibriTTS train-other-500 and VoxCeleb1,2. Additional information on the number of speakers and the gender distributions can be found in Table 4.

5.2 Generative model evaluation results

This is the first phase of GAN selection in which GANs are assessed in terms of their generation ability of fake x-vectors. Figure 4 shows the KS values for the different types of GANs for both genders. The KS metric measures the approximate distance between synthesized (fake) and real distributions of x-vectors. High values for KS (max. of 1) indicate dissimilar distributions. In other words, they indicate that the synthesized speakers’ distribution is far from the real speakers’ distribution, which is undesired. Based on the results depicted in Fig. 4, the CGAN has the highest KS statistic score for both genders; therefore it can be ruled out. However, the KS statistic scores are inconclusive for the GAN, WGAN, and CTGAN.

Fig. 4

KS score results for different GAN designs

To further assess the GANS, the cross-cosine similarity measure was used to further assess the diversity of the generated x-vectors by computing the cross-cosine similarity between x-vector pairs. Figure 5 compares the cross-cosine similarity distribution for the four GAN designs. Compared to the original data distribution for both males and females, the x-vectors generated by the vanilla GAN appear to have a slightly similar distribution but with some undesired density above 0.8. On the other hand, the CTGAN’s data distribution is good enough to indicate diverse identities, while in the case of a WGAN and CGAN, the results show a dominating state of mode collapse. Hence, this leaves the GAN and CTGAN as the potential candidates that learned the distribution of real x-vectors. Choosing between the GAN and CTGAN is performed by considering their effect on the quality of the anonymized speech, as discussed in the following subsection.

Fig. 5

Cross-cosine similarity distribution for the four GAN designs

5.3 Speaker anonymization evaluation results

As a second phase for GAN selection, the performance of the GAN and CTGAN is assessed in terms of anonymization quality. The evaluation considered varying the pool size as well as the distance measure. As mentioned in Section 4, an ASV_eval system is used to evaluate the speaker verification ability of the proposed anonymization system. Similar to the evaluation carried out for the baseline [6], the ASV system in [20] is used, as implemented in Kaldi speech recognition toolkit [19], which is trained on the VoxCeleb dataset and then adapted to the vctk domain using 2580 utterances from 20 unused speakers in the vctk corpus. On the other hand, to assess the quality of speech after anonymization, DeepSpeech [8] that is pre-trained on clean external data as the ASR_eval system is used.

5.3.1 GAN pool evaluation

Table 7 shows the anonymization evaluation results for the GAN-generated pool in terms of WER when the pool size is 1 K. It is evident how the GAN scored an extremely high WER. Effectively, more than 90% of the transcribed anonymization utterances are wrong compared to the same non-anonymized utterances. This indicates the poor quality of the anonymized waveform and its failure to maintain linguistic content.

Table 7
ASR results for vanilla GAN using validation data

Dataset Data WER(%)

small large

LibriSpeech-dev a 91.03 90.23

vctk-dev a 91.44 90.53

Dataset	Data	WER(%)
LibriSpeech-dev	a	91.03	90.23
vctk-dev	a	91.44	90.53

Note that a language model (LM) is a probability distribution over sequences of words, which is trained on a large text corpus. A probability distribution is learned over the proper sequencing of words in a spoken language. The LM is used to enhance the output of the ASR models by correcting its mistakes according to the LM’s learned contextual probability of the language. The size of the corpora used to train the LM highly contributes to the model’s accuracy. In this work, two models small and large are used, where the name refers to the LM corpora size.

In addition, a subjective evaluation through listening to a large subset of the anonymized utterances using the GAN pool revealed significant corruption in the anonymized voice with a bee-like nature. This violates the objective of anonymization, which is to produce a human-like voice. Based on this, the use of vanilla GAN for x-vector generation was dropped, and no further experiments were conducted.

5.3.2 CTGAN pool evaluation

The evaluation conducted in the previous two subsections makes the CTGAN the network of choice for the proposed model to generate the pool of fake x-vectors. In this experiment, the effect of varying the pool size of speaker identities in the anonymization process is studied when the CTGAN is considered. Results are reported in terms of speaker verifiability metrics (EER,C_llr) and speech quality metric (WER) for various pool sizes (1K, 3K, 5K) for both genders, and on multiple test datasets.

In Fig. 6, steady performance of the CTGAN is observed over various pool sizes and multiple datasets in terms of EER and C_llr. This indicates that the speaker verifiability metrics don’t depend on the size of the anonymization pool. It is also noticed that EER and C_llr values are lower in the case of (a-a), i.e., both utterances are anonymized. This results from the anonymization projecting the utterances to relatively similar speaker identities. In Fig. 7, it can be noticed that there is minimal difference in WER values due to varying the pool size for different datasets. However, the results of large language models tend to be better as the larger LM contributes more to correcting the ASR output mistakes. As a result of the varying pool size experiment, the smaller pool size was used throughout the rest of the experiments to shrink the size of the whole anonymization system.

Fig. 6

ASV results (EER) for CTGAN-generated pool on Test Sets, with C_llr score on top of each bar.

Fig. 7

ASR results for CTGAN-generated pool with different pool sizes. Small/Large refers to the language model size.

As discussed earlier, the x-vector selection process from the pool relies on calculating a distance measure, which could be the PLDA distance or the cross-cosine score. The following experiment studies the effect of the choice of the distance metric on anonymization performance. Figure 8 shows the EER and C_llr scores for the anonymization system when PLDA and cross-cosine similarity distance metrics are used to select the x-vector from the synthesized pool. The average EER score for both test datasets was 42.2 and 43.03 for the PLDA and cross-cosine similarity, respectively. On the other hand, the PLDA and cross-cosine distance metrics scored an average C_llr score of 96.47 and 95.1, respectively. These results imply that the choice of distance metric has no significant impact on the performance of the anonymization system in terms of speaker verification.

Regarding the effect of the distance metric on the quality of anonymized speech, Fig. 9 shows that both distance metrics (Cosine and PLDA) result in comparable WER scores for different datasets and using different language models. Hence, it can be concluded that there is no significant difference between the two distance metrics on the performance of the anonymization system.

Fig. 8

ASV Results using PLDA and Cross-Cosine distances on test sets. The C_llr score is shown on top of each bar.

Fig. 9

ASR results using the PLDA and Cross-Cosine distances on test sets. Small/Large refers to the language model size.

5.4 Comparison with previous work

In this section, the performance of the proposed anonymization system is compared with the baseline anonymization system [6] and the enhanced system in [25] which uses Gaussian Mixture Models (GMM) to generate the pool of x-vectors.

In terms of the ASV performance, Fig. 10 shows the EER (y-axis) and C_llr (bar tops) values for the three systems under consideration on the vctk-test and LibriSpeech-test subsets. A state-of-the-art ASV system with cosine backend has EER around 10% [29]. For ASV, this value indicates that the false acceptance and false rejection rates are equal to 10%, which matches the objective of a general ASV system which is to be able to verify speakers accurately. In other other words, to minimize false rejection and false acceptance rates. However, in the anonymization case, values around 50% indicate that 50% of the time, the system miss-identifies a given speaker as another speaker as if it’s tossing a coin, indicating that the ASV system is completely fooled and fails to verify the identity of the speaker given an original sample and an anonymized one. This indicates that the two compared speakers are different from each other, which is the exact goal of anonymization. Furthermore, as displayed in the figure, the average EER for all systems in the (original - anonymized) case tends to be larger than the (anonymized- anonymized case) as in the latter case all speakers are sampled from the same distribution and synthesized artificially thus are more similar as compared to the former case.

The $C_{llr}^{\min}$ , which is the number on top of each bar, represents how well are two classes separated from each other. The higher the value, the better the distinction between classes. As shown in Fig. 10, all values fall within a range consistent with the anonymization objective.

Fig. 10

ASV results on testing datasets using CTGAN, GMM [25], and Baseline [6].

As for the performance of the three systems in terms of ASR, Fig. 11 shows the WER scores where it is clear how the proposed anonymization system achieves the highest performance. This implies that the proposed solution can better preserve the linguistic content of the anonymized speech and make it sound more intelligible. This is an essential aspect of anonymization as this data is needed for training and personalization purposes; therefore, good speech quality is critical for the data to still be valuable.

Fig. 11

ASR results on testing datasets using CTGAN, GMM [25], and Baseline [6].

In summary, the proposed anonymization system resulted in similar results in terms of speaker verifiability metrics (EER,C_llr) and relatively lower WER. However, this was achieved by using fake speaker identities, unlike the other two systems which relied on real identities, which is the main contribution of this work.

6 Conclusion

The speaker anonymization problem aims to suppress the personally identifiable information in a speech signal, leaving all other aspects intact. This issue has been raised by the Interspeech VoicePrivacy Challenge [6] which served as an initiative to spread the effort of developing speech privacy-preserving solutions amongst the research community. The challenge provided a baseline model for the anonymization task that targets anonymizing speech utterances using a pool of real x-vectors. This poses a security risk as anonymized speech using these real speaker identities in the form of x-vectors violates the privacy of the real speaker.

In this paper, we attempted to address this issue by examining the opportunity of using generative adversarial networks (GANs) to generate a pool of synthetic x-vectors that can be used instead of the real pool in the baseline system. Performing speaker anonymization using this approach guarantees the extreme difficulty of reversing the anonymized utterances back to the original speaker by the use of sophisticated attacker systems as the pseudo identities are synthesized artificially, thus adding an extra anonymization layer to the whole process and ensuring better quality for the anonymized utterances, making them more real and close to being natural.

Basically, we evaluated four different types of GANs to generate the x-vectors. The performance of these four alternatives was assessed using proper metrics to measure their ability to generate a distribution close to the distribution of real x-vectors in addition to speaker verifiability and speech intelligibility and quality. The conditional tabular GAN (CTGAN) showed the best performance in this regard. Specifically, the CTGAN achieved a WER of 6.27% and 6.5% on LibriSpeech dev and test datasets, respectively, with comparable EER and C_llr values compared to the baseline. These results highlight that artificially synthesized speaker identities perform better than original x-vectors.

Generally, we believe this is the first attempt at synthesizing human speaker identities using GANs. The scope of use for such a model extends beyond just anonymization. For example, text-to-speech systems and data augmentation for creating speech corpora to train different models.

Footnotes

Conflict of interest

The authors do not have any conflict of interest to declare related to any material discussed in this article.

Data availability statement

The datasets used during the current study for training and evaluation are available in the public domain. The links to these datasets are available in the references section.

The data generated by this work (Synthesized Identities) can be made available upon request.

References

Alegre

Federico

, Soldi

Giovanni

and Evans

Nicholas

, Evasion and obfuscation in automatic speaker verification. In 2014 IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP) pages 749–753. IEEE 2014.

Arjovsky

Martin

, Chintala

Soumith

and Bottou

Léon

, Wasserstein generative adversarial networks. In International conference onmachine learning, pages 214–223. PMLR, 2017.

Bahmaninezhad

Fahimeh

, Zhang

Chunlei

and Hansen

John HL

, Convolutional neural network based speaker deidentification. In Odyssey, pages 255–260, 2018.

Brümmer

and Du Preez

, Application-independent evaluation of speaker detection, Computer Speech & Language 20(2006), 230–275.

Champion

Pierre

, Jouvet

Denis

and Larcher

Anthony

, A study of f0 modification for x-vector based speech pseudonymization across gender, arXiv preprint arXiv:2101.08478, 2021.

Fang

Fuming

, Wang

Xin

, Yamagishi

Junichi

, Echizen

Isao

, Todisco

Massimiliano

, Evans

Nicholas

and Bonastre

Jean-Francois

, Speaker anonymization using x-vector and neural waveform models, arXivpreprint arXiv:1905.13561, 2019.

Goodfellow

Ian J

, Pouget-Abadie

Jean

, Mirza

Mehdi

, Xu

Bing

, Warde-Farley

David

, Ozair

Sherjil

, Courville

Aaron

and Bengio

Yoshua

, Generative adversarial networks, arXiv preprintarXiv:1406.2661, 2014.

Hannun

Awni

, Case

Carl

, Casper

Jared

, Catanzaro

Bryan

, Diamos

Greg

, Elsen

Erich

, Prenger

Ryan

, Satheesh

Sanjeev

, Sengupta

Shubho

, Coates

Adam

, et al. Deep speech: Scaling up end to- end speech recognition, arXiv preprint arXiv:1412.5567, 2014.

Hashimoto

Kei

, Yamagishi

Junichi

and Echizen

Isao

, Privacy preservingsound to degrade automatic speaker verification performance. In 2016 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 5500–5504. IEEE 2016.

10.

Ioffe

, Probabilistic linear discriminant analysis, Computer Vision–ECCV 3954 (2006), 531–542.

11.

Jin

Qin

, Toth

Arthur R

, Schultz

Tanja

and Black

Alan W

, Speakerde-identification via voice transformation. In 2009 IEEEWorkshop on Automatic Speech Recognition & Understanding pages 529–533. IEEE, 2009.

12.

Justin

Tadej

, Štruc

Vitomir

, Dobrišek

Simon

, Vesnicer

Boštjan

, Ipšić

Ivo

and Mihelič

France

, Speakerde-identification using diphone recognition and speech synthesis. In 2015 11th IEEE International Conference and Workshops onAutomatic Face and Gesture Recognition (FG), volume 4, pages 1–7. IEEE, 2015.

13.

Knote

Robin

, Janson

Andreas

, Eigenbrod

Laura

and Söllner

Matthias

, The what and how of smart personal assistants:principles and application domains for is research, Multikonferenz Wirtschaftsinformatik (MKWI), 2018.

14.

Google LLC. voice-search-mobile-usestatistics. https://www.thinkwithgoogle.com/marketing-strategies/search/voice-search-mobile-use-statistics/, May 2021.

15.

Magarinos

Carmen

, Lopez-Otero

Paula

, Docio-Fernandez

Laura

, Rodriguez-Banga

Eduardo

, Erro

Daniel

and Garcia-Mateo

Carmen

, Reversible speaker de-identification using pre-trained transformation functions, Computer Speech & Language 46 (2017), 36–52.

16.

Mirza

Mehdi

and Osindero

Simon

, Conditional generative adversarialnets, arXiv preprint arXiv:1411.1784, 2014.

17.

Persaud

Mark

, Where is voice tech going? https://techcrunch.com/2020/07/29/voice-tech-in-2020/, July, 2020.

18.

Pobar

Miran

and Ipšić

Ivo

, Online speaker de-identificationusing voice transformation, In 2014 37th Internationalconvention on information and communication technology, electronicsand microelectronics (mipro). pages 1264–1267. IEEE, 2014.

19.

Povey

Daniel

, Ghoshal

Arnab

, Boulianne

Gilles

, Burget

Lukas

, Glembek

Ondrej

, Goel

Nagendra

, Hannemann

Mirko

, Motlicek

Petr

, Qian

Yanmin

, Schwarz

Petr

, et al. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011.

20.

Prince

Simon JD

and Elder

James H

, Probabilistic linear discriminant analysis for inferences about identity. In 2007 IEEE 11th International Conference on Computer Vision pages 1–8. IEEE, 2007.

21.

Qian

Jianwei

, Du

Haohua

, Hou

Jiahui

, Chen

Linlin

, Jung

Taeho

, Li

Xiang-Yang

, Wang

and Deng

Yanbo

, Voicemask: Anonymize andsanitize voice input on mobile devices, arXiv preprintarXiv:1711.11460, 2017.

22.

Radford

Alec

, Metz

Luke

and Chintala

Soumith

, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434, 2015.

23.

Snyder

David

, Garcia-Romero

Daniel

, Sell

Gregory

, Povey

Daniel

and Khudanpur

Sanjeev

, X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), pages 5329–5333. IEEE, 2018.

24.

Tomashenko

Natalia

, Srivastava

Brij Mohan Lal

, Wang

Xin

, Vincent

Emmanuel

, Nautsch

Andreas

, Yamagishi

Junichi

, Evans

Nicholas

, Patino

Jose

, Bonastre

Jean-Francois

, Noé

Paul-Gauthier

, et al. The voiceprivacy 2020 challenge plan, 2020.

25.

Turner

Henry

, Lovisotto

Giulio

and Martinovic

Ivan

, Speaker anonymization with distribution-preserving x-vector generation forthe voiceprivacy challenge 2020, arXiv preprintarXiv:2010.13457 2020.

26.

Voigt

Paul

and Von dem Bussche

Axel

, The eu general data protectionregulation (gdpr), A Practical Guide, 1st Ed., Cham: SpringerInternational Publishing 10 (2017), 3152676.

27.

Wang

Xin

and Yamagishi

Junichi

, Neural harmonic-plus-noise waveform model with trainable maximum voice frequency for text-to-speech synthesis, arXiv preprint arXiv:1908.10256, 2019.

28.

Lei

, Skoularidou

Maria

, Cuesta-Infante

Alfredo

and Veeramachaneni

Kalyan

, Modeling tabular data using conditional gan, arXiv preprint arXiv:1907.00503, 2019.

29.

Zeng

Chang

, Wang

Xin

, Cooper

Erica

, Miao

Xiaoxiao

and Yamagishi

Junichi

, Attention back-end for automatic speaker verification with multiple enrollment utterances, ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 2022.

Speaker anonymization using generative adversarial networks

Abstract

Keywords

1 Introduction

2 Related work

3.1 Generative adversarial networks

3.2 Anonymization model architecture

4.1 Phase I: Generative model evaluation

4.3 Datasets

Table 2 Datasets statistics Dataset Size (hr) Speakers Language Sample Rate LibriSpeech 1000 2484 English 16 kHz vctk 44 109 English 48 kHz LibriTTS 585 2456 English 24 kHz VoxCeleb-1,2 2770 7360 English 16 kHz

Table 4 Statistics of the training datasets for the proposed anonymization architecture Subset Size (hr) Number of Speakers Number of Utterances Female Male All VoxCeleb-1,2 2,794 2,912 4,451 7,363 1,281,762 LibriTTS train-other-500 310 560 600 1,160 205,044

5.1 Experimental setup

5.2 Generative model evaluation results

5.3.1 GAN pool evaluation

Table 7 ASR results for vanilla GAN using validation data Dataset Data WER(%) small large LibriSpeech-dev a 91.03 90.23 vctk-dev a 91.44 90.53

Footnotes

Conflict of interest

Data availability statement

References

Table 2
Datasets statistics

Dataset Size (hr) Speakers Language Sample Rate

LibriSpeech 1000 2484 English 16 kHz

vctk 44 109 English 48 kHz

LibriTTS 585 2456 English 24 kHz

VoxCeleb-1,2 2770 7360 English 16 kHz

Table 4
Statistics of the training datasets for the proposed anonymization architecture

Subset Size (hr) Number of Speakers Number of Utterances

Female Male All

VoxCeleb-1,2 2,794 2,912 4,451 7,363 1,281,762

LibriTTS train-other-500 310 560 600 1,160 205,044

Table 7
ASR results for vanilla GAN using validation data

Dataset Data WER(%)

small large

LibriSpeech-dev a 91.03 90.23

vctk-dev a 91.44 90.53