Emotional privacy-preserving of speech based on generative adversarial networks

Abstract

Consumer electronic devices with voice assistants are becoming increasingly popular in modern intelligent home. Nevertheless, directly uploading unprocessed speech data, which may contain sensitive attributes, to a cloud server poses a significant risk to user privacy. To address this privacy issue, this paper proposes a privacy-enhancing model to protect speech emotions based on generative adversarial networks (PSEGAN). The model aims to prevent the inference of emotional attributes while maintaining the accuracy and utility of speech features. PSEGAN benefits from three modules: (1) A pre-trained speaker matcher imposes generative constraints on the model during the training phase to ensure that the generated speech retains the essential information needed for speaker recognition. (2) Attribute adversarial networks can generate perturbed speech that transforms emotional attributes while preserving the utility of the speech. (3) Gated Recurrent Networks (GRN) can handle the long-short term dependencies of speech signals. PSEGAN model solves the problem of utility loss in traditional speech privacy preservation methods based on generative adversarial networks (GAN). Experimental results show that on the RAVDESS dataset, PSEGAN reduces emotion recognition accuracy by 80.7%, while speaker recognition accuracy only decreases by 1.1%. These findings demonstrate that PSEGAN effectively mitigates the leakage of emotional attributes while maintaining high utility.

Keywords

Speech emotion privacy voice assistants attribute recognition generative adversarial networks

1. Introduction

The rapid development of the intelligent consumer Internet of Things has popularized consumer electronic devices with voice assistants. The widespread use of voice technology in various fields greatly enhances our daily lives, especially in areas such as user authentication, system authorization, and recommendation services.^1–8 For example, short video platforms and shopping platforms use voice data and advanced big data analytics to provide personalized recommendations, thereby increasing user engagement and retention.^9–11 Consumer electronic devices with voice assistants represented by Amazon Echo and Google Home have surged recently and become an important auxiliary tool for supporting the intelligent life of modern people. Consumer electronic devices with voice assistants deployed in users’ homes collect speech data and upload them to cloud servers for storage and processing. These devices use speech recognition algorithms to analyze the data and generate corresponding speech feedback. A third party performs speech enhancement analysis of the speech data on the cloud server, which can yield valuable user information.^12–14 Digital healthcare company CompanionMx can predict emotional states and mental disorders, such as depression and anxiety through subtle changes in a patient’s tone of speech. Speech analysis can serve as a model for evaluating the probability of customer default in the financial industry and for verifying the authenticity of a candidate’s experience in human resources.

Although speech technology brings convenience to our daily lives, it also raises serious privacy concerns due to the sensitive information contained in speech data, such as emotions, gender, and health conditions.^15–17 When speech data containing sensitive information is stored on third-party cloud servers, it becomes vulnerable to theft by attackers. When emotional attributes are compromised, an attacker can use the compromised emotional data to conduct personalized psychological attacks or manipulation to influence an individual’s decisions and behavior. The Figure 1 illustrates the process of attribute inference attacks by a malicious entity on unprotected speech data in a speech system. In this scenario, the malicious entity extracts the user’s sensitive attributes through an attribute inference classifier, exposing the user to privacy risks. To protect user data, many countries have proposed data privacy protection policies and regulations, such as the General Data Protection Regulation (GDPR),¹⁸ which emphasize data minimization principles. The development of attribute privacy-enhancing models is crucial for the security of consumer electronic devices with voice assistants. By protecting sensitive emotional attributes in speech data from malicious attacks while preserving other attributes, these models can maintain user trust and encourage the wider adoption of voice technology. Our research addresses these privacy concerns by proposing a novel model that effectively balances privacy protection with the utility of speech data. This model safeguards user privacy by protecting emotional attributes while retaining other valuable attributes, thereby enhancing the functionality and reliability of voice-assisted consumer devices. A straightforward method for protecting speech privacy is speech de-identification, which disrupts the association between biometric features and specific individuals.^19–22 For example, noise is often added to obscure original speech and protect user privacy. However, this approach reduces the effectiveness of speech verification. However, these methods compromise the verification utility of the speech. Aloufi et al.²³ provided a privacy-preserving technology to sanitize speech input directly at edge devices, but their method always converts the original emotion to neutral, which will be detected by third parties. Pascual et al.²⁴ and Ericsson et al.²⁵ utilize adversarial learning theory to protect speech privacy, but existing GAN-based methods compromise the utility of generated speech. In summary, existing methods primarily have two limitations: (1) The speech utility of protected speech is significantly reduced while achieving privacy protection. (2) Inadequate privacy protection performance, deep neural networks can perform sensitive attribute inference on protected speech data and have high accuracy in attribute prediction, thus posing a risk of privacy leakage.

Figure 1.

Example of inferred threat model for soft biometric attributes in speech recognition system.

To address the aforementioned limitations, we propose the PSEGAN model, which can accurately identify and transform the sensitive attribute of emotion in speech data. Differing from traditional methods of attribute deletion and binary attribute fusion, PSEGAN randomly transforms emotion attributes while preserving the utility of speech data. This model prevents the typical reduction in utility and authenticity caused by partial information loss after transformation. Additionally, by modifying the traditional GAN²⁶ architecture and introducing new modules. PSEGAN not only overcomes significant utility degradation post-privacy protection found in traditional speech privacy models but also enhances the quality of generated speech. The transformed speech retains the original content and speaker identification, achieving a trade-off between the utility of data sharing and personal privacy protection. After preprocessing the original speech input and extracting speech features, the network architecture of the model learns the features that need to be transformed and generates corresponding feature representations. The model is trained using a multi-task loss function, focused on both privacy preservation and utility retention. These features are then inversely transformed using a pre-trained MelGAN²⁷ model to generate desensitized speech. The PSEGAN model aims to learn biometric speech features via advanced GAN technology, transforming emotional attributes to maximize speech utility without disclosing sensitive information. The following are the major contributions of this work:

We proposed a generalized privacy-enhancing model for speech data stored in speech recognition systems that maintains the utility of speech while preserving the privacy of user emotional attributes.

We mapped GRN²⁸ to learn the long-short term dependency associations present in the speech data, thus enabling the model to more accurately recognize the speaker’s intention, emotional state, and other relevant information.

By constraining the generator with the output of a pre-trained speaker matcher, the speaker recognition accuracy of the generated speech is improved.

The experimental results demonstrate that the method proposed in this paper outperforms other existing models in both privacy protection and speech utility, validating the effectiveness and superiority of PSEGAN model.

The remainder of this paper is organized as follows: Section 2 offers a brief overview of the related literature. Section 3 defines the privacy problem. Section 4 presents a detailed explanation of the proposed model. Section 5 showcases the experimental results and subsequent discussion. Finally, Section 6 summarizes the study and explores future research directions.

2. Related work

2.1. Intermediate speech representations

When modeling speech data, due to the complex interaction between the high temporal resolution of the original waveforms and their long-term and short-term dependencies, most studies typically used spectrograms to shift features toward low-dimensional domains. Two prevalent intermediate linguistic representations were aligned linguistic features²⁹ and Mel-Spectrograms. On the one hand, using aligned language features required a complex model architecture. On the other hand, it is not flexible enough to deal with the diversity and irregularity of natural language, which might have resulted in the inability to capture all subtle differences in spoken language. The Mel-Spectrogram utilizes the Mel Scale, a nonlinear frequency scale perceived linearly by humans, to reflect the design of the human ear. It accentuated low-frequency differences, which were information-rich, whereas high-frequency informational differences were given less domain weight. Kumar²⁷ addressed the challenge of non-invertible spectrograms by introducing MelGAN, a fully convolutional model crafted to convert Mel-Spectrograms back into original waveforms.

2.2. Speech morphing technology

Speech morphing technology had broad applications in the speech field. It was used in the entertainment industry for character dubbing and speech editing and could protect sensitive information by adjusting acoustic properties like pitch and volume.³⁰ While it is effective in protecting privacy, it can affect the naturalness, clarity of speech communication, alter intonation, and reduce the expressiveness and diversity of the speaker’s speech.

2.3. Speech anonymization techniques

The Google team³¹ introduced the d-vector method, which marked a significant advancement in speech biometric technology through deep representation learning. By using speaker identities as labels for speech frames during training, the d-vector method improved the efficiency of speech biometric recognition. However, its reliance on speaker identity labels may result in poor performance with unknown speakers.Building on this approach, Snyder et al.³² developed the x-vector model, which enhances model performance by merging frame-level features into higher-level segmental features through a pooling layer. Despite these improvements in feature extraction, the x-vector method still carries risks of identity leakage in high-dimensional feature spaces and can reduce the clarity and naturalness of converted speech. Srivastava et al.³³ extended the x-vector approach to speaker anonymization by converting speech to that of a random pseudo-speaker. The effectiveness of this anonymization depends on factors such as the distance metric between speakers, the selected region of speaker space, gender, and allocation strategy. Perero et al.³⁴ proposed a speech anonymization method using autoencoders (AE) and adversarial training. This method involves extracting x-vectors from discourse, converting them to new x-vectors through AE, suppressing speaker, gender, and stress information via adversarial training, and generating anonymous speech using a Neural Speech Synthesizer (NSS) with the anonymous x-vectors, fundamental frequency, and phoneme information. Nevertheless, this approach may compromise the naturalness and intelligibility of speech during the conversion process. Yao et al.³⁵ proposed a system comprising four main components: a feature extractor, an acoustic model, an anonymization model, and a neural vocoder. This setup effectively generates anonymous speaker vectors that do not correspond to any real speaker.

2.4. The vector quantized variational autoencoder (VQ-VAE)

The Vector Quantized Variational Autoencoder (VQ-VAE) effectively conceals speaker identity by quantizing linguistic content into a discrete latent space using learned codebooks and reconstructing speech waveforms during decoding through a combination of encoded attributes. Stoidis et al.³⁶ utilized VQ-VAE to separate biometric information, such as gender and identity, from speech content, thereby enhancing privacy while preserving utility. However, this method faces challenges, including reduced performance with complex speech features and bottlenecks when processing high-dimensional biometric data.

2.5. Adversarial representation learning for privacy

Adversarial learning plays a crucial role in privacy protection strategies by using adversarial methond to balance identity utility and soft biometric privacy. For example, GenGAN,³⁷ PCMelGAN,²⁵ and Double Anon³⁸ undergo adversarial training within the GAN model, aiming to minimize utility distortion in generated speech and maximize privacy protection for sensitive data. Training models to learn ambiguous sensitive attributes effectively reduces the risk of privacy inference. GenGAN is a model that synthesizes speech using generative adversarial networks. Synthetic speech that resembles real speech but contains ambiguous identity information can be generated through adversarial training within the GAN model. This method aims to reduce the recognizability of the speaker’s identity while maintaining the content and intelligibility of the speech, thus protecting the speaker’s privacy. PCMelGAN is a generative adversarial network method focused on reconstructing Mel-spectrograms. Through adversarial training, it can generate high-quality speech synthesis and learn speech features with ambiguous gender and identity information, thereby enhancing the effectiveness of speech privacy protection. PCMelGAN ensures the naturalness and intelligibility of speech by accurately reconstructing Mel-spectrograms while concealing sensitive personal features. Double Anon is a speaker anonymization system based on CycleGAN for protecting speech data privacy. This method modifies the gender and accent information of the speaker in the original speech signal, generating a more naturally sounding anonymized speech and achieving de-identification of the speaker.

In light of the shortcomings of the above approaches, we address the emotional privacy-preserving of speech by using GRN to learn the long and short-term associations in speech data, enabling more accurate categorization of speech attributes for sentiment transformation. Additionally, we use a speaker matcher with consistency loss to retain the valid information in speech, thus improving utility.

3. Problem formulation

Given a speech sample $x$ containing multiple attribute labels and a randomly selected target vector $s^{'}$ and noise $z$ , the overall goal is to train a GAN model $g$ , which transforms the speech $x$ into the target speech $x^{'}$ = $g (x, s^{'}, z)$ . This transformed speech should exhibit the following characteristics:

3.1. Attribute privacy preservation

We define emotion as the sensitive attribute that needs to be protected. After the speech passes through the transformation by $g$ , the accuracy of inferring its emotional attribute should significantly decrease, thereby reducing the individual’s privacy risk. Hence, we define privacy protection as successful when there is an inconsistency between the original attribute label and the inferred attribute label. This can be expressed as:

F_{e m o} (x) \neq F_{e m o} (x^{'})

(1)

where

F_{e m o}

is the trained emotional attribute classifier.

3.2. Speech utility protection

Under the premise of achieving attribute privacy protection, the biometric utility of the transformed speech must be considered. In our work, our goal is to preserve the availability of attributes other than emotional information, that is, the accuracy of inference, to realize the utility of the speech. This means that for a given speech $x$ , the transformed speech $x^{'}$ should have the same output as $x$ for all attributes other than emotion, as determined by the trained attribute classifiers. This can be represented as:

F_{*} (x) = F_{*} (x^{'})

(2)

where

F_{*}

is a set of other attribute classifiers (e.g., identity, gender, content). This implies that the transformed speech should be able to match the original speech in terms of speech utility.

3.3. Realistic

Furthermore, both the original and transformed speech should maintain auditory realistic, which benefits that: (1) These speech samples remain suitable for existing computational speech system tasks; (2) There is auditory consistency between the original and transformed speech.

4. The proposed model

4.1. Overview

For the original speech, our goal is to adaptively generate privacy-enhanced speech through a deep learning model. This speech will differ in emotional attributes from the original speech but will retain identity information, all while maintaining the speech’s intelligibility and usability. Figure 2 shows the process of converting original speech into privacy-enhanced speech using the PSEGAN model. Initially, the MFCC extraction module processes the original voice through several steps: pre-emphasis, framing, windowing, Fast Fourier Transform (FFT), Mel filtering, logarithmic transformation, and Discrete Cosine Transform (DCT). These steps extract the necessary speech features. These features are then input into the generator, which transforms them into desensitized features. Subsequently, the transformed features are processed by the pre-trained MelGAN module to generate the final desensitized speech. This resulting speech is privacy-enhanced, ensuring that user privacy is safeguarded while preserving the speech’s utility.

Figure 2.

A block diagram of a model for speech privacy protection is presented. This model involves converting original speech data into feature representations through feature extraction. These representations are then fed into an adversarial generative network to generate transformed features. Subsequently, these transformed features are converted into privacy-protected speech using a pre-trained MelGAN model. The bottleneck layer incorporates a modified multi-layer gated residual network.

4.2. PSEGAN model

Privacy-Conditional Generative Adversarial Network (PCGAN)²⁶ architecture integrates the generator within the filtering model of the Generative Adversarial Privacy (GAP)³⁹ model, enhancing the protection of sensitive attributes while maintaining the practical utility of the data. The network architecture proposed in this paper is based on PCGAN and has been modified according to privacy and utility objectives. The task of the generator $g$ is to produce new data endowed with synthetic attributes $s^{'}$ that are independent of the original attribute $s$ . The corresponding discriminator $D$ takes either real speech $x$ or speech $x^{'}$ produced by generator $g$ as input. When dealing with real speech data, the discriminator $D$ is trained to predict the original sample $s$ . For the generated data, it is trained to recognize these inputs as generated data, thereby accurately distinguishing between real and generated speech. Additionally, a pre-trained speaker recognition model is incorporated into the system to enhance the accuracy of this critical utility metric.

In this adversarial question, the discriminator $D$ aims to distinguish between the sensitive attributes of the real and generated speeches, ensuring that the attribute $s^{'}$ of the generated speech is independent of the original attribute $s$ . The goal of the generator $g$ is to minimize the difference between the generated speech and the real speech, while being constrained by a pre-trained speaker recognition model to preserve the necessary speaker characteristics.

Specifically, we operate on the input waveforms, which are converted into 80-band Mel-Spectrograms. In Figure 3, we present the architecture of the entire model and the composition of the related loss functions. We use a generator $g$ , a discriminator $D$ , and a pre-trained speaker matcher classifier, training them amidst the adversarial objectives of privacy and utility. The generator $g$ is a model modified based on the U-Net⁴⁰ architecture, comprising a contracting path, a modified bottleneck layer, and an expanding path. The discriminator $D$ is constructed with a ResNet18⁴¹ architecture. The classifier is composed of a simple AlexNet⁴² architecture. The goal of $g$ is to produce speech data to deceive $D$ , whereas $D$ ’s task is to distinguish between original and synthetic speech emotion. We maximize utility by minimizing the distortion in the generated speech and minimize the risk of inferring sensitive attributes by maximizing adversarial loss. $D$ learns to distinguish between real and generated emotional data, and our objective is to perform emotional conversion to protect the sensitive attribute of emotion while still preserving the usability of the remaining speech characteristic representation.

Figure 3.

Schematic of the PSEGAN architecture, this paper aims to generate perturbations to obfuscate the sentiment attribute classifiers while preserving the usability of the speech data. (A) Different components of PSEGAN: generator, discriminator (emotion classification), and speaker matcher. (B) Transforming the input original speech label into a target label and processing it through the speaker matcher. The time-domain signal waveform is first converted into a Mel-Spectrogram $M$ , and then input to the generator along with the noise vector $z$ $\sim$ $N$ (0, 1) and a randomly selected target attribute $s^{'}$ . Key terms - $x$ : original speech, $x^{'}$ : transformed speech, $M$ : original Mel-Spectrogram, $M^{'}$ : transformed Mel-Spectrogram, $s$ : real emotion label, $L_{d}$ : distortion loss, $L_{a}$ : adversarial loss, $L_{r e a l}$ : real loss, $L_{f a k e}$ : fake loss, $L_{i d}$ : speaker matching loss.

The input to the generator $g$ includes a pre-processed Mel-Spectrogram $M$ , a random noise vector $z$ that follows a normal distribution, and a target attribute $s^{'}$ for conversion. Before the training begins, uniformly sample a batch of speech signals $x$ representing sensitive attributes (emotions) and their corresponding labels $s$ from the dataset $D$ .

(x_{1}, s_{1}), \dots, (x_{n}, s_{n}) \sim D

(3)

The speech signal

x

is transformed into a Mel-Spectrogram

m_{i}

belonging to

M

(

i

=1,

\dots

n

), and is normalized such that the amplitude is constrained within the range [0, 1], as detailed below:

m_{i} = S T F T (x_{i})

(4)

where

S T F T (\cdot)

is the Short-Time Fourier Transform.

For multi-class emotion encoding, we uniformly and randomly select emotion categories from a range of 0 to 7, representing eight distinct emotional labels:

s_{1}^{'}, \dots, s_{m}^{'} \in (0, 1, \dots, 7)

(5)

The goal of this method is to enable the generator

g

to learn and master the feature differences among various emotional labels in different spectrogram representations. By using

g

to modify features, the transformation of emotional characteristics is achieved. Our model chooses uniformly and randomly select instead of a normal distribution to ensure that the probability of each emotion being randomly selected is equal, thus avoiding bias based on the actual distribution of emotions. Through this method, we can effectively transform emotional states while safeguarding the sensitive nature of emotional attributes.

The noise vector $z$ follows a standard normal distribution $N (0, 1)$ . This means that each element of $z$ is independently drawn from a normal distribution with mean $0$ and standard deviation $1$ . This can be expressed as:

z \sim N (0, I)

(6)

where

I

represents the identity matrix, whose dimensions are the same as those of

z

, ensuring that each element has an independent standard normal distribution. This notation emphasizes the independence and distribution characteristics of each element in

z

. This vector is inserted into the bottleneck of generator

g

, introducing a certain degree of randomness at the transition between the contraction and expansion paths. This ensures that the synthesized speeches are distinguishable from the original ones and increases the variability of the reconstruction. The introduction of

z

into generator

g

enhances its nonlinear processing capabilities and the richness of speech representation, making the generated speech samples more varied and diverse in terms of gender characteristics and also improves the model’s generalizability.

GRN²⁸ is designed based on residual blocks, incorporating time-dilated convolutions and Gated Linear Units (GLUs) into traditional bottleneck residual blocks, effectively enhancing the network’s capacity to handle complex time-series data. In the field of speech enhancement, GRN utilizes its extensive receptive field to deeply model the Time-Frequency (TF) representation of the input, allowing the network to not only capture transient features in speech data but also learn long-short term dependencies, thus more accurately recognizing and preserving speaker and other attribute features. Furthermore, the structure of GRN is particularly suited to processing dynamic variations in speech, as time-dilated convolutions cover a wider temporal span without increasing the computational burden. This is crucial for capturing fluctuations in intonation, rhythm, and emotions in speech, which are highly variable over time and essential for understanding linguistic content and speaker intentions. In this paper, by integrating multiple layers of GRN into the bottleneck layer of the generator, we significantly enhance the model’s learning and expressive capacity for speech features. This not only improves the quality of speech reproduction but also optimizes the model’s understanding of long-short term features in speech, making the generated speech more natural and realistic, and achieving better results in learning emotional features, thereby enhancing the model’s broad applicability and performance.

4.3. Training objectives

The generator total loss $L_{g}$ is designed to facilitate utility by minimizing distortion loss while maximizing adversarial loss to protect the privacy of the data. We use the Mean Squared Error (MSE) between $M$ and $M^{'}$ as the distortion loss $L_{d}$ . Compared to L1 loss, MSE can yield smoother outputs. Adversarial loss $L_{a}$ is based on the cross-entropy loss⁴³ between the discriminator’s sentiment prediction of the original speech and the target emotion. The speaker matcher loss $L_{i d}$ is based on the cross-entropy loss between the label id prediction from the real data and the label id prediction from the generated data, ensuring the utility of speaker identification. The total loss function of discriminator $L_{D}$ consists of two components: real loss $L_{r e a l}$ and fake loss $L_{f a k e}$ . In this setting, the $L_{r e a l}$ is defined by calculating the cross-entropy loss between the discriminator’s prediction of the original data’s Mel-spectrogram and the true emotional label. The $L_{f a k e}$ is measured based on the cross-entropy loss between the discriminator’s prediction of the generated Mel-spectrogram and the target emotion. The purpose of this configuration is to optimize the discriminator to accurately distinguish between actual Mel-spectrograms and those created by the generator, while ensuring that the generated spectrograms meet the expected transformation objectives in terms of emotional expression.

In the following, we describe the detail of each loss.

(1) Distortion Loss: The distortion loss is utilized between $M$ and $M^{'} = g (M)$ to enhance the clarity and auditory quality of the generated results. Dynamic parameters are employed to improve the quality of the training process. The distortion loss can be expressed as:

L_{d} = E_{x \sim P (x)} [{| | g (x_{i}, s_{i}^{'}, z_{i}) - x_{i}) | |}^{2}]

(7)

where

p (x)

represents the marginal probability distribution of the sample

x

(2) Adversarial Loss: Cross-entropy loss is used between the emotional label of the generated speech $D (g (x, z, s^{'}))$ and the target emotion $s^{'}$ to achieve the transformation of emotions, thus ensuring that the emotional classification result of the transformed speech matches the target emotion $s^{'}$ . We define the adversarial loss as:

L_{a d v} (g) = - E_{x, s^{'} \sim p (x, s^{'})} [\log D (g (x, s^{'}), s^{'})]

(8)

and

\begin{aligned} L_{a d v} (D) & = - E_{x, s \sim p (x, s)} [\log D (x, s)] \\ - E_{x, s^{'} \sim p (x, s^{'})} [\log (1 - D (g (x, s^{'}, z)))] \end{aligned}

(9)

where

p (x, s)

is the joint distribution of the sample

x

and its corresponding label

s

(3) Identity Loss: To preserve the critical attribute of speaker identity information, we incorporate a pre-trained SM before training the model. The identity loss is achieved by adding the cross-entropy loss between the speaker identification from the original speech SM $(x)$ and the speaker identification from the transformed speech SM $(g (x, s^{'}, z))$ to the generator loss. The identity loss can be defined as:

L_{id} = - E_{x, s \sim p (x, s)} \log (p_{sm} (s_{i} | g (x_{i}, s_{i}^{'}, z_{i})))

(10)

where

p_{sm} (s_{i} | g (x_{i}, s_{i}^{'}, z_{i}))

is the SM’s predicted probability that the ith sample belongs to the true label

s_{i}

(4) Full objective: The full objective is written as:

max_{g} min_{D} L_{d} + L_{a d v} (g) + L_{a d v} (D) + L_{id}

(11)

Finally, this paper utilizes a pre-trained MelGAN vocoder to convert the generated spectrogram

M^{'}

back to the original waveform, a non-autoregressive conditional waveform synthesis model. Algorithm ?? shows the whole process of network training.

5. Experiment

5.1. Dataset

In our experiments, we utilized the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)⁴⁴ and the Berlin Database of Emotional Speech (EmoDB)⁴⁵ for our study. RAVDESS contains speech recordings of 24 actors (12 male, 12 female) expressing 8 different emotions (calm, happy, sad, angry, fearful, surprised, disgusted, and neutral) with two intensity levels (normal and strong) for each emotion. The total duration of the recordings is approximately 15 hours. EMO-DB contains speech recordings of 10 actors (5 male, 5 female) expressing 7 different emotions (pleased, sad, angry, fearful, disgusted, surprised, and neutral), with multiple samples for each emotion. The total duration of the recordings is approximately 2.5 hours.

In this paper, we selected 1200 samples from the RAVDESS dataset as the training set and 120 samples as the test set. We uniformly selected these samples to ensure that the training and test sets include performances by different actors and expressions of various emotions. All recordings are labeled with the actor’s name, emotion, gender, and other information. In our experiments, these speech files were processed to reduce their sampling rate to 16 kHz, different from the original sampling rate of 48 kHz. To maintain the consistency of the speech segment length, we used zero-padding to make the length of each recording equal to 16.7 seconds multiplied by the sampling rate (i.e., 16.7 * 16000). Similarly, for the EMO-DB dataset, we selected 100 samples as the training set and 400 samples as the test set for experimentation and testing of the model.

5.2. Implementation and training details

This study adopts an end-to-end training strategy. All our experiments were conducted on an NVIDIA RTX A4000 GPU with 16GB of memory. Our model employs the Adam optimizer for parameter updates, with the learning rate set to 0.0001 for the generator and 0.0002 for the discriminator, while the betas parameters are set to (0.5, 0.9). For spectrogram preprocessing, the sampling rate was fixed at 16000 Hz, the frame length was 1024, 80 Mel frequency channels, as well as 32 MEL filters were used, and the window size and length of the STFT were set to 1024 to accurately capture audio features. In our model, we preprocess the speech signal by normalizing the entire spectrogram to the [0, 1] range. This normalization process helps maintain the overall features of the Mel-spectrogram while standardizing the input data. As a result, the model’s performance and accuracy in emotion recognition tasks are significantly improved. Subsequently, the spectrogram is transformed into a logarithmic Mel-spectrogram for use in the model.

5.3. Evaluation

We compared our proposed PSEGAN with two leading models, GenGAN and PCMelGAN, both designed for voice privacy protection. GenGAN generates gender-ambiguous voices to safeguard gender privacy, while PCMelGAN employs filtering and generation modules to replace sensitive information. For a fair comparison, we optimized GenGAN and PCMelGAN using the same voice datasets, RAVDESS and EmoDB, to achieve their best performance. GenGAN was chosen for comparison due to its focus on protecting predefined attributes, specifically emotion in this study. To ensure fairness, we configured GenGAN to produce voices with ambiguous emotional traits. Its generator uses a U-Net architecture and adversarial loss to balance signal distortion with privacy protection. We evaluated the effectiveness of GenGAN by assessing the privacy and utility of the emotion-ambiguous voices it generated. PCMelGAN excels in creating privacy-enhanced voices aimed at protecting various attributes. In our experiments, the sensitive attribute for PCMelGAN was set to emotion. This setup allowed us to investigate if spectrum-based privacy models might leak soft biometric information during voice verification processes.

Table 1 presents the performance of three models under the same testing conditions, all based on a pre-trained classification model. We conducted tests on both the RAVDESS and EmoDB datasets. In these datasets, the pre-trained classification model achieved speaker recognition accuracies of 97.6% and 95.6%, respectively, along with an emotion recognition accuracy of 94.9% from the original speech data. The models need to consider not only privacy protection performance but also utility, achieving a balance between the two. Specifically, we used a low success rate of emotion recognition as an indicator of privacy performance and used speaker recognition, gender recognition, and content recognition as metrics to evaluate utility performance, maintaining the usability of speech while ensuring privacy protection. In the RAVDESS dataset, PSEGAN desensitized unprotected speech without significantly reducing the utility of speaker recognition, lowering the emotion recognition rate of desensitized speech to 14.2%, close to the ideal 12.5%, which meets the requirement of equal probability selection for eight emotions. In the EmoDB dataset, although there was a noticeable decline in the accuracies of content and gender recognition, the model still performed well in terms of privacy protection and speaker recognition. Compared to other models, although PCMelGAN performs better in preserving other attributes of speech, it shows poorer performance in protecting emotions. The reason for the change in model performance is that the EmoDB dataset has fewer samples compared to RAVDESS, leading to the model not fully learning the data features. In summary, PSEGAN strikes a balance between privacy protection and utility compared to the other two schemes.

Table 1.
Comparing the recognition accuracy of different models.

Dataset Method Speaker recognition Emotion recognition Genger recognition Contet recognition

RAVDESS⁴⁴ Original 97.6% 94.9% 96.6% 96.5%

GenGAN 79.3% 14.6% 79.3% 62.9%

PCMelGAN 88.7% 16.4% 83.6% 77.6%

PSEGAN(our) 96.5% 14.2% 89.8% 89.7%

EmoDB⁴⁵ Original 95.6% 94.9% 96.0% 95.1%

GenGAN 85.5% 13.5% 45.5% 8.9%

PCMelGAN 81.5% 43.5% 88.5% 82.0%

PSEGAN(our) 93.7% 17.19% 60.8% 10.1%

Dataset	Method	Speaker recognition	Emotion recognition	Genger recognition	Contet recognition
RAVDESS⁴⁴	Original	97.6%	94.9%	96.6%	96.5%
	GenGAN	79.3%	14.6%	79.3%	62.9%
	PCMelGAN	88.7%	16.4%	83.6%	77.6%
	PSEGAN(our)	96.5%	14.2%	89.8%	89.7%
EmoDB⁴⁵	Original	95.6%	94.9%	96.0%	95.1%
	GenGAN	85.5%	13.5%	45.5%	8.9%
	PCMelGAN	81.5%	43.5%	88.5%	82.0%
	PSEGAN(our)	93.7%	17.19%	60.8%	10.1%

In Table 2, we evaluate the comprehensive performance of models in terms of privacy protection and utility maintenance by calculating the Equal Error Rate (EER) of the original speech and speech processed by different models. We select the EER of speaker recognition ( $E E R_{s r}$ ), gender recognition ( $E E R_{g r}$ ), and content recognition ( $E E R_{c r}$ ) as utility indicators, while the EER of emotion recognition ( $E E R_{e r}$ ) is used as an indicator of privacy protection. Specifically, for utility indicators, a lower EER represents better utility maintenance; for the privacy protection indicator, an EER close to 50% indicates better privacy protection. To facilitate comparison, we have normalized these EER values. The normalized value $e e r_{*}$ for utility indicators is defined as $1 - E E R_{*}$ , and the normalized value $e e r_{e r}$ for the privacy protection indicator is defined as $100 - 2 \times | E E R_{e r} - 50 |$ . Under this normalization, a value of 100 represents the best privacy protection or utility maintenance. In the RAVDESS dataset, it can be seen from $e e r_{e r}$ that the scheme proposed in this paper is similar to the other two comparison schemes in terms of privacy protection, but has a significant advantage in maintaining the utility of speech attributes. In the EmoDB dataset, although PCMelGAN performs better in terms of utility preservation, its privacy utility is significantly lower than the other two models. Overall, our proposal strike a better balance between utility and privacy protection compared to the control models.

Table 2.

Comparing the equal error rates of different models.

Dataset	Method	$E E R_{s r}$	$E E R_{e r}$	$E E R_{g r}$	$E E R_{c r}$	$e e r_{s r}$	$e e r_{e r}$	$e e r_{g r}$	$e e r_{c r}$
RAVDESS⁴⁴	Original	1.4%	11.1%	5.1%	5.5%	98.6%	22.2%	94.9%	94.5%
	GenGAN	17.2%	61.5%	79.6%	71.7%	82.8%	77%	20.4%	28.3%
	PCMelGAN	18.1%	61.1%	13.2%	16.9%	81.9%	77.8%	86.8%	83.1%
	PSEGAN(our)	9.4%	68.3%	9.4%	10.1%	90.6%	63.4%	90.6%	89.9%
EmoDB⁴⁵	Original	1.1%	0.1%	0.5%	0.2%	98.9%	0.2%	99.5%	99.8%
	GenGAN	28.4%	44.1%	50.1%	40.3%	71.6%	88.2%	49.9%	59.7%
	PCMelGAN	8.4%	17.9%	4.1%	5.6%	91.6%	35.8%	95.9%	94.4%
	PSEGAN(our)	5.9%	43.1%	18.4%	43.5%	94.1%	86.2%	81.6%	56.5%

Subfigures (a) and (c) in Figure 4 compare the waveforms of the original and transformed speech samples. These waveforms are largely similar, with the main differences arising from noise introduced during the training process. This similarity indicates that the waveforms maintain consistent characteristics in terms of content, demonstrating effective utility preservation. In contrast, the spectrograms in subfigures (b) and (d) show some frequency differences, but they also retain a certain degree of similarity. This suggests that only a small portion of the content has changed, specifically in terms of emotional characteristics, while most of the original content has been preserved.

Figure 4.

The waveforms and spectrograms of the original and synthesized speech. Subfigures (a) and (b) illustrate the waveform and spectrogram of the original speech, respectively. Subfigures (c) and (d) showcase the waveform and spectrogram of the synthesized speech, respectively.

In Figure 5, we employ the t-SNE⁴⁶ dimensionality reduction technique to visualize the emotional attributes of the original speech data and the speech data transformed by PSEGAN on a two-dimensional plane. From the Figure 5 (a) (original speech), we can observe an apparent clustering effect among sample points with the same label attributes. This indicates that the emotional attribute features in the speech data are distinguishable. That emotional information can be analyzed through features. However, in the Figure 5 (b), sample points of all categories are clustered together, indicating that the emotional attribute features in the transformed data have been effectively concealed, preventing the inference of emotional information from features thereby achieving protection of the emotional attributes.

Figure 5.

Data distribution of speech emotion representation based on t-SNE⁴⁶ dimensionality reduction Technique. Subfigures (a) presents the dimensionality reduction results of the emotional attributes of the original speech data, while subfigures (b) shows the dimensionality reduction results of the emotional attributes of the speech data after being transformed by PSEGAN. Each point in the figure represents a speech sample extracted from the dataset.

Figure 6 presents the ROC curves for anonymized speech in terms of speaker and emotion recognition. Subfigure (a) shows that PSEGAN achieves the smallest area under the curve (AUC), indicating robust privacy protection. Conversely, Subfigure (b) reveals that PSEGAN’s AUC is the largest, effectively preserving speaker recognition capabilities. Collectively, these findings highlight PSEGAN’s outstanding performance across both metrics.

Figure 6.

Demonstrates ROC curves for desensitized speech with respect to speaker recognition and emotion recognition.

In Table 3, we evaluate several ablations of our method in different settings to study the effect of each add-on component in the framework. Removing the SM component leads to a significant decrease in speaker recognition accuracy, highlighting the compromised preservation of speech utility. Additionally, eliminating the GRN module results in a marked increase in emotion recognition accuracy, indicating a substantial reduction in the effectiveness of emotion privacy protection.

Table 3.

Ablation experiment results.

Dataset	Speaker Recognition (%)	Emotion Recognition (%)
PSEGAN	96.5%	14.2%
without SM	72.4%	24.14%
without GNR	88.79%	44.82%

6. Conclusion

In this paper, a privacy protection mechanism, abbreviated as PSEGAN, was designed to safeguard the emotional privacy of speech data in consumer electronic devices with voice assistants. By employing adversarial learning strategies, PSEGAN effectively removes sensitive information from original speech data and replaces it with realistic new information, significantly enhancing privacy protection. Experimental results demonstrate that PSEGAN strikes a favorable balance between maintaining speech utility and preserving privacy, underscoring the innovative nature of our model. However, this research still has some limitations. The model proposed in this paper focuses on single-attribute protection, which may limit its applicability in more complex scenarios. Furthermore, the reliance on specific datasets could impact the generalizability of our findings. Future research should aim to expand PSEGAN’s capabilities to protect multiple sensitive attributes simultaneously, enhancing its overall performance. Additionally, incorporating diverse and large-scale datasets will improve the model’s robustness. Optimizing computational efficiency will be crucial for the practical deployment of PSEGAN in consumer speech devices.

Footnotes

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 62272103, 61872090 and the Natural Science Foundation of Fujian Province under Grant 2023J01531.

ORCID iD

Jinsen Lin

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Huang

. A novel residual shrinkage block-based convolutional neural network for improving the recognition of motor imagery EEG signals. Int J Intell Comput Cybern 2022; 16: 420–442.

Zhang

Lin

Pan

, et al. Cache reallocation-based page-level flash translation layer for smartphones. IEEE Trans Consum Electron 2023; 69: 671–679.

Huang

. Manifold embedded global and local discriminative features selection for single-shot multi-categories clothing recognition and retrieval. Int J Intell Comput Cybern 2023; 17: 363–394. DOI: 10.1108/IJICC-10-2023-0302.

Kumar

Dhanalakshmi

. EYE-YOLO: A multi-spatial pyramid pooling and focal-EIOU loss inspired tiny YOLOv7 for fundus eye disease detection. Int J Intell Comput Cybern 202410.1108/IJICC-02-2024-0077.

Chen

Lin

Liu

, et al. NT-DPTC: A non-negative temporal dimension preserved tensor completion model for missing traffic data imputation. Inf Sci (Ny) 2023; 653: 119797.

Chen

Lin

, et al. Consistency and dependence-guided knowledge distillation for object detection in remote sensing images. Expert Syst Appl 2023; 229: 120519.

Zhong

Lin

. Dynamic multi-scale topological representation for enhancing network intrusion detection. Comput Secur 2023; 135: 103516.

Lin

Pan

Feng

, et al. MTDB: An LSM-tree-based key-value store using a multi-tree structure to improve read performance. J Supercomput 202410.1007/s11227-024-06382-5.

Zhong

Lin

Zhang

, et al. A survey on graph neural networks for intrusion detection systems: Methods, trends and challenges. Comput Secur 2024; 141. DOI: 10.1016/j.cose.2024.103821.

10.

Zhao

Al-Dubai

, et al. Routing schemes in software-defined vehicular networks: Design, open issues and challenges. IEEE Intell Transp Syst Mag 2021; 13: 217–226.

11.

Zhao

Qian

Hawbani

, et al. Overtaking feasibility prediction for mixed connected and connectionless vehicles. IEEE Trans Intell Transp Syst 2024; 1–16. DOI: 10.1109/TITS.2024.3398602.

12.

Zhao

Yang

Tan

, et al. Vehicular computation offloading for industrial mobile edge computing. IEEE Trans Ind Inf 2021; 17: 7871–7881.

13.

Zhao

Al-Dubai

, et al. A novel prediction-based temporal graph routing algorithm for software-defined vehicular networks. IEEE Trans Intell Transp Syst 2021; 23: 13275–13290.

14.

Zhao

Han

, et al. Intelligent digital twin-based software-defined vehicular networks. IEEE Netw 2020; 34: 178–184.

15.

Rubinstein

. Big data: The end of privacy or A new beginning?. Int Data Privacy Law 2013; 3: 74.

16.

Hadian

Altuwaiyan

Liang

, et al. Efficient and privacy-preserving voice-based search over mHealth data. In: International conference on connected health: applications, systems and engineering technologies, 2017, pp.96–101. DOI: 10.1109/CHASE.2017.66.

17.

Zhao

Kumar

, et al. Introduction to the special section on intelligence-empowered collaboration among space, air, ground, and sea mobile networks towards B5G. IEEE Trans Network Sci Eng 2021; 8: 2719–2721.

18.

Viorescu

, et al. 2018 reform of EU data protection rules. Eur J Law Public Adm 2017; 4: 27–39. https://eur-lex.europa.eu/eli/reg/2016/679/oj.

19.

Chen

, et al. A non-intrusive and adaptive speaker de-identification scheme using adversarial examples. In: Annual international conference on mobile computing and networking, 2022, pp.853–855. DOI: 10.1145/3495243.3558260.

20.

Chen

Wang

, et al. voiceCloak: Adversarial example enabled Voice de-identification with balanced privacy and utility. Proc ACM Interact Mobile Wearable and Ubiquitous Technol 2023; 7: 1–21.

21.

Tavi

Kinnunen

Hautamäki

. Improving speaker de-identification with functional data analysis of F0 trajectories. Speech Commun 2022; 140: 1–10.

22.

Liu

Zheng

, et al. Cross-domain sentiment aware word embeddings for review sentiment analysis. Int J Mach Learn Cybern 2021; 12: 343–354.

23.

Aloufi

Haddadi

Boyle

. Privacy preserving speech analysis using emotion filtering at the edge. In: Conference on embedded networked sensor systems, 2019, pp.426–427. DOI: 10.1145/3356250.3361947.

24.

Pascual

Bonafonte

Serra

. SEGAN: speech enhancement generative adversarial network. In: Interspeech, 2017, pp.3642–3646. DOI: 10.21437/Interspeech.2017-1428.

25.

Ericsson

Östberg

Zec

, et al. Adversarial representation learning for private speech generation. In: International conference on machine learning, 2020. https://icml.cc/virtual/2020/7189.

26.

Martinsson

Zec

Gillblad

, et al. Adversarial representation learning for synthetic replacement of private attributes. In: IEEE international conference on big data, 2021, pp.1291–1299. DOI: 10.1109/BigData52589.2021.9671802.

27.

Kumar

de Boissiere

, et al. MelGAN: generative adversarial networks for conditional waveform synthesis. In: International conference on neural information processing systems, 2019, p.12. DOI: 10.1109/radarconf2043947.2020.9266709.

28.

Tan

Chen

Wang

. Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 2019; 27: 189–198.

29.

Bińkowski

Donahue

Dieleman

, et al. High fidelity speech synthesis with adversarial networks. In: International conference on learning representations, 2020. https://openreview.net/forum?id=r1gfQgSFDr.

30.

Lavner

Porat

. Voice morphing using 3D waveform interpolation surfaces and lossless tube area functions. EURASIP J Adv Signal Process 2005; 2005: 142638.

31.

Variani

Lei

McDermott

, et al. Deep neural networks for small footprint text-dependent speaker verification. In: International conference on acoustics, speech and signal processing, 2014, pp.4052–4056. DOI: 10.1109/ICASSP.2014.6854363.

32.

Snyder

Garcia-Romero

Sell

, et al. X-vectors: Robust DNN embeddings for speaker recognition. In: International conference on acoustics, speech and signal processing, 2018, pp.5329–5333. DOI: 10.1109/ICASSP.2018.8461375.

33.

Srivastava

BML

Maouche

Sahidullah

, et al. Privacy and utility of X-vector based speaker anonymization. IEEE/ACM Trans Audio Speech Lang Process 2022; 30: 2383–2395.

34.

Perero-Codosero

Espinoza-Cuadros

Hernández-Gómez

. X-vector anonymization using autoencoders and adversarial training for preserving speech privacy. Comput Speech Lang 2022; 74: 101351.

35.

Yao

Wang

Zhang

, et al. NWPU-ASLP system for the voiceprivacy 2022 challenge. VoicePrivacy 2022 Challenge, 2022. DOI: 10.48550/ARXIV.2209.11969.

36.

Stoidis

Cavallaro

. Protecting gender and identity with disentangled speech representations. In: Interspeech, 2021, pp.1699–1703. DOI: 10.21437/Interspeech.2021-2163.

37.

Stoidis

Cavallaro

. Generating gender-ambiguous voices for privacy-preserving speech recognition. In: Interspeech, 2022, pp.4237–4241. DOI: 10.21437/interspeech.2022-11322.

38.

Prajapati

Singh

Amin

, et al. Voice privacy through X-vector and CycleGAN-based anonymization. In: Interspeech, 2021, pp.1684–1688. DOI: 10.21437/INTERSPEECH.2021-1573.

39.

Huang

Kairouz

Sankar

. Generative adversarial privacy: a data-driven approach to information-theoretic privacy. In: Asilomar conference on signals, systems, and computers, 2018, pp.2162–2166. DOI: 10.1109/ACSSC.2018.8645532.

40.

Ronneberger

Fischer

Brox

. U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention, 2015, pp.234–241. DOI: 10.1007/978-3-319-24574-4-28.

41.

Zhang

Ren

, et al. Identity mappings in deep residual networks. In: European conference on computer vision, Vol. 9908, 2016, pp.630–645. DOI: 10.1007/978-3-319-46493-0-38.

42.

Krizhevsky

Sutskever

Hinton

. ImageNet classification with deep convolutional neural networks. Commun ACM 2017; 60: 84–90.

43.

Zhang

Sabuncu

. Generalized cross entropy loss for training deep neural networks with noisy labels. In: International conference on neural information processing systems, Vol. 31, 2018. https://proceedings.neurips.cc/paper/2018/hash/f2925f97bc13ad2852a7a551802feea0-Abstract.html.

44.

Livingstone

Russo

. The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE 2018; 13: e0196391.

45.

Burkhardt

Paeschke

Rolfes

, et al., . A database of German emotional speech. In: Interspeech, Vol. 5, 2005, pp.1517–1520. DOI: 10.21437/INTERSPEECH.2005-446.

46.

van der Maaten

Hinton

. Visualizing data using T-SNE. J Mach Learn Res 2008; 9: 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html.