P VALL-E: An efficient multilingual speech synthesis system based on the performer architecture

Abstract

Text-to-Speech (TTS) technology converts text into human-like speech, aiding the visually impaired, providing voice assistants, and enabling automated news broadcasting. This study proposes P VALL-E, an efficient speech synthesis system enhancing Microsoft’s VALL-E model by replacing its Transformer architecture with Performer structures through a layer-wise strategy. The proposed mechanism improves processing speed for long texts and reduces parameter count, making the model suitable for resource-constrained environments. Pre-trained with English and Simplified Chinese speech data, P VALL-E leverages multilingual training to transfer knowledge to less-resourced Taiwanese speech data, improving performance across languages. A language embedding mechanism is also incorporated for accent control and personalized synthesis. Experimental results show P VALL-E matches the original VALL-E in accuracy while boosting generation speed by approximately 20%. Even with limited data, the proposed architecture performs well in multilingual settings. Add to this an Android app was developed, running the model on a server due to its high computational requirements, and transmitting results to users’ devices.

Keywords

Accent control performer speech synthesis text to speech vALL-E

1. Introduction

In recent years, Text-to-Speech (TTS) synthesis has become a crucial aspect of modern technology, converting written text into spoken words with natural intonation and emotion. TTS systems find applications in various fields such as navigation systems, smart homes, educational resources, customer service robots, and reading articles like newspapers and magazines. Traditional TTS systems primarily utilized concatenative or parametric methods for voice generation, which often resulted in mechanical-sounding and less natural speech. The advent of deep learning has revolutionized TTS with models based on Convolutional Neural Networks (CNNs),¹ Long Short-Term Memory networks (LSTMs),² and Transformers.³ These models have significantly improved the naturalness and expressiveness of synthesized speech by learning from large datasets.

Among the cutting-edge TTS systems, the VALL-E model leverages the in-context learning capabilities of large language models (LLMs) and over 60,000 hours of English speech data, enabling zero-shot performance.⁴ VALL-E can accurately replicate a speaker’s tone and emotion with just a three-second reference, while also capturing the speaking environment. This capability allows VALL-E to generate realistic speech by understanding the context and nuances of the input text. However, the VALL-E model’s reliance on the Transformer architecture, with its large number of parameters and computational resources, poses efficiency challenges, especially when processing long sequences. Additionally, VALL-E’s training predominantly on single-language datasets limits its ability to fully exploit the contextual learning capabilities of language models. Expanding VALL-E’s capabilities to other languages, particularly those with less training data, remains an important area of research.

To address these challenges, this study proposes the use of the Performer architecture, an improved version of the Transformer, which enhances the efficiency of processing long sequences and reduces computational resource requirements. Furthermore, this study aims to utilize transfer learning to leverage extensive English and Chinese datasets, fine-tuning the model for multilingual training, thereby improving performance in languages with less available data. In summary, the proposed architecture enhances the state-of-the-art zero-shot TTS model VALL-E by replacing its Transformer architecture with the Performer model, resulting in a 20% reduction in parameter count and a 20% increase in execution speed. Additionally, the proposed method introduces multilingual training and transfer learning techniques to improve the performance of languages with limited data, and incorporates accent control features. Finally, the development of an Android application, deploying the P VALL-E model on a server for efficient speech generation, demonstrates the practical application of these advancements.

The remainder of this paper is organized as follows. Section II introduces related works. In Section III, the model architecture is presented. In Section IV, the experiments conducted in this study are described. Finally, conclusions are drawn in the last section.

2. Related work

2.1. Text-to-speech (TTS)

Text-to-Speech (TTS) technology converts text into speech and is used in various applications, such as assistive devices and voice assistants. Traditional TTS systems include concatenative synthesis, which uses pre-recorded speech segments, and parametric synthesis, which generates speech algorithmically. While concatenative methods yield natural speech, they require large datasets and may have unnatural transitions. Parametric methods, though flexible, often lack naturalness.Recent deep learning models, such as WaveNet,⁵ have significantly improved TTS quality. Models like Tacotron,⁶ FastSpeech,⁷ and NaturalSpeech⁸ have further enhanced naturalness and fluency. Modern TTS systems also incorporate features like accent and emotion control to personalize and enrich the speech output.^9,10

2.2. Transformer

The Transformer³ has become a cornerstone in sequence processing, particularly in NLP. It surpasses traditional models like RNNs and CNNs by effectively handling long-range dependencies through self-attention, which calculates relationships within a sequence. This mechanism enhances both accuracy and efficiency. Transformers have also shown success in image processing (e.g., ViT¹¹) and speech recognition (e.g., Conformer¹²).

Self-attention is key to the Transformer’s capability, computing the importance of sequence elements. It uses query, key, and value vectors to generate attention weights and outputs, effectively capturing global sequence information. However, its $O (L^{2} d)$ complexity can be computationally demanding for long sequences. The Performer¹³ reduces the computational complexity of self-attention from $O (L^{2} d)$ to $O (L d^{2})$ using a kernelized attention mechanism, making it suitable for processing long sequences efficiently while maintaining performance, particularly in text and speech tasks.

2.3. Language models

Language models have evolved from statistical approaches to deep learning architectures. The Transformer introduced a major shift with its self-attention mechanism, significantly improving long-range dependency capture in NLP tasks. Models like BERT¹⁴ and GPT¹⁵ have set new standards in NLP. Recently, large language models have also been applied in speech processing, with models like Whisper¹⁶ and AudioPaLM¹⁷ showing promising results. VALL-E⁴ marks a key development in TTS, utilizing a large language model for zero-shot speech synthesis.

2.4. VALL-E

VALL-E⁴ is a pioneering TTS model that employs a large language model architecture for zero-shot capabilities. Unlike traditional TTS systems that require extensive data, VALL-E uses a GPT-like Transformer to generate speech for unseen speakers, learning text-speech relationships from a vast pre-training dataset. Its architecture includes an audio encoder, Transformer, and speech decoder, enabling efficient and natural-sounding speech synthesis across a wide range of speakers.

3. Proposed architecture

3.1. System architecture

VALL-E leverages the Large Language Model (LLM) architecture to enable context-aware learning, achieving exceptional speech synthesis quality. However, this comes at the cost of substantial model parameters and computational resources. The original VALL-E model uses a Transformer architecture, which has a computational complexity that scales quadratically with input length, making it slow for processing long sequences. This is problematic in practical applications, such as when generating speech for long articles, as it leads to inefficiencies and poor user experience. Additionally, VALL-E is trained exclusively on English, limiting its support for multilingual speech synthesis and accent control, and failing to fully utilize the advantages of LLM training on multiple languages. This study addresses these issues by modifying the VALL-E model. The system, named P VALL-E, builds upon the VALL-E model by replacing the Transformer decoder with the more efficient Performer architecture, as shown in Figure 1. The Performer is an optimized version of the Transformer, reducing the number of model parameters and increasing processing speed, especially for long sequences. To introduce accent control, a language ID is added to the input, enabling the model to learn language embeddings and distinguish between different language characteristics.

Figure 1.

P VALL-E system architecture.

3.2. Efficient transformers

Due to the quadratic computational complexity of the self-attention mechanism, the Transformer model struggles with long sequences. In recent years, various efficient Transformer variants have been proposed to address this issue. Evaluating different models can be challenging due to inconsistent benchmarks across tasks and datasets. To address this, Tay et al. introduced the Long Range Arena (LRA) benchmark,¹⁸ specifically designed to evaluate model performance on long sequences. As shown in Figure 2, the Big Bird model¹⁹ achieves the highest scores on the LRA benchmark by incorporating multiple improved attention mechanisms, demonstrating its strong performance on long text sequences. However, its speed improvements over the original Transformer are not significant. Performer and Linformer, on the other hand, offer a good balance between speed and accuracy, with both models being approximately five times faster than the Transformer, albeit with slightly lower accuracy. To reduce model parameters and computational complexity while maintaining accuracy, this study replaces the VALL-E Transformer layers with Performer layers. The Linformer was also experimented with, given its comparable performance to the Performer.

Figure 2.

Long range arena benchmark.¹⁸

3.2.1. Linformer

The Linformer, proposed by Facebook AI Research, improves the efficiency of Transformers by addressing the computational and memory inefficiencies encountered when processing long sequences. The traditional Transformer model has a self-attention mechanism with a time and space complexity of $O (L^{2})$ , where $L$ is the sequence length. This complexity makes the model inefficient for very long sequences. Linformer reduces this complexity by using linear projections to decrease the dimensionality of the key ( $K$ ) and value ( $V$ ) matrices. Linformer’s key innovation is the dimensionality reduction of the key and value matrices in self-attention. In traditional Transformers, the dimensions of $K$ and $V$ are $L \times d$ , where $d$ is the feature dimension. Linformer reduces these dimensions to $r \times d$ , where $r$ is a constant much smaller than $L$ . This is achieved by learning linear projection matrices $P_{K}$ and $P_{V}$ , which project the original matrices as follows:

K_{r e d u c e d} = P_{K} \cdot K, V_{r e d u c e d} = P_{V} \cdot V

(1)

where

P_{K}

and

P_{V}

are matrices with dimensions

r \times L

. This reduces the sequence length while keeping the feature dimension unchanged.

Next, the query vector $Q$ (with unchanged dimensions of $L \times d$ ) is multiplied with the reduced key vector $K_{r e d u c e d}$ , followed by a Softmax function to normalize and compute the attention scores:

\begin{aligned} A t t e n t i o n (Q, K_{r e d u c e d}, V_{r e d u c e d}) = s o f t m a x (\frac{Q \cdot K_{r e d u c e d}^{T}}{\sqrt{d_{k}}}) \cdot V_{r e d u c e d} \end{aligned}

(2)

By reducing the dimensions of the query and value matrices, Linformer effectively lowers the original

O (L^{2})

computational complexity to

O (L r d)

, achieving a near-linear

O (L)

complexity, which significantly improves the efficiency of processing long sequences.

3.2.2. Performer

To address the high time and space complexity associated with long sequence processing, the Performer architecture¹³ was proposed by Google’s AI research team. The Performer retains the basic Transformer structure while introducing innovations in the calculation process, particularly in the self-attention mechanism. The Performer reduces computational complexity by reordering matrix multiplications, performing the key ( $K$ ) and value ( $V$ ) matrix multiplications first, followed by multiplication with the query ( $Q$ ).

This reordering, along with the use of an approximate softmax kernel to preprocess $Q$ and $K$ , allows for the operation:

Q^{'} \cdot K^{'} \approx softmax (Q K)

(3)

where

Q^{'}

and

K^{'}

are the transformed vectors after applying the softmax kernel. As shown in Figure 3, this approach reduces the self-attention complexity from quadratic

O (L^{2} d)

to linear

O (L r d)

, where

L

is the sequence length,

d

is the input dimension, and

r

is a constant much smaller than

L

. This allows the Performer to handle long sequences efficiently while significantly reducing computational resource consumption.

Figure 3.

Performer attention mechanism.

3.3. Accent control

To enable accent control in the speech synthesis system, this study proposes a method to set specific accents, allowing the model to simulate the pronunciation characteristics of speakers from different languages. For instance, the system can generate speech with a Simplified Chinese accent, British accent, or Taiwanese accent when reading Chinese sentences. Inspired by the YourTTS model,²⁰ a language ID is added to the input to specify the desired accent in the generated speech. To accurately identify and produce specific language characteristics, language embeddings from YourTTS are utilized, converting language features into vector representations. Each language ID corresponds to a specific embedding vector, enabling the model to learn and simulate various language accents. Language embeddings capture language-specific features such as phonetics, intonation, and pronunciation patterns, providing the model with essential linguistic information. For example, English and Chinese language embeddings differ due to significant differences in pronunciation rules and intonation between the two languages.In Table 1, an example is shown using the sentence “[EN]The truth must be told at all costs.[EN]” to illustrate how language cues are added at the beginning and end of a sentence. The [EN] tag indicates that the sentence should be synthesized with an English accent. By adding language cues to the sentence, the model can recognize the desired accent for synthesis, ensuring the speech output matches the expected accent. During inference, accent control can be achieved by setting the language ID to the desired language.

Table 1.
Language ID label example.

Language ID Sentence

ZH [ZH]simplified Chinese context.[ZH]

EN [EN]English context.[EN]

TW [TW] traditional Chinese context.[TW]

Language ID	Sentence
ZH	[ZH]simplified Chinese context.[ZH]
EN	[EN]English context.[EN]
TW	[TW] traditional Chinese context.[TW]

3.4. Performer layer replacement strategy

The goal of this study is to replace the Transformer decoders in VALL-E with Performer decoders to reduce the number of parameters and improve execution efficiency. Unlike replacing all Transformer layers at once, this study adopts a layer-by-layer replacement strategy while retaining the pre-trained weights of the remaining Transformer layers. The core idea is to use the retained pre-trained Transformer weights to guide the newly added Performer layers. This approach helps the new layers quickly learn the model’s existing knowledge, reducing parameters, enhancing computational efficiency, and maintaining stable model performance. The training process is outlined in Algorithm 1.

Figure 4 shows the Performer replacement process. By following this method, the Performer layers are replaced one by one in the original VALL-E Transformer layers, with a total of $N$ operations (assuming VALL-E contains $N$ Transformer layers). Through this step-by-step replacement, the final P VALL-E model consists entirely of Performer layers. Subsequent experimental results will demonstrate that this gradual replacement method provides higher accuracy than directly replacing all Transformer layers at once.

Figure 4.

Performer layer replacement process.

4. Experiments

The experiments were conducted on a system running Ubuntu 22.04. The training was performed using an NVIDIA RTX 3080Ti GPU with 12 GB of memory, which meets the minimum memory requirement for the model.

4.1. Datasets

The superior performance of VALL-E is attributed to training on a large-scale semi-supervised English speech dataset, including automatically generated text labels. Due to resource constraints, we could not use a dataset of the same scale as VALL-E. Instead, we selected nine high-quality supervised Chinese and English speech datasets as substitutes. These datasets provide accurate manual annotations and cover a wide range of scenarios, including recordings made with different devices (such as studio equipment and smartphones) and under various environmental conditions. This diversity ensures that the model learns a broad spectrum of variations, improving its robustness. Thus, even with limited resources, the speech model’s performance can be effectively enhanced.

The English speech datasets used in this study are as follows:

LibriTTS²¹: A large-scale English speech dataset designed for TTS applications, derived from LibriSpeech, including public domain audiobook recordings from LibriVox, providing 585 hours of high-quality speech recordings.

VCTK:²² An English speech dataset designed for speech synthesis and recognition, containing 44 hours of high-quality speech from 110 speakers with various accents.

The Chinese speech datasets include:

AISHELL-1²³: An open-source Mandarin speech corpus provided by Beijing Shell Shell Technology Co., Ltd., containing 180 hours of high-quality speech from 400 speakers with various accents across China.

AISHELL-3²⁴: A high-quality multi-speaker Mandarin speech corpus designed for multi-speaker TTS systems, containing 85 hours of emotion-neutral recordings from 218 speakers.

Aidatatang²⁵: A free Mandarin speech corpus provided by Beijing Datatang Technology Co., Ltd., containing 200 hours of speech data recorded in quiet indoor environments using smartphones.

Primewords²⁶: A Mandarin speech corpus released by Shanghai Primewords Information Technology Co., Ltd., containing 100 hours of speech from 296 native speakers, all recorded using smartphones.

THCHS-30²⁷: An open-source Mandarin speech corpus released by Tsinghua University’s Speech and Language Technology Center, containing 34 hours of speech from 40 speakers.

ST Chinese²⁸: A Mandarin speech corpus released by Surfingtech, containing 110 hours of speech from 855 speakers, recorded in quiet indoor environments using smartphones.

The Taiwanese speech dataset used is:

Common Voice²⁹: An open-source speech corpus initiated by Mozilla, including 120 hours of Taiwanese speech recordings from over 2000 speakers.

The summary of the datasets, including language, duration, and number of speakers, is provided in Table 2.

Table 2.
Summary of datasets.

English (ENG) Mandarin (ZH) Taiwanese (TW)

Duration (hours) 629 779 120

Number of Speakers 2000 $+$ 3000 $+$ 2000 $+$

Number of Sentences 371911 608902 81819

Average Length (seconds) 4.96 4.42 3.36

	English (ENG)	Mandarin (ZH)	Taiwanese (TW)
Duration (hours)	629	779	120
Number of Speakers	2000 $+$	3000 $+$	2000 $+$
Number of Sentences	371911	608902	81819
Average Length (seconds)	4.96	4.42	3.36

4.2. Data preprocessing

4.2.1. Speech data preprocessing

To ensure consistency in the input data for the model, all speech data were resampled to 24 kHz to meet the requirements of the original VALL-E model. These resampled audio files were then processed using the Encodec encoder to obtain corresponding discrete codes, which were stored for training. During training, the model directly reads these discrete codes, eliminating the need to reload the original speech data. This allows the training process to focus solely on the Transformer components of the VALL-E model, significantly improving training efficiency.

4.2.2. Text data preprocessing

To train VALL-E, paired speech and corresponding text data are required. The datasets used in this study mainly include labeled speech data. For speech data without text labels, a semi-supervised learning approach was adopted, utilizing OpenAI’s Whisper system¹⁶ to automatically generate text labels. These automatically generated labels, known as pseudo labels, provide a cost-effective way to obtain a large, relatively high-quality labeled dataset, supporting the training of the speech processing model. Phonemes are the smallest distinguishable sound units in a language, capable of differentiating word meanings. This study describes the conversion of text data into phoneme sequences compatible with the VALL-E model. Phonemes were encoded as numeric vectors, allowing them to be used directly for training and generating speech.

4.3. Evaluation methodology

In this study, the performance of the VALL-E system was evaluated using the Top 10 accuracy metric, which differs from the traditional Mean Opinion Score (MOS) that relies on subjective human auditory evaluation. The Top 10 accuracy metric assesses the similarity between the discrete codes generated by VALL-E and the target speech’s Encodec discrete codes.

Top 10 accuracy measures the model’s prediction accuracy by checking whether the correct answer is among the top ten predictions.

Accuracy = \frac{Number of Correct Predictions}{Total Number of Samples}

(4)

4.4. Experimental results

The evaluation was conducted using 20% of the data from various datasets, including LibriTTS, VCTK, AISHELL-1, AISHELL-3, Aidatatang, Primewords, THCHS-30, ST Chinese, and Common Voice, totaling approximately 300 hours. The validation data was carefully selected to ensure no overlap with the training speakers, maintaining the evaluation’s fairness and effectiveness. To facilitate comparison, the datasets were grouped by language into English (ENG), Simplified Chinese (ZH), and Traditional Chinese (TW), as shown in Table 3.

Table 3.
Dataset language comparison.

ENG ZH TW

LibriTTS AISHELL-1 Common Voice

VCTK AISHELL-3

Aidatatang

Primewords

THCHS-30

ST Chinese

ENG	ZH	TW
LibriTTS	AISHELL-1	Common Voice
VCTK	AISHELL-3
	Aidatatang
	Primewords
	THCHS-30
	ST Chinese

4.5. Comparison of model accuracy

Four different architectures were tested in this study. Given the similar performance of Linformer and Performer in the Long Range Arena benchmark, we also experimented with replacing the Transformer in VALL-E with Linformer to create Lin VALL-E. Additionally, P ${VALL-E}_{v 2}$ represents a model trained using the model replacement strategy discussed earlier. Table 4 presents the results of training on 1200 hours of English, Simplified Chinese, and Traditional Chinese speech data. The performance was evaluated on a 300-hour test set across the three languages. Experimental results show that P VALL-E outperforms Lin VALL-E by approximately 3% in accuracy while reducing parameter size by 5M. Consequently, we chose to replace the original Transformer architecture of VALL-E with Performer. Furthermore, the results indicate that using the model replacement strategy, P ${VALL-E}_{v 2}$ achieves about 8% higher accuracy than P VALL-E, further validating the effectiveness of this approach. Although the proposed model’s accuracy is close to that of the original VALL-E, with only a 0.5% decrease, it significantly reduces parameter size by 80M, approximately 22%, which substantially reduces hardware requirements. These results emphasize the success of our strategy in maintaining reasonable accuracy while pursuing high computational efficiency.

Table 4.
Comparison of model accuracy.

Model Params (M) ENG (%) ZH (%) TW (%)

VALL-E (baseline) 377 67.38 68.50 70.84

Lin VALL-E 295 57.15 59.86 61.19

Lin ${VALL-E}_{v 2}$ 295 63.33 65.10 66.51

P VALL-E 290 59.97 61.69 63.39

P ${VALL-E}_{v 2}$ 290 66.91 68.08 70.33

Model	Params (M)	ENG (%)	ZH (%)	TW (%)
VALL-E (baseline)	377	67.38	68.50	70.84
Lin VALL-E	295	57.15	59.86	61.19
Lin ${VALL-E}_{v 2}$	295	63.33	65.10	66.51
P VALL-E	290	59.97	61.69	63.39
P ${VALL-E}_{v 2}$	290	66.91	68.08	70.33

4.6. Training speed comparison

This section analyzes the training speed of VALL-E and P VALL-E models and explores the impact of different model architectures on training time. The experimental results in Table 5 compare the training time per epoch for the original VALL-E and P VALL-E models using Performer. The results show that the VALL-E model requires 10.2 hours per epoch, while the P VALL-E model requires 8.3 hours per epoch, indicating that P VALL-E has a significant speed advantage, being 1.23 times faster. This is mainly due to the linear computational complexity of the Performer mechanism, compared to the quadratic complexity of the traditional Transformer, which allows more efficient processing of large-scale datasets, reducing computational resources and time consumption. This result is significant for speech generation tasks requiring long-term, large-scale training. P VALL-E not only excels in generation speed but also demonstrates higher efficiency during training, shortening the overall development cycle.

Table 5.
Training speed comparison.

Model Params (M) Training time (Hours/Epoch)

VALL-E (baseline) 377 10.2

P VALL-E 290 8.3

Model	Params (M)	Training time (Hours/Epoch)
VALL-E (baseline)	377	10.2
P VALL-E	290	8.3

4.7. Generation speed comparison

This section compares the generation speeds of the VALL-E and P VALL-E models and examines the impact of different target speech lengths on the generation speed of these models. The experimental results in Table 6 compare the generation speeds of the original VALL-E and P VALL-E models for target speech lengths of 5, 10, and 20 seconds. The results indicate that when the target speech length is 5 seconds, the generation speed of P VALL-E is slightly lower than VALL-E. However, when the target length is 10 seconds, the generation speed of P VALL-E is significantly higher, being 1.21 times faster. This demonstrates the linear computational complexity advantage of Performer when processing longer sequences, compared to the quadratic complexity of the traditional Transformer. As the target speech length increases to 20 seconds, P VALL-E maintains high performance, being 1.69 times faster than VALL-E. This indicates that P VALL-E is more efficient in long-sequence speech generation tasks, particularly when processing longer sequences, where the linear complexity advantage of the Performer mechanism becomes more apparent.

Table 6.
Generation speed comparison.

Audio Length

Model Params (M) 5 s 10 s 20 s

VALL-E (baseline) 377 3.5 s 9.1 s 22.2 s

P VALL-E 290 3.6 s 7.5 s 13.1 s

		Audio Length
VALL-E (baseline)	377	3.5 s	9.1 s	22.2 s
P VALL-E	290	3.6 s	7.5 s	13.1 s

4.8. Impact on multilingual training

This section analyzes the impact of multilingual training on different languages by comparing the results of single-language training and multilingual training, as shown in Table 7. The results indicate that multilingual training generally outperforms single-language training. This is because multilingual training leverages the contextual capabilities of the language model, transferring knowledge across different languages, allowing the model to perform better in various languages. For example, if the model has learned to pronounce words correctly, even if it has encountered limited types of sounds, it can use the learned linguistic knowledge to improve the similarity of generated sounds. Multilingual training improves accuracy by 0.68% for Simplified Chinese and 0.95% for English. The effect is less pronounced because these two languages already have abundant data. However, for languages with less data, such as Traditional Chinese, the improvement is 2.18%.

Table 7.
Impact on multilingual training.

Dataset ENG (%) ZH (%) TW (%)

ENG 66.23 – –

ZH – 67.13 –

TW – – 68.15ZH $+$ ENG $+$ TW 66.91 68.08 70.33

Dataset	ENG (%)	ZH (%)	TW (%)
ENG	66.23	–	–
ZH	–	67.13	–
TW	–	–	68.15ZH $+$ ENG $+$ TW	66.91	68.08	70.33

4.9. Impact on multilingual training on low-resource languages

To explore the impact of multilingual training on low-resource languages, such as Taiwanese accent, we selected 120 hours of recordings from the Common Voice dataset, randomly selecting 1 hour as the training set and validating it on an additional 20 hours of recordings. The data in Table 8 shows that when the model is trained only on this 1 hour of Taiwanese accent data, overfitting is likely to occur, leading to poor performance on untrained test data. However, combining other languages with large amounts of training data for multilingual training can significantly improve the prediction accuracy of Taiwanese accent data, even if the structure of English is markedly different from Taiwanese accent. This proves that P VALL-E utilizes the language model’s strong contextual understanding ability, effectively transferring multilingual knowledge to languages with less data.

Table 8.
Impact of multilingual training on low-resource languages.

Dataset Train (%) Test (%)

TW 75.74 42.84

TW $+$ ENG 65.90 64.39

TW $+$ ZH 66.45 66.42

TW $+$ ENG $+$ ZH 66.80 67.01

Dataset	Train (%)	Test (%)
TW	75.74	42.84
TW $+$ ENG	65.90	64.39
TW $+$ ZH	66.45	66.42
TW $+$ ENG $+$ ZH	66.80	67.01

Finally, our experiments further explore the impact of training on small data amounts of Taiwanese accent by simultaneously using Simplified Chinese, English, and Traditional Chinese. The results show that when these three languages are trained together, the performance of speech synthesis for low-resource languages like Taiwanese accent is significantly improved. This multilingual training strategy fully leverages the correlations between languages and the generality of language models, demonstrating optimal performance in improving the identification and generation quality of Taiwanese accents. The experimental results confirm that multilingual training not only enhances the model’s adaptability to diverse languages but also significantly promotes the version of the language with less data (such as Taiwanese accent).

4.10. Ablation studies

4.10.1. Impact of language ID

Table 9 shows the impact of adding Language ID to the P VALL-E system’s accuracy. The addition of Language ID improves accent handling and provides important linguistic context, enabling the model to recognize and distinguish the language of the input text and generate corresponding speech vectors. The experimental results show that adding Language ID improves overall system accuracy by approximately 0.5%, with English improving by 0.67%, Simplified Chinese by 0.43%, and Traditional Chinese by 0.5%.

Table 9.
Impact of language ID.

Lang ID ENG (%) ZH (%) TW (%)

Without 66.24 67.65 69.83

With 66.91 68.08 70.33

Lang ID	ENG (%)	ZH (%)	TW (%)
Without	66.24	67.65	69.83
With	66.91	68.08	70.33

4.10.2. Impact of transformer replacement order

This section explores the impact of different replacement orders when replacing Transformer layers with Performer layers on model accuracy. Table 10 shows the impact of replacement order on accuracy across three languages (ENG, ZH, TW) and the average accuracy across the three languages (Avg.). In the table, ”Front” indicates starting the replacement from the first Transformer layer (closest to the input layer), while ”End” indicates starting from the last Transformer layer (closest to the output layer). The results in Table 10 show that different replacement orders have varying effects on model accuracy when replacing Transformer layers. Based on the average accuracy (Avg.), the strategy of starting replacement from the last layer (”End”) has a smaller impact on accuracy. Starting replacement from the first layer (”Front”) leads to a significant drop in accuracy across all languages, as the Transformer layers close to the input layer are responsible for initial feature extraction and representation, and replacing these layers affects the model’s basic understanding of the input data. Starting replacement from the last layer (”End”) results in a smaller drop in accuracy, as the Transformer layers close to the output layer mainly integrate high-level features into the final output.

Table 10.
Impact of replacement order on accuracy.

Transformer Performer

layers layers ENG (%) ZH (%) TW (%) Avg. (%)

Front 12 0 67.38 68.50 70.84 68.91

11 1 66.15 67.63 68.44 67.41

10 2 64.97 66.72 67.50 66.40

9 3 64.04 65.87 66.99 65.63

0 12 60.58 62.39 64.34 62.44

End 12 0 67.38 68.50 70.84 68.91

11 1 67.29 68.61 70.77 68.89

10 2 67.33 68.59 70.63 68.85

9 3 67.28 68.67 70.79 68.91

0 12 66.91 68.08 70.33 68.44

	Transformer	Performer
Front	12	0	67.38	68.50	70.84	68.91
	11	1	66.15	67.63	68.44	67.41
	10	2	64.97	66.72	67.50	66.40
	9	3	64.04	65.87	66.99	65.63
	0	12	60.58	62.39	64.34	62.44
End	12	0	67.38	68.50	70.84	68.91
	11	1	67.29	68.61	70.77	68.89
	10	2	67.33	68.59	70.63	68.85
	9	3	67.28	68.67	70.79	68.91
	0	12	66.91	68.08	70.33	68.44

5. Conclusion

In this study, the proposed architecture improved the VALL-E model by replacing its Transformer structure with Performer, reducing parameters and increasing processing speed. The enhanced model, trained extensively on English and Simplified Chinese and adapted for Taiwanese Mandarin, shows better synthesis quality and accent control. In addition, The P VALL-E model presented in this paper has made significant progress in multilingual support and computational efficiency, but several limitations remain. Support for low-resource languages still poses challenges, particularly as performance in data-scarce languages may not be on par with high-resource languages. Additionally, the model’s inference speed and computational resource requirements need further optimization, especially for applications on mobile devices and lower-end hardware. While Accent Control provides preliminary accent control, further improvements are needed for handling complex dialects and subtle speech differences. Future research could focus on improving performance for low-resource languages, optimizing computational resource demands, enhancing the precision and flexibility of accent control, and expanding the model’s applicability to multilingual and multicultural contexts. These improvements will better enhance the model’s practicality and its potential for widespread application. Add to this, future work will focus on further improving model accuracy through novel architectural or training advancements. Accent control for low-resource languages like Taiwanese Mandarin remains a challenge, requiring more diverse data and refined techniques. Additionally, exploring non-autoregressive models could further accelerate speech generation while maintaining quality.

Footnotes

Acknowledgements

This research is financially supported by National Science and Technology Council of Taiwan (under grant No. 113-2221-E-992 -116 -).

ORCID iD

Shih-Hsiung Lee

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2012; 25.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Comput 1997; 9: 1735–1780.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30.

Wang

Chen

, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:230102111 2023.

Oord

Avd

Dieleman

Zen

, et al. Wavenet: A generative model for raw audio. arXiv preprint arXiv:160903499 2016.

Wang

Skerry-Ryan

Stanton

, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:170310135 2017.

Ren

Ruan

Tan

, et al. fastspeech: Fast, robust and controllable text to speech. Adv Neural Inf Process Syst 2019; 32.

Tan

Chen

Liu

, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Trans Pattern Anal Mach Intell 2024; 46: 4234–4245.

Zhou

Zhang

Zhou

, et al. Accented text-to-speech synthesis with limited data. IEEE/ACM Trans Audio Speech Lang Process 2024; 32: 1699–1711.

10.

Guo

Chen

, et al. Emodiff: Intensity controllable emotional text-to-speech with soft-label guidance. In: ICASSP 2023-2023 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp.1–5. IEEE.

11.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:201011929 2020.

12.

Gulati

Qin

Chiu

, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:200508100 2020.

13.

Choromanski

Likhosherstov

Dohan

, et al. Rethinking attention with performers. arXiv preprint arXiv:200914794 2022.

14.

Devlin

Chang

Lee

, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 2018.

15.

Achiam

Adler

Agarwal

, et al. Gpt-4 technical report. arXiv preprint arXiv:230308774 2023.

16.

Radford

Kim

, et al. Robust speech recognition via large-scale weak supervision. In: International conference on machine learning, pp.28492–28518. PMLR.

17.

Rubenstein

Asawaroengchai

Nguyen

, et al. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:230612925 2023.

18.

Tay

Dehghani

Abnar

, et al. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:201104006 2020.

19.

Zaheer

Guruganesh

Dubey

, et al. Big bird: Transformers for longer sequences. Adv Neural Inf Process Syst 2020; 33: 17283–17297.

20.

Casanova

Weber

Shulby

, et al. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In: International conference on machine learning, pp.2709–2720. PMLR.

21.

Zen

Dang

Clark

, et al. A corpus derived from librispeech for text-to-speech. arxiv 2019. arXiv preprint arXiv:190402882.

22.

Veaux

Yamagishi

MacDonald

, et al. Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit 2016.

23.

, et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp.1–5. IEEE.

24.

Shi

, et al. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:201011567 2020.

25.

Beijing DataTang Technology Co, Ltd. Aidatatang, a free chinese mandarin speech corpus. Online, n.d. http://www.datatang.com.

26.

Primewords Information Technology Co L. Primewords chinese corpus set 1, 2018. https://www.primewords.cn.

27.

Wang

Zhang

. Thchs-30 : A free chinese speech corpus, 2015. http://arxiv.org/abs/1512.01882.

28.

SpeechOcean. St-cmds-20170001_1, free st chinese mandarin corpus. Online, 2017. http://www.speechocean.com.

29.

Ardila

Branson

Davis

, et al. Common voice: A massively-multilingual speech corpus. In: Proceedings of the 12th conference on language resources and evaluation (LREC 2020), pp.4211–4215.

		Audio Length
Model	Params (M)	5 s	10 s	20 s
VALL-E (baseline)	377	3.5 s	9.1 s	22.2 s
P VALL-E	290	3.6 s	7.5 s	13.1 s

	Transformer	Performer
	layers	layers	ENG (%)	ZH (%)	TW (%)	Avg. (%)
Front	12	0	67.38	68.50	70.84	68.91
	11	1	66.15	67.63	68.44	67.41
	10	2	64.97	66.72	67.50	66.40
	9	3	64.04	65.87	66.99	65.63
	0	12	60.58	62.39	64.34	62.44
End	12	0	67.38	68.50	70.84	68.91
	11	1	67.29	68.61	70.77	68.89
	10	2	67.33	68.59	70.63	68.85
	9	3	67.28	68.67	70.79	68.91
	0	12	66.91	68.08	70.33	68.44

P VALL-E: An efficient multilingual speech synthesis system based on the performer architecture

Abstract

Keywords

1. Introduction

2. Related work

2.1. Text-to-speech (TTS)

2.2. Transformer

2.3. Language models

2.4. VALL-E

3. Proposed architecture

3.1. System architecture

Table 1. Language ID label example. Language ID Sentence ZH [ZH]simplified Chinese context.[ZH] EN [EN]English context.[EN] TW [TW] traditional Chinese context.[TW]

4.1. Datasets

Table 2. Summary of datasets. English (ENG) Mandarin (ZH) Taiwanese (TW) Duration (hours) 629 779 120 Number of Speakers 2000 + 3000 + 2000 + Number of Sentences 371911 608902 81819 Average Length (seconds) 4.96 4.42 3.36

4.2.1. Speech data preprocessing

4.2.2. Text data preprocessing

4.3. Evaluation methodology

Table 3. Dataset language comparison. ENG ZH TW LibriTTS AISHELL-1 Common Voice VCTK AISHELL-3 Aidatatang Primewords THCHS-30 ST Chinese

Table 4. Comparison of model accuracy. Model Params (M) ENG (%) ZH (%) TW (%) VALL-E (baseline) 377 67.38 68.50 70.84 Lin VALL-E 295 57.15 59.86 61.19 Lin VALL-E v 2 295 63.33 65.10 66.51 P VALL-E 290 59.97 61.69 63.39 P VALL-E v 2 290 66.91 68.08 70.33

Table 5. Training speed comparison. Model Params (M) Training time (Hours/Epoch) VALL-E (baseline) 377 10.2 P VALL-E 290 8.3

Table 6. Generation speed comparison. Audio Length Model Params (M) 5 s 10 s 20 s VALL-E (baseline) 377 3.5 s 9.1 s 22.2 s P VALL-E 290 3.6 s 7.5 s 13.1 s

Table 7. Impact on multilingual training. Dataset ENG (%) ZH (%) TW (%) ENG 66.23 – – ZH – 67.13 – TW – – 68.15ZH + ENG + TW 66.91 68.08 70.33

Table 8. Impact of multilingual training on low-resource languages. Dataset Train (%) Test (%) TW 75.74 42.84 TW + ENG 65.90 64.39 TW + ZH 66.45 66.42 TW + ENG + ZH 66.80 67.01

4.10.1. Impact of language ID

Table 9. Impact of language ID. Lang ID ENG (%) ZH (%) TW (%) Without 66.24 67.65 69.83 With 66.91 68.08 70.33

Footnotes

Acknowledgements

ORCID iD

Funding

Declaration of conflicting interests

References

Table 1.
Language ID label example.

Language ID Sentence

ZH [ZH]simplified Chinese context.[ZH]

EN [EN]English context.[EN]

TW [TW] traditional Chinese context.[TW]

Table 2.
Summary of datasets.

English (ENG) Mandarin (ZH) Taiwanese (TW)

Duration (hours) 629 779 120

Number of Speakers 2000 $+$ 3000 $+$ 2000 $+$

Number of Sentences 371911 608902 81819

Average Length (seconds) 4.96 4.42 3.36

Table 3.
Dataset language comparison.

ENG ZH TW

LibriTTS AISHELL-1 Common Voice

VCTK AISHELL-3

Aidatatang

Primewords

THCHS-30

ST Chinese

Table 4.
Comparison of model accuracy.

Model Params (M) ENG (%) ZH (%) TW (%)

VALL-E (baseline) 377 67.38 68.50 70.84

Lin VALL-E 295 57.15 59.86 61.19

Lin ${VALL-E}_{v 2}$ 295 63.33 65.10 66.51

P VALL-E 290 59.97 61.69 63.39

P ${VALL-E}_{v 2}$ 290 66.91 68.08 70.33

Table 5.
Training speed comparison.

Model Params (M) Training time (Hours/Epoch)

VALL-E (baseline) 377 10.2

P VALL-E 290 8.3

Table 6.
Generation speed comparison.

Audio Length

Model Params (M) 5 s 10 s 20 s

VALL-E (baseline) 377 3.5 s 9.1 s 22.2 s

P VALL-E 290 3.6 s 7.5 s 13.1 s

Table 7.
Impact on multilingual training.

Dataset ENG (%) ZH (%) TW (%)

ENG 66.23 – –

ZH – 67.13 –

TW – – 68.15ZH $+$ ENG $+$ TW 66.91 68.08 70.33

Table 8.
Impact of multilingual training on low-resource languages.

Dataset Train (%) Test (%)

TW 75.74 42.84

TW $+$ ENG 65.90 64.39

TW $+$ ZH 66.45 66.42

TW $+$ ENG $+$ ZH 66.80 67.01

Table 9.
Impact of language ID.

Lang ID ENG (%) ZH (%) TW (%)

Without 66.24 67.65 69.83

With 66.91 68.08 70.33