Abstract
The Convolution-augmented Transformer (Conformer) model, which was recently introduced, has attained state-of-the-art(SOTA) results in Automatic Speech Recognition (ASR). In this paper, a series of methodical investigations uncover that the Conformer’s design decisions may not represent the most efficient choices when operating within the constraints of a limited computational budget. After a thorough re-evaluation of the Conformer architecture’s design choices, we propose Sampleformer which reduces the Conformer architecture complexity and has more robust performance. We introduce downsampling to the Conformer Encoder, and to exploit the information in the speech features, we incorporate an additional downsampling module to enhance the efficiency and accuracy of our model. Additionally, we propose a novel and adaptable attention mechanism called multi-group attention, effectively reducing the attention complexity from
Introduction
End-to-end neural network models have achieved substantial advancements in numerous automatic speech recognition tasks and have become the established standard for SOTA approaches. Convolutional neural networks (CNN) [1, 2, 3, 4] and Transformers [5, 6, 7, 8, 9] have several limitations as popular backbone architectures for ASR models. Generally, Transformers involve prohibitive computing and memory overhead, whereas CNN models have limited ability to effectively capture global contexts. To overcome these shortcomings, existing techniques have proposed different approaches to address these problems, such as pruning [10], quantization [11], optimized architecture design [12], and knowledge distillation [13]. These methods can reduce the complexity of the model, whether used individually or together. Besides, the availability of extensive, manually annotated datasets has enabled the training of robust deep neural networks for ASR. However, a significant drawback of employing a powerful deep neural network in practical applications is the associated resource cost. Achieving high performance demands a substantial training budget and the utilization of a considerable number of GPUs [1, 2].
Recently, Conformer [14] has proposed a novel convolution-augmented Transformer architecture, which combined CNN and self-attention. Due to its capability to simultaneously capture both global and local dependencies within audio signals, the Conformer architecture has achieved SOTA results for various end-to-end ASR tasks [15, 16]. Certainly, despite its significance as a foundational architecture in ASR tasks, the Conformer still exhibits certain limitations that warrant further improvement. First, the efficiency of Conformer on long sequence lengths is still limited by the quadratic complexity of the attention mechanism [17]. Furthermore, the Conformer architecture integrates the Macaron structure [14, 18], activation functions, and multiple different normalization schemes, which is more complex compared to the conventional Transformer structure employed in other domains [8, 9, 19]. Related research has been proposed in order to address the difficulty of efficiently deploying the model and reasoning on the dedicated hardware platforms [20, 21]. [22] demonstrates the effective incorporation of progressive downsampling into convolution-augmented transformer networks. [23] simplifies the Conformer into a more straightforward block structure reminiscent of standard Transformer blocks [8, 9] by employing a temporal U-Net structure to reduce computational complexity. Adopting a design such as sampling to reduce computation is an effective approach. More importantly, this raises the question of whether it is necessary and optimal to adopt such design choices to achieve good performance for ASR tasks.
In this paper, we proposed a more efficient hybrid convolution-attention architecture named Sampleformer. It has been found that temporal redundancy increases with network depth. To address this issue, a novel downsampling layer has been introduced to reduce redundancy between neighbouring features. Additionally, an SE module has been added to the convolutional downsampling module to selectively enhance informative features. we propose a novel and flexible attention mechanism called multi-group attention to solving the computational asymmetry caused by the introduction of downsampling. Our primary objective is to design an efficient architecture aimed at reducing the complexity of the Conformer, achieving a lower CER within a specific computational budget. In particular, the following contributions are made to our proposed Sampleformer model:
We find that the learned feature representations of temporally close speech frames correspond to very similar internal representations in Conformer training, especially deeper in the network. To address this, we proposed a novel structure inspired by the structure of Efficient Conformer [22]. In this structure, two downsampling layers halve the sampling rate at the network’s midpoint, and the Conformer block structure between the two downsampling layers adopted a residual connection. We redesign the downsampling architecture based on our observation that convolutional downsampling is unable to effectively model the local and global dependencies and also results in performance degradation for ASR tasks. We designed a novel downsampling layer that combines the two designs of squeeze-and-excitation networks (SE-Net) [24] and hybrid attention-convolution architecture [14]. This design aims to enhance accuracy, even though it incurs some additional computational resources. We finely examine the architecture of the network. The complexity of attention increases quadratically as the sequence length grows, we find that adopting downsampling methods introduces computation asymmetry into attention-based architectures and lead to a bottleneck in processing time. One solution involves creating an efficient self-attention mechanism [25, 26, 27], whereas another alternative to conventional attention is adopting local attention [28, 29]. We draw inspiration from these methods and introduce a novel and adaptable attention mechanism, which we term “multi-group attention”. Along the feature dimension, multi-group attention repeatedly groups adjacent time elements within the sequence, the grouping results in a reduction of attention complexity from We substantiate the final model architecture by conducting an ablation study, aiming to gain a deeper understanding of the enhancements resulting from the various methods.
(Left) The Conformer architecture and (Right) the Sampleformer architecture which comprises several Conformer blocks using multi-group attention, the hybrid convolution-attention-style block structure. The encoded sequence undergoes multiple downsampling operations before being projected into wider feature dimensions.
Typically, modern end-to-end ASR models consist of an encoder and a decoder. The encoder’s architecture plays a pivotal role in defining the representational capacity of an ASR model and its capacity to extract acoustic features, which is crucial for the overall performance of the ASR model. CNN is widely favored as the foundational model architecture for addressing the time-frequency characteristics of speech. In hybrid CTC-Attention neural architecture [30], the VGG network architecture, consisting of a 4-layer CNN and a 2-layer MaxPool, remains a fundamental component of the encoder model. Deep CNN models were first explored in end-to-end neural systems [4, 31], and further improved by introducing SE-Net in ContextNet [3] and depth-wise separable convolution (DSC) [32, 33] in QuartzNet [1]. However, CNN often fails to capture the global context.
A novel model architecture named Conformer [14], has achieved good performance in ASR tasks by effectively modeling both global and local dependencies through the convolutional-augmented transformer. However, the attention layers require way more computation for longer sequence lengths. The attention layer’s quadratic complexity remains excessively high. Similar approaches have been proposed to decrease the computational demands of multi-head attention (MHA) in ASR [17, 34, 35, 36]. Efficient Conformer [22] introduces a progressive downsampling scheme to minimize the training cost for the Conformer model. Additionally, it incorporates grouped attention to reduce inference time. Squeezeformer [23] introduces a scheme similar to progressive downsampling but also introduces an upsampling mechanism. This design concept takes inspiration from both the U-Time [37] architecture used in sleep signal analysis and the U-Net [38] architecture in computer vision. Nextformer [39] incorporates an extra downsampling module aimed at extracting fine-grained time-frequency speech features, and the module also serves to replace the subsampling component within a conformer-based attention-based encoder-decoder (AED). However, downsampling methods can reduce the training and inference costs of the model. It is important to note that temporal downsampling may lead to training instability and divergence, which can weaken the model’s representational capabilities. Our work incorporates similar progressive downsampling, and redesigning novel downsampling layers that allow us to achieve better performance and faster decoding. Our primary objective is to design an efficient architecture aimed at reducing the complexity of the Conformer. Moreover, we propose a novel and flexible attention mechanism called multi-group attention to solving the computational asymmetry caused by the introduction of downsampling.
Methods
The speech community has commonly adopted the Conformer architecture as a foundational framework for a variety of ASR tasks. At the macro-level, the Conformer is constructed by stacking multiple blocks of a Macaron structure [14, 18], with each structure comprising four modules per block, as shown in Fig. 1 (left). In this paper, we systematically reconsider the design choices within the Conformer by incorporating downsampling and designing novel downsampling layers. These enhancements enable us to attain better performance and faster decoding. Moreover, in order to enhance the efficiency of the early self-attention layers, multi-group attention is employed to strike a balance between the model’s overall complexity and accuracy, ensuring no compromise in precision. Conformer-CTC-S has been selected as the baseline model for our case study, and we have compared the CER for each architecture using the Aishell-1 dataset [40] as a performance metric.
Macro-architecture design
Although empowered by the hybrid attention-convolution structure to capture both local and global dependencies, it’s important to note that the attention operation demonstrates quadratic complexity concerning the input sequence length. Within the Conformer model, a decrease in the input sampling rate from 10 ms to 40 ms is achieved through convolutional subsampling blocks, while maintaining a constant temporal scale for all attention and convolution operations. Inspired by recent work on ASR, [2, 3, 22] reduce CNN computation for faster training and inference by applying progressive subsampling, we experimentally introduce downsampling into the Conformer encoder, and the proposed Sampleformer encoder is shown in Fig. 1 (right). In particular, [23] reveals that temporal redundancy increases as the network depth. We conduct this analysis using the Conformer model on feature embeddings for each speech frame. Fifty audio signals are randomly sampled from Aishell-1’s dev dataset, and recording activations of the Conformer blocks. We calculate the average cosine similarity between adjacent embedding vectors, with the results depicted in Fig. 5. We observe that at the topmost network layer (Layer 16), the cosine similarity between embedding vectors for adjacent speech frames at a distance of 1 or 2 exceeds 90%, and even the similarity between four speech frames is more than 65%. This indicates a growing temporal redundancy with deeper networks, especially for architectures that are built by stacking Conformer blocks multiple times. We hypothesize that the elimination of redundancies in feature embedding vectors can reduce unnecessary computational overhead without compromising accuracy.
In our macro-architecture improvement step, audio features are initially downsampled using convolutional downsampling and then fed into the encoder, consisting of multiple Conformer blocks with the same feature dimension. We keep the sampling rate to be 20 ms up to the 5th block, then reduce the sampling rate of each input sequence to 40 ms by replacing the original convolutional module with a convolutional downsampling module, and then reduce the sampling rate of each input sequence to 80 ms at the 10th block. Similar to spatial downsampling in computer vision models, this temporal downsampling technique is commonly utilized to economize computational resources and generate hierarchical-level features [41, 42, 43, 44].
However, temporal downsampling results in unstable and divergent training behavior. One possible reason is that after downsampling the rate to 80 ms, the decoder lacks sufficient resolution to successfully decode the full sequence. Taking inspiration from successful computer vision architectures like ResNet [42], U-Net [38], and U-Time [37], we modified the model’s architecture, as illustrated in Fig. 1 (right). This allows the model to train stably and gives better results, which can be found in our experiments.
Convolution downsampling module. The residual module contains a 1D scSE-Net and the pointwise projection layer, where the pointwise projection projects the input and output to the same dimension. Sequence downsampling is achieved through the application of a strided depthwise convolution [22].
Up to this point, we have established Sampleformer’s macro-structure. In this subsection, our attention shifts to optimizing the micro-structure of each module to enhance efficiency and performance.
Convolutional downsampling module
We experimentally introduce downsampling into Conformer Encode by replacing the original convolutional module with a convolutional downsampling module illustrated in Fig. 2. By incorporating a novel architectural unit, namely the “squeeze-and-excite” (SE) module [24, 45], we depict the structure in Fig. 3. scSE blocks can simultaneously recalibrate the input both channel-wise and spatially, acquiring the ability to leverage global information to selectively enhance informative features while suppressing less valuable ones. Mathematically, for an input
where,
In Eq. (1), where
SE networks can be created by merely stacking a set of SE blocks or by substituting them for the original blocks at any depth in the architecture. The advantages of feature recalibration from SE blocks can accumulate across the entire network. We add an SE block on the convolutional downsampling module, as shown in Fig. 2. The deeper Conformer block selectively learns informative features, and the redundancy in the feature embedding can be further reduced without loss of accuracy. As shown in Fig. 5, Sampleformer with the addition of the scSE block can reduce the redundancy in the feature embedding, and the accuracy can also be improved, which can be found in our experiments.

In general, downsampling is achieved through the use of strided convolution or pooling operations. In Efficient Conformer [22], Multi-head self-attention (MHSA) is employed as a global downsampling operation, accomplished by substituting the standard MHSA layers with strided MHSA layers within each downsampling block. By employing relative sinusoidal position encoding, the self-attention module improves generalization across different input lengths. For relative multi-head self-attention, attention mechanism is conducted separately for multiple heads
Equation (3) is the encoder output after the MHSA Module. where
Multi-group attention. Multi-group attention reduces attention complexity from 
Yet, the application of the downsampling method to an attention-based architecture brings about the primary drawback of introducing computational asymmetry into the network. By adjusting the feature dimensions of the blocks, it becomes possible to achieve various downsampling stages with similar per-hidden-layer complexity for different network depths. However, the time bottleneck still exists. We propose to address this issue through the introduction of a novel and adaptable attention mechanism, referred to as multi-group attention, as depicted in Fig. 4. Multi-group attention reduces attention complexity from
Equation (4) is the encoder output after the multi-group attention. And concatenated multi-group attention output
Data and training setup
Data
In this paper, we conduct model training and evaluation on the publicly available AISHELL-1 corpus [40] for Mandarin speech recognition. The AISHELL-1 consists of approximately 178 hours of labeled speech recordings from 400 speakers captured using high-fidelity microphones. Speakers come from diverse accents in China and the speech content is related to intelligent households, industrial production, drones, and other areas. For data augmentation, we apply SpecAugment [46, 47] with specific parameters to prevent overfitting during training, including frequency mask size parameters (
Training setup
Following the procedure detailed in Section 3, we construct Sampleformer variants of varying sizes. Specifically, we apply the corresponding macro and micro-architecture adjustments to create Sampleformer-S and M from Conformer-S and M, maintaining the model size. We conduct experiments using CTC models with 14M and 33M parameters, and Table 1 provides an overview of our model’s hyper-parameters.
Detailed architecture configurations for Sampleformer and Conformer-CTC (baseline)
Detailed architecture configurations for Sampleformer and Conformer-CTC (baseline)
We train CTC models on the AISHELL-1 dataset for 400 epochs, utilizing a batch size of 32. This training is executed on a single Nvidia RTX 3090 GPU. We utilize the Adam Optimizer [49] and implement a transformer learning rate schedule [9], featuring
The baseline dataloader sorts and packs utterances based on frame length and subsequently randomly selects them to pass to the model, optimizing training efficiency and maximizing GPU memory utilization. In many previous works [5, 51, 52, 53], the decoders of these models are frequently enhanced with an external language model (LM) for re-scoring the outputs to improve the final WER. A point of consideration during model evaluation lies in the incorporation of external LMs. Therefore, we conduct a fair comparison of the model architectures’ authentic representation capabilities and outcomes without external LMs, with performance metrics for the ASR models derived from our own replication of the best results.
Table 2 provides a CER comparison between Sampleformer, Conformer-CTC, and other SOTA CTC-based ASR models. The results suggest improvements in Sampleformer over previously published systems. Our Sampleformer-S model achieves competitive results of 5.46/5.71 without an LM for only 13.3M parameters. Furthermore, we achieved similar results to the version without multi-group attention, while realizing a 27% reduction in training time by employing multi-group attention with parameters
Comparison of AISHELL-1 CER (%) with recently published CTC models. At 13.3M parameters, Sampleformer-S outperforms the baseline model Conformer-S by 3.0%/2.6% on the dev/test dataset of the AISHELL-1. At 33.4M model parameters, the model still has nice results. The performance metrics for all models are derived from our dedicated efforts to reproduce the best achievable results
Comparison of AISHELL-1 CER (%) with recently published CTC models. At 13.3M parameters, Sampleformer-S outperforms the baseline model Conformer-S by 3.0%/2.6% on the dev/test dataset of the AISHELL-1. At 33.4M model parameters, the model still has nice results. The performance metrics for all models are derived from our dedicated efforts to reproduce the best achievable results
Cosine similarity of embedding vectors between adjacent speech frames in the Conformer blocks. Growing temporal redundancy with deeper networks. In the Sampleformer structure, downsampling is applied to the temporal dimension following the 5th block and subsequently after the 10th block.
Ablation studies for the design choices made in Sampleformer, including macro-architecture, Convolutional downsampling module, pointwise projection, and multi-group attention
Our Sampleformer model achieves satisfying results, with the CTC model converging faster with fewer epochs, achieving lower greedy CER. However, it still falls short of the original work, which benefited from training with larger batch sizes and more extensive resources.
In this section, we will provide supplementary ablation studies analyzing the design decisions related to individual architectural components. We will utilize Sampleformer-S as the baseline model to better understand the improvements resulting from different approaches to composing the Sampleformer.
We first investigate the effects of employing downsampling. The downsampled architecture attains improved accuracy in CTC experiments, along with reduced multiply-add operations and shorter training time, as evidenced in the second major row of Table 3. Moreover, without skip connections, the performance of the model deteriorates. In the third major row of Table 3, Using the scSE block in the convolution module makes the accuracy improved. We experimented with two downsampling methods in the pointwise projection layer, using convolutional downsampling or Pooling downsampling, and we found that convolutional downsampling performs better than Pooling downsampling. Adapted to the network, the results show that the SE block allows for better application of MHSA to global downsampling operations.
Ablation study on group size parameters
Ablation study on group size parameters
CTC models inference time for processing long sequences. The Sampleformer model shows an increase in inference speed with longer input sequences compared to the CTC Conformer baseline.
We incrementally expand the attention group size to analyze how multi-group attention impacts model complexity and recognition performance, as evidenced by the results in Table 4, highlighting its effectiveness in reducing model complexity without compromising recognition performance. By introducing multi-group attention in the early layers, the training speed is improved by 27% with average inference time accelerated by 30%, as shown in Fig. 6. By introducing multi-group attention with different attention group sizes at different depths of the network layers, the training speed can be further improved to 41%.
In this work, we have systematically examined the Conformer architecture and introduced a series of techniques to reduce the complexity of the Conformer and always obtain better performance within a certain computational budget. We introduce a new downsampling layer to reduce the temporal redundancy between adjacent features and to obtain better recognition performance with reduced computation. Moreover, we propose a novel and flexible attention mechanism called multi-group attention to solving the computational asymmetry caused by the introduction of downsampling. Finally, we establish the efficacy of our approach through an ablation analysis conducted on the Aishell-1 dataset. Trained on a limited computational budget of a single GPU, our 13.3M parameters Sampleformer model brings a 3.0%/2.6% improvement over the baseline model in dev/test, while achieving a 30% faster inference speed and a 27% shorter training time compared to our CTC Conformer baseline.
In the future, we intend to implement additional complementary complexity reduction methods, including distillation, quantization, and weight pruning to further decrease computational requirements.
Footnotes
Acknowledgments
This research was funded by the Science and Technology Key Projects of Guangxi Province, grant number 2020AA21077007; the Innovation Project of Guangxi Graduate Education, grant number YCSW2023019; the Guangxi New Engineering Research and Practice Project, grant number XGK2022003. The authors would like to express appreciation to Zhaohui Bu from the School of Foreign Languages of Guangxi University for her valuable comments and suggestions.
