Abstract
Recent years, research on automatic music transcription has made significant progress as deep learning techniques have been validated to demonstrate strong performance in complex data applications. Although the existing work is exciting, they all rely on specific domain knowledge to enable the design of model architectures and training modes for different tasks. At the same time, the noise generated in the process of automatic music transcription data collection cannot be ignored, which makes the existing work unsatisfactory. To address the issues highlighted above, we propose an end-to-end framework based on Transformer. Through the encoder-decoder structure, we realize the direct conversion of the spectrogram of the collected piano audio to MIDI output. Further, to remove the impression of environmental noise on transcription quality, we design a training mechanism mixed with white noise to improve the robustness of our proposed model. Our experiments on the classic piano transcription datasets show that the proposed method can greatly improve the quality of automatic music transcription.
Introduction
Automatic Music Transcription (AMT) is the process of converting the input digital music signal into a series of music marks, which is regarded as one of the key technologies of digital music signal processing. This technology has profound implications, ranging from music education to the music industry, algorithmic composition, and beyond. AMT includes multiple sub-tasks such as multi-pitch detection, note start and end time detection, and melody information extraction [1]. In our paper, we define AMT as the problem of transcription a given piano audio into notes representing the onset and duration of a melody.
A classic paradigm of piano transcription research is the use of well-designed model structures to deal with piano transcription problems with domain-specific knowledge. While existing work has demonstrated that domain-specific models can perform satisfactorily [2–4], their guaranteed performance can only be applied to a single theory and cannot be generalized to general music information retrieval problems. Besides, since the automatic piano transcription faces the noise interference of the environment, it is also a challenge in the field to eliminate the noise to obtain better characterization and transcription results. Meanwhile, the Transformer model [5] has been proven to have reliable performance in the fields of natural language processing [6], computer vision [7], bioinformatics [8], etc. Transformer’s well-designed self-attention mechanism enables the model to adaptively learn the hidden correlations in the input sequence, thereby extracting important information patterns similar to humans and improving the interpretability of the model. At the same time, when Transformer solves sequence problems, it can be applied to different fields by arbitrarily changing the form of input and output. In our paper, we design a general framework for the automatic transcription problem of pianos based on Transformer models. Without requiring given domain knowledge, we utilize a simple encoding strategy to encode the raw spectrogram and utilize a decoder to decode it into the MIDI protocol, which represents onset, duration notes.
Further, we our proposed model is able to adapt to different domains of AMTs. Overall, our work demonstrates the ability of the Transformer model to achieve adaptive audio transcription that is not based on domain knowledge. Meanwhile, we discuss the potential of applying this method to other music information retrieval tasks.
Related work
Piano automatic transcription
The research on automatic music transcription has a long history, mainly including methods based on traditional spectrum analysis and methods based on deep learning, which can effectively improve the effect of AMT research. The research status of automatic music transcription based on spectrum analysis is as follows. Su et al. utilized a method based on signal processing to analyze the frequency and period of music signals [9], and performed fundamental frequency detection based on the consistent relationship between harmonic series. Peeling et al. scholars utilize the method based on Bayesian model [10], according to Bayesian theorem, put forward a probability generation model of audio signal time-frequency coefficients, and complete the automatic music transcription by modeling and analysis of spectral features. Duan et al. [11] utilize the method of probabilistic latent component analysis to complete the fundamental frequency estimation by analyzing the probability components of the signal. Scholars such as Smaragdis et al. [12] utilized the method of non-negative matrix decomposition, using linear basis transformation and non-negative matrix decomposition to estimate spectral lines to obtain the time information of each note. The sparse matrix factorization method [13] utilized by Rizzi et al. increases the sparsity in the matrix factorization to improve the transcription effect. In the process of using the above method, the process is relatively complicated, and the effect of automatic music transcription has great room for improvement.
On the topic of automatic piano transcription, deep learning-based algorithms have made significant progress. The RNN-based transcription model was trained to output the binary piano roll by Bolanger-Lewandowski et al. [14]. Only for pianos, Böck and Schedl [15] began training a comparable RNN-based model. Using distinct stacks of convolution-based methods to detect note onset, note existence, and note velocity, Hawthorne et al. [4] improved transcription accuracy. Introducing extra domain-specific DNN elements and changing the decoder have been the most recent developments in piano transcription. In most situations, the additional intricacy is added to improve the quality of piano transcriptions. Kong et al. [16] employed a network design similar to that of Hawthorne et al., but instead of utilizing regression to forecast precise sequential times, they utilized a network-based structure, resulting in improved transcriptional accuracy. [4]. An adversarial loss function for the transcription output is utilized by Kim & Bello [17] to urge the transcription model for creating more appropriate piano roll. Our sequence based technique utilizes a decoder that is trained with an encoder that learns relevant audio features to explicitly describe this inter-output dependency. Rather than employing independent stacks of onset, frame, and offset, Quan et al. [18] utilize a language model to mimic note part of the system for each pitch. The decoding process, on the other hand, is rather difficult, especially in dealing with the combination of several pitches. Elowsson [19] implements a hierarchical model that learns basic frequency outlines from spectrum analyzer and utilizes these contours for predicting note onsets and offsets in a highly detailed domain-specific treatment. Although an intermediate representation like Engel et al. [20] is appropriate for many applications, we address polyphony transcription from audio notes as the problem in this paper. Ou et al. [21] explored the potential of Transformers for automatic piano transcription and reported promising results. Xiao et al. [22] presented a novel approach based on Graph Convolutional Networks for polyphonic piano transcription
Transformers
Since they were first introduced [5], Transformers have been utilized to address sequence relavent problems across different domains, replacing previously employed task-specific structures. Transformers have been utilized in natural language processing for sequence generation [6], object detection [23], pose reconstruction [24], and audio-specific tasks like speech synthesis [25], speech recognition [26], and audio task [27]. Transformer abandons traditional convolutional neural networks and recurrent neural networks, and the entire network structure is entirely composed of attention mechanisms. To be precise, the Transformer consists of only self-attention and feed-forward neural networks. Aiming at the sequence dependency of sequence data, Transformer designed a position encoder to extract the timing information in the input data. Transformers are able to process input data in parallel, effectively solving long-term dependencies and significantly reducing training time and inference time. In the implementation of the self-attention mechanism, a set of keys and values are used to record the learned information, and the attention output is obtained by querying. In many instances, Transformer employs a pre-training stage in which the network is developed on a significant amount of unsupervised learning under self-supervision. While such a pre-training stage may be beneficial for music transcription, we investigate a simpler scenario in which the Transformer architecture is trained from the ground up on labeled transcription in a traditional supervised manner. Toyama et al. [28] proposed a hierarchical frequency-time Transformer for automatic piano transcription and achieved state-of-the-art performance on benchmark datasets.
Methodology
Overall structure
In our method, both the encoder and the decoder are part of a generic Transformer structure. Each input includes a spectral frame, and each output includes a MIDI-like lexical event. The overall structure of its model and our input and output settings are shown in Fig. 1. By a series of self-attention mechanisms, Transformer adaptively extracts the temporal dependencies in the input spectral frame sequence. The input through the embedding layer is transformed into an embedding of the same length as the original data. At this point, the representation output of the embedding layer contains the implicit patterns in the spectral frame. The decoder layer then uses causal masked self-attention on the decoder results as well as cross-attention on the outputs of the entire encoder stack. This ensures that future sequence information is not leaked into the current time step. Importantly, this allows the length of the symbolic output to be determined only by the number of tokens designed to represent the input audio.

The structure of our proposed generic Transformer.
As input, we employ spectrograms. We end the input sequence with a trainable EOS embedding to match Transformer’s settings. The forecasting model is a softmax function over a lexicon of discrete occurrences at each step, as mentioned below. The messages provided in the MIDI standard served as a major inspiration for this language. Using events rather than the piano roll matrix as the output representation has the benefit of being sparser, as the output is only needed when an event happens, rather than having annotations for each frame. The following token kinds are included in the glossary: An action in one of the 128 MIDI pitches is represented by a note. We employ the entire MIDI pitch range for versatility, but just the 88 pitches matching to the piano keys are utilized in these experiments. Velocity specifies a change in velocity that will be implemented to all subsequent Note occurrences. There are 128 velocity values available, including 0, which interprets subsequent Note events as note-off events. The absolute time inside the segment is represented by time, which is divided into 12 ms bins. This time will be assigned to every future Note Events. We specify the lexicon up to 80 seconds for flexibility, but we only use the first few hundred occurrences of this kind in practice because we use duration resets for each segment. EOS denotes the sequence’s conclusion.
Previous studies that used this MIDI event language used matching time offsets within events to indicate the time since the last time offset. Nevertheless, in sequence-based situations, a single comparable time variences early in the output causes all satisfying specified steps to be inaccurate, and these inaccuracies compound as the sequence length grows. To compensate for this drifting, the Transformer must learn to combine all prior time offsets to compute the current time. Instead, we use absolute time, as shown in Fig. 1, where each time incident represents the length of time since the segment began. This allows the model to compute each timestamp independently; we empirically evaluate this option in Section 4.4 and show that using an absolute time offset rather than a relative offset produces superior results.
For temporal events, we select a temporal resolution of 10 ms because several studies have suggested that this displacement is near the limit of human perception. With finer event resolution, we may be able to improve our outcomes even further.
During inference, a basic greedy autoregressive method is utilized to decode the model output. At each step, we choose the event with the highest probability and input it back through the network since the anticipated events for the that step. We’ll keep doing this until the model forecasts the EOS tokens.
Utilizing a series of events as the training objective rather than a pianos roll matrix provides a lot more flexibility. In Section 4.4, we show how the same model setup with the same inputs trained to forecast the beginning or onset, offset, and velocity. The only change necessary is to change the training objective to a different set of tokens. Previously, anticipating new features necessitated adding additional output heads, designing loss for these outputs, and changing the decoding method to merge all model outputs to the final representations.
By the sequence-based technique, our model can build the representations by adaptively modeling audio features in a completely different end-to-end training setting. Only the tags that characterize the target output are affected by changing the task definition or adding a new output function.
Sequential setting
At each layer, Transformer can handle all tokens in a series, which is especially useful for transcription tasks that require fine-grained data about each event’s pitch and timing. However, when compared to the sequence length n, the space complexity of this attention mechanism is O (n2). As a result, most audio sequences utilized for transcription are too long to remember. During training and inference, we divide the audio and its related symbolic representation into smaller sections to overcome this problem.
For the training process, As the model input, choose a randomized audio clip from entire sequence. The length of the selected segment allows the selection of a single input frame up to the maximum length, the starting point is randomly chosen. Select the training target’s symbol segment that matches to the chosen audio segment. Since notes can begin in one segment and conclude to the other, the model learns to forecast note-off events when there are no note-on events. Calculate the spectrum analyzer of the chosen audio and map the symbol sequence into our lexicon. Calculate the absolute time offset within a symbol segment, with time 0 as the segment’s start. As training examples for the Transformer architecture, provide continuous spectrogram input and encoded MIDI events.
For the inference process,
Then, using the greatest input length possible, split the audio stream into non-overlapping parts and compute the spectrogram. Provide the spectrogram as the Transformer model’s input for each segment, then greedily select the most probable token for decoding based on the model output of each phase until an EOS token is forecasted. Any markers that come after a time offset that is longer than the audio segment’s length will be discarded. Concatenate all of the segments’ decoded events into the single sequence. There may be point of clarification events that do not meet note-on after concatenation; we eliminate these. We complete the note and start a new one if we get a note occurrence for an already began pitch. We terminate any active notes that are lacking note-off events at the end of the sequence.
As seen in Fig. 2, the model’s forecasting note-on or note-off occurrences was startling. With the scores in Section 4.2, the model’s results on Onset, Offset, and Velocity F1 experimentally indicate this capacity.

The visualization results of MAESTRO V1.0 validation set.
Experiment setup and datasets
We utilized a sampling rate of 15000 kHz for the audio. Meanwhile, we used a frequency band of 256 samples and an FFT duration of 1024 samples. The maximum segment length is 4.066 seconds, and the input and output size are 512 and 1024, respectively. With the batch size of 64, the learning rate of 1e-4, and dropout rate to 0.2 for sublayer outputs and embedded inputs, we train our model using the Adam optimizer [29]. The model’s learning rate and dropout value are identical to those used in the fine-tuning exercise. We train the model using 24 TPUv2 cores with a dataset consisting of four per core for model training. Other batch sizes we examined in our initial studies proved to have no effect on ultimate performance, so we chose this batch size to maximize training throughput.
We use three piano datasets for evaluation. The MAESTRO dataset [3], which includes around 200 hours of pianist performance with tight alignment among audio and ground-truth note annotations, to assess our model’s performance on the piano transcription task. We use the full musical pieces from the MAPS dataset [30], which consists of CD-quality recordings and corresponding annotations of isolated notes, chords, and complete piano pieces. We train on MAESTRO V1.0 and MAPS datasets for comparison with earlier transcription work, but we utilize MAESTRO V3.0 for other investigations since it has an extra 90 performance with 25 hours of data. While our model (and evaluation) do not utilize these, MAESTRO V3.0 includes sustain and stringless pedal events. We also didn’t explicitly imitate sustain pedal events, preferring to lengthen the note length when the sustain pedal is pressed, like Hawthorne et al. [3] did.
Evaluation methods
We utilize precision, recall, and Note F1 score metric to evaluate the performance of a piano transcription system: the geometric mean of precision score and recall score when evaluating individual notes. Based on the start time, pitch, and optional offset time, each projected note is matched to a distinct real note. In addition, the starting velocity can be utilized to rule out matches that are moving at different velocitys. We mostly employ the F1 score, which considers start, offset, and velocity. We also provide results that just take the F1 score of onset or the F1 score of onset and offset into account. To properly define the (standard) transcribing metrics we utilize, we utilize the mir eval library. Because the piano is a percussive instrument, identifying note onsets is typically easier than offsets. We utilize mir eval’s default match tolerance, which is a beginning value of 50 ms and an offset of the greater of 60 ms or 30% of the note duration.
Experimental results
MAESTRO V1.0 test set results
MAESTRO V1.0 test set results
MAPS test set results
Table 1 compares the scores given by prior piano transcription articles on the MAESTRO dataset V1.0 with our sequence-to-sequence technique. Using a common architecture, decoding algorithms, and standard representations, our solution achieves competitive F1 scores while being theoretically very simple, compared to the existing state-of-the-art methods.
Table 2 presents a comparison of our proposed sequence-to-sequence technique with prior piano transcription methods on the MAPS dataset using a common architecture, decoding algorithms, and standard representations. Our solution also achieves competitive F1 scores while being theoretically simple, compared to the existing state-of-the-art methods.
We utilized V3.0 of the MAESTRO dataset in Fig. 3 to undertake ablation studies on some elements of our model. First, we test the architecture’s ability to characterize the input audio using a variety of attributes. We begin by altering the symbolic data only using Note, Time, and EOS events, which is a description-only approach. On this updated start-only task, the model learns successfully and achieves good F1 scores.

MAESTRO V3.0 test set results.
Following that, we look into various input representations. After the FFT computation, we dropped the log mel scaling for “STFT.” As a result, the dense layer projects an input frame size of 1024 to a model embedding size of 256. For “raw samples,” we directly divide the audio into categories depending on the spectrogram’s jump width and use them as input, again projected to the embedding size through a feed forward layer. All configurations train effectively, but recording mel input yields better results. We believe this is due to the fact that mel scaling creates relevant characteristics that the model would otherwise have to extract using a portion of its capacity.
Finally, we train a model with relative time offset to show that absolute time offset is better for this architecture. It didn’t perform as well as predicted. Furthermore, we discovered that throughout training, the annotation-based evaluation metrics on the validation set fluctuated substantially, with initial F1 scores varying by up to 15 points between adjacent validation stages. The reason may be that modest differences in relative time-shift forecasts are exacerbated when compounded in the sequence to calculate the absolute time necessary for metric computation; in other words, relative time-shift would cautilize the generated transcription to be inconsistent with the audio.
In this paper, our proposed general Transformer architecture is shown to work in the absence of domain knowledge. Map spectrograms to MIDI output events. Experimental results show that our method achieves excellent performance on automatic piano transcription tasks. It is important to note that we surpass existing models that are well-designed for specific transcription tasks by simply using the standard structure of the Transformer. Meanwhile, our results show that the proposed general framework is effective for other music information retrieval tasks, such as beat tracking and chord estimation. Future work can be carried out from the following two aspects: (1) By redesigning the denoising attention mechanism, we can further solve the problem that the original audio is affected by noise in the automatic piano transcription problem. (2) Explore the potential of Transformer on a wider range of audio information retrieval problems, especially evaluating the performance of a general framework versus models designed in other problem domains.
