Abstract
Visual detection of fingering on the trumpet is an increasingly interesting topic in music research. The ability to recognize and track the movements of the trumpet player’s fingers during the performance of a musical piece can provide valuable information for analyzing and improving instrument technique. However, this is a largely unexplored task, as most works focus on audio quality rather than instrument fingering techniques. Developing techniques for identifying essential finger positions on a musical instrument is crucial, as poor fingering techniques can harm instrument performance. In this work, we propose the visual detection of this fingering using convolutional neural networks with a proprietary dataset created for this purpose. Additionally, to improve the results and focus on the essential parts of the instrument, we use self-attention mechanisms by extracting these features automatically.
Keywords
Introduction
Fingering detection on the trumpet is a topic of great interest in music and technology. Fingering is crucial in trumpet performing, as the correct valve actuation directly affects the tone and quality of the sound produced. Currently, there is a wide variety of methods for detecting fingering on the trumpet, from using sensors placed on the mouthpiece or valves to analyzing the audio signal generated by the instrument. However, these methods have different limitations regarding their accuracy and ability to detect complex fingering patterns.
Visual fingering detection refers to the ability of an image recognition system to identify and record the movements of the musician’s fingers while playing the trumpet. In this context, visual fingering detection can be a valuable tool for learning and teaching the trumpet, as well as for real-time performance evaluation, since one of the roles of the teacher is to assist the student in their daily practice, which involves detecting errors by the teacher and providing feedback on progress in student performance. However, a human teacher is not always available when a student requires it.
Performing a musical sequence is a complex task because it falls under the open domain category, as mentioned in [1]; this means that there is no one correct way to perform it, and an artificial agent analyzing it must consider factors such as the performer’s skill level (beginner, intermediate, or advanced) and the complexity of the sequence.
This article will describe a methodology for visual fingering detection on the trumpet using image processing techniques, Convolutional Neural Networks (CNNs), and self-attention mechanisms added as an additional layer of the CNN model. This research can significantly impact the trumpet’s teaching and practice and create real-time feedback systems for musicians of all levels.
The layout of this work is presented as follows: Section 2 describes the work on which this research is based, while Section 3 outlines the proposed methodology. Our results are presented in Section 4, and the conclusions of this work are finally presented in Section 5.
State of the art
Visual fingering detection on the trumpet has been a little-addressed topic, as most of the work focuses on obtaining fingerings through audio signals or focuses on other types of instruments.
Hofmann and his colleagues [2], studied the interaction between finger and tongue movements in portato playing. They used a sensor-equipped alto saxophone, attached to 3 saxophone keys measured finger forces of the left hand; a strain gauge glued onto a synthetic saxophone reed measured the reed bending. In educational settings, instrument learners benefit significantly from watching demonstrations by professional musicians, where the visual presentation provides deeper insight into specific instrument-technical aspects of the performance (e.g., fingering or choice of strings). Duan and his colleagues [3] performed an analysis on flute, in which they describe the relevant information that needs to be extracted from the recorded video signals and coordinated with recorded audio. As a consequence, there has recently been growing interest in the visual analysis of musical performances.
Han and Lee [4], present a method that attempts to learn acoustic features that are more appropriate than conventional features such as Mel Frequency Cepstral Coe?cients (MFCCÂ's) in detecting the fingering from a flute sound using unsupervised feature learning.
In Almeida [5] measured the individual finger movements of a group of amateur and professional flutists as they played an original piece unseen before the experiment. They played a modified flute with a position detector mounted below each key. The detectors, via an interface and computer, gave the timing and speed of each key, as reported in an earlier study
Moryossef and colleagues [6], use a methodology to detect the performer’s hands on the piano, extracting features from the performersÂŁ™ hands. For this purpose they use the hand pose estimation method [7] to extract key points from the hand and thus identify the notes played on the instrument through Generative Adversarial Networks [8].
In addition, through music information retrieval, Knight [9] proposes classification algorithms by training support vector machines. Although these algorithms do not allow for the identification of fingering on the trumpet, they are capable of identifying different levels in tone quality on this instrument. For this purpose, test subjects with different levels of expertise were selected, from whom recordings of the same musical notes and under the same conditions were obtained, achieving an accuracy of 72% .
The synchronization of finger movement has been widely studied, especially in piano performance [2]. This is mainly due to the fact that MIDI keyboards provide precise information regarding the time at which each note is played [10]. Likewise, systems have been used to capture movement, to obtain additional information such as the trajectory of the fingers, as well as their acceleration curves [11].
Finally, visual detection methods have also been used to identify finger positions in chord formation on the guitar [12]. In this work, videos are analyzed with the aim of identifying the individual action of each of the fingers involved. As can be seen, fingering analysis has been approached using different methodologies, whether in wind or string instruments. However, there is little work related to the identification of fingerings performed on the trumpet from a strictly connectionist approach.
Methodology
In order to tackle the challenge presented in this project, we decided to construct our database because we found that no existing dataset focuses on visually identifying fingering on the trumpet - most only concentrate on using audio for this purpose. We used a Convolutional Neural Network (CNN) to classify the images with and without self-attention mechanisms. The reason for this is that this type of neural network architecture can automatically extract features from images and, at the same time, classify them.
Dataset
We will utilize our dataset named CIC-Trumpet-Detection 1 for this project. Novice trumpet learners created it, comprising images captured from a side view of the instrument, showcasing the valve operation. It is important to note that the view captures the right hand’s movement, no matter the performers are right-handed or left-handed. The dataset consists of eight classes depicted in Fig. 1. Each class represents a specific position to be classified and contains 23 images per class giving a total of 184 images. To train the models, the dataset was divided randomly and keeping balance of the classes, giving us 144 images for training and 40 images for testing. We did not used a validation dataset due to the low quantity of examples.

Finger positions on the trumpet.
The labels of the dataset depend on the position we are executing. For this work, we used one-hot encoding to assign a target vector to each position.
TrumpetNet is a CNN model based on the AlexNet architecture [13], composed of two stages: the first is the feature extraction stage, which is composed of 5 convolutional layers with Rectified Linear Unit (ReLU) activation function [14]. After layers 1, 2, and 4, a max-pooling layer [15] with a filter size 2x2 is applied to reduce the dimension of the feature maps. The second stage consists of resizing or flattening the feature maps to a vector so that it can be entered into two fully connected layers with the ReLU activation function to classify these. The result is produced by a fully connected layer with eight neurons and utilizes Softmax activation function [16]. Each neuron is activated based on the corresponding fingering position. In Fig. 2, you can see the TrumpetNet model in detail. The labels of the dataset depend on the position we are executing.

TrumpetNet model.

Implemented self-attention mechanism layer.
The self-attention mechanism [17] arises from the variance and covariance; these measures apply to random variables. Variance, as the name suggests, quantifies how much an individual random variable deviates from its mean, whereas covariance gauges the degree of similarity between two random variables. When two random variables have closely similar distributions, their covariance holds significance. Conversely, when their distributions differ substantially, their covariance is minimal.
When we view each pixel within the feature map as a stochastic variable and compute the covariances among all pixel pairs, we can adjust the significance of each predicted pixel based on its resemblance to other pixels in the image. During training and prediction, we can capitalize on the strength of similarity among pixels while downplaying the significance of dissimilar ones. This concept is known as self-attention.
TrumpetNet with attention
As we mentioned in Section 3.2, this CNN will be used to classify the images of the positions on the trumpet. To improve the performance of the model, we added self-attention layers on the model. These self-attention layers are added after each convolutional layer on the model because we want to extract more features automatically from the fingers pushing the valves, not the complete trumpet. This model is presented in Fig. 4.

TrumpetNet model with self-attention mechanisms.
In this section, the results will be presented. For this, we propose two experiments: the first consists of training the TrumpetNet model with the dataset without data augmentation and the second consists of using online data augmentation methods. The CNN model was trained with Backpropagation [18] and Adam [19] as optimizer, with learning rate equal to 0.0001. 100 epochs were used in all models to make the proper comparisons. The model was trained on an NVIDIA GTX 1080Ti GPU; the architecture without data augmentation takes 15 minutes to train and around 1 second to test a single image, and with data augmentation it takes approximately 1 hour to train and less than 1 second to test a single image. The model was implemented and trained using the python framework Keras, this tool was also used to perform data augmentation.
Data augmentation
Since the neural network model needs many examples to learn the characteristics of the dataset, we decided to use a data augmentation technique to increase the number of images. We decided to use a spatial transformation, such as random rotations around the origin [21] of each of the images between 0 and 20 degrees. We also applied random left and right translations between 1 and 10 pixels, and we randomly increased or decreased the brightness by 0 to 20 percent of the original pixel value. 32 random images were obtained for each of the training and test images in the dataset, giving a total of 5,888 images for training and 1280 images for testing. To implement this data augmentation techniques we used the Keras [20] library functions.
Analysis of results
Quantitatively, the results can be seen in Table 1. To evaluate the classification performance, we used the metrics Accuracy, Precision, Recall and F1 [13]. That is to mention that this metrics were obtained directly from the test set. According to them, the best performance of the neural network was obtained with the TrumpetNet architecture with self-attention mechanism and data augmentation, although it performed well without it, adding variations to the images can improve the classification result since the network learns variations of each image in case the performer’s hand has moved.
TrumpetNet results, DA-Data Augmentation, WA-With Attention, WAD-With data augmentation and Attention
TrumpetNet results, DA-Data Augmentation, WA-With Attention, WAD-With data augmentation and Attention
From Figs. 5 to 8 you can see the confusion matrices for both experiments. The best classification results were obtained by the TrumpetNet model with self-attention mechanism and data augmentation.

Confusion matrix of the TrumpetNet.

Confusion matrix of the TrumpetNet with data augmentation.

Confusion matrix of the TrumpetNet with self-attention mechanism.

Confusion matrix of the TrumpetNet with self-attention mechanism and data augmentation.
In Fig. 9, we present the results of the Grad-CAM algorithm [22], which involves identifying which parts of an image led a convolutional neural network to its final decision. This method generates heatmaps that represent the class activations within the input images. Each class of activation is associated with a specific output class. These class activations allow us to indicate the importance of each pixel concerning the target class by increasing or decreasing the pixel’s intensity in a tested image.

GradCam algortihm applied to images of all the positions on the trained TrumpetNet model. A) GradCam of the TrumpetNet model, B) GradCam of the TrumpetNet model with data augmentation, C) GradCam of the TrumpetNet model with self-attention, D) GradCam of the TrumpetNet model with self-attention and data augmentation.
According to the results obtained, we discovered that increasing the data greatly helps us improve the results of our model. This is because we added variations to the small dataset, such as brightness, rotations, and translations, which serve as simulations of different spaces where the practitioner can position themselves within a room or environment. We also found that self-attention mechanisms further enhance the results when we increase the dataset. It’s worth noting that both attention and non-attention models were trained from scratch.
Conclusions and future work
In this work, we present a visual dataset for fingering detection applied to the trumpet, with which we can detect the actions exerted on the instrument’s valves. It was demonstrated that it’s possible to classify these images using a convolutional neural network model, and furthermore, it was shown that data augmentation techniques can improve the model’s performance by introducing variations in angle to the images. As future work, we propose the following tasks: employing different CNN models with varying hyperparameters to enhance classification, incorporating attention mechanisms to ensure that the CNN models are focusing on the instrument’s valves. For future work, we propose using different perspectives for the instrument performer, such as a top-down view or a side-angular view, to adapt the model to various viewpoints.
Footnotes
Acknowledgments
The authors wish to thank the support of the Instituto Politécnico Nacional (COFAA, SIP-IPN, Grant SIP 20240610, 20240666) and the Mexican Government (CONAHCyT, SNI).
