MusicNeXt: Addressing category bias in fused music using musical features and genre-sensitive adjustment layer

Abstract

Convolutional neural networks (CNNs) have been successfully applied to music genre classification tasks. With the development of diverse music, genre fusion has become common. Fused music exhibits multiple similar musical features such as rhythm, timbre, and structure, which typically arise from the temporal information in the spectrum. However, traditional CNNs cannot effectively capture temporal information, leading to difficulties in distinguishing fused music. To address this issue, this study proposes a CNN model called MusicNeXt for music genre classification. Its goal is to enhance the feature extraction method to increase focus on musical features, and increase the distinctiveness between different genres, thereby reducing classification result bias. Specifically, we construct the feature extraction module which can fully utilize temporal information, thereby enhancing its focus on music features. It exhibits an improved understanding of the complexity of fused music. Additionally, we introduce a genre-sensitive adjustment layer that strengthens the learning of differences between different genres through within-class angle constraints. This leads to increased distinctiveness between genres and provides interpretability for the classification results. Experimental results demonstrate that our proposed MusicNeXt model outperforms baseline networks and other state-of-the-art methods in music genre classification tasks, without generating category bias in the classification results.

Keywords

Music genre classification spectrogram deep learning L-softmax loss

1. Introduction

Music genre classification (MGC) [1] is an important task in the field of music information retrieval (MIR), and it holds significant significance for the music industry, music recommendations, search engines, and organizations. With the diversification of music development, genre fusion has become a common phenomenon. It promotes music exchange and collaboration while profoundly influencing music creation. However, this also gives challenges to the task of genre classification. Traditional classification methods lack sufficient discriminative power, making it difficult to effectively differentiate between multiple elements in fusion music.

There is no universally agreed-upon methodology for defining genres, making precise definitions difficult. Compared to other classification tasks, the label of music genre is subjective in nature. It is influenced by various factors such as culture, national context, and ethnic traditions. With the continuous development of music, artists are blending features and styles from different genres to create new music genres. This phenomenon is known as genre fusion, and it has resulted in music that exhibits cross-genre similarities. For example, the influence of blues music has permeated into country music and rock music, contributing to their development and evolution. Blues shares similarities with them in various aspects such as melody, rhythm, and harmony. Rock music, in particular, has evolved from Blues, Country, and Rhythm and Blues (R&B). This has also led to the formation of the fusion genre known as Country Rock, which not only impacted subsequent country and rock music but also provided inspiration for the fusion and cross-innovation of other genres. Due to the presence of multiple genre elements in fusion music, traditional MGC methods often tend to bias the classification results towards certain single-element genres.

Some researchers have recognized the impact of cross-genre similarity. For example, Esparza et al. [2] analyzed the subtle influences between rhythm and genre from a musicological perspective by synthesizing quantitative analysis by considering Computational and Musicological. Costa et al. [3] noticed a high confusion rate between the Forró and Gaúcha genres in their experiments, and consider that this could be explained by their similarities in rhythm and timbre content. These explanations remain somewhat ambiguous and do not propose a solution to address the issue of result bias in classification tasks. However, they highlight the connection between cross-genre similarities and features such as rhythm and timbre, which express musical characteristics. Therefore, to address the classification problem in fusion music, it is necessary to place more emphasis on features that capture the musical traits and better adapt to and understand the complexity and diversity of fusion music.

Figure 1.

ConvNeXt-Tiny network structure.

With the successful application of deep learning and CNNs, recent research has focused on improving model structures for classification based on spectrograms. Generally, to extract deep features and improve classification performance, the depth, and complexity of the network are increased, which leads to an increased risk of overfitting and higher computational demands. To address this issue, Liu et al. [4] analyzed the structure and parameters of the Swin Transformer and proposed the ConvNeXt model as illustrated in Fig. 1, a pure convolutional neural network based on the improvement of ResNet-50. It achieved better classification results through architectural optimizations, rather than model enrichment. The comprehensive performance surpasses Transformer, demonstrating that a well-designed model structure and architecture are more effective than increasing complexity. However, for music classification tasks, CNN architectures still have some limitations. Music spectrograms differ from regular images as they encompass the temporal information of the audio signal. The temporal information plays a vital role in capturing the rhythm, dynamics, and expression in music, which cannot be fully harnessed by CNN networks. Therefore, we propose a content-based music genre classification model called MusicNeXt. It has the advantage of effectively extracting musical features and accurately classifying them. Specifically, we introduced a genre-sensitive adjustment layer. It enhances the learning of differences between different genres by imposing intra-class angular constraints. This layer diminishes the impact of shared elements among fused genres, thus reducing bias in the classification results.

Our contributions in this paper can be summarized as follows:

•

To extract deep features from the temporal information in spectrograms, a network stacking module using depthwise convolution is designed. A lightweight network called MusicNeXt is proposed for MGC tasks.

•

A genre-sensitive adjustment layer is designed to enable differentiated classification, addressing the bias issue caused by genre fusion. This improves classification accuracy and provides a geometric interpretation of the classification results.

•

Compared to several state-of-the-art methods, our approach achieves excellent performance on three datasets while significantly reducing computational costs.

2. Related works

In order to extract music features more effectively, researchers have been continuously improving the architectures of neural networks. In this section, we introduce the progress made in CNNs for MGC using spectrograms, and describe the improvements in feature extraction.

2.1 Music genre classification with spectrograms

In recent years, with the rapid development of deep learning in computer vision, neural networks have also been successfully applied to various tasks in MIR, such as music tagging [5], automatic singer recognition [6], music emotion recognition [7] and more. Deep learning models combine low-level features to discover distributed representations of data, enabling the learning of abstract high-level features. This approach has led to an increasing number of researchers considering content-based music classification tasks as image classification problems, and replacing traditional methods of manually selecting features with improved feature learning through model enhancements.

Dong et al. proposed BCRSN [7], a bidirectional convolutional recurrent sparse network based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs), for music emotion analysis. This method adaptively learns influential features containing sequential information from spectrograms, reducing computational complexity. They also introduced a weighted mixed-binary representation method, transforming regression prediction into a weighted combination of multiple binary classification problems. Bian et al. [8] utilized DenseNet as a building block for the CNN architecture to enhance the performance of music audio classification and achieved data augmentation for music-specific data through time overlap and pitch shifting of spectrograms. Yu et al. [9] observed that spectrograms with different temporal resolutions have varying importance, and proposed a novel model combining bidirectional recurrent neural networks with attention mechanisms, introducing parallel and serial attention mechanisms for classification. Chang et al. [10] designed a multi-scale SincNet (MS-SincNet) for learning 2D representations from 1D raw waveform signals, jointly learning 1D and 2D kernels for classification. Liu et al. [11] proposed a CNN architecture that incorporates multi-scale audio features, considering the sensitivity of feature performance to frequency accumulation of sound events in the time domain. Inspired by traditional K-nearest neighbors (KNN), Zhang et al. [12] introduced KNN-Net for automatic singer identification (SID), restricting the decision scope of CNN within a relative range.

These previous studies have all utilized neural networks for classification tasks in MIR. Compared to traditional handcrafted features, the improved neural network models have a distinct advantage in automatically extracting features, resulting in better classification performance. However, these studies failed to recognize the influence of genre fusion on classification bias and did not fully leverage the temporal information of the spectra for classification. They cannot discern multiple musical elements, and as a result, the classification results still exhibit bias.

2.2 Improvements in feature extraction methods

Extracting effective music features is crucial for music classification. During the feature extraction process in traditional music classification tasks, short-term features are initially extracted from the music signal to capture its local or transient characteristics. Then, these short-term features are integrated or aggregated to form long-term features that reflect the overall nature of the music signal. The classical audio features used in MGC include Mel-frequency cepstral coefficients (MFCC) [13], short-time energy [14], zero-crossing rate (ZCR), spectral centroid/flux [15], discrete wavelet coefficient histogram (DWCH) [16], timbral texture, rhythmic content, and pitch content feature groups [17].

Some researchers have improved the classification performance by enhancing the feature extraction methods. Dielema et al. [18] conducted feature learning by analyzing raw audio signals to enable the model to autonomously discover audio frequency decompositions, as well as phase and shift-invariant feature representations. Choi et al. [19] proposed a CRNN model for music tagging that handles feature extraction and summarization tasks. Liang et al. [20] introduced the PiRhDy that integrates pitch, rhythm, and dynamic perception features. It is based on symbolic music in melodic and harmonic contexts and can be applied to various MIR tasks. Cai et al. [21] proposed a novel auditory-inspired feature set based on auditory image processing, which mimics the human auditory system. They also presented a classification framework that combines auditory image features with traditional acoustic and spectral features.

These studies have approached feature extraction from different perspectives, aiming to capture features that better align with the characteristics of music for classification purposes. However, fusion music often exhibits similar characteristics, and assigning equal weights to all features is unreasonable. To address this issue, we introduced genre-sensitive adjustment layers, enhancing the separability between genres and reducing the intra-class distance within the same genre.

3. Method

In this section, we introduce the design of MusicNeXt. It is a deep convolutional model aimed at extracting musical features and conducting differentiated classification for fused music.

3.1 Architecture of MusicNeXt

The MusicNeXt architecture consists of a feature extraction module and a classifier. The structure of MusicNeXt is illustrated in Fig. 2. In the feature extraction module, we start with the stem layer, which is used for preprocessing the input data and extracting local features for subsequent processing. It is followed by four stacked basic modules that utilize deep convolutions to extract temporal and frequency-related information from the music. At the same time, skip-connected connections are employed to alleviate the problem of gradient vanishing and promote model convergence. The modules are connected with downsampling to reduce computational burden and integrate deep-level features. Moreover, we have removed the batch normalization (BN) layers and most of the activation functions, keeping only the ELU activation function for the deep convolutional layers. This improves the stability and generalization capability of the model. In the ConvNeXt architecture, the feature extraction module achieves the best experimental results with the stacking ratio of 3:3:9:3. Therefore, we have considered this as a reference for our model.

Figure 2.

MusicNeXt network structure.

In the classifier part, we integrate the global information of the spectrograms, and employ a fully connected layer with L-softmax to map the features to the respective categories. L-softmax is an improved fully connected structure that incorporates angular cosine to enhance the boundaries between classes, thereby improving inter-class discrimination. It allows genres with partially similar features to have larger inter-class distances. Subsequently, L-softmax loss is employed to adjust the sensitivity to music genres, aiming to enhance the model’s ability to capture differences among different music genres and eliminate classification bias in the results. Based on the improvements made to the classifier component and the hints about its classification performance, we refer to this module as the genre-sensitive adjustment layer.

3.2 Depthwise convolution

In traditional convolution, the kernel slides and performs convolution operations on the input feature map. At each sliding position, the local pixels are element-wise multiplied with the kernel and the results are accumulated to generate the output feature map. This operation captures both global and local features of the image, and the parameter count is given by:

$\displaystyle\textit{parameters}=K\times K\times C_{in}\times C_{\textit{out}}$ (1)

where $K\times K$ represents the kernel size, $C_{in}$ represents the number of input channels, and $C_{\textit{out}}$ represents the number of output channels. However, lightweight convolutional neural networks often struggle to extract deep features effectively due to their limited depth. Increasing the depth of the network is a common approach to achieve better performance, but this significantly increases the parameter count.

Depthwise convolution is a convolutional operation that first performs separate convolutions on each input channel and then linearly combines the features from different channels to generate the final output feature map. Each channel of the input feature map is convolved with its corresponding kernel, which is responsible for extracting specific local features in that channel. The parameter count for depthwise convolution is:

$\displaystyle\textit{parameters}=K\times K\times C_{in}+C_{in}\times C_{% \textit{out}}$ (2)

a $(K\times K,C_{in},C_{\textit{out}})$ convolution is decomposed into a $(K\times K,1,C_{in})$ convolution and a $(1\times 1,C_{in},C_{\textit{out}})$ convolution. This separation of channel and spatial convolutions significantly reduces the parameter count, while maintaining a certain level of performance and achieving higher efficiency and computational speed.

The advantage of depthwise convolution lies in capturing multi-level features from the width perspective, effectively extracting deep features. By stacking depthwise convolutional layers, convolutional neural networks can learn more complex and abstract deep features, enhancing the network’s expressive power and learning capability. In the field of MIR, depthwise convolution can better learn music features and patterns, leading to improved performance and results in tasks such as music classification and music generation.

Figure 3.

Differences between MusicNeXt, ConvNeXt, and ResNet in basic stacking block.

3.3 Replacement of activation functions

The Exponential Linear Unit (ELU) activation function was introduced by Clevert et al. [22] in 2015 as a solution to the vanishing gradient problem. It is defined as:

$\displaystyle\textit{ELU}(x)=\left\{\begin{array}[]{ll}x,&x>0\\ \alpha(e^{x}-1),&x\leqslant 0\\ \end{array}\right.$ (3)

where $\alpha$ is a hyperparameter that controls the range of the negative output values. In our model, we set the default value of $\alpha$ to 1.0.

Compared to the Rectified Linear Unit (ReLU), ELU allows for negative values, which leads to function outputs close to zero, similar to the effect of Batch Normalization. This property enhances the model’s sensitivity to variations in the input data while requiring lower computational complexity. Additionally, ELU exhibits a soft saturation property, which improves its tolerance to noise and enhances the stability and generalization capabilities of the model.

In recent years, the effectiveness of the ELU has been demonstrated in the field of MIR [6, 23, 24]. Therefore, we use ELU as the activation function in our model. Due to the characteristics of ELU, it can achieve good performance in the network. As shown in Fig. 3, we have deleted the normalization layers from the stem and MusicNeXt blocks.

Figure 4.

Comparison of the accuracy of MusicNeXt with softmax and L-softmax. The horizontal axis represents the number of epochs, while the vertical axis represents the accuracy.

3.4 Genre-sensitive adjustment layer

In the development of music, different genres mutually influence each other, and some genres exhibit high similarity. This emphasizes the importance of improving both intra-genre compactness and inter-genre separability. To enhance cross-genre similarity classification, we introduce genre-sensitive adjustment in this study, replacing the conventional softmax loss with L-softmax loss. Convolutional neural networks commonly use the softmax and cross-entropy loss functions as classifiers. The softmax loss function is defined as:

$\displaystyle L_{s}=-\frac{1}{N}\sum_{i=1}^{N}\log\left(\frac{e^{\|w_{y_{i}}\|% \|x_{i}\|\cos(\theta_{y_{i}})}}{\sum_{j}^{n}\|\|w_{j}\|\|x_{i}\|\cos(\theta_{j% })}\right)$ (4)

where $x_{i}$ denotes the feature representation of the $i$ -th sample, and $y_{i}$ is the corresponding class label for $x_{i}$ . $W$ denotes the weight matrix, which represents the parameters of the last fully connected layer and serves as the classifier. $\theta_{i}(0\leqslant\theta_{i}\leqslant\pi)$ represents the angle between the vector $W_{j}$ and $x_{i}$ . The number of training samples is indicated by $N$ , while the number of classes is represented by $n$ .

Although softmax loss is simple and performs well, it lacks boundary discrimination and class sensitivity, making it less effective for music genre classification. In this paper, we introduce the L-softmax with a large-margin angle constraint to enhance the learning of differences between different classes. As shown in Fig. 4, it demonstrates superior performance compared to the softmax loss. Moreover, L-softmax exhibits better robustness against the vanishing gradient problem, leading to more stable network training and providing interpretability for neural network classification. The L-softmax loss function is defined as:

$\displaystyle L_{s}=-\frac{1}{N}\sum_{i=1}^{N}\log\left(\frac{e^{\|w_{y_{i}}\|% \|x_{i}\|\cos(\alpha\theta_{y_{i}})}}{{}_{e}\|w_{y_{i}}\|\|x_{i}\|\cos(\alpha% \theta_{y_{i}})_{+}\sum_{j\neq y_{i}}\|w_{j}\|\|x_{i}\|\cos(\theta_{j})}\right)$ (5)

where $\alpha$ is a hyperparameter that adjusts the sensitivity between classes, it controls the steepness of class boundaries. A larger $\alpha$ value results in steeper boundaries between classes, causing the model to focus more on the differences between classes. Conversely, a smaller $\alpha$ value leads to smoother class boundaries, making the model pay more attention to the features within classes. The parameter $\alpha$ also restricts the angle parameter $\theta$ by imposing constraints on the softmax output, thereby enhancing classification interpretability. The model is encouraged to classify different classes within a specific angle range, thereby clarifying the relationships and boundaries between classes.

Table 1

The genre labels and the actual number of songs used in the datasets

GTZAN		ISMIR2004		Extended ballroom
Genre	Track number	Genre	Track number	Genre	Track number
Blues	100	Classical	634	Chacha	455
Classical	100	Electronic	221	Jive	350
Country	100	Jazz-Blues	52	Quickstep	497
Disco	100	Metal-Punk	90	Rumba	470
Hip-hop	100	Rock-Pop	203	Samba	468
Jazz	100	World	244	Tango	464
Metal	100			Vienne sewaltz	252
Pop	100			Waltz	529
Reggae	100			Foxtrot	507
Rock	100

4. Experiments

We evaluate our method on three widely-used public datasets. In this section, we will provide descriptions of these datasets and then give the experimental results. MusicNeXt was compared with state-of-the-art methods to verify the effectiveness of the approach.

4.1 Datasets

In this work, we use the GTZAN, ISMIR2004, and Extended Ballroom for evaluation performance. Detailed information about the datasets, including the genre labels and actual number of songs used, is provided in Table 1.

GTZAN.

GTZAN is a publicly available dataset for music genre classification, created by Tzanetakis [17]. The dataset consists of 10 genres: blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, and rock, each with 100 music clips of 30 seconds, totaling 1000 clips. The GTZAN dataset has been widely used in many studies for music genre classification. In this work, we divided the given training dataset employing a random 8:2 split.

ISMIR2004.

The ISMIR2004 dataset [25] consists of 1458 music tracks from 6 unbalanced classes of genres, including classical, electronic, jazz-blues, metal-punk, rock-pop, and world. The dataset has been split with 729 music tracks used for training and the remaining for testing. In addition, the full songs were not used in the experiment. We deleted audio tracks that were less than 30 seconds long, and a 30-second audio track was segmented from each song. If the duration of the audio is less than 60 seconds, take the last 30 seconds of the audio, else take the 30 seconds after the first 30 seconds of the audio. After preprocessing, we had 724 files for training and 720 files for testing in our dataset.

Extended Ballroom.

The Extended Ballroom dataset is a genre classification dataset proposed in 2016 by Marchand [26], which extended the original Ballroom dataset. There are 4,180 tracks of 13 genres: Chacha, Jive, Quickstep, Rumba, Samba, Tango, Vienne sewaltz, Waltz, Foxtrot, Pasodoble, Salsa, Slow waltz, Wcswing, and each music track in Extended Ballroom lasts about 30 seconds. To balance the dataset, we used the first 9 genres for the experiment.

4.2 Experiment settings

4.2.1 Data preprocessing

In this work, we employed the short-time Fourier transform (STFT) provided by Librosa to extract mel-spectrograms with 128 mel-filters, resulting in spectrograms of size 2560 $\times$ 256. Then we cropped the spectrograms of each audio segment into 10 fragments of size 256 $\times$ 256. Each image contains 3 seconds of audio information and serves as input to the MusicNeXt. An example of such a spectrogram is shown in Fig. 5.

Figure 5.

A spectrogram image obtained from the STFT of a 30-second Blues audio file. Each audio segment is divided into 10 fragments, with each fragment sized at 256 $\times$ 256.

4.2.2 Evaluation metrics

In this study, accuracy and test loss are utilized as the evaluation metric to assess the classification performance of the proposed approach. And the confusion matrix is used to more visually demonstrate the effect of our model on the classification bias.

4.2.3 Implementation details

In this experiment, we employed the Adam optimizer for model training and utilized random horizontal flipping of images as a data augmentation technique to mitigate overfitting. The model was configured with a batch normalization size of 50 and trained for 50 epochs. Following the strategy of ConvNeXt, we increased the number of channels from 64 to 96 and set the margin parameter $m=0.4$ . To enhance training effectiveness, we employed a dynamic learning rate adjustment strategy. The initial learning rate was set to 1e-2, and after 5 epochs, the optimizer’s learning rate was multiplied by 0.7. Finally, for the classification decision-making, we adopted a voting scheme to determine the recognition results for the entire song. The final classification was determined by aggregating the predictions from all blocks through a voting process.

5. Experimental results

5.1 Selection of the angular margin values

The angular margin is an important parameter that influences the genre-sensitive adjustment layer. To investigate the impact of the parameter $m$ on the classification performance, we conducted a series of experiments on the GTZAN, ISMIR2004, and Extended Ballroom datasets. We varied the value of $m$ from 0.1 to 0.5 and evaluated the model’s accuracy for each value. As shown in Fig. 6, our model achieved the highest classification accuracy when $m=0.4$ .

Figure 6.

Accuracy of MusicNeXt under different angular margin values on GTZAN, ISMIR2004, and Extended Ballroom. The horizontal axis represents the angular margin value, while the vertical axis represents the accuracy.

Figure 7.

Experimental results of comparing different stacking block ratios in MusicNeXt and MusicNeXt with softmax.

5.2 Selection of the stacking block ratios

To study the performance differences of different feature extraction module stacking ratios, we conducted experiments by varying the stacking ratios, and the experimental results are shown in Fig. 7. We can observe that the 3:3:9:3 stacking ratio for the feature extraction block achieved the best classification performance for both the MusicNeXt and MusicNeXt with softmax models on the three datasets. This indicates that appropriately increasing the stacking ratio of stage-3 improves the classification performance, but too many stage-3 modules result in a worse performance. Based on the results in Fig. 7, we chose the 3:3:9:3 stacking ratio as the baseline for the feature extraction module.

5.3 Comparison with the state-of-the-art

Table 2
Comparison of the performance of different state-of-the-art methods on the GTZAN dataset

Method	Accuracy (%)
Bian et al. [8]	90.20
Chang et al. [10]	91.49
Cai et al. [21]	91.80
Zhao et al. [28]	81.10
Li et al. [29]	74.50
MusicNeXt (ours)	92.45

Table 3

Comparison of the performance of different state-of-the-art methods on the ISMIR2004 dataset

Method	Accuracy (%)
Medhat et al. [30]	86.04
Ng et al. [27]	92.46
Chang et al. [10]	91.91
Cai et al. [21]	82.90
MusicNeXt (ours)	92.13

Table 4

Comparison of the performance of different state-of-the-art methods on the Extended Ballroom dataset

Method	Accuracy (%)
Liang et al. [31]	95.10
Yu et al. [9]	92.70
Ng et al. [27]	95.50
Liu et al. [11]	94.50
MusicNeXt (ours)	95.82

We compare our proposed method with the state-of-the-art methods based on the GTZAN, ISMIR2004, and Extended Ballroom datasets. According to the results shown in Tables 2–4, our classification accuracy performs the best on the GTZAN and Extended Ballroom datasets, reaching 92.45% and 95.82%, respectively. On the ISMIR2004 dataset, the FusionNet method proposed by Ng et al. [27] achieves the best classification performance. FusionNet considers eight different features and combines them, obtaining the best test results on each dataset. However, when considering the mel-spectrogram as a single feature, its classification accuracy is only 86.42%. This indicates that our method consistently achieves the best classification performance when performing music genre classification tasks using a single model.

MusicNeXt demonstrates its classification advantage on the GTZAN dataset, which contains a significant amount of genre fusion. In the ISMIR2004 dataset, similar genres are merged into a single class for classification tasks due to an imbalanced distribution of song quantities. The Extended Ballroom dataset clusters different dance genres with similar timbres together. These two datasets artificially eliminate some cross-genre fusion issues, yet our model still exhibits strong feature extraction capabilities for classification. Overall, the results demonstrate that our method exhibits excellent generalization capabilities on datasets of different scales, suggesting potential applications in handling other MIR problems in the future.

Table 5

The ablation to evaluate the effectiveness of the proposed method with three baseline models on the GTZAN, ISMIR2004, and Extended Ballroom datasets

Method		GTZAN		ISMIR2004		Extended Ballroom
		Loss	Accuracy	Loss	Accuracy	Loss	Accuracy
ResNet-50	with Softmax	0.4395	0.8470	0.3451	0.8834	0.2917	0.9086
	with L-softmax	0.3168	0.8884	0.3208	0.8917	0.3053	0.9036
ConvNeXt	with Softmax	0.8882	0.6880	0.6886	0.7724	0.4572	0.8199
	with L-softmax	0.7462	0.7562	0.7745	0.7183	0.4389	0.8456
MusicNeXt	with Softmax	0.3100	0.8981	0.3031	0.8992	0.2257	0.9470
	with L-softmax	0.2593	0.9245	0.2568	0.9213	0.1723	0.9582

Table 6

Confusion Matrices (%) of MusicNext with Softmax

	(0)	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	(9)
(0) Blues	92	0	2	1	0	0	0	0	0	5
(1) Classical	0	99	0	0	1	0	0	0	0	0
(2) Country	1	3	87	1	0	0	0	1	1	6
(3) Disco	0	0	1	93	1	1	0	3	0	1
(4) Hiphop	0	0	1	0	96	0	0	1	1	1
(5) Jazz	2	3	2	1	0	92	0	0	0	0
(6) Metal	0	0	0	0	3	0	87	0	0	10
(7) Pop	0	0	0	3	4	1	0	81	8	3
(8) Reggae	1	0	1	1	0	1	0	1	95	1
(9) Rock	4	0	3	6	5	0	4	0	1	76

Table 7

Confusion Matrices (%) of MusicNext with L-Softmax

	(0)	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	(9)
(0) Blues	95	0	2	1	0	0	0	0	0	2
(1) Classical	0	100	0	0	0	0	0	0	0	0
(2) Country	3	1	92	1	0	0	0	1	1	3
(3) Disco	0	0	1	93	1	1	0	2	0	2
(4) Hiphop	0	0	1	0	97	0	0	2	0	0
(5) Jazz	1	3	2	0	0	94	0	0	0	0
(6) Metal	0	0	0	0	2	0	93	0	0	5
(7) Pop	0	0	0	1	2	1	0	91	4	1
(8) Reggae	1	0	0	0	1	1	0	1	96	0
(9) Rock	2	0	3	2	1	0	4	0	1	87

5.4 Ablation study

As shown in Table 5, neural networks with L-softmax generally outperform networks with softmax in all tested models. This indicates that L-softmax, with its large margin angle constraint, provides better class discrimination. Furthermore, our model achieved the highest classification accuracy on the GTZAN, ISMIR2004, and Extended Ballroom datasets, reaching 92.45%, 92.13%, and 95.82% respectively. These results demonstrate our model’s effective ability to extract deep features and its good compatibility with L-softmax.

For the GTZAN dataset, most studies [21, 11] have misclassified genres as Rock. This aligns with the understanding that the Rock genre often blends and influences other genres, highlighting the impact of cross-genre similarity on classification. As shown in the confusion matrix in Tables 6 and 7, our model effectively addresses this issue by reducing the bias towards a specific genre and improving the overall classification accuracy.

Figure 8 presents a comparison of training loss curves for different networks on the GTZAN dataset. The graph illustrates the contrast in loss between MusicNeXt and three baseline networks. The results indicate that our network exhibits enhanced learning capacity, enabling it to rapidly grasp the deep features of spectrograms.

Figure 8.

Comparison of MusicNeXt and other baseline models in music genre classification, where MusicNeXt-S is MusicNeXt with softmax. The horizontal axis represents the number of epochs, while the vertical axis represents the loss.

6. Conclusion

In this paper, we introduce a CNN architecture called MusicNeXt for precise music genre classification. It effectively utilizes temporal information from spectrograms to extract musical features and employs intra-class angle constraints to enhance the learning of differences between different genres. As a result, our model can accurately classify fused music and address the issue of classification bias. We stack feature extraction modules to extract time and frequency-related information from music using deep convolutions. Additionally, we utilize the genre-sensitive adjustment layer to enhance intra-class compactness and significantly improve the classification performance for fused music. We evaluate our model on multiple datasets and demonstrate its superiority over state-of-the-art methods, showing its effectiveness for MGC tasks.

In the present work, we adopt the genre-sensitive adjustment layer based on L-softmax, which improves feature representation learning but is sensitive to data distribution. It may not necessarily yield better results than softmax for other datasets. Therefore, in future work, we plan to explore other improved softmax variants for classification performance and further investigate the performance of our proposed model in other MIR tasks.

Footnotes

Acknowledgments

This work is supported by Tianjin “Project $+$ Team” Key Training Project under Grant No. XC202022.

References

Chaki

, Pattern analysis based acoustic signal processing: A survey of the state-of-art, Int. J. Speech Technol 24(4) (2021), 913–955.

tlacael miguel esparza, juan pablo bello and eric j humphrey, From Genre Classification to Rhythm Similarity: Computational and Musicological Insights 44 (2015), 39–57.

Costa

Y.M.G.

Oliveira

L.S.

Jr.

C.N.S.

, An evaluation of Convolutional Neural Networks for music classification using spectrograms, Appl. Soft Comput 52 (2017), 28–38.

Liu

Mao

Feichtenhofer

Darrell

Xie

, A ConvNet for the 2020s, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, IEEE, 2022, pp. 11966–11976.

Choi

Fazekas

Sandler

M.B.

Cho

, Convolutional recurrent neural networks for music classification, in: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5–9, 2017, IEEE, 2017, pp. 2392–2396.

Zhang

Qian

Sun

, Singer Identification Using Deep Timbre Feature Learning with KNN-NET, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6–11, 2021, IEEE, 2021, pp. 3380–3384.

Dong

Yang

Zhao

, Bidirectional convolutional recurrent sparse network (BCRSN): An efficient model for music emotion recognition, IEEE Trans. Multim 21(12) (2019), 3150–3163.

Bian

Wang

Zhuang

Yang

Wang

Xiao

, Audio-Based Music Classification with DenseNet and Data Augmentation, in: PRICAI 2019: Trends in Artificial Intelligence – 16th Pacific Rim International Conference on Artificial Intelligence, Cuvu, Yanuca Island, Fiji, August 26–30, 2019, Proceedings, Part III, Springer, 2019, pp. 56–65.

Luo

Liu

Qiao

Liu

Feng

, Deep attention based music genre classification, Neurocomputing 372 (2020), 84–91.

10.

Chang

Chen

Lee

, MS-SincResNet: Joint Learning of 1D and 2D Kernels Using Multi-scale SincNet and ResNet for Music Genre Classification, in: ICMR ’21: International Conference on Multimedia Retrieval, Taipei, Taiwan, August 21–24, 2021, ACM, 2021, pp. 29–36.

11.

Liu

Feng

Liu

Wang

Liu

, Bottom-up broadcast neural network for music genre classification, Multim. Tools Appl 80(5) (2021), 7313–7331.

12.

Zhang

Qian

Sun

13.

Logan

, Mel Frequency Cepstral Coefficients for Music Modeling, in: ISMIR 2000, 1st International Symposium on Music Information Retrieval, Plymouth, Massachusetts, USA, October 23–25, 2000, Proceedings, IEEE, 2000, pp. 293–302.

14.

Zhang

S.Z.

, Content-based audio classification and segmentation by using support vector machines, Multim. Syst 8(6) (2003), 482–492.

15.

Rabiner

L.R.

Juang

, Fundamentals of speech recognition, Prentice Hall signal processing series, Prentice Hall, 1993.

16.

Ogihara

, Toward intelligent music information retrieval, IEEE Trans. Multim 8(3) (2006), 564–574.

17.

Tzanetakis

Cook

P.R.

, Musical genre classification of audio signals, IEEE Trans. Speech Audio Process 10(5) (2002), 293–302.

18.

Dieleman

Schrauwen

, End-to-end learning for music audio, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4–9, 2014, IEEE, 2014, pp. 6964–6968.

19.

Choi

Fazekas

Sandler

M.B.

Cho

20.

Liang

Lei

Chan

P.Y.

Yang

Sun

Chua

, PiRhDy: Learning Pitch-, Rhythm-, and Dynamics-aware Embeddings for Symbolic Music, in: MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event/Seattle, WA, USA, October 12–16, 2020, ACM, 2020, pp. 574–582.

21.

Cai

Zhang

, Music genre classification based on auditory image, spectral and acoustic features, Multim. Syst 28(3) (2022), 779–791.

22.

Clevert

Unterthiner

Hochreiter

, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), in: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, 2016, Conference Track Proceedings, arXiv preprint, 2016, pp. arXiv:1511.07289.

23.

Kim

Won

Serra

Liem

C.C.S.

, Transfer Learning of Artist Group Factors to Musical Genre Classification, in: Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon, France, April 23–27, 2018, ACM, 2018, pp. 1929–1934.

24.

Zhong

Wang

Jiao

, MusicCNNs: A New Benchmark on Content-Based Music Recommendation, in: Neural Information Processing – 25th International Conference, ICONIP 2018, Siem Reap, Cambodia, December 13–16, 2018, Proceedings, Part I, Springer, 2018, pp. 394–405.

25.

Cano

Gómez

Gouyon

Herrera

Koppenberger

Ong

Serra

Streich

Wack

, ISMIR 2004 Audio Description Contest, 2006.

26.

Marchand

Peeters

, The Extended Ballroom Dataset, in: ISMIR 2016 Late-Breaking Session, 2016, pp. https://hal.science/hal–01374567/file/ISMIR2016LBD-ExtendedBallroom.pdf.

27.

W.W.Y.

Zeng

Wang

, Multi-level local feature coding fusion for music genre recognition, IEEE Access 8 (2020), 152713–152727.

28.

Zhao

Zhang

Zhu

Zhang

, S3T: Self-Supervised Pre-Training with Swin Transformer For Music Classification, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, IEEE, 2022, pp. 606–610.

29.

Han

Wang

Yuan

Yang

Yan

, Combined angular margin and cosine margin softmax loss for music classification based on spectrograms, Neural Comput. Appl 34(13) (2022), 10337–10353.

30.

Medhat

Chesmore

Robinson

, Masked Conditional Neural Networks for sound classification, Appl. Soft Comput 90 (2020), 106073.

31.

Liang

Zhou

Wan

Shu

, Deep Neural Networks with Depthwise Separable Convolution for Music Genre Classification, in: 2019 IEEE 2nd International Conference on Information Communication and Signal Processing (ICICSP), 2019, pp. 267–270.

MusicNeXt: Addressing category bias in fused music using musical features and genre-sensitive adjustment layer

Abstract

Keywords

1. Introduction

2.1 Music genre classification with spectrograms

2.2 Improvements in feature extraction methods

3. Method

3.1 Architecture of MusicNeXt

4.1 Datasets

GTZAN.

ISMIR2004.

Extended Ballroom.

4.2 Experiment settings

4.2.1 Data preprocessing

4.2.3 Implementation details

5. Experimental results

5.1 Selection of the angular margin values

5.3 Comparison with the state-of-the-art

Table 2 Comparison of the performance of different state-of-the-art methods on the GTZAN dataset

Footnotes

Acknowledgments

References

Table 2
Comparison of the performance of different state-of-the-art methods on the GTZAN dataset