Abstract
With the growth of people’s demand for personalized music, how to use AI technology to achieve accurate understanding and creative transformation of music styles has become an important topic. In this study, a transfer learning algorithm based on a deep learning framework is designed to automatically identify and simulate different music style characteristics in order to break through the traditional music creation mode. By pre-training a large-scale multi-style music library and then fine-tuning it for a specific target style, the effective migration of music styles is achieved. The experimental data show that this method can significantly improve the accuracy of style conversion and make the similarity of the generated music works in timbre, melody, rhythm, and other dimensions reach more than 92% while maintaining good novelty and diversity. In order to verify the audience acceptance of the generated works, this study invited participants from different age groups and musical preferences to conduct a listening comparison experiment. The results show that compared with the direct use of non-transfer learning models or artificially created music, the works generated based on transfer learning algorithms have achieved higher praise rates, especially in the two key indicators of innovation and emotional resonance, which have improved, respectively. About 23% and 16%.
Introduction
In the digital information age, the integration of art and technology has become one of the important driving forces to promote cultural innovation. Music, as an art form that crosses languages and national boundaries and touches people’s hearts, is ushering in unprecedented changes with the blessing of technology.1,2 In recent years, the development of artificial intelligence, especially deep learning technology, has opened up a new path for the exploration of the music field. Among them, transfer learning, as a cutting-edge technology, is gradually showing its great potential in music style transfer and creation. 3
Traditional music creation often depends on the accumulation of artists’ personal experience and inspiration explosion, and this process is full of uncertainty and difficult to replicate on a large scale. 4 However, with the advent of the era of big data, massive music data provides rich training materials for machine learning, which enables computers not only to understand and imitate human creative styles but even to innovate on this basis to create new works with unique charm. 5 Transfer learning, that is, the method of improving learning efficiency by applying knowledge acquired in one domain (source domain) to another related but different domain (target domain), is the key bridge connecting technology and art, tradition and innovation.6,7
In the field of music, the application of transfer learning is mainly reflected in two aspects: one is music style transfer, that is, the ability to let the model learn to change from one style to another; The second is music creation based on this ability, that is, the fusion of different elements to generate novel works.8,9 For example, a model trained in classical music can quickly adapt to the stylistic characteristics of popular music through transfer learning and then create musical works with both classical charm and modern flavor.
However, it is not easy to realize the transfer and creation of high-quality music styles. First of all, music is a complex form of artistic expression involving information in multiple dimensions, such as melody, rhythm, and harmony. How to effectively capture and express these abstract concepts poses challenges to existing algorithms. 10 The boundaries between different styles are not clear. How to accurately identify and transfer these nuances tests the design wisdom of the algorithm. The emotional transmission of music is its core value, and whether machines can truly understand and transmit emotions is still an urgent problem to be solved. 11 To meet the technical challenges of music style transfer and creation, we have taken the following measures: first, we use more advanced feature representation technology and transfer learning models to extract deep features of audio signals, learn and extract more distinguishing and representative features, more accurately capture the subtleties of music style, and ensure that the music remains coherent and natural when incorporating new style elements. A transfer learning algorithm that can adapt to different styles has been developed, and the strategies and parameters are flexibly adjusted to realize style transfer and creation effectively. In addition, this research aims to solve the limitations of current transfer learning algorithms in dealing with the complexity and diversity of music styles, develop more complex and intelligent algorithms, capture the nuances of music styles more comprehensively, and avoid the problem of overfitting a specific style or being difficult to generalize to an unseen style. Additional context-aware information, music metadata, artist background, and listener preferences are incorporated to enhance the algorithm’s ability to understand and express music styles.
By introducing transfer learning algorithms, this study aims to build a system that is efficient, flexible, and responsible for handling music creation. Compared with existing music style transfer and composition methods, our process uniquely uses transfer learning algorithms, innovative pre-processing steps, and new style transfer strategies that combine musical characteristics. We hope the system can provide unlimited creative inspiration for musicians, bring listeners a more diversified auditory feast, and ultimately promote the development of the music industry and even the entire cultural sector in a more prosperous direction.
Related theoretical techniques
Fundamentals of music
Music can be divided into three parts: pitch domain (tonality and scale), rhythm and beat of music, and keynote of song (chord). 12 Music is composed of multiple notes, which include pitch, level, interval, scale, and mode. 13 Interval, as the basic element in music composition, is essentially to describe the spatial distance between two notes. According to the law of 12 equals, a complete octave is evenly divided into 12 different pitches, and between these pitches, two adjacent intervals are exactly one half-tone distance. From the beginning of the tonic to the reappearance of the tonic in the next octave, all the notes in this process are arranged in a stepped way according to a certain order, from low to high or from high to low. This arrangement of notes with a clear order and structure is called a scale. The mode of music can be regarded as a specific scale in essence, and each mode has its unique scale structure. These scales with different structures endow different modes with unique charm and expressive force. Modes are not only divided into natural major and minor, but common modes also include harmonic minor, melodic minor, and harmonic major. 14
Melody, harmony and rhythm form the basic elements of a musical composition. 15 The speed of the beat directly determines the rhythm benchmark that musicians or MIDI instruments follow when playing. The basic notation method of note rhythm is mainly composed of two parts: the first part is the rhythm head, which is hollow or solid; the second part is the rune stem, which is closely connected with the rune head. The length of the hollow header is twice the length of the hollow header of the adding bar, and the length of the solid header of the adding bar is half the length of the hollow header of the adding bar; add a tail, and the time required will be cut in half. When the rule head is on the third line or higher, the rule stem should move downward. When the rune head is on the third line or lower, the rune stem rises. The beat constitutes the core morphology of the song.16,17 The beat pattern is orderly divided into the basic units of two beats, three beats or four beats, and is clearly distinguished and defined by the bar line, the “punctuation mark” in music, thus ensuring the rigor of music structure and the fluency of rhythm.
Dataset description
This study uses a dataset called the Diverse Musical Style Repository, derived from contributions from significant music platforms and independent music producers worldwide, ensuring a wide range and diversity of musical styles. The dataset contains thousands of music tracks, covering a variety of music genres such as classical, jazz, rock, and pop. Each track details its key characteristics, such as duration, rhythmic pattern, harmonic structure, and instrument use, providing rich material for in-depth analysis of musical styles.
The raw audio data was meticulously pre-processed. This step includes data cleansing, which involves noise removal, correction for temporal errors, and standardization of audio levels to ensure the quality and consistency of the dataset. Then, we divide the long audio file into manageable segments according to the musical structure and annotate these passages with relevant metadata (such as music genre and artist) so that the model can learn the music’s local characteristics and overall structure. In the feature extraction session, we elaborate on the method of converting the original audio signal into an informative and easy-to-manage representation, using techniques such as Mel’s frequency cepstrum coefficient (MFCC), pitch profile analysis, and rhythmic histogram to extract key information that can characterize the musical style. To ensure that these features have a uniform scale and are suitable as inputs to the transfer learning model, we normalize them. In addition, we explore the possibility of feature selection. If applicable, we will employ techniques such as principal component analysis (PCA) or mutual information-based selection to reduce the number of features to the features most relevant to the transfer learning algorithm to improve the algorithm’s efficiency. At the same time, if possible, we will also consider using data augmentation techniques such as pitch shift, time stretching, or applying various filters and effects to increase the diversity and scale of the training dataset to improve the performance of transfer learning algorithms in music style transfer tasks.
However, during the construction of the dataset, we were aware of the potential biases that could arise from selection criteria, data collection methods, and sample representativeness. In the results section, we provide an in-depth analysis of the possible impact of these biases on the study’s results, including how they affect the performance of our proposed methods and the limitations we face in generalizing the results. To alleviate these issues, we propose and implement strategies, such as diversifying the dataset to cover a broader range of musical styles and composers and using techniques such as oversampling or undersampling to balance the dataset. In addition, we performed sensitivity analyses to evaluate the performance of our methods on different subsets of datasets. We compared the results with those obtained from the entire dataset to assess the robustness of our results against changes in the dataset.
Transfer learning
The core idea of transfer learning is to use the training data taken from one problem to optimize the solution of another. It is a learning approach based on the connection between existing knowledge and new situations.18,19 In the case of limited labeling data, transfer learning is especially important to obtain better labeling results. On the other hand, in the process of parameter training using deep neural networks in the past, the introduction of transfer learning can significantly reduce the time consumption. Transfer learning encompasses two core concepts, one of which is called a domain.
20
We have clarified the method for hyperparameter tuning, employing grid search, random search, and cross-validation strategies to ensure that the selected hyperparameters have good generalization capabilities on unseen data.
In equation (1), Transfer learning algorithm.
Inductive transfer learning is a learning method in which the source domain and the target domain are consistent, but the source task and the target task show significant differences.22,23 Depending on whether the source domain contains labeled data, inductive transfer learning can be subdivided into two more specific subcategories: multitask learning and autonomous learning. Both learning methods focus on extracting useful information for the target task from the knowledge in the source domain, but they have different emphases on how to deal with labeled data and task relevance. Unsupervised transfer learning and inductive transfer learning are consistent at the domain level; that is, they both focus on the same or similar domain scope. Unsupervised transfer learning particularly emphasizes the absence of labeled data in the source domain and the target domain and focuses more on mining useful information from unlabeled data. Direct-push transfer learning presents a different characteristic. In this learning style, the source task and the target task are similar; that is, the problems they want to solve or the goals they achieve are similar. However, unlike inductive transfer learning, direct transfer learning involves different fields. This means that although the source task and the target task are similar in objectives, there are differences in the scenarios or backgrounds they are applied to. There is usually a large amount of labeled data in source tasks, which provides strong support for learning tasks. In the target domain, labeled data is often missing or insufficient for various reasons. This feature makes direct-push transfer learning necessary to pay more attention to how to use the labeled data of the source domain to assist the learning tasks of the target domain. This classification can be further subdivided into two subcategories: differences in feature spaces and differences in edge probability distributions.
24
Another concept is the task, and the calculation is shown in equation (2).
T represents the specific target, y represents the label space, and f (·) represents the target prediction function.
Research on music style transfer and creative methods
Migration music style migration and creation calculation method
Arbitrary style transfer, a process that aims to skillfully merge a content image c with a style image s that contains a specific musical inspiration, is used to create a new image that not only retains the essential characteristics of content image C but also subtly incorporates the unique musical visual elements of the style image p.
25
When selecting a pre-trained model, we considered its architecture, performance in the relevant tasks, and ability to capture musical features closely related to style transfer. Regarding feature extraction and representation, we have carefully modified the process to involve adjusting the input representation, introducing additional layers or modules to capture style-specific features precisely, and applying specific regularization techniques to ensure that the style shift is achieved while the content is maintained. Regarding training strategy, we implemented optimization measures, including hyperparameter tuning, learning rate scheduling, and data augmentation techniques to improve the performance of music style transfer tasks. Finally, when evaluating the algorithm’s effectiveness, we employ detailed evaluation metrics, including quantitative style similarity scores, content retention metrics, and qualitative evaluations through audience or expert reviews. In addition, we also found that using low-order spatial statistical properties can effectively express the visual similarity of music styles with the same features, and the low-order statistical properties shared between visual textures and musical styles make it possible to extract the same visual texture features. Furthermore, assuming that the visual texture is homogeneous in the distribution of features, we use the Gram matrix, a mathematical tool, to represent the low-level spatial statistics, and the Gram matrix, as a symmetrical matrix of squares, plays a crucial role in the style transfer task. In this context, the complete optimization object in style transfer can be expressed as equation (3):
Among them, f
l
(x) is the feature map of layer l, which carries the key information of the image processed by this layer. n
l
represents the total number of weight units in layer l, which together constitute the parameter set of this layer. G(f
l
(x)) is the Gram matrix extracted from the lth layer feature map, which reflects the correlation between features.
26
This study employs a style transformation network, using an encoder/decoder structure, providing different normalization parameters for each painting style. Conditional instance normalization (CIN) is adopted, in which the activation value z of each weight unit is set to a value related to the painting style, and the calculation formula is shown in equation (6). In this way, CIN can adjust the parameters of the network according to different painting styles to achieve more flexible and accurate style migration.
In the above formula, μ and σ, respectively, represent the mean value and standard deviation of the feature map in the whole hierarchical space and together constitute the basic statistics of the hierarchical space. γ
s
and β
s
constitute a combination of linear transformations, which are used to express the learnable mean (γ
s
) and learnable standard deviation (β
s
) of each weight unit. According to each painting style diagram s, the style characteristics are quasi-described. Throughout the network, a learnable set of linear transformation parameters {γ
s
, β
s
} is embedded. It enables vectors to capture unique stylistic elements in paintings. The style transformation functional network architecture is named
By avoiding downsampling and adopting a series of alternative methods to realize multi-level context aggregation, the displacement invariance can be effectively preserved. In this study, the void dilation convolution method is used to enhance the feature sampling of the convolution filter.27,28 In order to significantly improve the robustness of the music style transformation network, this study introduces the loss of temporal consistency to ensure that the changes between output frames can be strictly consistent with the changes between input frames. First, w = (u, v) is defined as the forward optical flow field between the input frames c
t
and c
t-1
to measure the position movement relationship of each pixel in the images c
t
and c
t-1
. It is assumed that a certain pixel of the output image pt should theoretically have the same chromaticity value as its p
t
at the corresponding position of p
t-1
. The loss function effectively constrains the network to maintain consistency with the inter-frame changes of the input image when generating the output image, thereby improving the overall performance of the music style transformation network, as shown in equation (7):
When using the forward optical flow field technique, we first output the video frame p
t
through deformation, thus obtaining the deformed frame
In the above formula, m
t
(h,w)∈[0,l], 0 is used to represent the region where the occlusion and motion boundary are located, and l correspondingly represents the effective region in the image. H and W are the specific values of the height and width of the input frame, respectively. L
o
stands for the target domain. m
t
stands for multitask learning. In the output musical style, the temporal consistency loss is computed. When processing through advanced feature mapping, the feature vectors corresponding to objects should also be consistent. The temporal consistency loss of the feature mapping hierarchy can compensate for the timing inconsistency of the feature mapping of two consecutive input music style frames in the neural network. The formula for calculating the time consistency loss of the feature mapping hierarchy is shown in equation (9).
Network structure and training process
The overall network architecture shown in Figure 2 deeply integrates two subnet systems, and the design reflects a deep understanding and granular control of the musical style transition process. Regarding the number of training cycles, we selected 100 epochs to ensure the model can thoroughly learn the music style transformation features and avoid overfitting. For the batch size, we use 32 as the batch size, which balances the constraints of hardware resources with the training efficiency, which can effectively learn features without increasing the training time excessively. The comprehensive part elaborates on the neural network architecture used for transfer learning, including the specific structure of each layer, the selection of activation functions, and the connection mode. At the same time, the architecture introduces the pre-trained model as the starting point of transfer learning. It provides an overview of the architecture, training details, and reasons for selecting the model. To meet the needs of music style conversion and creation, the pre-trained model is also customized and adjusted, including network architecture optimization, loss function improvement, and optimization technology innovation. In particular, the style prediction mechanism introduced in the architecture can more effectively guide the style transition and ensure that the converted music is consistent with the target style, which increases the flexibility, controllability, stability, and consistency of the style transition compared with the existing methods. In addition, the overall architecture design also considers the optimization of computing efficiency and resource utilization. It reduces the computational complexity and resource consumption while ensuring the style conversion effect through reasonable network structure and parameter configuration, which is more feasible and efficient in practical applications. Network overall structure and training schematic diagram.
The network combines the instance normalization processing after the first three convolutional layers into one layer, replacing the original batch normalization layer. At the same time, in order to eliminate the possible checkerboard effect of the deconvolution layer, the neighbor upsampling is combined with the ordinary convolution layer. Except for the initial three convolution layers, the network is behind all other convolution layers. An instance conditional normalization layer is added to jointly complete the refinement of the music style. After three consecutive convolution blocks are processed, the feature map size of the input picture is reduced to a quarter of the original size. Subsequently, the model further accelerates the convergence process by cleverly connecting five residual blocks. Finally, the stylized result is successfully generated on two sampling points of the network as well as an additional convolution block. Selecting the loss function is critical because it successfully generates a stylized result on two sampling points of the network and an additional convolutional block. In the network training process, we use a three-part loss function composed of content loss Lc, style loss Ls, and time consistency loss Lt. Each loss function involves specific trade-offs that profoundly impact the balance between content retention and style transfer. By comparing the performance of different loss functions, we gain insights into why some loss functions outperform others regarding music style transfer. This provides strong support for us in optimizing network performance further. The loss function in the network training process consists of three parts: content loss L
c
, style loss L
s
and time consistency loss L
t
, which act together on the optimization process of the network, as shown in equation (11).
In terms of time complexity, based on key parameters such as the size of the input music data, the length of the music clip, and the dimension of stylistic features, we analyzed the time complexity of each operation in detail, including the number of nested loop iterations, the matrix operation dimension, and the complexity of other computationally intensive tasks, and obtained an accurate evaluation. At the same time, in terms of space complexity, we also comprehensively considered the memory space required to store the input data, intermediate results, and final output music works. To fully evaluate the performance, we compare the proposed method with the existing music style conversion methods based on deep learning (which faces long training time and high resource consumption), signal processing (which is challenging to maintain music quality), and template matching (which is limited in stylistic diversity), and found that our method has potential advantages in computational efficiency, which can achieve faster processing speed and lower resource consumption while maintaining high music quality.
Experimental results and analysis
The music style transfer technology based on a transfer learning algorithm generates a novel style by capturing the style characteristics of the source music work and transferring it to the target music work. When evaluating the creativity of the music generated by this technology, we comprehensively consider the accuracy and innovation of style transfer, the ability to capture and transfer style characteristics, and the naturalness of style integration. At the same time, it pays attention to the maintenance and innovation of music content. It measures the ability to integrate new styles based on retaining the essence of original music works through content similarity and novelty indicators. In addition, emotional expression and listener resonance are also emphasized, and the performance of generated music in emotional transmission and listener acceptance is evaluated through emotional similarity and listener satisfaction indicators. Based on these evaluation indicators, we can comprehensively and objectively assess the creativity of music generated by the music style transfer technology based on transfer learning algorithms and provide a valuable reference for optimizing and improving algorithms.
Figure 3 is a comparison chart of generating fast and slow music. When observing slow-tempo generated music pieces, it was found that the same note occurred so many times that the duration of some notes even exceeded a bar. Compared with slow-paced music clips, music clips generated by fast-paced music clips show different characteristics. In this fast-paced music, there is no situation in which the same note lasts too long. On the contrary, the melody they produce presents a ladder-like change, with a clear and clear rhythm that is easy to distinguish. Comparison chart of generating fast and slow music. (a) and (b) show the visualization effects of musical notes in the comparative experiment of generating fast and slow music, (c) and (d) respectively show the results of the music effects in the experiment.
Further comparison of the music fragments generated by slow rhythm and fast rhythm shows that the music generated in slow rhythm is still insufficient compared with fast rhythm. This is mainly reflected in the model’s learning of long-time notes in slow rhythm, and the model’s processing ability in this respect needs to be strengthened. For fast-paced music, the model shows good performance and can accurately capture and generate melodies with a distinct sense of rhythm.
Melody music theory scores.
Comparison of running complexity.
We observed the training loss curve for the negative log-likelihood (NLL) test, as shown in Figure 4. Leak-GAN shows a faster convergence speed on this indicator and has achieved good results. Leak-GAN consistently had the best NLL score throughout the training phase, while Rank-GAN performed the worst. This result shows that Leak-GAN can learn and adapt to the style characteristics more stably in the face of different styles of music input to generate music works that are more in line with expectations, which reflects the strong robustness of the method to input style changes in the style transfer task. NLL-test loss.
Mathematical-statistical indicators.
In the study, we randomly used MT-GPT-2, MT-Leak-GAN, and LSTM models to generate 50 pieces of music of equal length and compare them with the actual music in the dataset. To fully assess the quality of the generated music, we conducted a subjective evaluation, inviting musicians and musicology experts to participate in the survey, asking them to listen to the generated music and provide their opinions on the quality, creativity, coherence, and adherence to the target musical style. At the same time, we also calculated the mean values of each evaluation index, as shown in Figure 5. Based on the music evaluation criteria, we found that the music generated by MT-GPT-2 and MT-Leak-GAN was highly similar to the actual music and showed higher variability. In particular, in the progressive jump comparison and wavy detection, the music generated by the two models is highly similar to the actual music. Compared with the LSTM model, the music generated by MT-GPT-2 and MT-Leak-GAN is more musical, and the music generated by MT-GPT-2 is highly accurate; the notes are mainly in the key of C, and the melody is more suitable for the auditory beauty. We have integrated these subjective evaluations with objective results to provide a more complete picture of the performance of our methodology, further demonstrating the practical relevance and musical quality of the generated music. Evaluation index of music theory.
It is found from Figure 6 that when one layer is frozen, the migration strategy of the model achieves the optimal effect, which is better than the performance of directly fine-tuning the entire network, but the effect gradually decreases after the number of frozen layers increases. In view of the small scale of MIRE X-like data, if the entire network is fine-tuned, the network will not be able to be effectively updated, and thus, higher accuracy will not be achieved, especially when the first layer is frozen. In this study, the music texture features of the shallow network, such as tone and rhythm information, are retained in the pre-trained model. These low-level features are generic in genre and emotional content, so retraining can be avoided. Freezing this layer effectively reduces the training cost and enables the deep network to be effectively trained. Comparison of fine-tuning accuracy under different freezing layers.
Figure 7 shows the comparison of feature fusion effects of different convolutional layers in the feature migration method. The combination of the first layer and the fifth layer shows the best performance in feature fusion, which means that the fusion of low-level and high-level features can effectively achieve more accurate feature expression. The fused features can be further extracted from valuable information with the help of PCA (Principal Component Analysis) dimensionality reduction technology. Then, the feature vectors are inputted into SVM (Support Vector Machine) for in-depth training and classification. Through this process, the emotional elements contained in music can be more accurately identified and classified. The experimental data further show that in the process of feature combination, the features at all levels show good expressive force. Comparison of accuracy of fusing features of different convolutional layers.
The data in Figure 8 shows that compared with Johnson and Huang’s music stylization model, this model shows better performance in all given music segments. However, it is challenging to implement Runder’s iterative optimization-based method because it is slow to execute and difficult to apply to actual industrial scenarios. The speed of this model is quite close to that of the Johnson model, which can process up to 50 pictures per second, realizing real-time music processing capabilities. Comparative analysis of stylized models.
Figure 9 shows the PSNR values and SSIM values of various models after dealing with the migration of different types of music styles. From these data, it can be clearly seen that the PSNR values and SSIM values of music styles both reached the highest values after the migration of the improved method based on this study. The four methods also had the highest PSNR and SSIM values when dealing with human music styles. Compared with the CartoonGAN model, after the migration of the improved method, the PSNR value of the music style increased by 2.5% overall, while the SSIM value increased by 3.3% overall. PSNR under different models.
In order to ensure the practicality of the proposed modules, each module is independently incorporated into the basic network for training. The research methodology was based on the existing CartooGAN model, and four different sets of experiments were designed, as is shown in Figure 10. These experiments compare three different improvement schemes: one is to introduce residual block, the other is to introduce AdaPoLIN layer, the third is to introduce perceptual loss, and the third is to add the model proposed in this research. The SSIM and PSNR values of the reconstructed music style and the input music style of the music style reconstruction model were calculated, respectively. By adding the AdaPoLIN layer, the performance indicators of SSIM and PSNR have been significantly improved. At the same time, by introducing residual blocks and perceptual losses, the performance indicators of SSIM and PSNR have also been slightly improved. The results show that the above modules can effectively reconstruct high-quality music styles. Results of ablation experiment.
Challenges and future research on musical style transfer methods
While somewhat successful in terms of musical style transfer, our approach still faces several significant challenges. First, limited scalability is a key concern, which can affect the efficiency and performance of the method when working with large-scale music datasets, limiting its potential for application in a broader range of datasets. Secondly, the quality of the style transfer results still needs to be improved, especially in capturing and integrating the target style features more accurately while maintaining the integrity of the original music content, a significant problem the current method faces. In addition, the computing resources required in the training and inference process are relatively large, which not only puts forward high requirements for hardware conditions but also may limit the broad application of the method due to excessive resource consumption.
For example, the existing methods still have shortcomings in dealing with complex musical structures and achieving delicate capture and fusion of stylistic features. Our approach and future research direction aim to fill these gaps, promote the continuous progress of music style transfer and composition technology by introducing more advanced transfer learning technology and algorithm optimization, and inject new vitality and possibilities into music composition and style transformation.
Given the above limitations, we propose several potential future research directions. First, explore new transfer learning techniques, such as leveraging more complex hierarchies in transfer learning or introducing new regularization strategies, to improve the accuracy and efficiency of style transfer. Secondly, the computational complexity is reduced, and the resource utilization is optimized by optimizing the algorithm design, reducing redundant calculations, using more efficient computing frameworks, and reducing the dependence on hardware conditions. In addition, we plan to expand this method to new areas or use cases of music composition, such as applying style transfer techniques to music improvisation and music sentiment analysis, to diversify music style transformation and composition techniques further.
Conclusion
In the field of artificial intelligence, the development of deep learning technology has greatly promoted the progress of computer music creation, among which transfer learning, as an efficient learning method, shows great potential in cross-domain knowledge transfer. The purpose of this paper is to explore and practice the method of music style transfer and creation based on a transfer learning algorithm and to verify its effectiveness through a series of experiments. (1) This study constructs a large-scale music database that covers various music features. Using this database, first extract features through pre-trained neural network models, convert audio signals into processable data forms, and then use transfer learning strategies to fine-tune the model to adapt to the target music style, achieving knowledge transfer. The experimental results show that transfer learning models can more accurately identify and imitate the target style. New works are highly consistent with the target style, enhancing innovation and diversity. We provide detailed examples of successful style conversion to demonstrate the method’s effectiveness. In the process of transforming classical music style into jazz style, this model not only retains the original melody framework and harmony structure but also cleverly integrates the improvisation and rhythm changes of jazz music, making the transformed works not only retain the elegance of classical music but also increase the flexibility and freedom of jazz music. (2) To further evaluate the effectiveness of the method, the generated works were analyzed comparatively with those of human composers. The results showed that pieces generated by transfer learning scored an average of 18% improvement in listener preference tests, demonstrating the method’s ability to not only capture core elements of a specific musical style but also create a satisfying listening experience. (3) By comparing the performance of models under data sets of different sizes, it is found that when the training sample size increases, the quality of the generated works is significantly improved, but the marginal benefit gradually decreases, which means that there is an optimal data size. After this threshold, additional data will not bring the same proportion of performance improvement.
Statements and declarations
Footnotes
Conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
