Automatic generation of music elements based on artificial intelligence algorithms

Abstract

With the rapid development of computer information technology in China, algorithmic composition technology has received more attention in the field of musical composition. However, the accuracy of existing methods based on artificial intelligence algorithms for composition is relatively low, and the production effect cannot meet practical requirements when facing complex tracks. In view of this, this research designed the music element automatic generation method based on recurrent neural network. A music element automatic generation model based on resonant neural network is proposed. The improved algorithm is experimentally validated. The experiment showed that the system combined with the average field connection network, initial universal connection of resonant neural network, and detuned oscillator performed the best. The F-value reached 77.2%. The chord generation accuracy of the LSTM-RNN model was 81.99%, 81.65%, 81.02%, and 80.47%, respectively. The designed method can effectively achieve music production, meet high precision design requirements, and achieve good design results. This indicates that the music element generation method based on recurrent gradient frequency proposed in the study has good performance. It can accurately generate music elements, providing certain assistance and reference for the development of automatic generation technology of music elements in China. It is recommended to apply this method to more diverse scenarios in the future to complete music element generation.

Keywords

algorithmic composition artificial intelligence neural network music elements automation LSTM music structure

Introduction

Humanity has been conducting research on music for a long time. Music has become one of humanity’s greatest artistic pursuits. Many theories have been proposed in the study of music. Songwriters are also continuously summarizing their composition techniques.^1,2 With the development of technology, applying computers for algorithmic composition has become an advanced technology for creating music works. In contemporary music, rule sets and algorithms are also considered as one of the ways to compose music. For example, western music Counterpoint is used to simulate the voice, which is one of manifestations of this algorithm.³ Algorithmic composition is a precisely defined approach. This is also a way to represent finite length sequences.⁴ When instructions describe calculations, the description starts with the initial input state. Clearly defined and limited transitions are executed in a certain order, resulting in output. In algorithmic composition, random algorithms are often applied. The input is usually a random parameter. After multiple transformations through specific transitions, a sequence of notes is ultimately formed.⁵ However, existing music element generation methods have shortcomings in accuracy and applicability, which cannot meet practical application needs. Especially for complex tracks, existing methods cannot accurately capture their complex musical element symbols. Specifically, the generated music segments have significant shortcomings in terms of logic and global structural consistency. Secondly, existing methods often focus on generating a single element, such as rhythm or melody, without considering the synergy of multiple elements, which may result in a mismatch between the generated melody and harmony, or obvious errors in different elements such as harmony or rhythm. Therefore, the practicality of existing methods in generating music of different styles is weak. The research aims to construct an automatic music element generation method based on artificial intelligence algorithm to achieve more accurate and effective music element generation, meeting complex and diverse application needs. This research suggests that musical elements include melody, harmony, and rhythm. Music works are all composed of these three elements. Therefore, combining the advantages of recurrent neural network in data extraction, a music element automatic generation model based on recurrent Resonant Neural Network (RNN) is constructed. It is hope that this method can better generate music elements, providing more effective support for high-quality music creation.

The innovation of this study is as follows. Firstly, this study uses melody, harmony, and rhythm as the basic units for intelligent algorithm learning, providing a clarity learning objective for neural network models to achieve music generation, and effectively considering the integrity and coordination between music melodies, ensuring the quality of music generation. Secondly, this study innovatively combines recurrent neural networks and RNNs. The former can effectively capture the temporal structure of musical elements, while the latter can effectively enhance the modeling ability of the model in frequency-domain relationships such as pitch and harmony, thereby improving the rationality and fluency of music generation.

The contributions of this study are as follows. Firstly, this study utilizes a recurrent neural network model to represent music elements, achieving automatic continuation and combination of music elements. Secondly, this study combines RNNs and recurrent neural networks to construct an automatic music element generation model, namely, the GFNN-LSTM hybrid architecture. Thirdly, a detuning oscillator is introduced as a differentiable phase coupling module to achieve higher quality music element generation.

Related works

With the support of deep learning technology, music generation methods have been better developed. Combining various deep learning technologies to practice music generation has become a research hotspot. Malyi D conducted an in-depth analysis for existing music creation techniques and methods. The results showed that musical composition technology was fundamentally interrelated. The creators’ ideological intentions were explained from different perspectives. There is consistency between music language creation and historical and cultural evolution.⁶ Zongyu Y et al. evaluated several music generation systems from six aspects, including style success, aesthetic pleasure, repetition or self-reference, melody, harmony, and rhythm. The results showed that the music generation method based on deep learning was significantly superior to other artificial synthesis methods.⁷ Weiming L et al. utilized artificial intelligence (AI) for song creation. The model was validated by generating 5 different bass, drum, guitar, piano, and string tracks. The results indicated that many of the music clips generated by the proposed model were smooth. This model was more stable, realistic, which also had a faster fitting speed in music generation, indicating that the music generation method was effective.⁸ Maryam M et al. proposed a multi-objective genetic algorithm (MO-GA) to generate polyphonic music segments. The proposed music generation system attempted to maximize the mentioned objective function to generate new music segments, including melodies and harmonies. The results indicated that the proposed method could generate pleasant fragments with the required style and length, as well as grammatical harmony.⁹ Jin C et al. proposed a new generation network model based on transformer and guided by music theory to produce high-quality music works. While training and discriminating the network, the global and local loss objective functions were optimized to provide a reliable adjustment method for the reward network. The reward network and cross entropy loss were combined to guide the training of the generator and produce high-quality music works. Compared with other multi track music, the experimental results verified the effectiveness of the Generative model.¹⁰

Recurrent neural networks are neural networks with fixed weights, external inputs, and internal states. It can be regarded as a parameter based on weights and external inputs, which can better describe the motion state. It has been widely used in various fields. Xing B et al. proposed a new robust non-singular terminal sliding mode (NTSM) control method based on recurrent neural network structure. Based on the adversarial generative network dynamics model and path following model, a robust NTSM steering controller was proposed. Then, recurrent neural network was applied to approximate the unknown dynamic part of the system online. Compared with existing methods, the proposed method had better performance.¹¹ Traditional neural network models have low accuracy, high complexity, and lack compatibility in predicting the bending degree of composite materials. Therefore, Zamyad H et al. proposed a mixture model with internal storage units to overcome existing weaknesses. The experimental data validation results indicated that the model had acceptable accuracy and flexibility.¹² Dsouza K B et al. proposed an automatic recurrent neural network encoder based on deep Long short-term memory (LSTM) to capture long-term dependence in Epigenome data. This method could capture potential representations of various gene phenomena, including gene expression, promoter enhancer interactions, replication time, frequent interaction regions, and evolutionary conservatism. The experimental results showed that this method was superior to existing methods.¹³ Yang S B et al. developed a new method based on recurrent neural network to deal with stochastic model predictive control problems. The deterministic optimization problem generated was solved through a nonlinear optimization solver. The results demonstrated that the proposed method had high accuracy.¹⁴ Lai W H et al. proposed a stacked recurrent neural network with gated recurrent units (GRUs) and jointly optimized soft time-frequency masks for extracting target instrument sounds from mixed instrument sounds. The stacked recurrent neural network model linked multiple simple recurrent neural network, making it an excellent model with temporal dynamic behavior and true depth. The results showed that this method could successfully extract Electric guitar and drum sound.¹⁵

In conclusion, the music element generation method combined with modern advanced technology has been deeply studied, providing effective support for Musical composition and research. However, existing music generation technologies focus more on simulating and creating various musical sounds. There are still many shortcomings in the extraction and representation of specific musical elements and rhythm elements. Therefore, combined with the advantages of recurrent neural network in data mining, a factor generative model based on Recurrent neural network is constructed. It is expected that this method can better generate music elements, providing effective support for high-quality musical composition.

Automatic generation method of music elements based on resonant neural network

Representation method of music elements based on recurrent neural network

List of musical symbols with large amount of information and time structure can be captured through recurrent neural network. To automatically renew and combine existing elements and form an automatic element generative model, multi-level recurrent neural network is trained in the folk song Big data set. The model system will continuously refresh the parameters in the network during the training process to accurately predict subsequent notes. After completing the model training, the previous note in the model automatically generates the following notes to create a brand new sequence of note elements. The combination of duration and pitch is considered a note. The collection of note sequences forms the melody.¹⁶ Therefore, note $n$ can also be represented by the vector corresponding to the time value $d [n]$ and the tone $p [n]$ . The corresponding vector can be used for encoding and training of duration and pitch. Therefore, all songs can be mapped to a matrix of time and pitch.

The representation method of melody is shown in Figure 1. To reduce redundancy and control data size appropriately, time values and pitch are standardized. Firstly, a minor or C major is used to standardize the melody of a song. Based on common notes, the duration of other notes is further standardized.¹⁷

Figure 1.

Representation method of melody.

Construction of music elements generative model based on recurrent neural network

This study uses melody, harmony, and rhythm as the basic units of intelligent algorithm learning, which can provide clarity learning objectives for neural network models to achieve music generation. Therefore, after analyzing the basic elements of music, the study adopts a neural network model to construct a method for generating music elements. Neural networks belong to a type of nonlinear function, represented by $y = f_{w} (x)$ . $x$ represents input. $y$ represents output. $w$ represents the weight parameter. All three are elements of high latitude space. Recurrent neural network is a network model that can accurately model time series data. The music element generation is a typical sequence data. The appearance of each note has a significant impact on the content before and after. In recurrent neural network, input and output represent sequences of arbitrary dimension and length. If the network input at time $t$ is $x_{t}$ , the state value of the network model at the previous time is $h_{t - 1}$ . The current state value of the model can be obtained, as shown in formula (1).

h_{t} = f (x_{t}, h_{t - 1}) = σ (W_{x}^{h} x_{t} + W_{h}^{h} x_{t - 1} + b_{h})

(1)

in formula (1),

W_{x}^{h}

represents the weight matrix from the output layer to the hidden layer.

W_{h}^{h}

represents the weight matrix between hidden layers.

b_{h}

represents the hidden layer bias parameter.

f

represents the recurrent neural network function. The basic model framework of recurrent neural network is shown in Figure 2.

Figure 2.

Structural model of recurrent neural network.

When recurrent neural network fine tunes weights in reverse regression, if the weight value of one layer is very small, it will cause the weight value of the previous layer to disappear, resulting in gradient problems. Long Short-Term Memory Network (LSTM) is a variant of recurrent neural network, which preserves long-term dependency information in the input corpus compared to recurrent neural network. At the same time, LSTM can effectively avoid the vanishing gradient of recurrent neural network. The basic unit structure of LSTM mainly includes input gate, output gate, forgetting gate, and memory unit. All gate structures are composed of a feedforward neural network layer and an activation function. LSTM introduces the state concept in every layer of recurrent neural network. The state as network memory will change based on the input of each sample. To better simulate the distribution of time values and tones, two LSTM structures are used to represent two different rhythm networks. The hidden layer of time value and LSTM is composed of 128 $G R U$ together. During the programming process, the $T h e a n o$ library is also applied, as shown in Figure 3.

Figure 3.

Connection method and network structure.

Figure 3(a) shows the note propagation method based on two LSTM structures. The time vector $d [n]$ of each note $n$ is input in the rhythm network. Simultaneously, the pitch vector $p [n]$ is input into the melody network. The duration vector $d [n + 1]$ of subsequent notes is input. After each operation, the network state is refreshed and the probability distribution of the time value $p (d [n + 1] | d [n])$ of the notes in the rhythm network is output. Therefore, in the melody network, the probability distribution of the tone $p (p [n + 1] | p [n])$ is also output. The various notes of the melody are split. The time value and tone component obtained through segmentation are input into two multi-layer $R N N$ loops. Then the duration and pitch probability distribution of subsequent notes are output.¹⁸ When the rhythm network receives the current pitch and duration, the input of the melody network is the duration of subsequent notes and the current pitch. Figure 3(b) shows the mapping process of notes in two LSTM structures. $x [n]$ represents the input layer $x$ mapping the note $n$ to the hidden layer $h$ and the output layer $o$ . The hidden layer is cyclically mapped between the notes, as shown by the dashed line in the figure, and transmitted to higher output and hidden layers, such as $h^{1} [n] \to h^{1} [n + 1]$ .

The activation vector of layer $i \in {1, 2, 3}$ is $h^{i} [n]$ , as shown in formula (2).

h^{i} [n] = z^{i} [n] \cdot h^{i} [n - 1] + (1 - z^{i} [n]) \cdot {\bar{h}}^{i} [n]

(2)

{\bar{h}}^{i} [n] = \tanh (w_{y^{i} h^{i}} y^{i} [n] + r^{i} [n] \cdot w_{h^{i} h^{i}} h^{i} [n - 1])

The updating gate $z^{i} [n]$ is shown in formula (3).

z^{i} [n] = σ (w_{y^{i} h^{i}} y^{i} [n] + w_{h^{i} z^{i}} h^{i} [n - 1] + b_{z}^{i})

(3)

The calculation method for reset gate $r^{i} [n]$ is shown in formula (4).

r^{i} [n] = σ (w_{y^{i} r^{i}} y^{i} [n] + w_{h^{i} r^{i}} h^{i} [n - 1] + b_{r}^{i})

(4)

In formulas (1) to (4),

y_{i}

represents the feedforward input of the

i

layer, which includes the hidden layer activation

h^{j < i} [n]

and the global input

x [n]

σ (x) = 1 / (1 + \exp (x))

represents the logical

S

function. The formula for activating

o_{i} [n]

in note

n

by output unit

i

is shown in formula (5).

o_{j} [n] = Θ {(w_{y^{0} o} y^{o} [n] + b^{o})}_{j}

(5)

In formula (5),

Θ

represents

S o f t \max

function.

y^{0}

represents the feedforward input of the output layer. After standardizing the

S o f t \max

function, it is possible to keep the sum of the output units at 1. Recurrent neural network has two outputs, representing the probability distribution of duration and pitch, respectively. The combination process of LSTM and RNN is shown in Figure 4.

Figure 4.

The process of combining LSTM and resonant neural network.

In Figure 4, music, represented as a series of discrete events (such as [Pitch, Duration]) being inputted. Then, these discrete symbols are mapped into continuous dense vectors through an embedding layer for processing. The LSTM encoder reads the entire input sequence. The final hidden state of the encoder is input into the RNN layer. This structure learns music patterns such as rhythm and melody. After encoding the information, the resonance mechanism is activated to enhance the characteristic signal while suppressing irrelevant information. Then output an enhanced context vector. Finally, by decoding the enhanced context vector output by the LSTM decoder as its initial state, the next note is generated note by note.

The probability distribution of note pitch $j$ is shown in formula (6).

P_{r} (p_{j} [n + 1] = 1 | a n d ϑ) = o_{j} [n]

(6)

Formula (6) indicates that the network output $o [n]$ is greatly influenced by model parameters and note adjustments. When training the model, there is a certain logarithmic likelihood between the parameters of the melody and rhythm network. Therefore, a random gradient ascending adaptive learning method is used to optimize the model. The parameters are $α = 10^{- 3}$ , $β_{1} = 0.9$ , $β_{2} = 0.999$ , and $ε = 10^{- 8}$ . The logarithmic likelihood of parameter $θ$ is shown in formula (7).

L (θ) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{N_{s} - 1} \sum_{n = 1}^{N_{s} - 1} \log (\Pr (x_{j}^{S} [n + 1] = 1 | a n d ϑ))

(7)

In formula (7),

p^{s} [n]

represents the pitch vector of the melody network.

x^{s} [n]

represents the time vector of the rhythm network.

N_{s}

represents the length of the song.

S

represents the number of songs. In the network, the cross validation is used for training. In each training cycle, 80% of the samples in the dataset will be applied, with the remaining 20% used for detecting automatically generated music. After one training cycle, the performance of the model is evaluated in random samples. In the probability of generating invisible data, the minimum value of the parameter is the final model parameter.¹⁹

Construction of music elements automatic generative model based on resonant neural network

The rhythm perception based on the neural inspiration model is known as the RNN. In RNN, the oscillator network is distributed across a spectrum. The internal connections between oscillators can be trained through Hebbian-type learning. RNN is the connection matrix connected to a typical oscillator. When the network signal is stimulated, if the frequency distribution is within the rhythm range, the pulse can form an integer proportion with resonance. Thus, these resonances can be explained as perceptions for hierarchical metric structures.²⁰ RNN is implemented in MATLAB, which includes 192 oscillators and exhibits a logarithmic distribution between the natural frequencies of 0.5Hz–8 Hz. RNN consists of two LSTM networks, the first with 192 linear inputs corresponding to each oscillator in RNN and occupying the real part of each oscillator’s output. The second network has only one linear input and contains frequency information of the input signal. When generating music elements, the measurement structure of music elements should be given special attention. Any given musical rhythm can be classified as strong or weak. When a given level rhythm is considered strong, it will also appear in the next highest level, which is the hierarchical structure of the music rhythm. In theory, the hierarchical structure of the music rhythm may include various phrases and rest notes. A simple music rhythm metric analysis is shown in Figure 5.

Figure 5.

Measurement analysis of music elements.

Music rhythm analysis shows that the occurrence time of music structure is similar to the basic pattern of brain mechanics. There is a resonance between the nervous system and the rhythm pattern of music, as shown in formula (8).

\frac{d z}{d t} = z (a + i w + (β 1 + i δ_{1}) {| z |}^{2} + \frac{(β_{2} + i δ_{2}) ε {| z |}^{4}}{1 - ε {| z |}^{2}}) + K P (\in, x (t)) A (\in, \bar{z})

(8)

In formula (8),

z

represents a complex numerical variable.

\bar{z}

is the complex conjugate of

z

w

is a driving frequency.

a

is a linear damping parameter. The amplitude compression parameters for

β_{1}

and

β_{2}

are used to increase the stability of the model.

δ_{1}

and

δ_{2}

are frequency detuning parameters.

ε

is used to control nonlinear variables in the system.

x (t)

represents time-varying external stimuli, including the passive part

P (ε, x (t))

and the effective part

A (ε, \bar{z})

controlled by the coupling parameter

k

. When two RNNs are connected together, they can be used for pulse detection experiments in syncopation rhythm.

A RNN model will be trained to predict audio data and form expressive rhythmic events. The function activation mainly relies on the input and output of audio in the model. The schematic diagram of the RNN model is shown in Figure 6.

Figure 6.

Resonant neural network.

In Figure 6, (A) represents the input audio data, B represents converting the audio signal into a rhythm form that can be distinguished from it, C represents multiple oscillators, D represents RNN representing the audio data, and E represents the output audio event activation function. The main dataset used in the study is dataset $M A Z$ , which mainly exists in the form of recorded audio. It contains a large number of dance performance clips. The various segments in $M A Z$ are interpreted. During the interpretation, different rhythms and dynamics are presented. All works belong to a unified type. They are played uniformly on the piano, making the effect comparison more realistic. In this research, randomly selected fragments are sliced for 40 s and then made into 50 subsets. When processing rhythm events, the audio signal is generally converted to prepare rhythm for resolution. Usually, continuous peak functions are applied to extract notes from the data.

Permormance analysis of music elements automatic generative model based on resonant neural network

The corresponding experiments are designed to analyze the performance of the automatic Generative model of music elements based on the cyclic RNN. To automatically generate music datasets that meet the requirements of the model, a program is written to compile and parse the music symbol list of the abc notation. At the same time, the form of symbols is transformed until the model requirements are met. The dataset used in the study is Free Music Archive (FMA), which is an open and easily accessible dataset for experimental analysis. It provides full-length and high-quality audio, pre calculated functionality, as well as audio tracks and user level metadata, tags, and free-form text (such as biographies). It includes 106,574 tracks and 14,854 albums from 16,341 artists, with a note length range of 136 ± 83, which includes information about duration and tone conversion. The study uses the Min-Max Scaling method to compress data to [0,1]. Pitch and duration are two distinct characteristics, and they are studied separately for processing. Pitch is an ordinal variable, which is processed by One-Hot encoding: treating each pitch as an independent category, resulting in a high-dimensional sparse vector. Then, the Min-Max Scaling is used for processing. Firstly, clarify the range of pitch, pitch_min = 21 (MIDI note 21, A0, piano bass), pitch_max = 108 (MIDI note 108, C8, piano treble). Then, process it by using pitch_normalized = (original_pitch - pitch_min)/(pitch_max - pitch_min). Duration is a continuous positive variable. Its distribution is usually long tailed (with many short notes and a few very long notes). The study uses proportional time encoding. For example, a quarter note is denoted as 1, and an eighth note is denoted as 0.5. After standardization preprocessing, the dataset usually has a value of 1. This increases the probability of other transitions to carry out unified time value conversion. The specific content is shown in Figure 7. From Figure 7, $C$ major scale is the most common. The likelihood of ending with two tones $C$ and $A$ on $B$ and $G #$ is highest. This also demonstrates the characteristics of Western music, which can be called melody solving. After analyzing the relationship between duration and pitch, both duration and pitch meet the expectations of music theory.

Figure 7.

Statistics of tone harmony and time conversion information.

Through the next note value, the subsequent note distribution model is constructed to determine the effective distribution law of the decomposition time value and pitch. This ensures the correct calculation of the conditional probability value. Finally, an example is automatically generated for the corresponding melody, as shown in Figure 8. The $F$ -value measurement, recall rate, and accuracy of data retrieval are used to evaluate the generation results of music elements. Event prediction uses an output gradient threshold for judgment. When searching for signal peaks, the main focus is on the magnitude of the gradient threshold change. When there is a gradient change during the search for signal peak, the gradient change point is marked as the starting point. The fault tolerance limit window for rhythm events is within the range of ±58.1 $m s$ . The starting point with a positive gradient may exist within the same time window.

Figure 8.

Example of automatic melody generation.

Based on the sample sampling rate, the limitation range of this research has an impact on the 5 sampling points on both sides of the rhythm event. The focus of this study is mainly on the rhythm structure. Therefore, the impact caused by limiting the scope is within a controllable range during the research process, which can be ignored. The experimental results are shown in Table 1. From Table 1, the accuracy, recall, and

F

-value of LSTM-RNN for non online full connection are 0.6114, 0.6182, and 0.6059, respectively. The accuracy, recall, and

F

-value of the nononline mean field of LSTM-RNN are 0.6878, 0.6883, and 0.6823, respectively. After combining the average field connection network, LSTM-RNN initial universal connection, and detuned oscillator, the output performance is optimal. The mean field network is superior to the LSTM-RNN fully connected model. The results reflect that the average field can obtain the maximum resonance frequency, which achieves noise cancellation at lower resonance frequencies. Therefore, obtaining fully connected and mean field signals plays an important role in predicting rhythmic events.

Table 1.

Output results evaluation for resonant neural network model.

Learning methods	Non online full connection	Non online mean field	Online full connection	Online average field	Initial online full connection	Initial online mean field
Accuracy (%)	0.6114 (0.035)	0.6878 (0.100)	0.5637 (0.043)	0.6862 (0.039)	0.5982 (0.055)	0.7032 (0.031)
Recall (%)	0.6182 (0.034)	0.6883 (0.067)	0.6185 (0.076)	0.6401 (0.050)	0.6230 (0.041)	0.6979 (0.041)
F-value	0.6059 (0.021)	0.6823 (0.081)	0.5798 (0.042)	0.6548 (0.042)	0.6000 (0.018)	0.6958 (0.036)

An output result of the recurrent RNN model is shown in Figure 9. Multiple peaks appear in the target output value. Multiple peaks appear in the target output because the input sample contains distinct features. The input music sample data itself has some rhythm variation patterns. Therefore, the model attempts to generate a peak for each feature during operation. Therefore, multiple peaks appeared. The actual output is the test results obtained from the designed research model, while the target output is the exact rhythm change pattern of the sample. The actual output result should be as close as possible to the target output value. Overall, there is a certain difference between the actual output value and the target output value. The actual output value is slightly smaller than the target output value. When the running time is 15.2 s, the difference between the actual output and the target output is 0.15. When the running time is 16.4 s, the difference between the two is 0.2. When the running time is 16.7 s, the maximum difference between the two is 0.3. When the running time is 18.5 s, the difference between the two is 0.02. Overall, the maximum difference between the actual output value and the target output value is 0.3. However, the peak variation between the actual output value and the target output value is basically coincident. This indicates that the LSTM-RNN music element generation network model constructed in the study has good performance. This neural network system can effectively capture the rhythm structure, generating new music rhythms.

Figure 9.

One output result of the resonant neural network model.

To verify the accuracy of the proposed method in generating music elements, the chord generation accuracy of commonly used Hidden Markov Model (HMM), LSTM, and LSTM-RNN are compared. The training data samples used are consistent with the data samples used in the previous section. The parameter settings for each model are as follows. The number of states for HMM is 24, the maximum number of iterations is 100, and the convergence threshold tol is 1e-5. The Vocabulary Size for LSTM is 320, and the embedding dimension is 256. The hidden layer dimension of LSTM and RNN is 512, the number of layers is 3, the Dropout is 0.5, and the Batch Size is 64. In this study, each music piece is composed of a series of notes. The study uses the Dynamic Time Warping (DTW) algorithm to minimize the overall distance (such as onset time difference) and find a corresponding note for each generated note in the real sequence. Therefore, the F1 value is the harmonic mean of precision and recall. Among them, precision refers to how many pitches are correct among all the criticized notes. The recall rate is the number of correctly generated pitches among all real notes. The accuracy of duration is the correct proportion of duration within the allowable error range among all aligned and correctly pitched note pairs. The results are shown in Table 2. From Table 2, the chord generation accuracy of the music element generative model based on the LSTM-RNN proposed in the study is significantly higher than that of other methods under different music rhythm lengths. Specifically, when the music rhythm lengths are 4, 8, 12, and 16, the chord generation accuracy of the HMM model is 45.33%, 45.06%, 44.51%, and 43.74%, respectively. The accuracy of chord generation in the RNN model is 52.41%, 52.15%, 51.69%, and 50.86%, respectively. The chord generation accuracy of the LSTM-RNN model is 66.78%, 66.24%, 65.62%, and 64.87%, respectively. The chord generation accuracy of the LSTM-RNN model is 81.99%, 81.65%, 81.02%, and 80.47%, respectively. From the above data, the method proposed in the study has more significant advantages. Because the research design method utilizes multiple oscillators to reduce the impact of noise, compared to other comparative methods, the graduate total method has better accuracy.

Table 2.

Chord generation accuracy of different methods.

Music rhythm length	HMM (%)	RNN (%)	LSTM (%)	LSTM-RNN (%)
4	45.33	52.41	66.78	81.99
8	45.06	52.15	66.24	81.65
12	44.51	51.69	65.62	81.02
16	43.74	50.86	64.87	80.47

In all average field networks, the evaluation metrics are relatively higher than those of other networks, but they have a large standard deviation. This reflects that the range of round results obtained through cross validation methods fluctuates greatly. The convergence of the mean field network requires more steps. The music speed can also be seen as the mutual holding process of local oscillator groups, which is more conducive to the implementation of beat tracking. A strong resonance phenomenon appears in the local area, and the frequencies between oscillators are similar. Local regions can also change with changes in stimulus frequency, moving resonance regions along the frequency gradient direction. The $F$ -means is 71.8%. However, in music scores with only a fixed rate, the $F$ -means generated by rhythm exceed 80%. The system that combines the average field connection network, RNN initial universal connection, and detuned oscillator performs the best, reaching 77.2%. The amplified response values of different time values are analyzed, as shown in Figure 10. The red box represents the audio connection point. In Figure 10, when the time is t = 1 s, the music element generation has an insignificant linear change and abrupt sense in the connection part. The better part is in the smooth area. The auditory discrimination is also less obvious. The best processing effect is in the smooth part.

Figure 10.

Sound track loudness chart (t = 1).

When t = 2, the amplified response value is analyzed, as shown in Figure 11. The red box represents the audio connection point. From Figure 11, when the time is t = 2s, the variation area is longer and the musical elements appear less compact. In terms of the auditory, the music clearly has a certain sense of confusion. The analysis of track loudness indicates that some soundtrack loudness maps have significant long segment weakening. The effect of element generation is poor.

Figure 11.

Sound track loudness chart (t = 2).

From the above results, it can be seen that the music element generation method designed in the study has better results, including automatic melody generation and accuracy in chord generation. At the same time, it also has a smoother effect in the connection of sound track loudness, and the generation effect of music elements is better. Compared with the GAN used in Reference 8, the method designed in the study achieves a smoother sound track effect, as the RNN network used in the research method contains multiple oscillators, which can better reduce the impact of noise and improve the music element generation effect. Xu X et al. used convolutional neural networks to analyze music elements, but this method requires solving too much parameter data in application, which limits its application.²¹ By comparison, the music element generation method based on RNN designed in the research has simpler operations and parameter requirements. Overall, the designed method has significant advantages.

To further validate the robustness and generalization ability of the research method, it is applied to the Weimar Jazz Database and Lakh MIDI Dataset datasets for verification. Weimar Jazz Database contains a large amount of MIDI transcription data for jazz solos, including the onset, pitch, duration, and other details of each note. The Lakh MIDI Dataset contains over 170,000 MIDI files, covering multiple genres (including a large amount of popular music). The evaluation indicators include F1-score, precision, recall, and overall accuracy. The comparison methods include HMM, RNN, LSTM, and LSTM-RNN. The evaluation results are shown in Table 3.

Table 3.

Performance comparison on different datasets.

Model	Dataset	Precision (%)	Recall (%)	F1-score (%)	Overall accuracy
HMM	Weimar jazz	67.9	66.8	65.6	64.2
HMM	Lakh MIDI	70.6	72.5	71.3	69.7
RNN	Weimar jazz	75.5	76.2	73.1	72.3
RNN	Lakh MIDI	76.9	78.5	77.2	73.7
LSTM	Weimar jazz	79.6	80.5	77.8	76.9
LSTM	Lakh MIDI	82.6	84.1	83.3	80.2
LSTM-RNN	Weimar jazz	86.4	85.8	86.2	83.7
LSTM-RNN	Lakh MIDI	88.6	85.2	87.1	84.3

All models performed slightly better on the Lakh MIDI dataset than on the Weimar Jazz dataset. This may be because the harmony and rhythm of popular music are simpler and more regular than jazz music, and the challenges in the generation process are smaller. The Precision, Recall, F1 score, and Overall accuracy of LSTM-RNN on the Weimar Jazz dataset are 86.4%, 85.8%, 86.2%, and 83.7%, respectively. The Precision, Recall, F1-score, and overall accuracy of LSTM-RNN on the Lakh MIDI dataset were 88.6%, 85.2%, 87.1%, and 84.3%, respectively. The research method performs better than other comparative methods on two different datasets, indicating that the method has stronger generalization ability and robustness. From this perspective, the research method has great potential in practical applications and has good adaptability.

Conclusion

In the field of musical composition, the research designs the music element automatic generation method based on the recurrent RNN to realize the automatic generation of music elements. From the experimental results, after standardizing and preprocessing the sample data, the time value of the dataset is usually 1. This increases the probability of converting other states into unified time values, of which $C$ major scale is the most common. The fault tolerance limit window for rhythm events is within the range of ± 58.1 $m s$ , indicating that the starting point with a positive gradient may exist within the same time window. After combining the average field connection network, RNN initial universal connection, and detuned oscillator, the output performance is the best. The average field network is superior to the RNN fully connected model, with $F$ reaching 77.2%. The design effect is good. The recurrent RNN model has multiple peaks in one output result. The maximum difference between the actual output value and the target output value is 0.3. However, the peak changes between the actual output value and the target output value basically coincide, indicating that the LSTM-RNN music element generation network model can better capture the rhythm structure and generate new music rhythms. When the music rhythm lengths are 4, 8, 12, and 16, the chord generation accuracy of the LSTM-RNN model is 81.99%, 81.65%, 81.02%, and 80.47%, respectively. The proposed method has more significant advantages. The results reflect that the average field can obtain the maximum resonance frequency, which can eliminate noise at lower resonance frequencies. This study mainly focuses on the music element generation. There is little discussion about the presentation of musical elements, and the emotional expression, coherence of expression, and consistency of style in music have not been systematically addressed. In future research, more attention will be paid to this area. In addition, cross-modal learning strategies can be used to further explore melody modeling between different styles. It is possible to introduce the lyric-melody joint modeling and transition to traditional Chinese music and opera styles to generate more different styles of music.

Footnotes

ORCID iD

Sha Li

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Eckberg

Mertan

Ledbetter

, et al. 1043: continuous prediction of ICU-free days using a recurrent neural network in a pediatric ICU. Crit Care Med 2021; 49(1): 521.

Zhang

Phillips

Wang

, et al. Fruit classification by biogeography‐based optimization and feedforward neural network. Expert Syst 2016; 33(3): 239–253.

Pimpalkar

Singh

Sheikh

, et al. Fake news classification using Bi-Directional LSTM-recurrent neural network. J Huazhong Univ Sci Technol 2021; 50(6): 1–9.

Wang

Rao

Chen

, et al. Abnormal breast detection in mammogram images by feed-forward neural network trained by jaya algorithm. Fundam Inf 2017; 151(1-4): 191–211.

Krishnan

Rajarajeswari

Krishnamohan

, et al. Music generation using deep learning techniques. J Comput Theor Nanosci 2020; 17(9/10): 3983–3987.

Malyi

. The writing technique as a component of the compositional process (on the examples of creative practice of the second half of the 20th–21st centuries). Probl of Int Arts Ped T and P of Edu 2021; 59(59): 117–130.

Zongyu

Federico

Susan

, et al. Deep learning’s shallow gains: a comparative evaluation of algorithms for automatic music generation. Mach Learn 2023; 112(5): 1785–1822.

Weiming

. Literature survey of multi-track music generation model based on generative confrontation network in intelligent composition. J Supercomput 2022; 79(6): 6560–6582.

Maryam

Mahdian

. A combination of multi-objective genetic algorithm and deep learning for music harmony generation. Multimed Tool Appl 2022; 82(2): 2419–2435.

10.

Jin

Wang

, et al. A transformer generative adversarial network for multi-track music generation. CAAI Trans Intell Technol 2022; 7(3): 369–380.

11.

Xing

Wei

, et al. Recurrent neural network non-singular terminal sliding mode control for path following of autonomous ground vehicles with parametric uncertainties. IET Intell Transp Syst 2022; 16(5): 616–629.

12.

Zamyad

Naghavi

Godaz

, et al. A recurrent neural network–based model for predicting bending behavior of ionic polymer–metal composite actuators. J Intell Mater Syst Struct 2020; 31(17): 1973–1985.

13.

Dsouza

Bhargava

, et al. Latent representation of the human pan-celltype epigenome through a deep recurrent neural network. IEEE ACM Trans Comput Biol Bioinf 2022; 19(4): 2313–2323.

14.

Yang

Envelope

. Recurrent neural network-based joint chance constrained stochastic model predictive control. IFAC-PapersOnLine 2022; 55(7): 780–785.

15.

Lai

Wang

. Monaural instrument sound segregation by stacked recurrent neural network. J Inf Sci Eng 2022; 38(3): 499–515.

16.

Huang

Fan

Huang

, et al. Main melody configuration and chord algorithm for relaxing music generation. Intelligent Automation & Soft Computing 2023; 35(1): 661–673.

17.

Wenkai

Yihao

Zefeng

, et al. Polyphonic music generation generative adversarial network with markov decision process. Multimed Tool Appl 2022; 81(21): 29865–29885.

18.

Omar

Oleg

Alejandro

, et al. Algorithmic music generation by harmony recombination with genetic algorithm. J Intell Fuzzy Syst 2022; 42(5): 4411–4423.

19.

Duarte

AEL

. Algorithmic interactive music generation in videogames. SoundEffects 2020; 9(1): 38–59.

20.

Shan

Tsai

. Automatic generation of piano score following videos. Transactions of the International Society for Music Information Retrieval 2021; 4(1): 29–41.

21.

Schieck

AFG

Kang

, et al. Effect of music tempo on duration of stay in exhibition spaces. Appl Acoust 2023; 207(5): 1–16.