Abstract
With the advancement of modern medical concepts, the beneficial effects of music on human health have gradually become accepted, and the corresponding music therapy has gradually become a new research direction that has received much attention in recent years. However, folk music has certain peculiarities that lead to the fact that there is no efficient way of selecting repertoire that can be carried out directly throughout the repertoire selection. This paper combines deep learning theory with ethnomusic therapy based on previous research and proposes a deep learning-based approach to ethnomusic therapy song selection. Since the feature extraction process in the traditional sense has insufficient information on each frame, excessive redundancy, inability to process multiple frames of continuous music signals containing relevant music features and weak noise immunity, it increases the computational effort and reduces the efficiency of the system. To address the above shortcomings, this paper introduces deep learning methods into the feature extraction process, combining the feature extraction process of the Deep Auto-encoder (DAE) with the music classification process of Gaussian mixture model, which forms a new DAE-GMM music classification model. Finally, in terms of music therapy selection, this paper compares the music selection method based on co-matrix and physiological signal with the one in this paper. From the theoretical and simulation plots, it can be seen that the method proposed in this paper can achieve both good music classifications from a large number of music and further optimize the process of music therapy song selection from both subjective and objective aspects by considering the therapeutic effect of music on patients. Through this article research results found that the depth of optimization feature vector to construct double the accuracy of the classifier is higher, in addition, compared with the characteristics of the original optimization classification model, using the gaussian mixture model can more accurately classify music, the original landscape “hometown” score of 0.9487, is preferred, insomnia patients mainly ceramic flute style soft tone, without excitant, low depression, have composed of nourishing the heart function.
Introduction
In the context of social progress and development, traditional medicine is also changing from a medical model dominated by biology itself to a new model of “biology-environment-society-medicine". It is also evident that the connection between the environment and human health is becoming stronger, and as a result, research on the use of music to treat illnesses in the psychiatric field continues to receive attention [1]. Music therapy is a comprehensive and emerging interdisciplinary discipline that focuses on medicine and integrates knowledge from multiple fields such as acoustics, psychology, philosophy, and biology. With the advancement of modern medical concepts, the beneficial effects of music on human health have gradually become accepted, and music therapy has gradually become a new research direction that has received much attention in recent years [2]. Since music is a type of audio with more complex components, a piece of music is often a combination of multiple sound elements, usually including various instrument sounds, human voices, faint noises, and other sounds. Moreover, the compositions contained in different styles of music may be very similar, which also makes the analysis, feature extraction and modeling calculation of music signals more difficult [3–7]. The current mainstream music classifiers are mainly based on three types of methods: (1) rule-based music analysis [8]; (2) minimum distance-based music classification [9]; (3) statistical learning algorithm-based music classification [10]. A typical example of such algorithms is neural network music classification. It is simple and intuitive to build a network by training the neurons so that the music categories to be discriminated can be directly corresponded to the textual information labeled [11]. In recent years, more and more researchers have successfully practiced statistical learning methods such as K-means algorithm, Gaussian mixture model, Markov model, and Bayesian algorithm in music classification.
Related work
The US-based musicfish system combines features and pattern recognition to solve the problem of music classification. The system extracts the time-domain frequency-domain characteristics of music and forms them into a multidimensional feature vector, which basically covers the characteristics of music such as rhythm and pitch to a large extent. On this basis, the above two approaches can be said to have basically laid the foundation for musical classification and provided a clear idea for future research. Classification methods that focus on musical genres and singer traits are also more common [12–14]. Most of the research in this area has drawn on some of the common approaches used in speech recognition. Tzanetakis [3] used a Gaussian mixture model to solve the problem of genre classification of music. Dave et al. [15], on the other hand, used an algorithm of neural networks to achieve better results. Zhou et al. [16], combined audio analysis techniques with pattern recognition to achieve accurate classification of eight typical Chinese regional operas. Laurier et al. [6] used word meanings as well as score information for music classification; Sebe et al. [7] applied plain Bayes to music classification for the first time. Hu et al. [8] used a combination of audio-visual data and lyric information for sentiment categorisation in a music database. Yang et al. [9] made the first attempt to use the meaning of lyrics to uncover the sentiment of music. Yao et al. [17] proposed a method to classify music using linear discriminant and support vector machine. Wang et al. [18] then proposed an improved support vector machine model for the classification of music. Li et al. [19] applied data mining techniques to music classification techniques based on the principles related to statistical learning. Qin and Ma [20], then compared the advantages and disadvantages between different algorithms based on the classification rules combining scales and intervals.
In summary, the feature extraction process in the traditional sense has insufficient information on each frame, excessive redundancy, and inability to process multiple frames of continuous music signals containing relevant music features and weak noise immunity, so it increases the computational effort and reduces the efficiency of the system [21–23]. To address these shortcomings, this paper introduces deep learning methods into the feature extraction process, combining the feature extraction process of deep autoencoder with the music classification process of Gaussian mixture model, which forms a new DAE-GMM music classification model [24–26]. The deep structure is of course relative to the shallow structure, and the difference between them lies in the number of layers of nonlinear operational units contained [27–29]. The advantages of deep learning are (1) it highlights the depth rather than the breadth of the model network; common deep learning uses 3, 4, or even more hidden layer nodes; (2) it highlights the importance of feature learning, which makes classification easier by transforming the spatial features of the original sample to a new space to form new features through multiple computational transformations of the original sample. (3) The importance of data training in relation to the model is more highlighted [30]. Compared with the features constructed by manual annotation, the automatically learned features are more capable of portraying the rich intrinsic information of the data [31–32]. Deep learning, on the other hand, can effectively mine the internal structure of data and improve modeling capabilities [33] The practical application of deep learning theory to music-based signal processing has strong theoretical significance and practical value [34]. The deep learning model used in this paper is the deep autoencoder (DAE), which is briefly described below.
This paper first introduces the research background and significance of music therapy selection, and introduces the theoretical basis of music therapy and related research on music selection at home and abroad. Then we introduce the extraction of traditional musical feature vectors and compare the time-domain features with experiments, proving that the short-time autocorrelation function cannot be used for the analysis of musical features to prepare for subsequent experiments. Finally, the experiments using traditional neural networks illustrate the advantages of the Gaussian mixture model in music classification, based on traditional features and deep autoencoders, we show that the proposed DAE-GMM classification method further improves the music classification performance over the traditional Gaussian mixture model.
Methodology
Extraction of music feature vectors
In music analysis, certain feature parameters are usually used to approximate different parts of the music. The extraction of different feature parameters can also have different effects on the performance of the system. A collection of data that can accurately and comprehensively characterize a piece of music and can be used for computation and comparison is called a feature vector of a piece of music. By its very nature, feature extraction is a way to represent the original music signal in a reduced and compact form without affecting its statistical properties, i.e., it can represent the music signal better with fewer parameters. Using feature vectors to represent the original music signal can make the research process more convenient, because after extracting the features of music, it can be modeled using sophisticated model learning algorithms. The commonly used feature parameters in this paper are divided into time domain features and frequency domain features, and the preprocessing process of music signals and the characteristic parameters of various music signals used are briefly introduced. The time domain and frequency domain characteristics is two common characteristics of deep learning, which are used to describe the image of time domain and frequency domain distribution. Time domain features said changes of the image on the time dimension, including parameters such as amplitude, phase, frequency. These characteristics can be used to describe the changing trend and degree of image, image, for example, is slowly increasing or falling fast, or is a stable state. These characteristics can be used to describe the distribution of image in the frequency domain, the image is the high frequency part contains more information, for example, or the low frequency part contains more information. The time domain and frequency domain characteristics all have their own advantages and disadvantages. Using different feature representation can be suitable for different application scenarios, such as through the image is decomposed into different frequency domain level to extract features of different levels, or the use of multiple features are combined to construct a comprehensive said.
Before extracting the characteristic parameters of the music signal, the music signal was preprocessed to save space and facilitate the analysis. These processes include steps such as acquiring musical signals, pre-emphasis, windows and frames, and frame discrimination of silence. The flow of the preprocessing is shown in Fig. 1.

Pre-processing process.
Time domain analysis of music signals is to extract the time domain characteristic parameters of music signals. The earliest form of audio signal sampling is the time domain signal, so extracting time domain characteristics becomes one of the earliest and most widely used analysis methods. The characteristics of time domain analysis are: the audio signal can be expressed intuitively and the physical meaning is clear; the observation equipment used is simple, less arithmetic and easy to implement; important time domain characteristic parameters can be obtained.
Deep learning itself is a kind of feature learning, and this paper uses deep learning directly for learning and optimization of the original features. Deep auto-encoder is a special kind of deep neural network, whose output target is the original input data itself (same essence and dimension as the original data, but different representation form). The original data X is input into the encoder, and then it is transformed by the mapping of the nonlinear implicit layer, and then the output is C(X), which is just another form of representation of X. It is essentially the original data X. It does not need the classification information of the training samples, and finally the original input data is used as the flag data for calibration and fine-tuning. Therefore, the essence of the deep autoencoder can be seen as a decomposition and reorganization of the original signal. The basic structure of the deep autoencoder is shown in Fig. 2. As a fully connected network, it can be through the following two ways: (1) the level of feature extraction of feature extraction, through the image is divided into different layers, and use the different levels of different layer unit to extract the image feature. This method can easily deal with massive image data, and can extract the important information in the image. (2) global feature extraction: used to connect to the Internet all the global image features extraction, such as gradient direction and color information. This method can obtain global information in the image, to better describe the characteristics of the image. All can connect to the Internet through a variety of ways to realize feature extraction, including the use of multiple convolution and pooling layer, use ReLU activation functions, using batch weight training algorithm and so on. Through the network optimization, all can connect to the Internet under the premise of the model performance is not affected, achieve better feature extraction effect.

Basic structure of depth auto encoder.
The restricted Boltzmann machine (RBM) is a typical unsupervised learning model that can fit the input data with an unsupervised learning algorithm. RBM is a typical unsupervised learning model that can be used to fit the input data by unsupervised learning algorithms, especially for complex data sets with unknown input distributions, and can be used to characterize arbitrary probability distributions through energy distributions to uncover valid information, and RBM can provide the necessary target results for the unsupervised learning process. Therefore, this paper uses RBM to build deep autoencoders. Since RBM can be trained quickly and efficiently using a hierarchical contrast scattering algorithm, training RBM can simplify the process, avoid high complexity, and improve efficiency in general. The structure of RBM includes two aspects: hidden layer and visible layer, as shown in Fig. 3. RBM is also a type of neural network [35, 36], which is made up of visual and hidden layers interconnected.Matter model at the same time, the author introduces the connection layer, the main reasons include: (1) improve the generalization ability of the models: the connection layer can be trained neural network to study the probability distribution of the data, thus improve the generalization ability of the model. (2) to improve the performance of the model: the connection layer can be through the selection and optimization of the input data to improve the performance of the model, so that the model can faster convergence. (3) to reduce the complexity of the model: the connection layer can reduce the complexity of the model, so that the model is more simple and efficient. Better feature extraction effect: the connection layer can be better to extract the image features, thus improve the classification accuracy.

RBM structure diagram.
As can be seen in Fig. 3, there are no visual layer-to-visual layer or hidden layer-to-hidden layer links in this structure. In the training process of RBM, unsupervised greedy layer-by-layer training is used, i.e., the feature vector of the visual layer is passed to the hidden layer, the visual layer is reorganized by the hidden layer, and the new visual layer passes the feature vector to the new hidden layer again, so that the generated weights can be obtained. So in summary, the features from the previous layer are input to the next layer as training samples. Therefore, the training process is fast and efficient.
A complete RBM is composed of a visual layer and a hidden layer interconnected. The joint distribution relationship between the two can be expressed as an energy function as:
This place, θ ={ w, a, b }, w are the connection weights of the visible and hidden layers, b and a and are the corresponding biases, respectively. The specific probability distribution density has also been expressed by the Boltzmann distribution as:
It is known from Equation (3) that the probability of the j-th node of the hidden layer being 1 or 0, given the visible layer v, can be found as follows:
Similarly, it is possible to find the probability that the i-th node of the visible layer is 1 or 0 given the nodes of the hidden layer:
Maximizing the likelihood function of the logarithm, then:
By taking the derivative of the above equation, the parameter w corresponding to the maximum value of L is obtained, and the key parameter of the whole network is w.
After training the multilayer network, we use the traditional supervised learning method to fine-tune the entire network from back to front, and finally build the complete DAE, which can be fine-tuned by using the original training data in the supervised part.
There are three advantages of using DAE network for feature optimization: (1) it can extract multi-frame features, which is beneficial for information extraction and analysis; (2) unsupervised learning is beneficial for extracting information from unlabeled data; (3) the dimensionality of the feature vector can be reduced without reducing the representation information, which improves the computing efficiency.
Since its introduction, the deep autoencoder has proven to be a great success in image recognition, speech recognition, and so on. However, no research has so far combined it with GMM and applied it in music emotion recognition. In this paper, we propose a method to optimize features from samples using deep autoencoder by combining and optimizing the original multi-frame and multi-dimensional feature vectors to form a new feature vector, and using GMM to train and model the new feature vector.
The experiments prove that using this method is slightly more accurate than the original feature vector modeling method, proving its effectiveness. In practical application, we can take advantage of the deep learning network’s ability to process consecutive multi-frame signals by feeding multiple frames into the network at one time (10 consecutive frames are selected in this paper during the experiments). Therefore, the number of nodes in the input layer is equal to the number of frames multiplied by the dimensionality of each frame; the number of nodes is usually set larger when setting the first hidden layer to provide good modeling capability. If feature compression is required, the number of nodes in the middle hidden layer is set smaller; if feature expansion is required, the number of nodes in the middle hidden layer is set larger.
To illustrate the advantages of the deep learning-based folk music therapy model in dealing with music classification and related problems, this section uses traditional backpropagation (BP) for music selection experiments to compare with subsequent deep learning model classification experiments. The main experiments are: two-classified music experiments, two-classified music experiments with added samples, two-classified experiments with replaced samples, three-classified music experiments, and four-classified music experiments.
DAE feature extraction
The structure of the deep autoencoder is mainly related to the number of layers of the implicit layer, the number of nodes contained in each implicit layer and the type of nodes. After several experiments, the final deep autoencoder used in this paper consists of one input layer, three implicit layers (the second implicit layer is called the middle layer) and one output layer. The number of nodes in each layer is 320*640*160*640*320 for the direct quadruple classification experiment. 320 nodes are in the input layer, because each frame has 32 dimensions, and there are 320 dimensions after stitching 10 frames, so the number of nodes corresponds to the number of sample dimensions. The number of nodes in the implicit layer 1 is 640, which constructs a high-dimensional feature space; the intermediate layer is 160, which realizes the compression of the original features; the number of nodes in the implicit layer 3 is 640, which corresponds to the implicit layer 1; the output layer is 320, which corresponds to the input layer and realizes the feature reconstruction. Since the input network is the features of music signal with uncertain value range and unconventional black and white image binary distribution, it is usually necessary to model the input layer with Gaussian RBM, and after simple processing, the mean and variance of each dimensional feature of the input data can be normalized. The input and output layer nodes use a linear excitation mechanism, corresponding to Gaussian-type nodes, and the other implicit layer nodes use a sigmoid nonlinear excitation mechanism, corresponding to Bernoulli-type nodes. When performing the four classification hierarchical experimental feature extraction, the number of nodes in each layer of the depth encoder needs to be changed due to the different training data. The number of nodes in each layer is 170*340*85*340*170 for the Arousal axis correlation model, and 150*300*75*300*150 for the Valence axis correlation model.
DAE-GMM music classification experiment
In order to form a comparison with the neural network experimental results, only the algorithmic model of classification is changed, and for all other conditions are not changed. Since the number of models in the hybrid model based on deep learning is different, the classification results are different. In this paper, experiments are conducted for four commonly used model numbers, GMM8, GMM16, GMM32, and GMM64. To form a comparison, so only the algorithmic model of classification is changed, and for all other conditions are not changed. Since the number of models in the Gaussian mixture model is different, the classification results are different [37]. In this paper, experiments are conducted for four commonly used model numbers, GMM8, GMM16, GMM32, and GMM64.
The experimental results of the direct construction of the four classification system using DAE-GMM are shown in Table 1. The comparison with the neural network algorithm Table 2 is shown in the following table.
Accuracy of DAE-GMM direct construction four classification system
Accuracy of DAE-GMM direct construction four classification system
Accuracy of neural network four classification system
The comparison between Table 1 and Table 2 shows that the classification accuracy of the system has improved after feature optimization using the deep belief network, but it is a time-consuming process to perform deep feature extraction under the condition that the training volume is already huge. Next, the results of the classification experiments using the optimized feature vectors of the deep belief network for the classifier with a two-layer structure are shown. After optimizing the feature matrix, the accuracy generally improves, especially in recognizing Exciting and Peace types of music. There is still much room for improvement in the accuracy of recognizing Joy and Sad genres. Two improvements can be made based on this experiment: First, the number of training and testing samples selected for this experiment is small, and the advantages and disadvantages are not more clearly distinguished. Second, the selected training songs are not wide enough or the type discrimination is not accurate enough, which leads to ambiguity in the discrimination. However, in general, the accuracy of using the deeply optimized feature vectors for constructing the bilayer classifier is still improved than the direct classification and the form of bilateral classification without feature optimization, which shows that the deep belief network is more effective in the process of music classification in combination with Gaussian mixture model and is better than the previous classification model without feature optimization. For the sake of demonstration, the following planar radar diagram representation of the music “Picking Tea Lights” is drawn as shown in Fig. 4.

Radar picture of the folk music ‘Cai Cha Lantern’.
Some scholars emphasize that music therapy itself is a highly subjective process, and it is only the patient’s word whether music is applicable to the treatment of disease. For this reason, the idea of the co-matrix algorithm for music selection can be summarized as follows: the algorithm takes a few mature and effective therapeutic music as the core, combines patients’ actual experience of new music, compares the new music with the core music, compares patients’ recognition of the new music with the core music, and then ranks the new music to analyze whether it is also suitable for the treatment of a certain disease.
The implementation process is as follows: if a mature core music i is given, and the number of music to be validated is given as n, and the number of patients m are given to test the new music, then a matrix is formed by the degree of recognition of the validated music by each patient
Each row represents a patient’s degree of recognition of whether the new music is effective in the treatment of their disease, considered effective is 1, invalid is 0. Each column represents a piece of music, where the first column is set to the effective core music i, so the first column of the matrix are 1, that is, the default of the music has been determined to be effective in the treatment of this type of patient disease.
The following is a brief description of the process of selecting music for the treatment of insomnia: the current music that is generally applicable for the treatment of insomnia is “Night at the Military Port". It is assumed that there are 9 new music that need to be experienced by patients, namely: “The Mark of Rain", “The Moon in the Second Spring", “The Starry Sky", “The Original Landscape of the Hometown", “Crested Bamboo in the Moonlight", “Whispers in Autumn", “Song Family Dynasty", “Thinking of the King Eerily", and “Wedding in Dreams". Experimental tests of the new music were conducted with the help of 10 insomnia patients during the experiment, and the following matrix was established.
According to the above-mentioned matrix, the result of the calculation shows that “The Original Landscape of the Hometown” scored 0.9487, The identification result of Weber et al. [37] is 0.8956, which shows that the matrix constructed in this paper can identify the results more accurately. Which is preferred by the insomnia patients because it is mainly played by ceramic flute, with soft style, no irritating tone and low melancholic feeling, which is good for the mood and has the function of calming and nourishing the mind. The score of “Moonlight Bamboo” is 0.3162, although the overall style is mainly calm, but it still contains an element of cheerfulness, which is not suitable for insomnia patients, so it is rated lower. It can be seen that the selection process of music therapy based on the co-matrix has well integrated the subjective feelings of patients and made the selection process more appropriate to the actual needs of patients.
In this study, a new method for selecting music therapy songs using deep learning and comatrix algorithm is proposed, and a detailed comparison and analysis is carried out to illustrate the advantages of this new method compared to existing methods.
Much of the traditional music therapy song selection process relies on the therapist’s experience and subjective judgment, or is categorized by simple musical attributes (such as rhythm, melody, etc.). There are two main problems with this approach: First, the selection process is affected by the therapist’s personal preference, and the individual differences of patients may not be fully taken into account; Second, existing classification methods often fail to accurately reflect the impact of music on patients’ emotions and diseases, because the therapeutic effect of music is often related to multiple factors (such as music style, lyrics meaning, human emotions, etc.).
Proposed work: Unlike existing work, the proposed approach uses deep learning and a comatrix algorithm to systematically select therapeutic music from multiple dimensions (such as music characteristics, patient feedback, etc.). First, deep learning models are used to extract deep features of music, which more accurately reflect the complexity and diversity of music than traditional musical attributes. A comatrix algorithm is then used to compare and rank individual pieces of music, a process that takes into account the actual patient experience and feedback, so that individual patient needs can be better met.
Through experimental comparison, it is found that the method using deep learning and comatrix is superior to the existing methods in music classification accuracy and patient satisfaction. In particular, in the Gaussian Mixture model (GMM) experiment, different models of the deep autoencoder-Gaussian Mixture model (DAE-GMM) achieved better results than the original GMM on all four categories of emotion (excitement, joy, sadness, peace). In addition, the research method has also achieved good results in the experiment of music selection in the treatment of insomnia, which verifies the practicability and effectiveness of this new method.
In general, compared with the existing work, the proposed method in this study not only improves the accuracy of song selection in music therapy, but also improves the level of individualization of treatment, which is conducive to the further development and application of music therapy.
Conclusion
This paper introduces the deep automatic encoder into feature optimization, compares the process of the traditional music selection, and explains the subjective and objective effectiveness. The music selection method of music therapy can be further improved in two ways:
(1) Improvement in the music classification model
Like other statistical models, the performance of a Gaussian hybrid model mainly depends on the accuracy and quantity of the training data, that is, whether it can be fully trained. Therefore, how to improve the accuracy of the training data and reduce the requirements of the length of the model is still a problem worth studying.
(2) The study of the meaning of the lyrics and human emotions
Current classification techniques still do not well solve the study of lyric emotion and human emotion in music, that is, how to extract the semantic recognition of semantics. Therefore, after solving the association of music style and human emotion, how to add the lyric meaning and study the relationship between music and emotion more completely is still worth studying.
Data availability statement
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Funding statement
No funding was received.
