Abstract
BACKGROUND:
Snoring source analysis is essential for an appropriate surgical decision for both simple snorers and obstructive sleep apnea/hypopnea syndrome (OSAHS) patients.
OBJECTIVE:
As snoring sounds carry significant information about tissue vibrations within the upper airway, a new feature entitled compressed histogram of oriented gradients (CHOG) is proposed to recognize vibration patterns of the snoring source acoustically by compressing histogram of oriented gradients (HOG) descriptors via the multilinear principal component analysis (MPCA) algorithm.
METHODS:
Each vibration pattern corresponds to a sole or combinatorial vibration among the four upper airway soft tissues of soft palate, lateral pharyngeal wall, tongue base, and epiglottis. 1037 snoring events from noncontact sound recordings of 76 simple snorers or OSAHS patients during drug-induced sleep endoscopy (DISE) were evaluated.
RESULTS:
With a support vector machine (SVM) as the classifier, the proposed CHOG achieved a recognition accuracy of 89.8% for the seven observable vibration patterns of the snoring source categorized in our most recent work.
CONCLUSION:
The CHOG outperforms other single features widely used for acoustic analysis of sole vibration site.
Keywords
Introduction
Snoring is a prevalent sleep-related disorder affecting approximately 25% of females and 45% of males in the general population [1]. Defined as loud upper airway breathing caused by vibrations of the pharyngeal tissues during sleep [2], snoring may disturb the sleep of a bedpartner or others and may further become a cause of social disturbance. Snoring is also an important indicator of obstructive sleep apnea/hypopnea syndrome (OSAHS) [3], which is characterized by repetitive events of hypopnea and apnea resulting from partial or complete collapse within the upper airway [4]. This disease will decrease sleep quality, leading to daytime sleepiness and poor job performance. Furthermore, it increases the risk of hypertension, stroke and cardiovascular diseases [5,6].
There are several treatments for snoring such as continuous positive airway pressure (CPAP), oral appliances, and surgical treatment [7–9]. Since their effectiveness depends on the patient selection as well as the treatment procedure choice [9], the knowledge of anatomic structures of vibration and obstruction is necessary for a targeted and successful treatment.
Drug-induced sleep endoscopy (DISE) is a widely used method to determine the sites of vibration and obstruction through direct visual observation of patients’ pharyngeal cavity [10]. However, since patients are required to be induced to sleep by intravenous sedation with endoscopic equipment inserted into their pharyngeal cavity, DISE is invasive and time-consuming. Considering that snoring sounds carry significant information about the dynamic of the upper airway [11], acoustic analysis becomes a promising means for identifying the snoring source, which possesses many advantages such as lower cost, convenience, and non-invasion.
To date, an increasing amount of studies have attempted to find the relationship between different snoring sources and various acoustic features such as crest factor [12], peak and centre frequencies [13], fundamental and formant frequencies [14], statistical moment coefficients of skewness and kurtosis [15], wavelet features [16], and a combination of multiple features [17]. However, most of these studies only analyzed snoring events originated from vibration of sole soft tissue. But in practice, there are abundant snoring events related to combinatorial vibrations of multiple soft tissues, which are more common especially for OSAHS patients [18].
In this work, a total of seven vibration patterns of snoring sources in line with the work of Beijing Hospital, China [14] are recognized, including vibrations of one, two, or more soft tissues. Since the acoustic characteristics of the first several post-apnea snoring events are always irregular to some extent due to the fact that the upper airway shape always changes gradually after apnea [19], only those recordings with at least 10 consecutive steady snoring events were chosen for analysis, during which there is no noticeable occurrence of apnea or hypopnea.
Recently, texture-based descriptors have been successfully used in the field of image recognition [20]. When snoring events’ spectrograms are given as the input images, these descriptors are able to depict the spectral information [21], which should be used as the features related to the vibration patterns of the snoring source. Motivated by these studies, we propose a new feature denoted as compressed histogram of oriented gradients (CHOG), which is obtained by compressing the histogram of oriented gradients (HOG), a set of texture-based descriptors [20], via the multilinear principal component analysis (MPCA) algorithm [22]. Then, a support vector machine (SVM) is used to classify the vibration patterns of the snoring source.
The rest of this paper is organized as follows: Section 2 describes the method of data acquisition and annotation. Section 3 illustrates the proposed feature and recognition procedure. Detailed experimental results are provided in Section 4. Section 5 presents further discussions. Lastly, Section 6 concludes the work.
Data acquisition and annotation
This work was approved by Ethics Committee at Beijing Hospital, China, and all the snoring recordings were obtained from the Otolaryngolgy Department of Beijing Hospital. We used the recordings from 76 patients diagnosed as either simple snorers or OSAHS patients with an apnea-hypopnea-index (AHI) ≤45 assessed by polysomnography (PSG). Informed consent of these patients was obtained prior to the study. The patients’ demographic information (age, body mass index (BMI), and AHI) is listed in Table 1.
Demographic information of the 76 patients
Demographic information of the 76 patients
DISE was performed in the operating room by the same otolaryngologist and anesthesiologist for the purpose of snoring source determination. All patients were in the supine position and were given topical nasal administration of aerosolized 3% ephedrine and 2% lidocaine, and were then induced to sleep by injecting dexmedetomidine and propofol intravenously. When patients snored steadily, a video fibrolaryngoscopy (Machida ENT-30PIII), which was connected to a high-speed video system (Karl Storz Image 1 HUB HD), was inserted through a naris into the patients’ pharyngeal cavity. The vibrations of soft tissues in the upper airway were observed and recorded by the video system [19]. In parallel with the video recording, the snoring sounds were collected with a non-contact microphone (20–20 kHz, CHZ-213, BAST) positioned at about 30cm above the patient’s mouth, which was connected to a data acquisition system (developed by Nanjing University of Science and Technology) configured with a sampling frequency of 16 kHz and a resolution of 16-bit. Only snoring recordings with at least 10 consecutive steady snoring events were used for analysis in order to eliminate the influence of shape changes in the upper airway and to keep the characteristics of snoring events predominantly related to the vibration patterns of the snoring source [14].
According to the DISE results, the snoring sounds mainly generated from the soft palate, lateral pharyngeal wall, tongue base, and/or epiglottis. Among these tissues, soft palate is the most frequent vibration site, followed by lateral pharyngeal wall. In our most recent work, seven vibration patterns of the snoring source were observed as follows [14]: (1) sole vibration of soft palate; (2) obvious vibration of soft palate with mild vibration of epiglottis or tongue base; (3) sole vibration of epiglottis; (4) no obvious vibration of any tissues (snoring might be generated by the airflow passing through the extremely narrow upper airway since the soft palate almost touches the post pharyngeal wall); (5) combinatorial vibration of soft palate and lateral wall with mild vibration of epiglottis or tongue base; (6) combinatorial vibration of soft palate and lateral wall; (7) sole vibration of lateral pharyngeal wall. For simplicity, these seven categories are denoted in the sequel as P, P+E/T, E, NONE, ALL, P+L and L, respectively. We removed the recordings with strong background noise and manually extracted snoring events from the recordings. As a result, 1037 events in total with length ranging from 0.22 s to 2.18 s were available, whose detailed information is provided in Table 2.
Detailed information of the recordings and snoring events used in this work
In this section, we illustrate the recognition procedure (Fig. 1). The individual processing steps are described in the following subsections.

Block diagram of the recognition procedure in this work.
Extracting significant information from snoring events is the major challenge in snoring source analysis. In this work, we extracted CHOG from spectrograms to characterize the spectral texture of snoring events. The extraction process consists of three steps: spectrogram resizing, HOG descriptors extraction, and MPCA-based feature compression.
Spectrogram resizing
The snoring events used in this work were divided into frames of 64 ms with a shift of 32 ms between the adjacent frames. Then the short-time Fourier transform (STFT) with hamming window was applied to each frame to obtain the spectrograms, which can be considered as the input images of the following HOG descriptors extraction. Because HOG calculation requires the input size to be the same, these spectrograms were all resized to 64 × 256 by means of the nearest neighbor interpolation [23].
HOG descriptors extraction
HOG descriptors were used to describe the local object appearances based on the distribution of local gradients. The computation steps are as follows [20,21]: Given a resized spectrogram of a snoring event, compute the gradient at each point. For example, consider a spectrogram S (x, y) with amplitude H (x, y), where x is the index of temporal frames and yis the index of Fourier coefficients, its horizontal and vertical gradients, denoted as G
x
(x, y) and G
y
(x, y), respectively, are calculated with the simple [−1 0 1] mask as Divide the spectrogram into small non-overlapped regions, i.e. cells. At each point, compute a function of the gradient magnitude as the weighted vote of the corresponding gradient orientation bin. Then, a local gradient histogram for each cell denoted as a cell descriptor is constructed by accumulating the votes of all points within the cell in each orientation bin. Group cells into large regions, i.e. blocks, with 50% overlap between neighboring blocks. Each block descriptors is the concatenation of all cell descriptors within the block. It is worth noting that local normalization of each block independently turns out to be essential for good performance due to a great variable range of gradient strengths. In addition, overlapped blocks reuse cells in final descriptors with different normalization, which enhances reliability [20].
Specifically, given a spectrogram with size 64 × 256, a cell and a block are defined as a region with size 8 × 8 and 16 × 16, respectively. The cell descriptor is calculated for each cell with nine gradient orientation bins, so that each block descriptor is 36-dimensional which concatenates four cell descriptors and is then normalized independently. The final HOG descriptors are obtained as a combination of all block descriptors with a dimension of 7 × 31 × 36, denoted as a third-order feature tensor sample
Conventionally, HOG descriptors are required to be vectored before being fed into a classifier. However, considering that the overlapped blocks increase the reliability of classification but generate plenty of redundancy, transforming HOG descriptors (7 × 31 × 36) into vectors (7812 × 1) will result in high computational cost. Furthermore, vectorization may break the original structure of feature set and lose the spatial correlation information. In order to overcome these drawbacks, we applied the MPCA algorithm for the purpose of feature compression.
Principal component analysis (PCA) is a popular unsupervised linear technique for vector dimension reduction. As a generalization of PCA, MPCA was proposed in [22] for tensor input instead of a conventional vector. To be specific, MPCA is a tensor decomposition algorithm with orthogonality constraints, which decomposes a tensor into a core tensor multiplied by matrices each along one mode (dimension).
Given a set of third-order feature tensor samples

Illustration of tensor decomposition.
The main challenge of MPCA is to find appropriate factor matrices to project the original tensor as optimally as possible while retaining the maximum information. The rank P
n
is determined depending on the number of dominant eigenvalues of the product
The decomposition process was implemented by the N-way Toolbox [27]. The threshold 𝜃 was set to 0.9 and the corresponding P
1, P
2, and P
3, were 6, 27, and 23, respectively. Considering that each element of the compressed tensor, denoted as
We employed SVM for the classification task, which is a robust supervised learning method widely applied to pattern recognition. Since SVM was originally designed for binary classification by constructing a hyperplane based on the theory of structural risk minimization [29], we utilized the “one-against-one” strategy [30] for multi-class classification in this work. LIBSVM [31] was used to implement SVM with radial basis function (RBF) kernel while the two parameters, gamma and cost, were set to 0.125 and 4, respectively.
Experimental results
Experimental setup
Two hundred independent trials were conducted for performance evaluation. In each trial, the 1037 snoring events listed in Table 2 were randomly separated into two sets with 60% events for training and the other 40% for testing. Meanwhile, the percentage of training and testing snoring events in each category of vibration patterns was kept the as same as 60:40. Note that the factor matrices calculated from the feature tensor of the training set were preserved for using in the testing stage.
Performance metrics
We employed recognition accuracy to evaluate the overall recognition performance
In addition, the performance of each category of vibration pattern was measured by the three common metrics: precision, recall, and F-score, which were then averaged over 200 independent random trials.
In this section, we compared the recognition performance of the proposed CHOG and other widely-used features including crest factor (CF), fundamental frequency (F0), formants, spectral frequency features (SFF), power ratio between below and above 800 Hz (PR800), sub-band energy ratio (SER), Mel-scale frequency cepstral coefficients (MFCC), Empirical Mode Decomposition-Based Features (EMDF) and wavelet energy features (WEF) [16,17,32]. The box plot, as shown in Fig. 3, is utilized to depict the distribution of recognition accuracy values over 200 trails. It demonstrates that CHOG is superior to other features with an averaged accuracy of 89.8%, followed by MFCC and SFF, achieved 83.0% and 80.7%, respectively. By contrast, the average accuracy metrics of PR800, F0 and CF are below 40%, indicating that they may not be suitable for the snoring source recognition task.

The box plot of recognition accuracy for all features. The central line represents the median. The bottom and top edges of the box indicate the first and third quantiles, respectively. The whiskers extend to the extreme non-outliers and the outliers are symbolized as ‘+’.
In order to further investigate the performance of these features, the comparison of averaged precision, recall and F-score over 200 trials are provided in Tables 3, 4, and 5. It can be observed that the performance metrics of CHOG are higher than those of other features for almost all categories of the vibration pattern, except for pattern E for which the metrics of CHOG are slightly lower than those of MFCC.
Comparison of all features in terms of the averaged precision for each category
Comparison of all features in terms of the averaged recall for each category
Comparison of all features in terms of the averaged F-score for each category
Note that the threshold 𝜃 in ((6)), which determines the factor matrices rank P
n
, n = 1, 2, 3, and the CHOG dimension K may bring a non-negligible impact on the performance of CHOG. In this section, we investigated the influence of these two parameters. The threshold 𝜃 was set to 0.9, 0.8, 0.7, and 0.6, while the corresponding ranks

Influence of threshold 𝜃 and dimension K on the performance of CHOG.
In addition, according to the experimental results, the performance metrics for vibration patterns P+E/T and E are higher than those for pattern P+L. To be specific, the confusion matrix after one trial is provided in Table 6. There is only one snoring event of pattern P+E/T and pattern E that is misclassified, while a small part of events are misclassified into each other for three pairs of patterns: P and P+L, NONE and L, ALL and P+L.
Confusion matrix after one trial using CHOG
Snoring solely origniating from vibration of epiglottis (vibration pattern E) is relatively rare in practice and the snorers who have this type of snoring cannot be treated by regular ENT surgical procedures. In this work, we distinguished snoring events of pattern E from others with a high accuracy, which helps to avoid ineffective surgery for these snorers. In addition, the difference between snoring events in patterns ALL and P+L is the slight flutter of epiglottic/tongue, which is always indiscernible even under the observation of DISE. Therefore, the result shows that the confusion between these two patterns in our experiment is consistent with the outcome of DISE. However, the differences between snoring events in vibration patterns P and P+L, NONE and L are relatively significant in DISE. According to the analysis of detailed experimental results, we found that misclassified events are concentrated in several recordings, and the reason is under further investigation.
Considering that CHOG depicts the texture information of spectrogram from an image processing perspective while MFCC and SFF, which achieve comparable performance to CHOG as depicted in Fig. 3, characterize the spectral and Mel spectral information respectively, we fused these features in order to investigate the impact of multi-feature analysis strategy on recognition performance. ReliefF algorithm [33] was applied to the aggregation of CHOG, MFCC, and SFF to calculate a contribution weight for each feature element. Since a higher positive weight represents a more important contribution, all feature elements are ranked in descending order of weight and the first 120 feature elements are selected to construct the fusion feature set, among which CHOG, MFCC and SFF contribute 50%, 26.7%, 23.3%, respectively. Expermential results after 200 trials show that the fusion feature set achieves an average recognition accuracy of 92.6%. It demonstrates that fusing multiple diversified features together may bring an improved recognition performance, and multi-feature analysis strategy can be adopted in future research to construct a more comprehensive and discriminative feature set.
1. It was demonstrated that the noncontact acoustic analysis is a promising alternative to determine the snoring source of patients, which is neither complicated nor invasive compared to conventional methods such as DISE.
2. The snoring source was categorized into seven vibration patterns in our recent work according to the different combinations of upper airway soft tissues vibrating simultaneously.
3. The proposed feature CHOG can characterize the textural information of the snoring sound spectrograms. As a result, it achieved an accuracy of 89.8% for recognizing the seven vibration patterns of the snoring source based on the sound recordings of 76 simple snorers or OSAHS patients during DISE.
4. Experimental results showed that the snoring source recognition potency of CHOG was better than other widely-used single features, while a higher recognition accuracy could be obtained if CHOG was fused with other features such as MFCC and SFF.
Footnotes
Acknowledgements
The work was supported by the National Natural Science Foundation of China under Grant Nos 61271410 and 61401203.
Conflict of interest
None to report.
