Sound event localization and detection using element-wise attention gate and asymmetric convolutional recurrent neural networks

Abstract

There are problems that standard square convolution kernel has insufficient representation ability and recurrent neural network usually ignores the importance of different elements within an input vector in sound event localization and detection. This paper proposes an element-wise attention gate-asymmetric convolutional recurrent neural network (EleAttG-ACRNN), to improve the performance of sound event localization and detection. First, a convolutional neural network with context gating and asymmetric squeeze excitation residual is constructed, where asymmetric convolution enhances the capability of the square convolution kernel; squeeze excitation can improve the interdependence between channels; context gating can weight the important features and suppress the irrelevant features. Next, in order to improve the expressiveness of the model, we integrate the element-wise attention gate into the bidirectional gated recurrent network, which is to highlight the importance of different elements within an input vector, and further learn the temporal context information. Evaluation results using the TAU Spatial Sound Events 2019-Ambisonic dataset show the effectiveness of the proposed method, and it improves SELD performance up to 0.05 in error rate, 1.7% in F-score, 0.7° in DOA error, and 4.5% in Frame recall compared to a CRNN method.

Keywords

Sound event localization and detection asymmetric convolution context gating squeeze excitation element-wise attention gate

1. Introduction

Sound event localization and detection (SELD) is to identify the temporal activities of each sound event, estimating their respective spatial positions trajectories when active, and further associating with textual labels with the sound events. It is a combined task of sound event detection (SED) and direction-of-arrival (DOA) estimation. Sound event localization and detection have been widely used in many fields such as robotics, smart cities, smart homes and industries, smart conferences, and biodiversity monitoring, and it has very broad application prospects and important research value.

Due to the emergence of more hard-labelled audio datasets, more and more methods based on deep neural network models have greatly improved the performance of SELD. Recently, deep neural networks have achieved good results in sound event detection [16], and they have been applied successfully to pure source localization [2,7,19], showing potential for joint modelling of the SELD task. Advanne et al. [1] present SELDnet, which is a convolutional recurrent neural network that can identify, locate and track multiple sound events at the same time, the results show that this method is generic and applicable to any array structure, robust to reverberation, and low signal-to-noise ratio scenarios. After that, most structures adopt the method of combining CNN and RNN. Kong et al. [14] proposed a sound event localization and detection method based on convolutional neural network, and by studying the CNN of layers 5, 9, and 13, it is found that the CNN of layer 9 is a better-performing model. Kapka et al. [12] use four CRNN SELDnet-like single output models into estimating the number of sound sources, estimating the direction of arrival of a single sound source, estimating the direction of arrival of the second source where the direction of the first one is known and a multi-label classification task, achieving a lower error rate than CRNN. Cao et al. [5] use two CRNNs for SED and DOA estimation, using log mel feature for sound event detection, using intensity vector and generalized cross-correlation features for localization, which can improve the performance of SED and DOA estimation, and is significantly better than CRNN. Ranjan et al. [20] combine the deep residual network with the recurrent neural network to estimate the classes and direction of sound events in the reverberant environments, which is a great improvement over the CRNN. Cordourier et al. [8] use the generalized cross-correlation with phase transformation algorithm to augment the magnitude and phase features at each frame, generalized cross-correlation can calculate the arrival time difference in the audio signal, compared with using phase and amplitude as the input feature, generalized cross-correlation with phase transformation CRNN has achieved good results. Ronchini et al. [21] use the convolutional recurrent neural network with the rectangular filter to identify important features related to tasks, and also uses data augmentation technology to increase the size of the training dataset, the system has better results than the CRNN model. Celsi et al. [6] use quaternions as input features, which are related to the sound intensity, and the sound intensity is related to the direction of arrival of the sound source, compared to existing methods, this method can improve the performance of sound event localization. Krause et al. [15] are based on arborescent convolutional recurrent neural networks designed to enable joint localization and detection of overlapping acoustic events, where the relationship between the phase and amplitude channels is utilized independently in the two branches and is connected before the recurrent layer. Nustede et al. [18] present contribution incorporates group delay features into the baseline system of DCASE 2019 task3, supplementing them with amplitude features. Group delay encoding may constitute a more robust feature for data-driven algorithms as it represents time delays of the signal’s spectral-band envelopes. Komatsu et al. [13] combine the learnable gated linear unit with convolutional neural network to extract useful features from the amplitude and phase, compared to the CRNN method, this method improves the performance of SELD. Guirguis et al. [10] use temporal convolutional network for sound event localization and detection, it uses dilated convolution to enlarge the receptive field, so that more input data is helpful for output, the proposed framework has achieved good results in four different datasets, and has good robustness to different types of noise and reverberation, improved the performance of sound event localization and detection model.

The above SELD methods based on deep neural network models have achieved good results. However, most models use standard square convolution kernel which can not extract abundant feature information. In addition, recurrent neural network (RNN) always ignores the importance of different elements within an input vector, resulting in the performance degradation of the SELD. In order to solve the above problems and further improve the performance of SELD, this paper proposes an EleAttG-ACRNN model, which is verified by the TAU Spatial Sound Events 2019-Ambisonic dataset. The experimental results show that the method proposed in this manuscript has certain advantages, and significantly improves the performance of sound event localization and detection. The main contributions of this manuscript are as follows:

An asymmetric convolutional block (ACB) is proposed to replace the standard convolutional layer. Through the asymmetric convolutional block, the weight of the kernel skeleton position can be increased, and the representational ability of the square convolutional kernel can be improved, allowing the network to extract rich features.

Context Gating (CG) is proposed as the activation function. Context Gating can highlight the important information of time-frequency units by weighting important features. Reduce the interference of useless information.

By adding Squeeze Excitation (SE) after convolution. SE can improve the interdependence between channels, allow the network to recalibrate features, and learn to use global information to selectively emphasize information features, enabling the model to extract more discriminative sound features.

Integrating Element wise Attention Gate (EleAttG) into the Bidirectional Gated Recurrent Unit (BGRU) network can highlight the importance of different elements in the input vector, while suppressing the influence of unimportant elements, and obtaining the temporal correlation between frames.

2. Sound event localization and detection based on EleAttG-ACRNN model

2.1. EleAttG-ACRNN model structure

We propose the EleAttG-ACRNN model, which includes input, three CG-ResASECNN modules, two EleAttG-BGRU modules, FC and output. First, the phase and amplitude of the sound signal are input to the stacked three CG-ResASECNN modules for depth feature extraction. Then the output of CG-ResASECNN is sent to the EleAttG-BGRU module which can highlight the importance of different elements within an input vector and further learn the temporal context information. The output of EleAttG-BGRU is sent to the SED branch and the DOA branch, the SED branch outputs the classes of events in each time frame that through the sigmoid activation function, the DOA branch outputs the azimuth and elevation classified by the event in each time frame. As shown in Fig. 1.

Fig. 1.

EleAttG-ACRNN network structure diagram.

2.2. Context gating-asymmetric squeeze excitation residual convolution neural network (CG-ResASECNN)

Essentially, CNN performs a standard square convolution, but can not extract rich feature information. Therefore, this paper adopts the asymmetric convolution block (ACB) as the building block of CNN [9], it uses one-dimensional asymmetric convolution to improve the representation ability of the square convolution kernel and enrich the feature space. In order to extract effective information from amplitude and phase, this paper uses context gating (CG) as the activation function to weight the features. Then applies squeeze excitation (SE) to model interdependencies between channels and to strengthen the representational power of the CNN by improving the quality of spatial encodings throughout its feature hierarchy. The residual connection can accelerate the training speed, greatly alleviate the problem of gradient disappearance in the deep networks, and improve the model effect. It is shown in Fig. 2.

Fig. 2.

CG-ResASECNN network structure diagram.

2.2.1. Asymmetric convolution block (ACB)

A standard $d * d$ convolutional layer can be replaced by a sequence of two layers with $1 * d$ and $d * 1$ kernels to reduce the parameters and required computations. However, such transformation would result in significant feature information loss, and would not work well at the low-level layers of the model. Therefore, in this paper, we propose the Asymmetric Convolution Block (ACB), an innovative structure as a building block to replace the standard convolutional layers. Specifically, in this paper, for the replacement of a $d * d$ layer, we construct an ACB consisting of three parallel layers with $d * d$ , $1 * d$ , and $d * 1$ kernels respectively, of which the outputs are summed up to enrich the feature space. As the introduced $1 * d$ and $d * 1$ layers have non-square kernels, we refer to them as the asymmetric convolutional layers. In addition, no hyperparameters that need to be tuned are introduced, and no extra inference-time is required.

By randomly setting the weights at different locations of the $d * d$ convolution kernel to zero. It is found that if the weight of the kernel corner is set to zero, it has little effect on improving the performance of the network. If the weight is set to zero at the skeletal locations of the kernel, it will have a greater impact on improving the performance of the network. This suggests that the skeleton weights are more important to the model’s representational capacity. Therefore, in each convolution kernel of this paper, the horizontal and vertical kernels are added to the skeleton locations, making the skeleton locations more weighted. ACB by using 1D asymmetric convolutions to enrich the feature space and then fusing its learned knowledge into the square kernel layers, enhancing the expressiveness of the standard square convolution kernel. The ACB is computed as follows: $\begin{matrix} (1) & y = \sum_{c = 1}^{C} \sum_{h = 1}^{H} \sum_{w = 1}^{W} F_{h, w, c}^{(j)} X_{h, w, c} \end{matrix}$ Where $F^{(j)}$ is filter, X is the corresponding sliding window.

In the network model of this paper, the ACB consists of three parallel layers with kernel sizes of $3 * 3$ , $1 * 3$ , and $3 * 1$ , respectively, each of the three layers is followed by batch normalization and their outputs are summed up as the ACB. As indicated in Fig. 3.

Fig. 3.

ACB structure diagram.

2.2.2. Context gating (CG)

In the network structure of this paper, in order to make the model pay more attention to the important part of the audio features. Therefore, we use the learnable CG as an activation function to replace the conventional ReLU activation function in the CRNN model [17]. CG can capture the dependencies between features to recalibrate the different activation intensities of input features. The CG is connected to each CNN layer, and the output of each CNN layer is weighted that according to the importance of the input. The CG is computed as follows: $\begin{matrix} (2) & Y = σ (W * X + b) ⊙ X \end{matrix}$ Where X and Y represent the input and output of CG, respectively, W is the convolutional filter, b is the bias, ⊙ is the element-wise multiplication, ∗ is the convolution operator, and σ is the element-wise sigmoid activation. The vector of weights $σ (W * X + b) ϵ [0, 1]$ represents a set of learned gates applied to the each dimension of the input feature X.

Compared to the gated linear unit, the CG weights the input vector X directly, rather than the linear transformation of X, so the CG only learns a set of weights and reduces the training parameters.

2.2.3. Squeeze excitation (SE)

SE improves the quality of the network representation by modelling the interdependencies between the channels of its convolutional features [11]. This mechanism allows the network to perform feature recalibration, in which it learns to use global information to selectively emphasize informative features and suppress less useful ones. It is shown in Fig. 4.

Fig. 4.

SE structure diagram.

For any given transformation $F_{tr}$ , the input feature X is mapped to U, where $U ϵ R^{H * W * T}$ . The feature U is first passed through a squeeze operation, that is $F_{sq} (.)$ , which performs feature compression on the spatial dimension, each two-dimensional feature map becomes a real number, which is equivalent to the pooling effect with the global receptive field, and the number of feature channels remains unchanged. Then the squeeze is followed by an excitation operation, that is $F_{ex} (.)$ , which generates weights for each feature channel through the parameter W, W is used to model the correlation between feature channels, after obtaining the weight of each feature channel, the weight is applied to each original feature channel to learn the importance of different channels.

2.3. Bidirectional gated recurrent unit with element-wise attention gate (EleAttG-BGRU)

Recurrent neural networks (RNNs) are capable of modelling temporal dependencies of complex sequential data. In general, currently available structures of RNN tend to control the contributions of current and previous information. However, the exploration of different importance levels of different elements within an input vector is always ignored. Therefore, adding EleAttG to the RNN block, to empower the RNN neurons to have attentiveness capability [22]. For all neurons of an RNN block, the input and output of EleAttG are attention vectors of the same dimension, then the original input is modulated by the attention vector to strengthen the impact of important elements while suppressing the impact of unimportant elements.

For an RNN block, EleAttG is used to enhance the RNN neurons to have attentiveness capabilities. EleAttG is a vector $a_{t}$ with the same dimension as the input $x_{t}$ of the RNN. The $a_{t}$ is computed as follows: $\begin{matrix} (3) & a_{t} = ϕ (W_{xa} x_{t} + W_{ha} h_{t - 1} + b_{a}) \end{matrix}$ where ϕ denotes the activation function of sigmoid, $ϕ (s) = 1 / (1 + e^{- s})$ . The current input $x_{t}$ and the previous hidden states $h_{t - 1}$ are used to determine the importance of each element of the input.

$a_{t}$ represents the element-wise attention gate in this paper, it acts on each element of the current input to generate a new input, and the new input replaces the original input, which is used as the input of the network for subsequent iterative calculations. The ${\bar{x}}_{t}$ is computed as follows: $\begin{matrix} (4) & {\bar{x}}_{t} = a_{t} ⊙ x_{t} \end{matrix}$

The neuron iterative expression of adding EleAttG to the BGRU is computed as follows: $\begin{matrix} (5) & \begin{matrix} r_{t} = σ (W_{xr} {\bar{x}}_{t} + W_{hr} h_{t - 1} + b_{r}) \\ z_{t} = σ (W_{xz} {\bar{x}}_{t} + W_{hz} h_{t - 1} + b_{z}) \\ h_{t}^{'} = tanh (W_{xh} {\bar{x}}_{t} + W_{hh} (r_{t} ⊙ h_{t - 1}) + b_{h}) \\ h_{t} = z_{t} ⊙ h_{t - 1} + (1 - z_{t}) ⊙ h_{t}^{'} \end{matrix} \end{matrix}$

The structure of EleAttG-BGRU is shown in Fig. 5.

Fig. 5.

EleAttG-BGRU structure diagram.

3. Experiments

3.1. Data set

The experiment uses the TAU Spatial Sound Events 2019-Ambisonic dataset, which consists of a development and evaluation set [4]. The development set consists of 400 one-minute long recordings sampled at 48000Hz, divided into four cross-validation splits of 100 recordings each. The evaluation set consists of 100 one-minute recordings. The development and evaluation sets are synthesized using spatial room impulse response (IRs) collected from five indoor locations, which have 504 unique combinations of azimuth-elevation-distance. Furthermore, in order to synthesize the recordings, the collected IRs are convolved with the isolated sound events dataset from DCASE 2016 task 2. Finally, to create a realistic sound scene recording, natural ambient noise collected at the IR recording locations is added to the synthesized recordings, which makes the average SNR of the sound events is 30 dB. The number of azimuth and elevation angles of the sound source direction is 36 with 10° intervals from −180° to 180° and 9 with 10° intervals from −40° to 40°, respectively. The dataset contains 11 sound event classes, such as throat clearing, coughing, doorbell pressing, door pushing, door knocking, speaking, laughter, and flipping books, etc. In addition to the Ambisonic format data, this dataset also provides Microphone Array format data.

3.2. Measurement metrics

The SELD task is evaluated using individual metrics for SED and DOA estimation. For SED, we use the standard SED metrics, error rate (ER) and F-score ( $F 1$ ) calculated in segments of one second.

$F 1$ is an evaluation metric that combines precision (P) and recall (R). The $F 1$ is computed as follows: $\begin{matrix} (6) & F 1 = \frac{2 PR}{P + R} \end{matrix}$

ER measures the number of errors according to insertion errors $I (k)$ , deletion errors $D (k)$ , and replacement errors $S (k)$ . The ER is computed as follows: $\begin{matrix} (7) & ER = \frac{\sum_{k = 1}^{K} S (k) + \sum_{k = 1}^{K} D (k) + \sum_{k = 1}^{K} I (k)}{\sum_{k = 1}^{K} N (k)} \end{matrix}$

For DOA estimation, we use two frame-wise metrics: DOA error and Frame recall.

For a recording of length T time-frames, the ${DOA}_{R}^{t}$ shows the reference DOAs at time-frame t, at the same time, the ${DOA}_{E}^{t}$ shows the estimated DOAs at time-frame t [3]. Therefore, the evaluation metrics of the direction of arrival estimation is the error between the actual reference angle and the angle estimated by the system. The DOA error is computed as follows: $\begin{matrix} (8) & DOA error = \frac{1}{\sum_{t = 1}^{T} D_{E}^{t}} \sum_{t = 1}^{T} H ({DOA}_{R}^{t}, {DOA}_{E}^{t}) \end{matrix}$ where $D_{E}^{t}$ is the number of DOAs in ${DOA}_{E}^{t}$ at t time-frame, and H is the Hungarian algorithm for solving assignment problem.

In order to account for time frames where the number of estimated and reference DOAs are unequal, we report the second metric Frame recall. The Frame recall is computed as follows: $\begin{matrix} (9) & Frame recall = \frac{\sum_{t = 1}^{T} 1 (D_{R}^{t} = D_{E}^{t})}{T} \end{matrix}$ where $D_{R}^{t}$ is the number of DOAs in ${DOA}_{R}^{t}$ at t time frames. The resulting in an output one if the $(D_{R}^{t} = D_{E}^{t})$ condition is met else returns zero.

3.3. Experimental setup

The experimental equipment in this manuscript uses the Inter(R) Xeon(R) Gold 5122 CPU@3.60GHz processor with a memory size of 64 GB. GPU model is NVIDIA Corporation GK210GL8, experiments run in GPU mode, the operating system environment is Ubuntu 16.04, the development integration environment is Anaconda, the development frameworks are Keras and Tensorflow, and the development language is Python.

During the training process, all models are trained for 100 epochs with the Adam optimizer and a batch size of 16. The learning rate is 0.0001. Early stopping is employed, where training is stopped if no improvements on validation split is observed for 20 epochs.

4. Experimental results and analysis

To evaluate the performance of the proposed EleAttG-ACRNN network, we conducted six sets of experiments. The first two sets of experiments are in search of the best hyperparameter ρ of SE on the development and evaluation set. The third and fourth sets of experiments are to verify the effectiveness of the proposed network. The last two sets of experiments are the comparison of the proposed EleAttG-ACRNN model with other network models.

4.1. Parmeter selection experiment

In order to study the squeeze-excitation residual blocks contribution, it is decided to carry out a grid search of different possible ratios. The network is made up of 3 blocks of 64 filters. The ratio (ρ) is the same for all blocks as it can be seen in Fig. 1. The experiment results can be seen in Table 1 and Fig. 6.

Table 1
Experimental results of different parameters in the development set

Parameter SED DOA

ER F1(%) DOA error (°) Frame recall(%)

$ρ = 1$ 0.33 80.6 26.4 85.6

$ρ = 2$ 0.34 80.2 28.0 84.5

$ρ = 4$ 0.33 81.0 27.8 85.7

$ρ = 8$ 0.35 80.9 28.1 85.6

$ρ = 16$ 0.33 80.3 28.0 85.8

Parameter	SED	DOA
$ρ = 1$	0.33	80.6	26.4	85.6
$ρ = 2$	0.34	80.2	28.0	84.5
$ρ = 4$	0.33	81.0	27.8	85.7
$ρ = 8$	0.35	80.9	28.1	85.6
$ρ = 16$	0.33	80.3	28.0	85.8

Fig. 6.

Experimental results of different parameters in the evaluation set.

Table 2

Experimental results of ablation under the development set

Model	SED		DOA

	ER	F1(%)	DOA error (°)	Frame recall(%)
CRNN	0.34	79.9	28.5	85.4
ResSE-CRNN	0.33	80.6	26.4	85.6
ACB-CRNN	0.35	80.1	27.7	84.9
CG-CRNN	0.27	84.1	24.8	87.7
EleAttG-CRNN	0.31	81.5	27.8	86.3
EleAttG-ACRNN	0.32	81.3	27.3	86.4

For sound event localization and detection, lower ER and DOA error values and higher F1 and Frame recall values indicate a well-performing network model. The results of the evaluation set experiments are shown in Fig. 6, using both the SED evaluation metrics (i.e., F1 and ER) and the DOA estimation metrics (i.e., DOA error and Frame recall). We can conclude that when $ρ = 1$ , ER reaches a minimum value of 0.25 and F1 reaches a higher value, SED obtains a better performance. At the same time, DOA error reaches a minimum value of 22.4 and Frame recall also reaches a higher value. The experimental results show that when $ρ = 1$ , SED and DOA estimation obtain a better performance, so we choose a value of 1 for the parameter ρ.

4.2. Ablation experiment

To verify the effectiveness of the ResSE, ACB, CG and EleAttG proposed in this paper, six comparative models are designed to perform ablation experiments. Based on the CRNN model, we add ResSE, ACB, CG and EleAttG. The results are shown in Table 2 and Fig. 7. The following is a description of the specific models:

CRNN network, which contains only a simple three-layer CNN and a two-layer RNN.

ResSE-CRNN network, which adds ResSE to the CRNN network.

ACB-CRNN network, which adds ACB to the CRNN network.

CG-CRNN network, which adds CG to the CRNN network.

EleAttG-CRNN network, which adds EleAttG to the CRNN network.

EleAttG-ACRNN network, which adds all the above modules to the CRNN network.

Fig. 7.

Experimental results of ablation under the evaluation set.

The results of the evaluation set experiments are shown in Fig. 7. We compared our proposed EleAttG-ACRNN network with the CRNN network, and the EleAttG-ACRNN network is superior in SED performance. The ER values are reduced by 0.03, 0.08, 0, 0.05 and 0.05 respectively, the F1 values are increased by 0.4, −2.0, 3.3, 1.2 and 1.7 respectively. For the DOA estimation, the proposed EleAttG-ACRNN network is superior in DOA estimation performance compared to CRNN. The DOA error values are reduced by 2.2, 0.3, 0.2, 0.1 and 0.7 respectively, the Frame recall values are increased by 1.0, 2.4, 4.6, 4.4 and 4.5 respectively. This indicates that our proposed EleAttG-ACRNN network is superior to CRNN in SED and DOA estimation performance.

4.3. Comparative experiment with other network models

Compare the results of the EleAttG-ACRNN model with other network models on the TAU Spatial Sound Events 2019-Ambisonic dataset. The results are shown in Table 3 and Fig. 8.

Table 3
Experimental results of different models under the development set

Model SED DOA

ER F1(%) DOA error (°) Frame recall(%)

CNN[6] 0.31 81.2 43.9 78.4

CRNN[5] 0.34 79.9 28.5 85.4

A-CRNN[13] 0.19 88.5 46.9 88.6

GDE-CRNN[14] 0.30 82.1 28.9 85.8

EleAttG-ACRNN 0.32 81.3 27.3 86.4

Model	SED	DOA
CNN[6]	0.31	81.2	43.9	78.4
CRNN[5]	0.34	79.9	28.5	85.4
A-CRNN[13]	0.19	88.5	46.9	88.6
GDE-CRNN[14]	0.30	82.1	28.9	85.8
EleAttG-ACRNN	0.32	81.3	27.3	86.4

Fig. 8.

Experimental results of different models under the evaluation set.

The results of the evaluation set experiments are shown in Fig. 8. Compared to CNN, A-CRNN and GED-CRNN, the ER values of the EleAttG-ACRNN network are reduced by 0.06, 0.01 and 0.05 respectively, the F1 values are increased by 3.7, −0.3 and 3.3 respectively, which indicates that our proposed EleAttG-ACRNN network is superior in SED performance compared to other networks. For DOA estimation, the DOA error values of the EleAttG-ACRNN network are reduced by 13.7, 7.1 and 5.3 respectively, the Frame recall values are increased by 8.9, 3.2 and 6.1 respectively, which indicates that our proposed EleAttG-ACRNN network is superior in DOA estimation performance compared to other networks. So our proposed EleAttG-ACRNN network has some advantages over other networks on the same dataset.

5. Conclusion

This paper has proposed an EleAttG-ACRNN model for sound event localization and detection. Compared with CRNN, the EleAttG-ACRNN model improves the representation capability of the convolution and enriches the feature space. The EleAttG-ACRNN model weights the important features and suppresses the irrelevant features, and learns the features of space and channel independently, so that it can improve the interdependence between feature channels. The EleAttG-ACRNN model highlights the importance of different elements within an input vector and further learns the temporal context information. Evaluation results using the TAU Spatial Sound Events 2019-Ambisonic dataset show the effectiveness of the proposed network, and it improves SELD performance up to 0.05 in ER, 1.7% in F-score, 0.7° in DOA error, and 4.5% in Frame recall compared to a CRNN method. Therefore, the EleAttG-ACRNN network model proposed in this paper effectively improves the performance of sound event localization and detection.

Footnotes

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 61902228), the Fundamental Research Funds for the Central Universities (Grant No. GK202105006, GK202103083).

References

Adavanne,

Politis,

Nikunen et al., Sound event localization and detection of overlapping sources using convolutional recurrent neural networks, IEEE Journal of Selected Topics in Signal Processing 13(1) (2018), 34–48. doi:10.1109/JSTSP.2018.2885636.

Adavanne,

Politis and

Virtanen, Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network, in: 2018 26th European Signal Processing Conference (EUSIPCO), IEEE, 2018, pp. 1462–1466. doi:10.23919/EUSIPCO.2018.8553182.

Adavanne,

Politis and

Virtanen, Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network, in: 2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 1462–1466. doi:10.23919/EUSIPCO.2018.8553182.

Adavanne,

Politis and

Virtanen, A multi-room reverberant dataset for sound event localization and detection, 2019, https://arxiv.org/abs/1905.08546.

Cao,

Kong,

Iqbal et al., Polyphonic sound event detection and localization using a two-stage strategy, 2019, https://arxiv.org/abs/1905.00268.

M.R.

Celsi,

Scardapane and

Comminiello, Quaternion neural networks for 3D sound source localization in reverberant environments, in: 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), IEEE, 2020, pp. 1–6.

Chakrabarty and

E.A.P.

Habets, Multi-speaker DOA estimation using deep convolutional networks trained with noise signals, IEEE Journal of Selected Topics in Signal Processing 13(1) (2019), 8–21. doi:10.1109/JSTSP.2019.2901664.

Cordourier,

Lopez Meyer,

Huang et al., GCC-PHAT cross-correlation audio features for simultaneous sound event localization and detection (SELD) on multiple rooms in Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, USA, 2019.

Ding,

Guo,

Ding et al., Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks, in: IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1911–1920.

10.

Guirguis,

Schorn,

Guntoro et al., SELD-TCN: Sound event localization & detection via temporal convolutional networks, in: 2020 28th European Signal Processing Conference (EUSIPCO), IEEE, 2021, pp. 16–20. doi:10.23919/Eusipco47968.2020.9287716.

11.

Hu,

Shen and

Sun, Squeeze-and-excitation networks, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.

12.

Kapka and

Lewandowski, Sound source detection, localization and classification using consecutive ensemble of CRNN models, 2019, https://arxiv.org/abs/1908.00766.

13.

Komatsu,

Togami and

Takahashi, Sound event localization and detection using convolutional recurrent neural networks and gated linear units, in: 2020 28th European Signal Processing Conference (EUSIPCO), IEEE, 2021, pp. 41–45. doi:10.23919/Eusipco47968.2020.9287372.

14.

Kong,

Cao,

Iqbal et al., Cross-task learning for audio tagging, sound event detection and spatial localization: DCASE 2019 baseline systems, 2019, arXiv preprint arXiv:1904.03476.

15.

Krause and

Kowalczyk, Arborescent neural network architectures for sound event detection and localization. Detection Classification Acoust. Scenes Events Challenge, 2019, Tech. Rep.

16.

Mesaros,

Diment,

Elizalde et al., Sound event detection in the DCASE 2017 challenge, IEEE/ACM Transactions on Audio, Speech, and Language Processing 27(6) (2019), 992–1006.

17.

Miech,

Laptev and

Sivic, Learnable pooling with context gating for video classification, 2017, https://arxiv.org/abs/1706.06905.

18.

Nustede and

Anemuller, Group delay features for sound event detection and localization (Task 3) of the DCASE 2019 challenge. Detection Classification Acoust. Scenes Events Challenge, 2019, Tech. Rep.

19.

Perotin,

Serizel,

Vincent et al., CRNN-based multiple DoA estimation using acoustic intensity features for ambisonics recordings, IEEE Journal of Selected Topics in Signal Processing 13(1) (2019), 22–33. doi:10.1109/JSTSP.2019.2900164.

20.

Ranjan,

Jayabalan,

T.N.T.

Nguyen et al., Sound event detection and direction of arrival estimation using residual net and recurrent neural networks in Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), New York, NY, USA, 2019.

21.

Ronchini,

Arteaga and

Pérez-López, Sound event localization and detection based on crnn using rectangular filters and channel rotation data augmentation, 2020, https://arxiv.org/abs/2010.06422.

22.

Zhang,

Xue,

Lan et al., EleAtt-RNN: Adding attentiveness to neurons in recurrent neural networks, IEEE Transactions on Image Processing 29 (2019), 1061–1073. doi:10.1109/TIP.2019.2937724.

Sound event localization and detection using element-wise attention gate and asymmetric convolutional recurrent neural networks

Abstract

Keywords

1. Introduction

2. Sound event localization and detection based on EleAttG-ACRNN model

2.1. EleAttG-ACRNN model structure

2.2.3. Squeeze excitation (SE)

3.1. Data set

3.2. Measurement metrics

3.3. Experimental setup

4. Experimental results and analysis

4.1. Parmeter selection experiment

Table 1 Experimental results of different parameters in the development set Parameter SED DOA ER F1(%) DOA error (°) Frame recall(%) ρ = 1 0.33 80.6 26.4 85.6 ρ = 2 0.34 80.2 28.0 84.5 ρ = 4 0.33 81.0 27.8 85.7 ρ = 8 0.35 80.9 28.1 85.6 ρ = 16 0.33 80.3 28.0 85.8

Table 3 Experimental results of different models under the development set Model SED DOA ER F1(%) DOA error (°) Frame recall(%) CNN[6] 0.31 81.2 43.9 78.4 CRNN[5] 0.34 79.9 28.5 85.4 A-CRNN[13] 0.19 88.5 46.9 88.6 GDE-CRNN[14] 0.30 82.1 28.9 85.8 EleAttG-ACRNN 0.32 81.3 27.3 86.4

Footnotes

Acknowledgements

References

Table 1
Experimental results of different parameters in the development set

Parameter SED DOA

ER F1(%) DOA error (°) Frame recall(%)

$ρ = 1$ 0.33 80.6 26.4 85.6

$ρ = 2$ 0.34 80.2 28.0 84.5

$ρ = 4$ 0.33 81.0 27.8 85.7

$ρ = 8$ 0.35 80.9 28.1 85.6

$ρ = 16$ 0.33 80.3 28.0 85.8

Table 3
Experimental results of different models under the development set

Model SED DOA

ER F1(%) DOA error (°) Frame recall(%)

CNN[6] 0.31 81.2 43.9 78.4

CRNN[5] 0.34 79.9 28.5 85.4

A-CRNN[13] 0.19 88.5 46.9 88.6

GDE-CRNN[14] 0.30 82.1 28.9 85.8

EleAttG-ACRNN 0.32 81.3 27.3 86.4