Abstract
In autonomous systems and robotics, acoustic signals provide valuable information for tasks such as acoustic source localization and recognition (LR), particularly in environments where visual sensing is limited. This paper investigates two unmanned aerial vehicles (UAVs)-based real-world scenarios that leverage acoustic scene awareness: (1) localization and recognition of human speech for search-and-rescue missions, and (2) detection and classification of other UAVs for counter-drone applications. To address these tasks, we design two deep learning models based on convolutional neural networks (CNNs) and a feature-based approach. These models process acoustic signals captured by two types of microphone arrays mounted on UAVs: a 4-microphone linear array and a 19-microphone spherical array. Each model performs direction of arrival (DOA) estimation and source classification under challenging ego-noise conditions using real-world datasets recorded in controlled experimental setups. We evaluate the models across different signal-to-ego-noise ratios and training configurations. Results show robust performance in both localization and recognition tasks, with approximately 6 degrees mean error and 7 degrees root mean square error (RMSE) for DOA estimations in the human speech scenario with multi-speaker classification accuracy till 0.95, and 3-5 degrees mean error and 7-11 degrees RMSE for DOA estimations in the UAV sound scenario with multi-UAV classification accuracy till 0.98. This demonstrates the potential of deep acoustic learning for UAV-based scene understanding in complex operational environments.
Keywords
Introduction
Sensing technologies are central to many civil applications, including event monitoring, object classification, search and rescue, and surveillance of both urban environments and disaster zones. The choice of sensors varies based on application needs and environmental constraints. In hostile environments, acoustic sensing and processing prove suitable where other sensor data are inadequate, inapplicable, or unreliable. Examples include applications such as reconnaissance and surveillance with unmanned aerial vehicles (UAVs) for contrasting intrusions,1–3 and search and rescue with adverse weather or even in hostile environments. 4 In such conditions, speech and audio data fusion with video, as described in Cecchinato et al., 5 Toma et al.,6,7 may be infeasible, even if multiple cameras are used as in Liu et al., 8 particularly in environments with occlusions, in disaster areas9,10 or in cases where the target is outside the field of view (FoV). Indeed, in scenarios with low visibility, camera related sensors cannot enable data processing algorithms to achieve significant performance. For this reason, applications based solely on visual data cannot guarantee the desired performance. Similarly, integrating visual and acoustic data may not be appropriate, as it increases algorithmic complexity without benefiting from camera capabilities. In these conditions, the use of only acoustic data overcomes these limitations.
Recently, within the fields of deep learning (DL) and deep neural networks (DNNs), as demonstrated in studies,11–15 acoustic source localization and recognition (LR) algorithms have received increasing attention in the literature, as highlighted in Kolamunna et al., 16 Al-Emadi et al., 17 Kolamunna et al. 18 The usage of DL and DNNs in acoustic data processing could support acoustic scene awareness. In particular, regarding the application of UAVs, recent work analyzed the combined sound signals emitted by both UAV components, like propellers and electrical engines, and mechanical vibrations. 18 That study demonstrated that the combined noise features an acoustic signature sufficiently unique to identify specific UAVs among different UAV categories. DL and DNNs are also being investigated in various multi-channel acoustic processing applications, such as those in Choi and Chang, 19 He et al., 20 Salvati et al., 21 where multichannel spectral phase information serves as input to a convolutional neural network (CNN) for the direction of arrival (DOA) estimation. The use of DL in audio processing applications for improving or designing new multi-channel LR algorithms has been investigated only recently,19,20,22 and its application is being explored in many data-based contexts, as shown in Salvati et al., 21 Farhadi et al. 23 Further investigations about localization of speakers based on DL networks can be found in Varanasi et al., 24 Chakrabarty and Habets, 25 Wang et al. 26
In this study, we introduce and evaluate CNN-based models for feature-based LR tasks involving acoustic signals. The input stages process the phase values computed from the covariance matrix and the module of the frequency spectrum of the multichannel acoustic signals. The output stages perform regression where sound source localization (SSL), in terms of DOA estimation, is required and classification where acoustic source recognition, in terms of class estimation, is required. This is investigated in two real-world scenarios addressing UAV-based sensing and processing of: (1) voice emitted by humans, and (2) propeller noise emitted by UAVs. Specifically, in the experimental framework, UAVs gather information from the environmental acoustic scene using their onboard acoustic sensors, which consist of microphone arrays with different configurations. In both Scenario 1 and Scenario 2, the task of the acoustic method is the LR, namely localization and recognition, of an acoustic source. Localization provides DOA estimation of the human voice in Scenario 1 and of the propeller noise in Scenario 2. Recognition differentiates voice types in Scenario 1 (normal or loud) and distinguishes sounds from different UAVs in Scenario 2. This research is conducted within the context of autonomous systems where sensory data can form the basis for acoustic scene awareness and decision-making. The sensory data are represented by acoustic signals. Two different real datasets with acoustic signals are collected for the two specific scenarios where different acoustic sources are considered as targets. The first dataset consists of voice signals recorded using a small 4-microphone linear array mounted on a UAV (Parrot Bebop quadcopter). The human target is positioned in front of the UAV at various distances and angles and emits vocal signals in both normal and loud voice modes. To generate the second dataset, propeller noise recordings were produced using two different UAVs (a DJI Matrice 200 and an F450 custom-built quadcopter) as targets. The data is collected by a medium-sized 19-microphone spherical array and includes two sound categories according to the target UAVs. In both cases, the DOA and category of the acoustic sources are unknown parameters to be estimated. To benchmark the performance of the LR method, DOA estimation errors, sound recognition accuracy, and computing time are measured under different parameter settings.
In these scenarios, when acoustic data are collected by microphone arrays with small inter-microphone distances installed on multirotor UAVs as described in Cobos et al., 27 Oh et al., 28 Salvati et al., 29 processing becomes challenging due to inadequate spatial resolution, limited signal-to-noise enhancement, and insufficient spatial information. The ego-noise generated by the propellers, electrical engines, and mechanical vibrations of the UAV carrying the microphone array further challenges LR algorithm performance. Considering these limitations, we implemented an automatic acoustic source LR algorithm using a data-driven approach. For this purpose, two alternative DL schemes based on CNNs are exploited for estimation of both DOA and category of the acoustic source. A feature-based approach for learning the models is considered.
The primary motivation for this research is the analysis of the acoustic scene exclusively through audio processing in scenarios that still require adequate investigation in the literature, particularly within the DL framework. Additionally, in scenarios where video information is unreliable or unavailable, integrating audio processing with video analysis becomes infeasible, and algorithms based solely on audio processing must guarantee fully accurate results. In summary, this study introduces two real-world application scenarios based solely on acoustic data processing: (1) speaking subject LR from UAVs, and (2) UAV LR from UAVs, both reproduced entirely at our university. Real acoustic sensors, UAVs, and acoustic sources are deployed for generating two real datasets representing both the acoustic scenes. Deep models using a data-driven approach are then investigated for acoustic source LR. These models are capable of analyzing the multi-channel acoustic signals even in the presence of ego-noise. The evaluation of the LR models is conducted on datasets from both Scenario 1 and Scenario 2, collected in real environments. LR estimation accuracy and computational complexity have been investigated and compared with both the conventional steered response power with the phase transform (SRP-PHAT) method and a state-of-the-art (SOTA) CNN model30,31 to evaluate the performance and the real-time capability of our approach. Increased dataset diversity is also investigated for Scenario 1 with 2 to 6 speakers, and for Scenario 2 with 6 different UAVs.
The rest of the paper is organized as follows: discussion on related work from the literature is provided in Sect. 2; two real-world scenarios based on LR from UAVs are introduced in Sect. 3; the general architecture of the proposed system with the logical structure of the algorithm is described in Sect. 4; acoustic sensors and experimental datasets based on acoustic data related to the two scenarios are presented in Sect. 5; experiments and results of the LR algorithm based on the two DNN models for a data-driven approach are shown in Sect. 6; a detailed analysis of the outcomes is then reported in Sect. 7; conclusion of this study and future work are the final parts of the manuscript in Sect. 8.
Related work
From the analysis of the state-of-the-art, four primary categories, among others, have been identified based on the acoustic applications addressed: 1) human voice localization, 2) human voice recognition, 3) UAV localization, and 4) UAV recognition.
- The first category concerns voice recognition from UAVs. A voice recognition-based detection system for search and rescue operations in large-scale disasters, such as major earthquakes, is presented in Yamazaki et al. 32 A speaker installed on the UAV produces sounds to elicit responses from victims, who are then detected by a recognizer that captures their vocal reactions. The captured acoustic signal consisting of the human voice also contains ego-noise and other outdoor environmental sounds.
A speaker identity verification through voice recognition is presented in Abdulghani et al. 33 for security and privacy of voice-controlled UAVs. Features are first computed from the audio signals as mel-frequency cepstral coefficients (MFCC), and then a feature matching is applied. The authors employed DL as a soft computing tool capable of enabling intelligent systems that emulate human behavior.
A sound source separation and identification algorithm for processing noise-contaminated acoustic signals is presented in Morito et al. 34 Audio data are gathered using a microphone array embedded in a UAV to detect human voices quickly and over a wide area during disaster situations. The authors propose a partially-shared DNN (PS-DNN) that can learn from a limited amount of annotated data.
- The second category is voice localization from UAVs. An algorithm proposed for drone-based search and rescue operations during disaster management is investigated in Banerjee et al. 35 to estimate the coordinates of the speech source. Specifically, a person requiring assistance (acoustic source) screams, and the drone with an onboard microphone array captures the vocal signal emitted by the human. An approach is proposed to analyze the captured audio signals for acoustic source localization. The ego-noise (comprising noise related to UAV motion, propeller noise, electrical motor noise, and other stationary structural noise) challenges the performance of the SSL algorithm based on time difference of arrival.
The publicly-available DREGON dataset introduced in Strauss et al. 36 was developed for SSL research purposes. It was generated using a microphone array embedded in a UAV. The dataset contains both clean and noisy in-flight audio recordings annotated with the 3D position of the target sound source. It can be used for emerging tasks of UAV-embedded SSL. The study conducted on the dataset showed promising localization performance for broadband acoustic sources in the presence of high noise levels, while speech localization remained challenging under extreme noise conditions.
The authors of the work related to the IEEE Signal Processing Cup 2019 Student Competition, 37 highlight that, although UAVs equipped with acoustic sensors such as microphone arrays could significantly aid in localizing people during emergencies –where video acquisition is severely limited due to reduced or absent visual information caused by poor lighting conditions (e.g. nocturnal or foggy conditions) or by obstacles limiting the FoV– UAV-based acoustic localization has not yet been sufficiently investigated.
- The third category relates to UAV recognition. The drone recognition system studied in Solis et al., 38 based on audio signals generated by UAVs, employs MFCCs as audio features and uses both a support vector machine (SVM) and a CNN for recognizing UAV-generated audio. A small UAV audio dataset was created.
UAV identification under extreme environmental conditions with large dataset requirements is addressed in McCoy et al., 39 where an ensemble DL framework is proposed to counter unauthorized or malicious UAVs. The UAV classification is based on hybrid synthetic and deep features computed from acoustic signals fused with data from other sensors.
UAV recognition in Lee et al. 40 is performed using CNN models trained on audio spectrograms along with other data, with the CNN output probabilities subsequently processed by multinomial logistic regression. For the experiments, datasets were both collected through field measurements of real UAVs using audio microphones for acoustic processing and obtained from open online repositories.
- The fourth category corresponds to UAV localization. The drone acoustic detection system (DADS) from Stevens Institute of Technology uses DOA estimation and localization to track UAVs based on their propeller noise. 41 The system is based on microphones arranged in a tetrahedron. A 16-channel two-tier cross array, the OptiNav 40-microphone phased array, and parabolic and shot gun microphones were also considered. The multirotor UAVs employed in testing were DJI models Phantom 4, M600, and S1000.
A UAV acoustic source localization algorithm based on ESPRIT (estimation of signal parameters via rotational invariance techniques) with Toeplitz matrix reconstruction is proposed in Hu et al. 42 Indoor and outdoor tests were conducted with a 12-channel spherical microphone array and a circular micro-electro-mechanical systems (MEMS) microphone array designed by the authors.
The objective in Chang et al. 43 is DOA estimation of an intruding UAV through its acoustic signature, using harmonics extracted from received sound signals. The proposed method first estimates the harmonic frequencies corresponding to the frequency-domain acoustic signal of the UAV. A classifier for multiple signals is then applied to compute DOA estimates for the set of harmonics. A weighted sum of the estimated DOAs is then computed as the drone’s DOA estimate, where the weights are proportional to the harmonic energies.
Recent and detailed overviews on SSL in drone audition are also provided in Chevtchenko et al., 44 Martinez-Carranza and Rascon. 45
As mentioned above, this study introduces two real-world application scenarios based solely on acoustic data processing: (1) speaking subject LR from UAVs, and (2) UAV LR from UAVs, which are described later in this manuscript. These scenarios were reproduced at our university to generate two real datasets, according to them. Real acoustic sensors, UAVs, and acoustic sources are deployed. This allowed the proposed DL models designed for acoustic source LR to be experimentally validated. In fact, two deep models capable of extracting information from acoustic scenes using a data-driven approach are proposed in this research.
The motivation for considering the analysis of the acoustic scene alone is that the application of multi-channel audio processing for LR tasks requires further investigation in the literature, especially within the rapidly developing field of DL. Moreover, this research addresses the LR problem in scenarios where video information is unreliable or unavailable, such as during nocturnal or foggy conditions. In these cases, audio-based algorithms alone should provide fully accurate analysis of the surrounding scene. An iterative diagonal unloading (IDU) beamforming based on the identification of the dominant signal for DOA estimation in acoustic multi-channel signal processing is studied in Salvati et al. 46
Each cited work addresses specific scenarios, models, and issues. We conclude this section with a brief account of recent works most similar to our approach. Hence, the comparison is restricted to the recently mentioned CNN-based solutions. Among those contributions, several studies propose systems for predicting the DOA of sound sources. For instance, the approach in Choi and Chang 19 focuses on a search and rescue application that employs a simple stereo microphone onboard a UAV and proposes a parametric multi-channel Wiener filter to address UAV ego-noise. Then, power level-based features are extracted (e.g., power level ratio (PLR), power level difference (PLD), and power level summation (PLS)) and fed into a CNN to predict the DOA of the sound source. The CNN architecture comprises a chain of six convolution blocks with two-dimensional convolutions, batch normalization, rectified linear unit (ReLU) activation, and max-pooling layer and two fully connected blocks. In He et al., 20 within the context of multi-speaker DOA estimation, the issue of insufficient labeled training data is addressed by leveraging data augmentation and weakly-supervised domain adaptation. Simulation is used to generate source domain data, while collected real data are annotated with the number of sound sources as weak labels. The real data are then augmented by mixing single-source segments. Finally, weakly-supervised domain adaptation is applied to models pre-trained on simulated data. This approach, according to experiments with real robot audio data conducted by the authors, achieves performance similar to that of fully-labeled real data scenarios. The problem of determining DOA of sound sources as a classification task is addressed also in Varanasi et al., 24 where two CNNs are used to infer elevation and azimuth of sound sources by leveraging on Spherical Harmonic Decomposition. This decomposition enables the extraction of two feature sets containing information about the elevation and azimuth of the sound source. Finally, the authors of Wang et al. 26 focus on DOA estimation in noisy and reverberant environments by exploiting DNNs to identify speech-dominant time-frequency units with relatively clean phase. This is particularly useful for DOA estimation with a DNN trained using only monaural spectral information. This yields a model directly applicable to microphone arrays with diverse geometries.
From the above-mentioned studies, it emerges that one the novel contributions of this work is an approach that combines sound classification and DOA estimation in a complete LR system using two types of non-trivial microphone arrays. Moreover, the associated CNNs are relatively lightweight, using 3-4 convolutions for Scenario 1 (two separate CNNs for sound classification and DOA estimation) and 7 convolutions for Scenario 2 (employing a data fusion strategy to combine both types of predictions). This is a consequence of the effectiveness of the chosen features.
In conclusion, our study primarily focuses on applying the LR algorithm to real scenarios featuring multi-channel acoustic data for two different targets: human voice and UAV sound. We present two datasets collected according to these scenarios, where two microphone array configurations have been employed: a 4-microphone linear array with 16kHz sample rate and a 19-microphone spherical array with 48kHz sample rate). In the revised version of our study, multiple UAVs and voices have been employed to increase data diversity. The algorithm is designed to be suitable for embedded applications. Complexity analysis and comparison with the conventional SRP-PHAT and a SOTA CNN in30,31 are conducted to verify the performance of the DNN models.
Two real-world scenarios with UAVs based on acoustic signal processing
In this section, the two real-world scenarios for analyzing the acoustic scene –(1) a speaking subject LR from UAVs and (2) a UAV LR from UAVs– are described in detail. The term “real-world” refers to non-simulated scenarios representing applications that exist in reality (as opposed to hypothetical scenarios). The two datasets and the extended versions are collected in real environments.
Scenario 1: acoustic data-based speaking subject LR from UAVs
In Scenario 1, a UAV equipped with an onboard microphone array collects acoustic data from its environment. The target source is a human emitting vocal signals, which are captured alongside ego-noise generated by the UAV itself. This configuration enables the system to detect and localize human presence through voice, supporting applications such as human-UAV interaction and search-and-rescue missions.
The acoustic signal received at the microphone array consists of two main components: the voice emitted by the target source (human) and the propeller noise emitted by the acquisition UAV on which the microphone array is mounted (ego-noise). The collected signals can then be processed using signal processing methods for different tasks, such as detection, recognition, and localization of the target human (sound source) based on the intrinsic characteristics of the voice. Possible application areas include human-UAV interaction and search and rescue, where the acquisition and processing of acoustic sensory information for human voice LR may enable situation awareness.
Scenario 2: acoustic data-based UAV LR from UAVs
An acquisition UAV gathers information from the surrounding acoustic scene using its onboard microphone array. The acoustic sensor is mounted onboard the UAV. It can capture acoustic signals from sources in the surrounding area. In this scenario, the acoustic source is a UAV that performs flight maneuvers in the observed area. It generates propeller noise that propagates throughout the surrounding area.
The acoustic signal received at the microphone array consists of two main components: the propeller noise emitted by the target source (UAV) and the propeller noise emitted by the acquisition UAV where the microphone array is mounted (ego-noise). The collected signals can then be processed through signal processing methods for different tasks, such as detection, recognition, and localization of the target UAV (sound source) by means of the intrinsic characteristics of its propeller noise. Possible application areas include UAV-UAV interaction and counter-UAV operations, Figure 1, where acquisition and processing of acoustic sensory information for UAV LR may enable situation awareness.

An example of Scenario 2 designed for counter-UAV applications, where a sentinel UAV and interceptor UAVs detect an intruder UAV by acquiring information from the surrounding acoustic scene through microphone arrays.
Acoustic data-based speaking subject LR from UAVs (scenario 1) and acoustic data-based UAV LR from UAVs (scenario 2) are the two real-world scenarios investigated in this work. In both scenarios, the primary cause of performance degradation is the ego-noise generated by the propellers. Several studies on UAV propeller noise characterization can be found in the literature, such as in Insausti et al., 47 Moshkov, 48 Podsédkowski et al., 49 Kingan et al. 50
The general system architecture is represented in Figure 2, showing the physical representation of the system –consisting of an acoustic source (either a UAV or a human) and a UAV equipped with a multi-microphone array (upper part of Figure 2)– along with the logical structure of the algorithm for DOA estimation

General architecture of the system. Physical representation of the system (top) and logical structure of the algorithm (bottom).
The acoustic array formulation considers a microphone array with
The covariance matrix of the array signal can be computed as
On the other hand, the spectrum
The estimated covariance matrix
The spectrum
Acoustic tasks performed by the DNN models
The localization component of the algorithm provides estimates, denoted by
The recognition component of the algorithm provides estimates, denoted by
CNN-based DNN models
In this study, two deep network architectures are investigated and proposed. The first structure in Figure 3 is employed in Scenario 1. The regression network is separated from the classification network. Sound DOA estimation is based on covariance matrix features. Sound classification uses the single-channel spectrum module features. Both branches pass through a convolution layer before the corresponding head networks for regression and classification, respectively. The layer composition and corresponding parameters are also listed in the scheme.

Features-based DNN scheme for data-driven acoustic source LR designed for the Scenario 1. The layer composition and parameters are visualized in the scheme.
The second structure in Figure 4 is employed in Scenario 2. Two parallel branches process the input features –specifically the phase values computed from the estimated covariance matrix,

Features-based DNN scheme for data-driven acoustic source LR designed for the Scenario 2. The layer composition and parameters are visualized in the scheme.
Furthermore, network and learning parameter values are also detailed in Section 6.3 to facilitate understanding of the experiments.
The motivation for introducing these two deep networks is to substantially reduce the computational complexity of the algorithm by replacing portions of the computationally intensive multi-channel array processing with data-driven deep learning models without compromising LR algorithm performance.
The algorithm’s tasks are DOA estimation to predict the azimuth angle,
Acoustic sensors and experimental datasets
Experiments with acoustic signals have been conducted with to validate and demonstrate the proposed method. In this section, we introduce two microphone arrays (a 4-microphone linear array and a 19-microphone spherical array) and two acoustic signal datasets (a human voice generated by speech dataset and a UAV noise generated by propellers dataset) used in the experiments. The three UAVs used in the experiments and mentioned in this section are: (1) a Parrot Bebop, 53 (2) a DJI Matrice 200, 54 and (3) a F450 custom build quadcopter. 55
Acoustic sensors
1) 4-microphone linear array. The Sony PlayStation Eye (PS3 Eye) 56 is a small and lightweight device, as shown in Figure 5(a). In addition to a camera, the PS3 Eye features a built-in four-capsule linear microphone array, enabling technologies for multidirectional voice location tracking, echo cancellation, and background noise suppression. The acoustic array is designed to have the microphones equidistant with inter-microphone distance and total length equal to 2cm and 6cm, respectively. The peripheral can be used in applications such as speech recognition (SR) and audio chat in noisy environments without requiring a headset. The PS3 Eye microphone array operates with each channel processing 16-bit samples at a sampling rate of 16KHz, and a signal-to-noise ratio of 90dB. Several technologies are available for the PS3 Eye. Among these is the PSVR (PlayStation voice recognition), a SR library that is intended to support about 20 different languages.

a) Sony PlayStation Eye (PS3 Eye) where the 4-microphone linear array is visible on top of the device and b) Zylia ZM-1 consisting of 19 microphones organized in a spherical array.
2) 19-microphone spherical array. The Zylia ZM-1 57 is a multichannel microphonic device used for acquisition. As a compact spherical array, it consists of 19 digital omnidirectional microphonic capsules (XENSIV) based on MEMS technology from German-based Infineon Technologies forming a sphere with a diameter of 9 cm, Figure 5(b). The nominal signal-to-noise ratio is 69dB, the dynamic range is 105dB, and the output linearity is guaranteed up to 130dB. The Zylia ZM-1 is capable of capturing the entire surrounding acoustic scene in 3D. The sample rate and the resolution are 48kHz and 24 bit, respectively. The acoustic gain is adjustable in the range from 0 to 70dB. The microphones do not require re-calibration due to their matched and constant over-time parameters. The front of the sphere is marked with a painted dark-red dot.
The two experimental datasets described in this section and the related annotation files are freely available at. 58
1) Data consisting of human voice generated by speech. The dataset was collected according to the Scenario 1 introduced in Sect. 3.1 and corresponds to the experimental setup represented in Figure 6. It features a human subject whose pre-recorded vocal signals were alternatively emitted by two acoustic loudspeakers placed in front of the drone. The two loudspeakers were positioned symmetrically with respect to the drone’s frontal direction and changed positions in terms of angle and distance, uttering the same sentence one at time in two different voice modes: normal and loud. A set of 30 positions was used for each of the two categories (normal voice and loud voice) according to the following angles

Setup for Scenario 1, where the acoustic signal generated by human voice is captured by the 4-microphone PS3 Eye array. The main scheme of the experiment is shown in the lower-left part of the image, indicating the positions of the target human (speaker). The horizontal distance values
An example of an acoustic signal generated by propeller noise of a target UAV (Parrot Bebop) and collected by the PS3 Eye is shown in Figure 7 where all the four channels are visualized. The time duration of the visualized recording is 410 seconds and the sampling rate is 16 kHz.

Acoustic signal generated by human voice of a target speaker and collected by the PS3 Eye sensor all the 4 channels are visualized in the image). The time duration of the visualized recording is 410 seconds and the sampling rate is 16 kHz.
The ego-noise is a pre-recorded signal previously obtained by recording the propeller noise of a Parrot Bebop UAV using the 4-microphone PS3 Eye array, as shown in Figure 8. In the synthesis method, the ego-noise is summed up by time domain superposition to the signals relative to the targets previously described. The signal-to-ego-noise ratio (SNR) can be set according to the specifications of the experiment. To increase the diversity of the dataset, a software tool for producing different voices from acoustic recordings in the Scenario 1 previously described was utilized. It applies acoustic transformations to the voice signal to change its characteristics, simulating different voices and yielding a number of speakers (more than 10 speakers can be obtained). The drawbacks of this method are the loss of spatial information, normal/loud characteristics, and channel separation. The extended diversity dataset is employed for testing the multi-voice classification performance of the DNN-based recognition algorithm.

Ego-noise signal generated by propeller noise of a hovering Parrot Bebop UAV and collected by the PS3 Eye array mounted on the UAV (all the 4 channels are visualized in the image).
2) Data consisting of UAV noise generated by propellers. The dataset was collected according to Scenario 2, introduced in Sect. 3.2, and corresponds to the experimental setup shown in Figure 9. The acoustic signals, sequentially emitted by UAVs generating propeller noise while hovering at six specific positions defined by their coordinates, were recorded using the Zylia. Due to its weight, it remains unsafe for a lightweight UAV to carry the Zylia. The altitude of the spherical acoustic array, measured relative to the ground, was 1.60m. Two UAVs with different characteristics, serving as unidentified acoustic sources, flew in the area near the acoustic sensor (first, the DJI Matrice 200, and then the F450, in two separate acquisition sessions). Both UAVs were in hovering mode during acquisition at six coordinate points, according to the following azimuth angles

Setup for Scenario 2, where the acoustic signal generated by hovering UAV propellers is captured by the 19-microphone Zylia ZM-1 array. The main scheme of the experiment is shown in the lower-right part of the image, indicating the positions of the target UAVs. The 3D distance values
The front direction relative to the Zylia is the reference direction. The experimental setup that reproduces the real acoustic scene is shown in Figure 9. The Zylia is placed in the middle of a meadow at the university campus near the gym building. To increase dataset diversity, four additional UAV targets were employed following the same procedure according to Scenario 2 specifications. The four additional UAVs were: (1) an Aurelia X6 Hexacopter, (2) a DJI Phantom 4 Quadcopter, (3) a DJI Mini 4K Quadcopter, and (4) a Yuneec H520E Hexacopter resulting in a total of six UAV targets (including the two UAVs described previously). This results in a multi-class problem, and the extended diversity dataset is used to test the multi-UAV classification performance of the DNN-based recognition algorithm. Regarding environmental noise, reverberation from obstacles, and other challenging conditions, dataset acquisitions were conducted in open spaces (near roads, industrial buildings, trees, etc.) under outdoor conditions where reverberation and various environmental sounds from the surrounding area were also captured, undoubtedly affecting LR algorithm performance. To create the dataset with acoustic signals consisting of propeller noise recordings according to the six configurations for the hovering UAVs, audio signals collected using the spherical acoustic array were recorded as mp4 files. The recording time for each configuration was 19-30 seconds, the Zylia audio gain was set to 0dB, the data format was 32-bit float, the FFT length was 2048 samples, and the sample-rate was 48kHz. The sounddevice Python module is used in the registration software.
An example of an acoustic signal generated by propeller noise from a target UAV (Matrice 200) and collected by the Zylia ZM-1 is shown in Figure 10 where 8 out of 19 channels are visualized. The time duration of the visualized recording is 33 seconds and the sampling rate is 48 kHz. Tests on the Scenario 2 setup were conducted with the Zylia ZM-1 mounted below an F450 UAV, as shown in Figure 11(a), for ego-noise measurements. Additionally, real ego-noise was collected by flying a UAV (a Holybro X500 V2 quadcopter) as an alternative ego-noise source over the Zylia at a safe distance from the spherical array as shown in Figure 11(b). Since it is unsafe for the F450 to carry the Zylia, and lighter spherical arrays and planar arrays do not offer comparable quality and performance, the hovering state of the acquisition UAV was simulated both by fixing the UAV to a stable support and by flying the UAV over the Zylia. In this manner, the ego-noise is a pre-recorded signal previously obtained by recording the propeller noise of the F450 UAV using the 19 microphone Zylia array. An example is shown in Figure 12. After SNR calibration, in the synthesis method the ego-noise is summed up by time domain superposition to the signals relative to the targets previously described. This simulates the hovering state of the F450. The SNR can be set according to the specifications of the experiment.

Acoustic signal generated by propeller noise of a target UAV (Matrice 200) and collected by the Zylia ZM-1 sensor (only 8 out of 19 channels are visualized in the image). The time duration of the visualized recording is 33 seconds and the sampling rate is 48 kHz.

Zylia ZM-1 a) mounted below a fixed F450 quadcopter b) below a flying Holybro X500 V2 quadcopter for testing the Scenario 2 setup and ego-noise measurements.

Ego-noise signal generated by propeller noise of a hovering F450 UAV and collected by the Zylia ZM-1 array mounted on the UAV (only 8 out of 19 channels are visualized in the image).
HW and SW processing platforms
A Laptop with an Intel i7-11800H CPU @ 2.30GHz, 16 GB RAM, NVIDIA GeForce RTX 3050 Ti Laptop GPU (4 GB), and 64-bit Windows 10 Pro OS was used for data collection and storage. A desktop workstation with an Intel i9-10920X CPU @ 3.50GHz, 64 GB RAM, NVIDIA GeForce RTX 3090 GPU (24 GB), and 64-bit Windows 10 OS was used for processing and train/validation/testing of the deep neural model.
Python 3.11.4 with Pytorch 2.0.1+cu117 was used for the deep network experiments, including training, validation, and testing. The Python libraries Soundfile and Sounddevice were used for reading and extracting the acoustic data.
Datasets of acoustic signals
The datasets used in the experimental study cover both Scenario 1 for acoustic data-based speaking subject LR from UAVs (Section 5.2-1) and Scenario 2 for acoustic data-based UAV LR from UAVs (Section 5.2-2).
Signal and learning parameters
Values of the signal and learning parameters used for the experiments are listed in Table 1.
Signal and learning parameters.
Signal and learning parameters.
The outcomes of the experiments for the two scenarios are introduced in this section.
Scenario 1 (speaking subject LR with 4-channel signals)
the system was tested under two signal-to-ego-noise ratios (SNR1 = 2:1 and SNR2 = 1:1) and two batch sizes (32 and 64). Results demonstrated consistent localization accuracy across settings (Table 2). The mean azimuth error varied between 5.78
LR performance (scenario 1) against two SNR (SNR1
2:1 and SNR2
1:1) and two batch size (BS1
32 and BS2
64) levels for the audio signal with 4 channels.
LR performance (scenario 1) against two SNR (SNR1
The conventional SRP-PHAT method and a SOTA CNN model both for acoustic target localization were applied to the human voice-based dataset in Scenario 1 to compare the performance with our DNN-based method. The time-domain variant of the SRP-PHAT was evaluated with 100 and 200 grid points. The results are organized in Table 3 where two ego-noise levels are considered.
SRP-PHAT and SOTA CNN performance (scenario 1) against two SNR (SNR1

Classifier performance on the extended dataset (scenario 1) against the number of voices (2-6) with BS=32 and SNR=1:1 for the audio signal with 4 channels.
The classifier performance on the increased diversity dataset with 2 to 6 different voices can be seen in the test accuracy curve in Figure 13, obtained with a batch size of 32 and ego-noise level equal to the signal level. For this experiment, MFCC coefficients were used as features for the DNN, which was replaced with the model described in. 59 These speaker classification results are justified within the context of our study, which focuses on low-complexity methods where acoustic data are heavily degraded by UAV ego-noise. In this situation, the diversity in distinctive characteristics of each voice is reduced.
For computational complexity analysis, single-frame computation time was measured for both the binary dataset (localization and recognition) and the multi-class dataset (recognition only) for: (1) training of the DNN model; (2) testing of the DNN model; (3) performing the SRP-PHAT method; (4) training and testing of the SOTA CNN-based model. The measured times in msec are in Table 4.
Single-frame computation time (msec) for scenario 1 at two different batch size (BS1
when trained with a batch size of 32, the system consistently produced low mean azimuth errors of approximately 3.05
LR performance (scenario 2) against two SNR (SNR1
2:1 and SNR2
1:1) and two batch size (BS1
32 and BS2
64) levels for the audio signal with 19 channels.
LR performance (scenario 2) against two SNR (SNR1
These results validate the system’s ability to effectively estimate direction and recognize source types, even in realistic, noisy UAV conditions. Notably, a smaller batch size appears more favorable for localization performance in both scenarios.
The conventional SRP-PHAT method and the selected SOTA CNN model both for acoustic target localization were applied to the UAV sound-based dataset in Scenario 2 to compare the performance with our DNN-based method. The time domain variant of the SRP-PHAT was considered with 200 grid points. The results are organized in Table 6 where two two ego-noise levels are considered.
SRP-PHAT and SOTA CNN performance (scenario 2) against two SNR (SNR1
The performance of the classifier when applied to the increased diversity dataset with 6 different UAVs can be seen from the resulting values of the metrics in Table 7 obtained by setting the batch size at 32 and 64 and the ego-noise ratio at two different levels.
Classifier performance on the extended dataset (scenario 2) against two SNR (SNR1
For an analysis of the computational complexity, the single-frame computation time was measured in both the binary dataset (localization and recognition) and the multi-class dataset only recognition) for: (1) training of the DNN model; (2) testing of the DNN model; (3) performing the SRP-PHAT method; (4) training and testing of the SOTA CNN-based model. The measured times in msec are in Table 8.
Single-frame computation time (msec) for scenario 2 at two different batch size (BS1
Visualizations of internal network representations (e.g., Figure 14) further illustrate how the deep learning model processes features through its layers and consolidates information for final predictions, providing insight into the interpretability and functioning of the learned models.

Inner representations of the most important of network layers for Scenario 2 (validation, batch size 64, SNR 2:1): a) spectrum feature linear layer b) covariance matrix feature linear layer c) fusion layer d) first of the two linear layers in the recognition head.
Additionally, the internal representations of the most important network layers during the validation process for Scenario 2, with batch size 64 and SNR 2:1, are extracted and shown in Figure 14. These images represent how signals are encoded throughout the DNN layers. Considering the DNN model in Figure 4, the image in Figure 14(a) corresponds to the linear layer of the spectrum feature branch, and the image in Figure 14(b) corresponds to the linear layer of the covariance matrix feature branch. The fusion layer generated the representation in Figure 14(c). After the fusion block in the DNN model, one linear layer produces the sound DOA estimation at one network head, while another linear layer produces the sound class prediction at the other head. Just before the aforementioned linear layer of the recognition head, the representation is generated as shown in Figure 14(d). Specifically, the DNN outputs (localization and recognition predictions) are related to encoded signal representations that propagate through the DNN layers and can be visualized through these internal representations. Once the DNN is trained, its model parameters are frozen during validation and testing; thus, internal representations change only with variations in the input signal, while network parameters have no influence during these phases. This facilitates both DNN testing and investigation of how outputs are formed.
This work comprises three core components: system setup, dataset acquisition and processing, and deep learning-based acoustic scene interpretation for localization and recognition tasks. The proposed models are designed for real-world deployment in scenarios where traditional sensing may be limited, such as search-and-rescue or UAV monitoring. By analyzing multichannel audio data, the models extract spatial and semantic information from acoustic scenes, demonstrating robustness even under high ego-noise conditions. The network’s ability to detect and discriminate acoustic signatures of the signals using the CNN and other model layers enables the DNN to mitigate the impact of noise.
Results were obtained under different signal-to-noise conditions (SNR1 and SNR2) and batch sizes (BS1 and BS2) for both scenarios. Validation and testing produced similar values for each metric, SNR, and batch size configuration, for both azimuth angle estimation and class prediction across both scenarios. Overall, a higher ego-noise level (SNR2) slightly increases azimuth prediction errors, although this effect is more pronounced in Scenario 1 than in Scenario 2. A similar pattern is observed for classification accuracy. In Scenario 1, the batch size parameter has essentially no influence. However, when considering testing
Representations of acoustic data as they propagate through the DNN are visualized for Scenario 2 to investigate the DNN and how it produces DOA and sound class predictions. Understanding these representations offers two advantages: highlighting the internal processes at the layer level within the DNN and enabling DNN optimization based on these representations. Once fully investigated within the DL field, this study may yield important findings and applications for contexts such as acoustic scene awareness from UAVs for real-time search-and-rescue operations and counter-UAV intrusion detection.
Conclusion and future work
This research presented experimental UAV-based acoustic scene awareness through LR in counter-UAV and search-and-rescue applications. Using acoustic sensors –a 19-microphone spherical array and a 4-microphone linear array– information from the acoustic scene can be extracted and analyzed. The motivation lies in the fact that analysis of the acoustic scene exclusively through multi-channel audio processing for LR tasks requires further investigation, especially within the context of rapid developing DL. Specifically, this study addresses the LR problem in scenarios where video information is unreliable or unavailable. In such cases, audio-based algorithms must guarantee accurate results. A real experimental framework based on acoustic data acquisition by UAVs was utilized for LR of acoustic sources in two scenarios: (1) LR of target human voice from UAVs for search-and-rescue applications (the human voice is the source) and (2) LR of target UAVs from UAVs for counter-UAV applications (the propeller noise is the source). Within a data-driven framework, two feature-based DNN models based on CNNs were investigated and utilized to analyze the data gathered from the corresponding acoustic scenes and perform LR tasks involving DOA estimation (localization) and class prediction (recognition). These models are capable of processing the multi-channel acoustic signals even in the presence of UAV ego-noise.
The results demonstrated the validity of the algorithm for both scenarios at different signal-to-noise levels. Moreover, (1) comparison with both the conventional SRP-PHAT method and a selected SOTA CNN model for evaluation of DOA estimation accuracy, (2) computational time analysis demonstrating the real-time capability of the DNN-based approach, and (3) multi-class classification using an increased diversity dataset with 2 to 6 speakers in Scenario 1 and 6 UAVs in Scenario 2 were also conducted in the LR study. Future work will expand the range of acoustic sources and sensing devices to improve model generalizability. Additionally, ongoing research will explore both the optimization of deep neural network architectures, hyperparameters, and interpretability through internal feature visualization and the potential of combining multi-channel audio processing with deep learning. Environmental influences caused by reverberation from walls, buildings, trees, and other obstacles will also be investigated in future work to fully characterize their impact on LR tasks. To improve the suitability of the LR method for UAV-embedded platforms, strategies for reducing the computational complexity –such as model pruning, quantization, and signal downsampling– should be considered. These efforts aim to enhance the practical deployment of audio-based scene awareness in critical applications such as search-and-rescue and UAV intrusion detection.
Footnotes
Acknowledgements
Partial financial support was received from PNRR DD 3277 del 30 dicembre 2021 (PNRR Missione 4, Componente 2, Investimento 1.5) - iNEST, and from the Programma Operativo Complementare (POC) della Regione Autonoma Friuli Venezia Giulia 2014-2020 (cup G23C25000880002, operation identifier 2025/13846, project identifier 2025/13846/1)..
Dr. Andrea Gulli from University of Udine took part in the ”human voice generated by speech” data collection and dataset preparation used in this work.
Author contributions
All authors have had an active part in the study and the manuscript preparation. All authors have approved the manuscript, and agree with its submission to the Integrated Computer-Aided Engineering journal.
Ethical considerations
All the research meets the ethical guidelines and legal requirements specified by the Integrated Computer-Aided Engineering journal.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interest
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
