Abstract
Depth estimation from images is fundamental for autonomous navigation of robots, vehicles, drones, and integrating navigation aid systems for people with visual impairments. Despite the challenges of obtaining depth information from complex scenes, advancements in Deep Learning have opened new possibilities. Thus, this work introduces an approach based on recent Convolutional Neural Network architectures and attention mechanisms to enhance monocular image depth estimation, with potential applications in navigation aid systems for the visually impaired. The proposal focuses on implementing a Convolutional Neural Network model with an attention mechanism configuration that has not yet been tested in the literature, primarily integrating the Convolutional Block Attention Module and the Modified Global Context Network in the encoder and decoder, respectively. Unlike stereo camera-based systems, which require complex setups and image pairs, this model simplifies data collection and processing, although it still faces the challenge of requiring large datasets and significant computational capacity. However, it is experimentally possible to demonstrate that these limitations can be overcome by using reduced-resolution images and resizing techniques. The evaluation of the proposed model indicated satisfactory performance compared to state-of-the-art works that use images with resolutions identical to those in this work, validating the comparative tests. It presented an improvement in the Absolute Relative Error of 25.22% and 6.28% in relation to the Root Mean Squared Error. The results highlight the feasibility of conducting Deep Learning research, even with limited hardware resources.
Keywords
Introduction
Depth estimation technology is crucial for ensuring the safety and autonomy of modern autonomous vehicles 1 and drones. Beyond the automotive sector, depth maps play a vital role in various fields, including navigation assistance for visually impaired individuals, enhancing their ability to perceive their surroundings. The ability to analyze a scene in detail provides enhanced resources for decision-making. 2
Depth estimation concepts date back to early computer vision theories, such as Wheatstone’s stereoscope and the exploration of motion parallax. 3 Wheatstone’s groundbreaking work involved creating a stereoscope, revealing that the brain uses horizontal disparity to perceive relative depth. Another significant contribution came from Von Helmholtz, 4 who explored depth perception through motion structure. They proposed that objects at varying distances exhibit different speeds when moving across the retinal surface, introducing the concept of motion parallax. This phenomenon allows the human brain to extract three-dimensional shapes from dynamic scenes. These fundamental theories laid the groundwork for the development of multi-view depth mapping.
While traditional methods like stereovision systems and motion parallax laid the groundwork, modern approaches have shifted towards monocular depth estimation using deep learning. This system leverages principles introduced by Wheatstone to estimate depth from multiple viewpoints. Similarly, another technique involves creating a depth map based on motion structure. In this scenario, a single camera is systematically moved with a constant baseline, resulting in the development of a depth map. 5 These historical perspectives and methodologies paved the way for contemporary advancements in depth estimation techniques. 6
However, the complexity of setting up stereo cameras for depth estimation has discouraged the use of these methods in real-world applications, leading to widespread adoption of monocular estimation. This simplifies the data acquisition process, eliminating the need for multiple images and the complexity associated with stereo matching. 7
Monocular depth estimation simplifies data acquisition and processing, avoiding the complexities of stereo matching, making it a practical alternative for real-world applications. A notable work by Hoiem et al. 8 reconstructed a three-dimensional scene in a virtual environment, using a manual approach based on features to categorize pixels based on shapes and colors. Other researchers, such as Karsch et al. 9 and Ladicky et al., 10 applied the theories of Hoiem et al. 8 to conduct depth analyses, referring to this process as “depth from semantic segmentation”.
Compared to parallax methods,11,12 our neural network-based approach offers significant advantages. It is capable of generating depth maps from a single monocular image, which eliminates the need to capture multiple images from different angles. Furthermore, deep learning allows for adaptation to a wide range of scenarios and lighting conditions, potentially overcoming the limitations of traditional methods that heavily rely on the quality and alignment of stereoscopic images.
However, it is important to recognize that the neural network-based approach also has limitations, such as the need for large training datasets and the potential for limited generalization in scenarios that differ significantly from the training data. In contrast, parallax methods may be more robust in environments where capturing multiple images is feasible.
In summary, while parallax-based methods offer a traditional and straightforward solution for depth estimation, our neural network approach provides flexibility and the ability to handle monocular images, excelling in depth estimation across diverse and realistic scenarios.
Recent advancements in deep learning, such as Convolutional Neural Networks (CNNs) and attention mechanisms, have significantly improved the accuracy of monocular depth estimation, 6 and advanced models have been studied in recent years in various ways for application in these tasks.13–16
Finally, depth estimation is a powerful approach for the pre-training of deep networks17,18 using image datasets. 19 However, collecting large, diversified training datasets with accurate ground truth is a challenging task.13,20
This study aims to implement a U-Net architecture integrated with attention mechanisms, specifically the Convolutional Block Attention Module (CBAM) and Global Context Network (GCNet), to enhance monocular depth estimation. The motivation is to develop a reliable method for integration into navigation assistance systems for visually impaired individuals.
Therefore, the suggested architecture is structured to incorporate, as its main components, the CBAM (Convolutional Block Attention Module) and GCNet (Global Context Network) attention mechanisms. This configuration was defined after a thorough, though not exhaustive, review of the specialized literature. Identifying the absence of U-Net architectures that simultaneously integrate these two specific attention mechanisms for the task of estimating depth maps from monocular images. This allowed for better results than the state-of-the-art cited in this work, as will be seen later.
A similar approach for feature selection, utilizing the attention mechanism, was employed by Xue et al. 21 in their development of an external attention-based feature ranker for feature selection, highlighting the effectiveness of attention models in identifying important features across datasets. This connection further underscores the relevance of attention mechanisms in the context of our study.
The article employs an external attention mechanism inspired by concepts used in deep neural networks, particularly the Transformer model. External attention helps model the relationships between features without the need for the model to directly access the original data, which can be advantageous in terms of processing and efficiency. However, unlike the present study, the authors used Transformers originally applied in natural language processing (NLP) models and deep neural networks adapted to solve problems in other areas, such as feature selection, which is a crucial step in machine learning pipelines but requires more powerful hardware to train the model.
The paper is organized as follows: the introduction covers the key concepts of monocular depth estimation, followed by a comprehensive review of related works. In the fundamentals section, the Convolutional Block Attention Module (CBAM) and the Global Context Network (GCNet) are introduced, which form the basis of the proposed model. The methodology section details the implementation process of the model, while the experimental results section analyzes the performance metrics and outcomes. The ablation section explores the various tested configurations, providing an in-depth view of the model variations. Finally, the conclusion synthesizes the main contributions of the study.
Related works
This section reviews key state-of-the-art methodologies in monocular depth map estimation, highlighting significant advancements and their contributions to the field.
Eigen et al. 22 pioneered the use of convolutional neural networks (CNNs) for predicting depth maps from single images. Their two-stage model, comprising a coarse and fine network, leveraged pre-trained weights from the ImageNet dataset to enhance performance. 23
Kumari et al. 24 introduced a residual encoder-decoder CNN with hourglass networks, enhancing depth estimation by analyzing encoded features at various scales. The inclusion of a perceptual loss function accelerated model convergence. 25
Alhashim and Wonka 26 used transfer learning with a Dense-Net encoder-decoder network, leveraging pre-trained weights for efficient feature extraction. Despite improved training time, the model faced challenges with memory usage and overfitting due to color augmentation.23,27
Eigen and Fergus 28 extended the work of Eigen et al. by integrating semantic labels and surface normals into depth map estimation, using a deeper network based on VGG architecture. This approach improved performance but faced challenges with sharp transitions.22,23,29
To address limitations in depth map sharpness, Seo 30 applied a signal-to-noise ratio (SNR) based method for edge and line detection, enhancing the quality of depth maps through post-processing.31–33
Brief theorical foundation
The goal of monocular depth estimation is to obtain distance information from a single image of the scene. This challenge is significant due to the complex correspondence between pixels in a 2D image and infinite points in 3D space. Despite this complexity, humans are able to estimate distance by aggregating visual cues in the observed scene, even when using just one eye. This phenomenon suggests that visual cues contain distance information. Thus, monocular depth estimation can be formulated as the search for a mapping between visual cues and distance information of objects in the scene,
34
which can be defined as follows, equation (1):
The concept of attention, rooted in biological inspiration, plays a crucial role in tasks such as image classification
35
and machine translation. Prominent components, identified through a global analysis of the scene, hold greater relevance for the object category and demand more attention. In machine translation, certain words have a greater influence on the output, indicating that each component of the source impacts the target in a distinct manner, requiring differentiated treatment. The attention mechanism is not limited to modeling only the relevance between source and destination; it also has the capability to generate new representations, adjusted according to the weights assigned to each source component. From a technical standpoint, the attention model calculates coefficients through a query (
In Convolutional Neural Networks (CNNs), the incorporation of attention mechanisms is achieved through one of two methods: post-hoc analysis of the network and the use of trainable attention mechanisms. The former technique has been predominantly employed to access the network’s intrinsic reasoning in the context of object visual recognition.29,37 Similarly, trainable attention is subdivided into hard attention and soft attention.
Hard attention, characterized by a stochastic process, poses significant challenges to model training due to the iterative proposal of region and cropping. 38 These models often lack differentiability and require reinforcement learning for parameter refinement. Recently, hard attention has been extensively adopted in trainable transformers, especially due to its application in the iterative proposal of region and cropping during feature learning. 6
Soft attention, characterized by a deterministic process, uses conventional backpropagation. These attention mechanisms eliminate the need for Monte Carlo or repeated random sampling. Soft attention is employed to highlight only the relevant activations during training, resulting in the reduction of computational resources spent on irrelevant activations and promoting more effective network generalization. Recently, additive soft attention has been applied in tasks such as image categorization and sentence-to-sentence translation.6,39,40 Mathematically, soft attention can be formally expressed as shown in equation (3)
The expectation of the content vector
The non-local networks have played a crucial role in providing solid intuition and establishing the foundations for various contemporary attention mechanisms employed in deep neural network architectures geared towards computer vision. 43 They are essential in modeling the attention map of an individual pixel by aggregating relational information from its neighboring pixels. This process is achieved through a reduced number of permutation operations, enabling the construction of the attention map with emphasis on the query pixel. 44 From an abstract perspective, this approach demonstrates some similarity with the proposed Self-Attention Mechanism in SAGAN by Zhang et al. 45
As mentioned earlier, the development of GCNet was influenced by Non-local Networks and Squeeze-and-Excitation Networks, aiming to model an attention mechanism that enables the network to capture long-range dependencies at substantially reduced cost. This model is called the Global Context Network (GCNet). More details about non-local networks (Figure 1) and squeeze-and-excitation can be found in Cao et al., 43 Hu et al., 46 and Wang et al. 44

Architecture of the non-local block (a) and its simplified version (b). Feature maps are presented by their dimensions, for example, C
The Convolutional Block Attention Module (CBAM) presents itself as a lightweight and versatile solution, 47 seamlessly integrable into various Convolutional Neural Network (CNN) architectures and amenable to end-to-end training alongside the base CNN. 48 As illustrated in Figures 2, from an intermediate feature map, attention weights are sequentially inferred along spatial and channel dimensions, being multiplied by the original feature map to perform adaptive adjustments. The input feature map is summed with the fully connected shared output features, followed by a sigmoid activation operation to generate the final channel attention feature map. This feature map, generated by the channel module, is then utilized as input for the subsequent module, where the spatial attention feature is generated through the sigmoid function. Finally, this feature is multiplied by the module’s input, resulting in the final generated feature. 49

Representative block of the CBAM attention mechanis.

Representative block of the Modified GCNet attention mechanism. Sigmoid was used instead of softmax. The GCNet has the first part NL block simplified and the second part comprises a piece of the squeeze-excitation mechanism, followed by a spatial attention mechanism.
The Convolutional Block Attention Module (CBAM) attention mechanism is a sequential attention mechanism consisting of two main components: channel attention and spatial attention. The goal of CBAM is to enhance important features and suppress less useful ones both in the spatial and channel domains to improve the performance of convolutional neural networks.34,50,51 \hyperlink{alg:cbam}{Algorithm 1} shows the implementation of the CBAM attention mechanism at a high level.
Channel Attention input: Feature map
Average-Pooling: Apply average pooling along the spatial dimensions
Max-Pooling: Apply max pooling along the spatial dimensions H and W to produce a
MLP with Weight Sharing: Apply a MultiLayer Perceptron (MLP) with a hidden layer on
Sum and Activation: Sum the outputs of the MLP applied to
Multiply: Multiply the channel attention map
Spatial Attention input: Feature map
Concatenation: Apply average pooling and max pooling along the channel dimension in
Conv2D: Apply a 2D convolution with a
Activation: Apply the sigmoid activation function on
Multiply: Multiply the spatial attention map
CBAM Output: Final Feature Map
The GCNet (Global Context Network) is a neural network architecture that incorporates global context into local convolutions to improve the performance of computer vision tasks such as object detection and segmentation. The central component of GCNet is an attention mechanism that efficiently aggregates global context information.53,52,44 \hyperlink{alg:gcnet}{Algorithm 2} shows the high-level implementation of the GCNet attention mechanism.
In the present study, a modified GCNet (Figure 3) mechanism was used, which combines three types of attention. Non-local attention is employed to capture long-range dependencies in the image. Squeeze-and-Excitation attention is used to recalibrate input channels according to their importance. Spatial attention focuses on specific areas of the image. Additionally, we used the sigmoid activation function instead of the softmax from the original GCNet design.
The GlobalContextAttention procedure takes as input a feature map
GlobalPooling: The first step within the procedure is to apply GlobalPooling to the feature map
Transform:
This excerpt defines an MLP (multilayer perceptron) for the “excitation” part in the SE block. The MLP has two dense layers: the first reduces the channels by the specified ratio, and the second returns to the original channels with a sigmoid activation.
The transformation is followed by the application of a sigmoid function
This attention mask
The weighted feature map
The procedure returns
Experimental model
The experimental model described here is constructed with an encoder that uses a sequence of convolutional layers to encode image information. Each block in the encoder consists of a convolutional layer (Conv2D) followed by a max-pooling layer (MaxPooling2D), aiming to reduce the spatial dimensions of the input image. The hierarchical organization of the blocks is established by progressively increasing the number of filters, ranging from 64 to 256 in the convolutional layers as it goes deeper into the encoder, as illustrated in Figure 4.

Experimental model with the appropriate CBAM and GCNet attention mechanisms.
The decoder plays a key role in reconstructing the image from the encoded representation, using up-sampling operations (UpSampling2D) to expand the spatial dimensions. Furthermore, skip connections were incorporated between the encoder and decoder layers. These connections play a crucial role in the direct transmission of low-level information to high-level layers, resulting in a substantial improvement in the accuracy of image reconstruction. In this context, the number of filters is initially set at 256 for the first layer, followed by 128 and 64.
In both stages of the process, in both the encoder and decoder, uniform parameters were adopted, including strides = 3, ‘relu’ activation function, and padding = ‘same’. However, a notable distinction is made in the decoder, where the choice was made to perform bilinear interpolation during the execution of UpSampling2D. This approach aims to enhance the resolution of the forecast, aligning it more accurately with the desired image resolution, as previously addressed in studies.26,54
At the end of the decoder, a resizing layer (Conv2D) was incorporated with the purpose of adapting the dimensions to the desired output, employing a ’linear’ activation function.
In the transition between the encoder and the decoder, a ‘bottleneck’ layer was introduced, which serves as a compact transition region between the encoding (encoder) and decoding (decoder) phases.
The bottleneck layer emphasizes crucial and discriminative information for the task at hand, focusing the network’s attention on critical aspects of the input during encoding and subsequently on reconstruction during the decoding phase.
The addition of this layer can result in computational gains, as subsequent operations are carried out on a more compact representation, promoting efficiency in processing. The main details of the internal components for each layer of the model are shown in Figure 5.

Specification of convolutional layers that integrate the models mentioned in this work.
In the model proposed in this study, the encoder is configured with residual convolutional layers, 6 providing efficient learning of representational features. Each residual block is enriched with attention mechanisms, a strategic addition to enhance the model’s ability to focus on specific areas of the image. This refinement highlights crucial features during the encoding phase, improving the ability to extract and interpret relevant information.
Each block of the decoder also includes residual layers and attention mechanisms similar to those of the encoder.
The main distinction between the proposed model (Figures 6 and 7) and the model implemented for comparison (Figure 4) lies in the fact that the former was built using two consecutive layers of convolution in the encoder and two consecutive layers of up-sampling in the decoder. All other parameters were kept consistent in both configurations.

Detailed representation of the proposed model, with respective indications in the legend.

Proposed model with the appropriate CBAM and GCNet attention mechanisms.
For comparison purposes with the article,
6
at the end of the model, the crop mentioned in Eigen et al.
22
was used, and the images resized to
The proposed model essentially consists of a U-Net type convolutional neural network (CNN), which adopts an encoder-decoder architecture.
The complete model is constructed using the Model class, having the RGB image as inputs and estimated depth maps as outputs. At the end of each layer in the encoder, a CBAM attention mechanism is incorporated, while in the final layers of the decoder, a GCNet attention mechanism is introduced. Further details can be observed in Figures 6 and 7.
The algorithm 3 represents a simplified version of the proposed model for predicting depth maps. The model is constructed by sequentially applying downsampling blocks to reduce the resolution and increase the depth of channels, followed by upsampling blocks to reconstruct the resolution of the original image. Residual connections and attention mechanisms are applied at each step to improve the flow of information and the representational capacity of the model.
The input image is initially processed through a series of residual attention blocks without upsampling, where the number of filters is progressively increased through a predefined set of values (64, 128, 256, 512).
After initial processing, the image is processed by a series of residual attention blocks with upsampling, where skip connections from earlier layers are integrated. This allows the model to recover higher-resolution information lost during downsampling operations.
The model is finalized with a convolution to adjust the output dimensions as needed, and the model configuration is completed by specifying the inputs and outputs.
This architectural design, which incorporates attention and residual connections into an encoder-decoder structure, aims to enhance the model’s ability to capture and reconstruct meaningful features in image processing tasks. However, the effectiveness of this model will be empirically evaluated in specific tasks, and adjustments may be necessary based on the data and requirements of the target application.
In this section, comparative experiments will be conducted using three widely recognized datasets in the literature. The objective is to evaluate the performance of state-of-the-art architectures against the method proposed in this study, both quantitatively and qualitatively. This comprehensive analysis will help demonstrate the effectiveness of the proposed method compared to state-of-the-art solutions.
Datasets
The dataset used for comparisons with state-of-the-art works and NYU Depth v2
55
consists of video sequences of 464 indoor scenes recorded with the Microsoft Kinect. It includes 120,000 training samples and 654 test samples. For testing the proposed method, a subset of 49,000 images was utilized, divided into 7 parts of 7,000 each for training, due to hardware limitations. The model was trained on each of these subsets for 100 epochs. The maximum depth of the depth maps is 10 meters. The original size of the image and depth map in the dataset is
The labeled dataset is a subset of the raw dataset, comprising pairs of synchronized RGB and depth frames, each annotated with dense labels. In addition to the original projected depth maps, we provide a set of preprocessed depth maps where missing values have been filled using Levin et al. 56 colorization method. Unlike the raw dataset, the labeled dataset is distributed as a Matlab .mat file, from which the images used in our experiments were extracted. Beyond hardware constraints, another reason for selecting images with the specified dimensions was to ensure a fair comparison with the works cited in Jan and Seo. 6
KITTI
57
is an outdoor dataset consisting of stereo images and 3D scans of 61 scenes captured by multiple sensors mounted on top of a moving vehicle. The dataset contains an RGB input image with a resolution of
In this study, the NYU v2 dataset was used for both quantitative and qualitative analyses, while the KITTI dataset was exclusively used for qualitative evaluations.
Quantitative comparison of different state-of-the-art methods on the NYU-depth v2 dataset.
Quantitative comparison of different state-of-the-art methods on the NYU-depth v2 dataset.
A loss function considers the difference between the expected depth map and the map predicted by the network. 22 Variations in the configuration of the loss function can have a significant impact on both the training rate and the overall performance of depth estimation. Several adaptations in the formulation of the loss function, used to optimize the neural network, are documented in the literature dedicated to depth estimation.22,58–60
The function used in the present work is the same as described in Alhashim et al.
26
Where, during the training of the network, the loss
Finally,
The U-Net was implemented using TensorFlow.
62
For training, the Adam optimizer
63
was used with a learning rate of 0.001, Beta1 of 0.9, Beta2 of 0.999, and epsilon of
The model was trained for hundred epochs on each of the seven subsets, each composed of seven thousand images, totaling forty-nine thousand images used for training. The training time for each subset was approximately 1.5 hours.
Regarding monocular depth map predictions, once the model is trained, it can estimate the depth map of a scene (image) at the specified resolution in as little as 1 millisecond. However, we cannot confirm if this processing time remains consistent for scenes with higher resolution images.

Here’s a brief qualitative comparison between images from some state-of-the-art methods. In (a) are the RGB images of the scene, (b) their respective ground truth depth maps, (c) depth maps estimated by the method proposed by Fu et al., 58 (d) depth maps estimated by the method of Alhashim and Wonka, 26 (e) depth maps estimated by the method of Jan and Seo, 6 and (f) depth maps estimated by the method proposed in this study.

Qualitative Evaluation: Results obtained with the NYU v2 dataset.

Qualitative Evaluation: Results obtained with the KITTI dataset.
This study conducts a systematic evaluation of methods aimed at monocular depth estimation, employing various distinct metrics for quantitative analysis. This evaluation included comparing the performance of the proposed method against state-of-the-art techniques, using metrics such as Mean Relative Error (ABS Rel), Root Mean Square Error (RMSE), Mean Error Log Error (Log10), and Threshold Accuracy (
Table 1 shows a quantitative comparison of the methods studied in this work with other architectures, where the proposed model using CBAM and Modified GCNet attention mechanism achieved satisfactory results in relation to the state-of-the-art methods compared in Table 1. Specifically, the U-Net with attention demonstrated a 25.22% improvement in Absolute Relative Error (ABS Rel) compared to the lowest value presented in Fu et al. 58 Furthermore, the proposed model achieved a 6.28% improvement in terms of RMSE error compared to the method presented in Jan and Seo. 6
The NYU Depth v2 dataset contains RGB images along with their respective depth maps, which are stored in separate files, typically organized in pairs. The images are represented in three channels, while the depth maps are monochromatic matrices which each pixel value corresponds to a depth measurement. This measurement information is available in the form of MAT (Matlab) files, which contain the RGB images, depth maps, and segmentation annotations. However, the information contained in the .mat files was not used in the model training process, as the depth maps effectively function as numerical masks that assign a depth value to each pixel. Nevertheless, in situations where it is necessary to determine the distance in meters of objects contained in the NYU V2 dataset, the values provided in the .mat files can be utilized.
In this work, we decided not to use the values contained in the available masks, focusing exclusively on the RGB images and their respective depth maps during the model training. This approach allowed us to simplify the process and concentrate our analysis on the visual features of the images and the depth information directly associated with each pixel. Although the masks can provide useful information about the reliability of the depth data, we opted for a strategy that prioritizes direct interaction with the images.
The use of this additional information, such as the values of the masks, will be addressed in a future phase of the project, in which we plan to explore how these data can contribute to improving the accuracy and robustness of the model. We believe that by integrating the masks in the next phase, we will be able to enhance the identification of problematic areas in the images and refine the analyses conducted. This strategy will allow us to leverage all the information available in the dataset, maximizing the results achieved.
Qualitative evaluation of results
The Figure 8 presents the depth maps generated from the NYU-depth v2 dataset, following the split approach proposed by Eigen. 22 Figure 9 shows more results obtained with the NYU v2 dataset.
For validation purposes, the KITTI dataset 57 was also used for a qualitative analysis of the model, where Figure 10 displays the results obtained with the proposed model.
The results obtained were compared with state-of-the-art methods, highlighting, especially in the last column of Figure 8, the clarity and sharpness of the depth maps produced by the proposed model in this work. In column (f), details (highlighted by rectangles) can be observed that were preserved in the reconstruction of monocular depth maps, which do not appear in the images generated by the other compared methods present in the comparison.
Quantitative results of ablation tests using the NYU v2 dataset for the experimental model.
Quantitative results of ablation tests using the NYU v2 dataset for the experimental model.
With the goal of choosing a model capable of producing competitive results in comparison with the methods presented in the state of the art, various tests were conducted with different configurations and attention mechanisms.
Two models were developed for the decision of which would be used in the tests carried out in this work, where one consists of single convolutional layers in both the encoder and the decoder (Figure 4). The second model containing two consecutive layers (Figure 7). Both are based on models present in the state of the art.
In both structures, various attention mechanisms were integrated, including channelWise Attention, spatial Attention2D, twoDimensional Attention, selfAttention, CBAM Attention (Figure 2) and also GCNet-Attention (Figure 3). However, it was not possible to train the model using the self-Attention mechanism due to hardware limitations for the defined number of heads and dimensions of keys based on the input dimensions.
Following a thorough analysis and comprehensive literature review, such as those presented in Jan and Seo, 6 Agarwal and Arora, 54 Alhashim and Wonka, 26 and Li et al., 34 it was identified that the configuration of two consecutive convolutional layers, maintaining the same number of filters in each layer, may prolong the time required for training. However, this approach resulted in an improvement in the final results, compared to the model that did not adopt this same structure.
Additional experiments were conducted using 1024 filters in the bottleneck layer. However, this modification did not translate into significant advances for
The study was directed using the NYU v2 dataset, starting with the implementation incorporating skip connections in both models. Subsequently, the analysis continued incorporating a residual convolution block. The adoption of these blocks resulted in improvements in the quantitative indicators of the tests, leading to the decision to keep them for subsequent experiments with attention mechanisms. It’s important to highlight that, in all phases of testing, RGB images with a resolution of
In a subsequent step, the tests were repeated with the same images, but segmented. For this, the k-means algorithm was employed, exploring segmentation variations in two, three, five, and ten regions. The proposed model, which incorporates residual convolution blocks and double convolution per layer, was chosen for these tests. The most promising results were observed with segmentation into ten regions, a configuration that was maintained in the following experiments with segmented images and attention mechanisms. However, the results with RGB input images surpassed those obtained with segmented images, which allowed for a careful evaluation of the most effective model for the comparative tests detailed in the next section of the work. The results of the ablation study can be seen in Tables 2 and 3, where it is possible to observe that those obtained with both models were better using the configuration with CBAM and Modified GCNet attention mechanism (highlighted in bold in the tables).
Quantitative results of ablation tests using the NYU v2 dataset for the proposed model.
Quantitative results of ablation tests using the NYU v2 dataset for the proposed model.
The tests using segmentation of the input images were conducted only with the proposed model, given that the experimental model showed less significant results.
The automation of Depth Map estimation through advanced Deep Learning techniques shows great potential for integration into navigation aids for the visually impaired and other applications. Researchers are increasingly focused on improving depth map accuracy with single-camera systems, as stereo camera setups pose significant challenges, particularly with calibration.
This study aimed to implement, train, and evaluate a encoder-decoder architecture for estimating depth maps from RGB images. The proposed model (Figure 7), incorporating CBAM attention in the encoder and Modified GCNet in the decoder, achieved accuracy comparable to state-of-the-art methods, as shown in both quantitative (Table 3) and qualitative (Figure 9) evaluations. The model’s design followed current trends in supervised depth estimation, prioritizing accuracy even with increased model complexity and computation.
This study shows that significant results in monocular depth map estimation can be achieved with a simple architecture, avoiding the need for complex, resource-intensive models. These results enable safe navigation when integrated into assistance systems for the visually impaired.
Future experiments should test the integration of transformers and Generative Adversarial Networks into the proposed model, as some studies have reported significant results in depth map estimation with these methods. However, as noted by Kim et al., 66 integrating transformers increases computational complexity.
The main limitations of the depth map estimation system using U-Net with CBAM and GCNet include computational complexity, which, while lower than more advanced models like GANs and Transformers, can still pose a challenge for real-time applications or devices with limited resources. There are also uncertainties regarding the model’s performance on high-resolution images. Additionally, the model may struggle to generalize to real-world scenarios with environmental variations, and the requirement for large amounts of labeled training data represents a significant barrier. Another relevant limitation is the lack of evaluation in dynamic scenarios, which are crucial for assisted navigation applications, as well as the low flexibility in integrating data from multiple sources, such as LiDAR or stereo cameras, which could improve accuracy in more complex environments.
Incorporating hybrid systems could address some of these limitations. Hybrid approaches that combine deep learning models with traditional algorithms can leverage the strengths of both methods, potentially enhancing performance and robustness. For instance, integrating Generative Adversarial Networks (GANs) could improve the realism of depth estimations, while combining U-Net with classical stereo vision techniques may yield more reliable results in diverse environmental conditions. Such hybrid models can also offer greater adaptability by fusing data from multiple sources, thus improving accuracy and performance in complex and dynamic scenarios.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
