Abstract
This article delves into the integration between CNN-based artificial vision and robotic navigation algorithms with the aim of efficient autonomous driving of a tracked mobile robot in residential environments. The development is based on a machine vision system, through a camera mounted on the robot, capturing scenes from different environments within a residential home to identify its current location.
PROBLEM:
Robotic navigation’s kinematics are usually implemented in spatial coordinates of an unknown environment, thus limiting the human-robot interaction to a naive completion of commands by ignoring the potential behind the environmental context in which the robot behaves. The integration of artificial vision into robotic navigation is expected to enhance a robot’s performance in supporting domestic environment tasks.
METHODOLOGY:
To achieve the identification of the robot’s location and its direction of movement, a convolutional neural network is employed, which has two branches that identify different aspects of the environment from the robot’s perspective. Once a destination is set within the environment, a branched exploration algorithm is implemented, allowing the robot to navigate while knowing its location.
RESULTS:
Mobile robotic algorithms for path planning and obstacle avoidance were implemented along with a 98.33% accuracy CNN measured on its capacity to identify residential rooms from the robot’s first-person perspective. These algorithms’ incorporation resulted in the successful guidance of a tracked differential mobile robot through the rooms of a virtual residential environment, avoiding obstacles in the process and identifying locations through which the robot crosses.
Introduction
The technological development of the 21st century has led to the creation of so-called intelligent systems, which are now evident in mobile phones and computational algorithms, among other things. These advancements go hand in hand with the development of robots that, since the last century, have supported industrial cutting-edge tasks, and are now even present in homes. The interaction between humans and robots has become increasingly common, but it still requires numerous investigations and developments in this field. For instance, enabling a human to instruct a robot to move (navigate) to a specific location known by a given name (e.g., kitchen) rather than using numerical coordinates requires the robot to visually recognize that location.
Robotic navigation algorithms continuously evolve to adapt to the displacement requirements of various robot types. In home or residential settings, service robots necessitate trajectory planning [1], obstacle avoidance [2], environmental adaptation models [3], and perception [4] as essential components. In the context of assistive robotics, the navigation requirements dictate the selection of algorithms to be applied [5], encompassing image prediction [6], reinforcement learning [7, 8], fuzzy systems [9], fuzzy neuro systems [10], and deep learning [11].
In the current state of the art, deep learning stands out as one of the most effective techniques in pattern recognition [12, 13]. Research utilizing these techniques spans diverse applications, from smart traffic lights [14, 15] and hyperspectral image analysis [16] to robotic vision systems [17]. In the field of robotics, deep learning is applied to robot navigation primarily through Convolutional Neural Networks (CNN) [18], which serve as the primary algorithm for deep learning-based robot navigation.
The versatility of using Convolutional Neural Networks (CNNs) extends to various fields, such as [19] the detection of pandemic diseases and fraud detection, for example, based on web environments as described in [20]. Some applications involve modifying the base structure of the CNN architecture, as discussed in [21], where the authors employ a quantum graph convolutional network’ for predicting traffic congestion.
Robotic navigation also requires an analysis of the environment to avoid collisions. For instance, in [22], sensing methods utilizing obstacle clustering are proposed to identify road boundaries. In [23], algorithms for automated parking systems are presented, including regional differentiation accuracy requirements for Automated Valet Parking (AVP), such as ramp area, surface fluctuation area, and narrow area, which are essential in the final stage of roboticnavigation.
The cooperation between pattern recognition and navigation algorithms in assistive robots is an ongoing area of development and research [24–26]. This reinforces the necessity for coaction between artificial intelligence and machine vision algorithms in achieving autonomous navigation [27], with the perception of the robot’s surroundings being a fundamental factor in this process [28]. In [29], a navigation algorithm based on nodes is presented and validated in closed simulation environments for drone navigation. It employs transfer learning with convolutional networks, using the AlexNet network architecture, intending to reduce energy consumption. Similarly, in [30], the development of a drone navigation algorithm using convolutional networks is presented, focusing on reducing energy consumption, in open environments.
Based on the referenced state of the art, this document presents the development of a robotic navigation algorithm in a residential environment. It relies on environmental perception through convolutional networks and a navigation algorithm, serving as a foundation for the design and implementation of a home-assistant robot. Given the high computational cost of convolutional architectures used for transfer learning [31, 32], the proposed design introduces a dual-branch Convolutional Neural Network (DAG-CNN) with a maximum of 6 convolution layers. This architecture is intended for use with embedded systems tailored for mobile robots. The network identifies different environments within a residential setting, accompanied by a navigation algorithm that enables the robot to move between these environments.
The main contribution of this project lies in the design of a network architecture and the evaluation of the navigation algorithm, operating not under coordinates as is commonly done but under the direct recognition of the robot’s environment through the proposed convolutional network. Sensors are employed for the avoidance of surrounding obstacles, allowing a residential robot to adjust its navigation even when conditions of the environment change, implying typical obstacles present in residential environments such as the swapping of a dining table, changing the location of a sofa, and others.
This article is divided into four sections: the current introduction, the methodology section where recognition and navigation algorithms are presented, the results section validating robotic navigation in a virtual residential environment, and finally, the conclusions attained.
Methodology
The navigation of robots within a residential environment, as outlined in this paper, is proposed to be achieved through the integration of two distinct modules. The first approach utilizes an algorithm that employs convolutional networks to recognize various spatial zones within the residential setting. The second approach involves the use of a robotic displacement algorithm achieved through trajectory planning. Detailed explanations for each of these methods are provided below.
Machine vision algorithm
The navigation algorithm is designed for a mobile robot operating within a residential environment, and its initial functionality revolves around identifying the robot’s position. To determine the robot’s location, a deep learning algorithm based on convolutional neural networks is employed [33].
Considering the unique characteristics associated with recognizing individual spaces within a residential environment, a two-branch DAG-CNN network was implemented. This network architecture is tailored to distinguish both the general and specific features of spaces such as bedrooms, bathrooms, dining rooms, or TV rooms. With the goal of training the chosen network architecture, a database containing images of various rooms within residential environments was constructed. These images were captured from the anticipated viewpoints and heights of the robots perspective. This database includes a total of 426 training images and 120 validation images. Figure 1 offers a visual representation of the database. What the network will do is allow the robot to have a general awareness of its location within a house or apartment.

Sample from the database.
The convolutional neural network architecture is show in Fig. 2. It was developed with the primary objective of classifying images into one of four categories: bedroom, living room, dining room, and bathroom. It’s important to emphasize that no distinction was made between rooms of the same type, as residential environments may contain multiple instances of spaces falling within the four selected categories. The architecture of the two-branch DAG-CNN network is detailed in Fig. 2.

Architecture of the implemented two-branch DAG-CNN.
In Fig. 2, the design parameters of the network are established for each convolutional layer (C), such as the kernel (k), the number of filters (F), padding (P), and stride (s). Additionally, the kernel for the pooling stage is denoted as (KMP), where M implies the use of the maximum method. These parameters are defined up to the final classification stage of the network in the fully connected layer (FC) [33]. The network is designed based on equations that determine the dimensions of the volumes for each layer of the network according to width, height, and depth, as outlined in equations (1), (2), and (3) respectively.
The network’s hyperparameters, which define the filter size, number of filters, stride (step displacement between filters), and padding, were established through an iterative process. As depicted in Fig. 3, the optimal training performance reached an efficiency rate of 98.33% after 100 iterations. The training process took 36 minutes on a computer equipped with an Intel Core i7 processor and an NVIDIA RTX 3070 GPU with 16 GB of memory. The optimizer used is based on Stochastic Gradient Descent (SGD), 100 epochs are employed to limit the training time. The kernel used in all cases corresponds to a randomly initialized square filter so that its size reduces as the network deepens to determine specific features of the visual area to be learned.

DAG-CNN training progress.
The performance of the implemented DAG-CNN was evaluated through the confusion matrix presented in Fig. 4, on which precision and accuracy metrics are presented for each category on the vertical and horizontal axis respectively. A total of 100% precision was evaluated when classifying dining rooms and bedrooms, and 100% accuracy was achieved when distinguishing bathrooms and dining rooms from the other room types.

Confusion matrix of the DAG-CNN.
The root cause for the mislabeling of categories was identified by checking isolatedly on error samples, being the vision limitants that semilateral perspective view points that entails omission of key room characteristics needed for differentiating room types.
The four components encompassing trajectory planning, namely perception, localization, cognition, and control [27] were addressed utilizing the MATLAB® Robotics System Toolbox. This toolbox offers a comprehensive set of tools and algorithms integrated into Simulink, making it an ideal choice for communication with a virtual residential environment previously constructed in MATLAB's VRML development environment tool “3D World Editor".
The trajectory planning algorithm consists of four essential modules. Firstly, the development of the kinematic model, which determines the mobile robot’s pose (x, y, θ) with respect to the global reference. Secondly, the Rapidly Exploring Random Tree star (RRT*) global path planning algorithm, that calculates both a feasible and optimal path from an initial to a final point within the virtual residential environment. Thirdly, the inverse dynamics method, which computes the rotation speed of each track, ensuring accurate tracking of the expected pose, and lastly, the Vector Field Histogram (VFH) obstacle avoidance module, that dynamically adjusts the robot’s trajectory in real-time to avoid collisions with any obstacles encountered along the beforehand planned path.
The obstacle evasion algorithm relies on the ‘Vector Field Histogram’ (VFH) technique, allowing the robot to determine the best obstacle-free direction for following a predefined path. This is achieved by continuously creating a polar histogram, as shown in Fig. 5, using data from a distance Lidar sensor [34].

Representation of the polar histogram generated in the virtual environment for obstacle avoidance using VFH. HT: Histogram Threshold, RR: Robot Radius, SD: Safety Distance, StD: Turning Direction.
The VFH algorithm’s programming involves three critical parameters that require fine-tuning through trial and error. These parameters, known as the “weights of the cost function,” include the target steering weight, current steering weight, and previous steering weight. They are calculated to prioritize an obstacle-free path among available options. The specific values for these parameter weights were determined through iterative simulations of a straightforward trajectory involving the avoidance of two obstacles, as demonstrated in Fig. 6.

Tuning the weights of the cost function. a) Global trajectory determined with RRT*. b) Target direction weight = 3, current direction weight = 3, previous direction weight = 3. c) Target direction weight = 3, current direction weight = 5, previous direction weight = 1.
While the VFH algorithm performs well on its own, it can be enhanced, especially when confronted with abrupt changes in direction. These situations can compromise the safety distance initially set to safeguard the mobile robot from collisions due to the continuous updates between the target direction and the obstacle-free direction. To tackle this issue, an integrative factor and a derivative factor were introduced. These factors are designed to amplify the importance of deviations recommended by the VFH algorithm.
The determination of gains for both the integrative and derivative factors follows an iterative process. Considering their impact on the robot’s trajectory tracking behavior: The derivative parameter’s gain, when increased significantly, amplifies deviations signaled by the VFH algorithm, whereas the integrator parameter’s gain adds up a smoothing effect and retains a memory of the path covered by the mobile robot up to that point.
The practical implications of fine-tuning these parameters become evident in Fig. 6, where iterative adjustments were made. In Fig. 6(a), the occupancy map and the selected navigation route are displayed. Figure 6(b) illustrates the robot’s navigation with obstacle avoidance during the third iteration of the steering parameters, while Fig. 6(c) represents the fifth iteration, showcasing a noticeable improvement in the mobile robot’s path taken when evading the same presented obstacles.
Both algorithms were seamlessly integrated into the overall system, contributing to route planning, route tracking, object avoidance, and location identification. The system was configured with starting and destination points strategically positioned within a virtual residential environment, aligning with the robot’s anticipated daily tasks, such as transporting specific objects from various locations within a house to the main bedroom. This operation followed the algorithm delineated in the flowchart depicted in Fig. 7.

Flowchart navigation.
The validation of learning to discriminate the residential environment using the DAG-CNN network allows focusing the training on the objects that demarcate each of the environments, as shown in the Fig. 8, where the heat map (right) of the input image (left) highlights the learning. For example, in Fig. 8a it is evident that the recognition of the bathroom is centered on the toilet, for Fig. 8b it is observed that the recognition of the bedroom is centered on the bed. In other words, learning is based on the particular elements that determine each environment regardless of its location within it.

Heat map of filters activations.
In the navigation of the robot within the real environment, the recognition of each environment is validated according to the classification of the network according to the capture of the image from the robot’s perspective as shown in Fig. 9. Figure 9a illustrates the recognition of the dining area from a perspective that is not included in the training database. Figure 9b illustrates the recognition of the television area successfully.

Final environment recognition. a. Dinner. b. TV c. Bedroom d. Bathroom.
Figure 9c illustrates the recognition of the room area from an untrained perspective, where the intersection between two spaces (left and right) stands out, which would eventually allow the robot to determine where to go. In the case of the last (right) Fig. 9d, there is a misclassification because it is an untrained and unlabeled environment, the network only has the possibility of generating the four established outputs, so it labels each image within the one that most closely resembles the database.
With the robot’s location identified and a destination set within the virtual environment, the navigation algorithm generates potential trajectory branches, ultimately determining the optimal route among the available choices. This process is exemplified in Fig. 10 (on the left), where it seeks to minimize both travel time and distance (as depicted in Fig. 10 on the right). In Fig. 11, the two red dots within the corridor represent dynamic objects that the robot must avoid, underscoring the algorithm’s effective performance.

Direct trajectory test.

Obstacle trajectory test.
The implementation of derivative and integrative gains in the deviation signals has been demonstrated to significantly enhance the performance of the VFH obstacle avoidance algorithm, which implemented along with an optimized route planning algorithm such as RRT* results in a integral option algorithm for implementing over an assistant mobile robots path planning within a residential environment.
The utilization of the DAG-CNN architecture has demonstrated remarkable proficiency in accurately identifying rooms within residential environments. This achievement is notable, especially considering its shorter sequence of layers when compared to other well-established CNN designs like AlexNet, which consumes longer computation times in achieving comparable results.
The analysis of learning activations within the network facilitated the swift refinement of parameters, leading to faster convergence and enhanced accuracy during the training process. These insights also provided a clear view of the filter learning approach unique to each environment within the residential context. Similarly, an iterative process was applied to the navigation algorithm settings, resulting in smoother robot movement.
The results obtained up to this point, leads to a subsequent stage of the project, in which a voice command recognition system is to be added for the sake of instructing the robot where in a residential environment should it head to, and what action is expected it to perform.
Footnotes
Acknowledgments
The authors are grateful to Nueva Granada Military University for the funding of this project with code IMP-ING-3405 and titled “Prototipo robótico móvil para tareas asistenciales en entornos residenciales”.
