Abstract
Detecting of cracks and damages, especially in multi storied buildings is a crucial aspect of infrastructure and building maintenance, as it ensures safety and reliability. An enhanced framework for the crack detection is proposed to identify the fine cracks which are present at greater heights and not captured to the human vision from the ground. The cracks are identified and classified by the deep convolutional neural network model. The Oriented Non-Maximal Suppression module reduces the false positives to improve the classification accuracy and reliability. The proposed method O-CNN(CNN with ONMS)can be used in real-world for the infrastructure inspection and potential applications in civil engineering construction. The ability to input different types of data, including images and videos, makes the proposed system user-friendly and easy to use. Furthermore the system reduces the risk of human error and prevents the huge damages caused to the building. Also, it prevents the major loss which may be caused to the lives. Overall, the proposed system contributes to the field of deep learning and computer vision by providing an effective and better solution for crack detection in real-world scenarios.
Introduction
Crack detection is a critical task in various industries, including civil engineering, aerospace, and manufacturing. Cracks can cause significant damage to structures, reduce their strength and stability, and compromise their safety. However, detecting cracks in structures can be challenging, as they can be small and difficult to detect visually, especially in complex or inaccessible locations.
Recent advances in computer vision and deep learning have shown great promise in detecting cracks in images and videos. Convolutional neural networks (CNNs) are a type of deep learning algorithm that can learn to extract features from images and identify patterns that correspond to specific objects or structures. Moreover, oriented non-maximal suppression (ONMS) is a post-processing algorithm that enhances the detection of elongated features, such as cracks, while suppressing other features that do not have a similar orientation.
However, existing methods for crack detection have some limitations and challenges, such as manual inspection and rule-based approaches, which can be time-consuming and costly. Moreover, traditional methods are not as accurate as deep learning-based approaches and can miss detecting cracks. The system is proposed to detect the cracks using CNNs and ONMS. We train a CNN to learn the features that distinguish cracks from other image features, and then apply ONMS to the output of the CNN to enhance the detection of crack-like features. Also a user-friendly interface is developed to allow the users to capture the images, videos, or images from webcams to detect cracks in real-time. The proposed method addresses these limitations and challenges in the existing system by leveraging the power of deep learning and computer vision, leading to more reliable and accurate detection of cracks. The main objective of this work is to develop a crack detection system that can detect cracks in images and videos using convolutional neural networks (CNNs) and oriented non-maximal suppression (ONMS). Furthermore, the proposed system O-CNN(CNN with ONMS)can be beneficial for the detection of cracks in huge, long concrete structures that are difficult to inspect manually. The aim is to achieve high accuracy and reliability in crack detection by combining CNNs and ONMS, and to evaluate the proposed approach on a public dataset to compare its performance with existing crack detection methods. The project also aims to demonstrate the effectiveness of the proposed method in detecting cracks in real-world scenarios, and to provide a user-friendly system that can handle different types of input data. Overall, it aims to contribute to the field of crack detection by proposing a new approach that can detect cracks that may not be visible to the human eye or that are difficult to detect manually in larger buildings.
In Section 2 Literature survey of various models is discussed. Section 3 has the discussed the traditional image processing methods in crack detection and its drawbacks. It clearly specifies the motivation behind CNN model. In Section 4, the architecture of the model is discussed with clear explanation. Section 5 discusses the experimental setup and Section 6 shows the results and discussions. In Section 7 conclusion and future work is discussed.
Literature review
Saberian et al. (2020) [1] developed a deep neural network-based method combined with digital image processing techniques for automatic crack detection and measurement in concrete structures. The authors used a deep convolutional neural network architecture and introduced an augmented dataset for training to handle the class imbalance problem. The proposed method achieved high accuracy in crack detection, with an F1 score of 0.97.
Sankar and Vijayakumar (2020) [10] proposed a crack detection system for concrete structures using an improved convolutional neural network (CNN) architecture. The authors introduced a modified CNN with a novel block unit to enhance the feature extraction capabilities of the network. They also utilized an augmented dataset for training to address the class imbalance problem. The proposed method was evaluated on a dataset of concrete images and achieved high accuracy in crack detection, with an F1 score of 0.97.
Mehta et al. (2019) [3] proposed an efficient crack detection method using machine learning for concrete structures. The authors utilized a hybrid feature extraction approach and combined it with a machine learning algorithm to improve the accuracy of crack detection. The proposed approach outperformed other conventional approaches, receiving an F1 score of 0.95.
Park and Kim (2019) [4] proposed a crack detection method in concrete structures using piezoceramic sensors and neural network-based pattern recognition. The authors utilized the piezoceramic sensor system to capture the acoustic emission signals produced by cracks, and the neural network-based approach to recognize the crack patterns. The proposed method achieved high accuracy in crack detection with an F1 score of 0.98.
Feng et al. (2019) [5] proposed a hybrid crack detection method for concrete structures using image processing and wavelet transform. The authors used a Canny edge detection algorithm to extract the crack edge information and then applied wavelet transform to enhance the crack features. The proposed method achieved an accuracy of 92.3% on a dataset of concrete structure images, demonstrating the effectiveness of the hybrid approach for crack detection.
S. S. S. Sivasankari et al. (2020) [25] proposed a crack detection system for industrial radiographs using a CNN. The authors utilized a pre-trained VGG-16 network as the backbone architecture and fine-tuned it for crack detection. They also incorporated data augmentation techniques and a transfer learning approach to improve the accuracy of crack detection. The proposed method achieved an accuracy of 93.7% on a dataset of industrial radiographs, outperforming traditional methods for crack detection in industrial radiographs.
Banerjee and Majumder (2021) [24] conducted a comprehensive review of crack detection techniques for concrete structures using image processing and machine learning. The authors discussed various methods, including traditional image processing techniques, deep learning-based methods, and hybrid approaches. They also highlighted the advantages and limitations of each method and identified potential areas for future research. The review provides valuable insights for researchers and practitioners interested in crack detection in concrete structures.
R. P. Pawar et al. (2020) [15] proposed a crack detection system for bridge images using oriented non-maximal suppression (O-NMS). The authors utilized a CNN with a ResNet-50 backbone architecture and incorporated O-NMS to improve the accuracy of crack detection. They also utilized a patch-based approach to handle large images and alleviate the effect of class imbalance. The proposed method achieved an F1 score of 0.87 on a dataset of bridge images, outperforming other deep learning-based methods.
Zhao et al. (2021) [2] proposed a deep learning-based method for crack detection in pavement using a CNN. The authors used a ResNet-18 network as the backbone architecture and incorporated a multi-scale feature fusion module to capture features at different scales. They also utilized an attention mechanism to highlight informative regions in the input image. The proposed method achieved an F1 score of 0.948 on a dataset of pavement images, outperforming other deep learning-based methods.
M. Zafar et al. (2021) [22] proposed an automated crack detection system for building walls using deep learning and oriented non-maximal suppression (O-NMS). The authors utilized a CNN with an Inception-ResNet-V2 backbone architecture and incorporated O-NMS to improve the accuracy of crack detection. They also utilized a patch-based approach to handle large images and alleviate the effect of class imbalance. The proposed method achieved an F1 score of 0.88 on a dataset of building wall images, outperforming other deep learning-based methods.
Das and Chakraborty (2019) [6] conducted a comparative study of various machine learning algorithms for crack detection in concrete structures. The authors evaluated the performance of six algorithms, including Support Vector Machines, Random Forests, and Convolutional Neural Networks, on a dataset of concrete images. They found that the CNN-based methods outperformed the other algorithms, achieving an accuracy of up to 95.4%. The study highlights the potential of deep learning methods for effective crack detection in concrete structures.
Gu et al. (2020) [9] developed a CNN-based method for crack detection in asphalt pavement using a multi-scale feature fusion strategy. The authors used a ResNet-50 network as the backbone architecture and combined feature maps from different convolutional layers to detect cracks at different scales. They also utilized a spatial pyramid pooling module to capture features at different levels of granularity. The proposed method achieved an F1 score of 0.905 on a dataset of asphalt pavement images, outperforming other deep learning-based methods.
Khatibinia et al. (2021) [14] proposed a crack detection method for concrete structures using fuzzy clustering and hybrid machine learning algorithms. The authors utilized fuzzy C-means clustering to segment the images and extract crack features. Then, they used a combination of support vector machines and artificial neural networks for classification. The proposed method achieved an accuracy of 94% on a dataset of concrete structure images, demonstrating the effectiveness of the hybrid machine learning approach for crack detection.
Xu et al. (2020) [5] developed a CNN-based method for crack detection in pavement using a novel attention mechanism. The authors used a ResNet-34 network as the backbone architecture and incorporated an attention mechanism to highlight informative regions in the input image. They also utilized a patch-based approach to handle large images and reduce the computational cost. The proposed method achieved an F1 score of 0.959 on a dataset of pavement images, outperforming other deep learning-based methods.
Ghayvat and Nimbalkar (2019) [7] conducted a review of various image processing techniques for crack detection in concrete structures. The authors discussed the advantages and limitations of different techniques and highlighted the importance of selecting the appropriate technique based on the application requirements.
Jiang et al. (2021) [11] developed a CNN-based method for crack detection in pavement using a combination of residual attention and feature pyramid networks. The authors used a ResNet-50 network as the backbone architecture and incorporated residual attention and feature pyramid networks to capture both local and global features of the input image. They also utilized a patch-based approach to handle large images and reduce the computational cost. The proposed method achieved an F1 score of 0.961 on a dataset of pavement images, outperforming other deep learning-based methods.
Zhu et al. (2019) [18] proposed a CNN-based method for crack detection in concrete surfaces using a novel boundary-aware loss function. The authors used a U-Net network as the backbone architecture and incorporated the boundary-aware loss function to improve the accuracy of crack detection. The boundary-aware loss function assigns higher weights to pixels near the crack boundary, which helps to improve the localization of cracks. The proposed method achieved an F1 score of 0.935 on a dataset of concrete surface images, outperforming other deep learning-based methods.
Lee et al. (2020) [30] developed a CNN-based method for crack detection in asphalt pavement using a multi-scale feature fusion strategy. They also utilized a patch-based approach to handle large images and reduce the computational cost. The proposed method achieved an F1 score of 0.908 on a dataset of asphalt pavement images, outperforming other deep learning-based methods.
From the literature review, it can be concluded that deep learning approaches have demonstrated immense potential in identifying cracks in various surfaces, such as concrete, pavement, roads, bridges, and railway tracks. Most of the proposed methods employ convolutional neural networks (CNNs) [12, 26–29] as the fundamental architecture, with different strategies for multi-scale feature fusion [23] and context aggregation to boost the distinctiveness of the features [8]. Transfer learning [16, 17] and data augmentation techniques, such as rotation, flip, and color distortion, have been extensively employed to enhance the diversity of the training dataset and improve the generalization capability of the models. Moreover, some approaches use attention mechanisms to emphasize informative regions in the input image, resulting in reduced false positives. Overall, the proposed methods have achieved better accuracy and robustness, even in the presence of noise, occlusions, and varying lighting conditions, surpassing traditional crack detection techniques.
Limitations of image processing in crack detection
Image processing crack detection techniques have been widely used for crack detection in various structures. However, these methods also have certain limitations that can affect their effectiveness in certain scenarios. Some of the key limitations of image processing crack detection techniques are: First, Image processing techniques can be sensitive to changes in lighting conditions, which may lead to variations in pixel intensities. This sensitivity can result in false positives or false negatives when detecting cracks, especially in outdoor environments with varying lighting conditions. Second, these methods may struggle to detect cracks with complex patterns or irregular shapes. Traditional edge detection algorithms may fail to capture subtle or fragmented cracks, leading to incomplete crack detection. Third, Images captured in real-world scenarios often contain noise and artifacts, which can interfere with crack detection algorithms. Noise can introduce false edges, leading to incorrect crack identification. Fourth, many image processing techniques require the tuning of parameters, such as threshold values or filter sizes. Selecting appropriate parameter values can be challenging, and suboptimal choices may result in subpar crack detection performance. Fifth, traditional image processing methods may lack scale and orientation invariance, making it difficult to detect cracks in images with varying scales and orientations. Also, they may not be robust enough to handle variations caused by different surface materials, coatings, or surface roughness. Some image processing techniques, especially those based on complex algorithms, can be computationally expensive and may not be suitable for real-time or large-scale crack detection tasks. Many image processing techniques require preprocessing steps such as denoising, normalization, and enhancement. These additional steps can add complexity to the pipeline and may require domain-specific knowledge. Image processing techniques are often designed for specific crack types or materials and may not generalize well to different structures or materials, requiring adaptations for each new scenario.
Due to these limitations, image processing techniques are often complemented by other methods, such as machine learning or deep learning approaches, to improve crack detection accuracy and robustness. Machine learning models, especially CNNs, can automatically learn features from data and adapt to different crack patterns and scenarios, mitigating some of the limitations of traditional image processing methods.
Methodology
Automated detection of cracks has emerged as a promising alternative to traditional manual visual inspection for concrete structures. The computer-based methods, such as edge detection, threshold, and texture analysis, have shown potential in detecting cracks automatically. However, these methods are limited by the need for extensive manual feature extraction and selection, which can be time-consuming and challenging to implement.
The development of deep learning-based approaches has shown significant potential in addressing the limitations of traditional methods and has been demonstrated to be effective in detecting cracks in various surfaces, including concrete, pavement, roads, bridges, and railway tracks. These deep learning-based methods utilize convolutional neural networks as the backbone architecture with varying strategies for multi-scale feature fusion and context aggregation to enhance the discriminative power of the features. Transfer learning and data augmentation techniques such as rotation, flip, and color distortion have also been widely used to increase the diversity of the training dataset and improve the generalization of the models. Furthermore, attention mechanisms and other techniques such as non-maximal suppression have been utilized to reduce false positives and highlight informative regions in the input image. As a result, deep learning-based methods have shown to achieve high accuracy and robustness, outperforming traditional methods for crack detection.
The current system for crack detection in infrastructure involves manual inspection, which is time-consuming and labour-intensive. It also suffers from limited accuracy, as it relies on human judgement and expertise, which can be subjective and inconsistent. Feature extraction is another challenge, as it requires domain-specific knowledge and can be influenced by lighting conditions, camera angles, and other factors. The existing system is also inflexible, as it may not be adaptable to different types of infrastructure or new crack patterns. Finally, integrating the system with existing workflows can be difficult, as it may require significant changes to existing processes and technologies.
In Fig. 1 the architecture of the system is shown. The image is captured using high pixel camera or drone. It will be rotating around the multi-storeyed building of larger heights to capture thin cracks, cracks in diagonal and any other form of cracks in the building. Every type of images are captured. Next, the captured images are preprocessed. The proposed system O-CNN(CNN with ONMS) for crack detection in concrete structures leverages the power of deep learning with convolutional neural networks (CNNs) and oriented non-maximum suppression (ONMS). It consists of three stages: image preprocessing, O-CNN feature extraction, and O-CNN crack detection. In the preprocessing stage, the input image is first normalized, resized, and converted to gray scale. In the feature extraction, the relevant features are extracted from the image. Finally, the crack detection stage uses ONMS to suppress non-maximal responses and accurately locate crack regions in the image. It enhances the detection of cracks.

Model architecture.
The performance of the system is measured in terms of accuracy, speed and robustness. Also it is compared to the existing methods such as U-NET, VGG NET, FAST R-CNN, DENSENET and PIXELNET. The system has potential to significantly improve the accuracy and efficiency of crack detection in concrete structures, enabling early detection and prevention of potential structural failures. Additionally, it can be easily integrated into existing inspection workflows, providing a seamless and efficient solution for automated crack detection. The proposed system offers better accuracy in detecting cracks in various surfaces. It automates the process of crack detection, eliminating the need for manual inspection. This approach results in faster processing times, making it suitable for real-time detection applications. The system is also highly flexible and can be easily integrated into existing workflows. Overall, the proposed methodology offers a range of advantages, including accurate and fast detection, automation, flexibility, and real-time capabilities. The feature extraction, crack detection and classification are discussed below and shown in Fig. 2.

Feature extraction and classification.
The core of feature extraction in O-CNNs is the use of convolutional filters or kernels. These filters are small grids of learnable weights that slide over the input crack images. At each position, the filter performs element-wise multiplication with the pixel values in its receptive field and sums them up. This process captures local patterns and features, such as edges and corners. As the convolutional filters slide over the input crack images, they produce feature maps. Each filter corresponds to one feature map. The feature maps represent the response of each filter to specific patterns or features in the input image. These feature maps collectively hold important information about the image’s content. After the convolution operation, an activation function is applied element-wise to introduce non-linearity. The Rectified Linear Unit (ReLU) is commonly used as the activation function. It replaces any negative pixel values in the feature maps with zero, preserving only the positive values and discarding the negative ones.
This non-linearity allows the CNN to learn more complex patterns and relationships in the data. Pooling layers are used to downsample the spatial dimensions of the feature maps while retaining the most important information. Max pooling is a common pooling technique that selects the maximum value from a small neighborhood in the feature maps. Pooling reduces computational complexity and helps the network be more translation-invariant, making it capable of detecting cracks at different positions in the input image. O-CNN with multiple convolutional layers stacked on top of each other. Each layer learns increasingly abstract and complex features as the input propagates through the network. Lower layers tend to capture low-level features like edges and textures, while deeper layers learn high-level features and semantics. This hierarchical learning allows O-CNN to detect relevant features for crack detection in a step-by-step manner. After several convolutional and pooling layers, the feature maps are flattened into a one-dimensional vector. This flattening process transforms the spatial information into a format that can be processed by fully connected layers. The fully connected layers further process the learned features to make the final crack detection predictions. The CNN’s ability to learn hierarchical features from crack images allows it to distinguish between crack patterns and background noise effectively. By training, the model becomes capable of generalizing well to new crack images and accurately detect cracks in concrete structures.
Crack detection using ONMS
Oriented Non-Maximum Suppression (ONMS) is a technique used in edge detection and feature extraction tasks, including crack detection in images. It is a variation of the traditional Non-Maximum Suppression (NMS) algorithm, which is commonly used to thin out edges or features in an image to produce more accurate and thin representations. In the context of crack detection, ONMS is applied to edge maps generated by canny edge detector. The edge map highlights regions in the image that contain edges or high-intensity gradients, which are often indicative of crack boundaries in concrete or other structures. The detailed explanation for the work is as follows. First, the candy edge detection algorithm is applied to the input crack image. This generates an edge map that represents detected edges as high-intensity pixels. For each pixel in the edge map, compare its intensity with the two neighboring pixels along the orientation direction (perpendicular to the edge). If the current pixel’s intensity is not greater than both of its neighbors, suppress it (set its intensity to zero). This ensures that only local maxima along the edge direction are retained, effectively thinning out the edges. The orientation represents the direction in which the edge is oriented. After ONMS, hysteresis thresholding is applied to further refine the edges and eliminate weak, noisy edge pixels. Hysteresis thresholding involves setting two threshold values: a low threshold and a high threshold. Pixels with intensity values higher than the high threshold are considered strong edge pixels, while pixels with values between the low and high thresholds are considered weak edge pixels. Weak edge pixels that are connected to strong edge pixels are kept, while isolated weak edge pixels are discarded. ONMS helps in obtaining more accurate and localized edge representations, which are beneficial for detecting thin and curvilinear structures(cracks) in concrete structures.
Classification of cracks
After preprocessing and feature extraction, the system proceeds to classify the image as either containing a crack or not. This step is crucial for accurate and reliable crack detection. The CNN model is trained using a large dataset of images containing both cracked and non-cracked concrete structures. The CNN model learns to differentiate between features in the input image that indicate the presence of a crack and those that indicate the absence of a crack. During training, the CNN model learns to identify patterns and features in the input images that are relevant for distinguishing between cracked and non-cracked structures. These patterns and features may include cracks of different shapes, sizes, and orientations, as well as variations in color, texture, and lighting. The O-CNN model uses these patterns and features to make predictions about whether an input image contains a crack or not. To get better accuracy, the hyperparameters of the model are adjusted to have optimal value. Hyperparameter optimization is to find the best combination of hyperparameter values for a O-CNN to achieve optimal performance on the crack detection task. These are not learned during the training process but are set before training begins. They influence the training process. The most important hyperparameters considered are Learning Rate (LR), batch size, number of epochs, dropout rate, kernel size, number of filters, activation function and optimizer. The significance and the values set for each hyperparameter are discussed next. Learning rate controls the step size at which the optimizer adjusts the model’s weights during training. Too high a learning rate can lead to overshooting the optimal solution, and too low a learning rate can slow down convergence.
The learning rate can be fixed or adapted during training process. In O-CNN model, learning rate is adjusted using cyclical LR scheduler. Batch size determines the number of samples fed into the model during each forward and backward pass. Larger batch sizes can lead to faster training but may require more memory. So the moderate batch size 64 is this model. The number of epochs is the number of times the CNN iterates over the entire training dataset during training. Setting the number of epochs too low may result in underfitting, while setting it too high may lead to overfitting. A typical value for the number of epochs is 100. Dropout is a regularization technique used to prevent overfitting. It randomly sets a fraction of input units to 0 during training, which helps the model become more robust. A better dropout rate value for the model is 0.2. The kernel size refers to the dimensions of the filters used in the convolutional layers. The kernel size is (5, 5). Smaller kernels capture local features, while larger kernels capture more global features. It is committed to global feature identification. The number of filters in each convolutional layer controls the depth of the feature maps. Increasing the number of filters can help the model to learn more complex patterns but also increases the model’s computational complexity. So this parameter must have a moderate value. The activation function used is ReLU due to its simplicity and effectiveness. Optimizers control how the model’s weights are updated during training. Popular choices are Adam, RMSprop, and SGD (Stochastic Gradient Descent) with momentum. As the learning rate scheduler is used, SGD with momentum is considered as the optimizer. The optimal values for hyperparameters can vary depending on the specific dataset, task, and model architecture.
The architecture of the crack detection using O-CNN model is shown in Fig. 3. It shows the overall working of the system. Once the O-CNN model is trained and it generates a probability score for the input image, a threshold is applied to the score to determine whether the image is classified as containing a crack or not. The threshold value may be set based on the specific requirements of the application, such as the desired trade-off between sensitivity and specificity. If the probability score for the input image exceeds the threshold value, the image is classified as containing a crack. Otherwise, the image is classified as not containing a crack. The output of this step is a binary decision indicating whether the input image contains a crack or not.

CNN architecture for crack detection.
After the cracks have been detected and classified, the final step is to display the results in output visualization for easy interpretation. The visualization should provide information about the location, size, and severity of the detected cracks. This information is useful for making decisions about repairs and maintenance of the structure. One common approach to output visualization is to overlay the detected crack regions on the original image. This allows the user to easily see the location and extent of the cracks. Additionally, it is important to provide a summary of the crack characteristics, such as length, width, and depth. This can be achieved by generating a report that includes a detailed description of the cracks, along with images and measurements.
Another approach to output visualization is to generate a heat map that highlights the regions of the image that contain cracks. This can be achieved by assigning a color scale to the probability scores generated by the crack classification algorithm. Regions with higher probability scores are assigned a higher color value, while regions with lower probability scores are assigned a lower color value. This allows the user to easily identify the regions of the image that are most likely to contain cracks. In addition to these visualization techniques, it is important to provide a user-friendly interface for accessing and interacting with the results. This can include features such as zooming, panning, and scrolling, as well as the ability to save and export the results for further analysis. By providing comprehensive and user-friendly output visualization, the proposed crack detection system can enable more efficient and effective maintenance of concrete structures.
Experimental setup
This section discusses about the dataset and the working of the CNN in each layer and ONMS is applied to the results of CNN.
Dataset
The dataset consists of 40,000 images divided into two folders: “Positive” and “Negative”. The Positive folder contains images of concrete that have cracks, while the Negative folder contains images of concrete without cracks. Each folder has 20,000 images with a resolution of 227 x 227 pixels and RGB channels. No data augmentation techniques, such as random rotation or flipping, were applied to the dataset. However, data preprocessing techniques were applied to normalize the pixel values of the images. Specifically, each pixel value was divided by 255 to scale it between 0 and 1. The dataset has a total size of 233MB. The dataset was split into a training set and a test set. The training set consisted of 80% of the total dataset, while the remaining 20% was used for testing. This split was chosen to ensure that the model had enough data to learn from while also having sufficient data to test its performance. The dataset is balanced, meaning that there are an equal number of positive and negative images in the dataset. This ensures that the model does not have a bias towards one class over the other.
Image enhancement
Image preprocessing is a crucial step in automated crack detection as it helps to enhance the quality of the input image and extract relevant various techniques to improve the image quality, such as resizing, color normalization, and contrast enhancement. Image resizing is used to standardize the size of the input image, which is essential for consistent processing across different images. It can also help to reduce the computational load by resizing the image to a smaller size without compromising the features of interest. Color normalization is a technique used to standardize the color distribution of the input image. This is important as color variations can affect the accuracy of crack detection. For example, cracks on a gray-scale image may appear different from cracks on a color image. Color normalization techniques can help to reduce such variations and improve the accuracy of crack detection. features for accurate detection of cracks. This step involves Contrast enhancement is used to improve the visibility of cracks in the input image. This is achieved by adjusting the contrast of the image such that the difference between the light and dark regions is increased.
This technique can help to make faint cracks more visible and enhance the accuracy of crack detection. In addition to these techniques, other preprocessing methods such as noise reduction, edge enhancement, and illumination normalization can also be applied depending on the nature of the input image. The goal of image preprocessing for enhancement is to prepare the input image for feature extraction and improve the accuracy of crack detection. The algorithm applies a convolution operation to the input image to extract relevant features.
An image with a resolution of 256 x 256 pixels has over 65,000 input features. CNNs are able to reduce the dimensionality of this data by learning to identify relevant features, such as edges, corners, and textures. One of the key advantages of CNNs is their ability to learn features automatically from the data. With CNNs, the network learns to identify the most relevant features from the data itself, making them more adaptable to different types of images and tasks. In the next part, we will discuss the different layers that make up a typical CNN architecture, including the input layer, convolutional layer, activation layer, pooling layer, dropout layer, fully connected layer, and output layer. We will also discuss how these layers work together to enable the CNN to learn relevant features and make accurate predictions on new images. Figures 4 5 shows the image before enhancement and after enhancement.

Original image.

Enhanced image.
This section discusses the working of the various layers like input layer, convolutional layer, output layer in detail.
Input layer
The input layer is the first layer in the CNN architecture, which takes the input image as a matrix of values, where each pixel in the image represents the intensity of the image at that point. In crack damage detection, the input image is usually a gray scale image, where each pixel’s intensity ranges from 0 (black) to 255 (white). The input image’s size may vary depending on the specific application, but typically, it’s a square matrix with a width and height of the same size, such as 256x256, 512x512, or 1024x1024. The primary role of the input layer is to convert the raw image data into a format that the model can understand and process further. The input layer does not learn any features from the input image but passes it on to the next layer for further processing. The input layer’s size and the number of channels depend on the size and the number of color channels of the input image. For instance, a grayscale image of size 256x256 has a tensor of size (256, 256, 1), while an RGB image of size 256x256 has a tensor of size (256, 256, 3), where the 3 channels represent the Red, Green, and Blue color channels. The input layer’s role is crucial in determining the CNN’s performance, as the quality of the input data significantly affects the model’s accuracy.
Convolutional layer
The Convolutional Layer is one of the key components of a CNN. It applies a set of filters (also known as kernels) to the input image to extract relevant features. Each filter slides over the input image and performs element-wise multiplication followed by summation to generate a feature map. The filters are typically small square matrices with values that are learned during the training phase of the model. The size of the filters is usually much smaller than the input image, for example, a 3x3 filter is commonly used in practice. The number of filters in this layer is typically chosen based on the complexity of the input image and the number of features that need to be extracted. By applying multiple filters to the input image, the CNN is able to detect edges, shapes, and textures in the image. The outputs of the convolutional layer are passed through an activation function ReLU, to introduce non-linearity into the model and capture complex patterns in the data. By repeatedly adjusting the weights, the model learns to extract relevant features from the input image that can be used for classification tasks.
ReLU activation layer
The ReLU activation layer introduces non-linearity into the model. Non-linearity is important in capturing complex patterns in the data by allowing the model to learn non-linear relationships between the input and output. The ReLU activation function is given in Equation (1).
The pooling layer reduces the spatial dimensionality of the feature maps generated by the convolutional layer by applying a pooling operation. In more detail, the pooling layer is used to down sample the output of the convolutional layer, reducing its spatial size and hence the number of parameters in the model. The pooling operation replaces the output at a certain location with a summary statistic of the nearby outputs. This can be thought of as summarizing the presence of features at that location. Max pooling is a popular pooling operation used in the model, where the maximum value within a rectangular window is taken as the output. It is effective in retaining the most prominent features while reducing the dimensionality of the feature maps. The pooling layer typically has no learnable parameters and is used to reduce the spatial size of the feature maps, which helps to reduce overfitting and improve the generalization performance of the model.
Dropout layer
The dropout layer helps in preventing overfitting by randomly dropping out some of the neurons in the previous layer during training. The dropout layer works by probabilistically dropping out some of the neurons in the previous layer during each training iteration. This means that during each iteration, some neurons will be randomly selected and their outputs will be ignored. The dropout rate is a hyperparameter that controls the percentage of neurons that are dropped out. By dropping out neurons, the model is forced to learn more robust features that are not dependent on specific neurons. This makes the model more generalizable and less prone to overfitting. Additionally, the dropout layer acts as an ensemble of smaller networks, each with a subset of the neurons. This ensemble approach helps in reducing the variance in the model and improving its performance on new, unseen data. The dropout layer is a simple yet effective way to improve the performance of neural networks, especially in cases where the dataset is small or the model is complex.
Fully connected layer
The fully connected layer takes the output of the previous layers and applies a set of weights to generate a vector of class scores. In this layer, every neuron in the previous layer is connected to every neuron in the current layer, giving it the name “fully connected”. The model adjusts the weights of the model to minimize the difference between the predicted output and the actual output. The output of the convolutional and pooling layers is typically flattened into a one-dimensional array before being fed into the fully connected layer. This is done to maintain the spatial information learned by the previous layers while still allowing the fully connected layer to perform its function. The number of neurons in the fully connected layer depend on the complexity of the problem and the number of classes that the model needs to classify. The fully connected layer can be thought of as a high-level feature extractor that maps the input image to a set of abstract features that are relevant for the classification task. The issue with fully connected layers is that, computationally expensive when dealing with large input images. To address this issue, techniques such as convolutional layers with global average pooling have been developed to reduce the number of parameters in the model and improve its efficiency.
Output layer
The output layer produces the final output of the model. In crack damage detection, the output layer typically has two neurons, one for each class cracked and non-cracked. The output of the previous layer, which is a vector of class scores, is fed into the output layer. These scores represent the likelihood that the input image belongs to each of the classes. To convert these scores into probabilities, the output layer applies a sigmoid activation function. The sigmoid function maps any input value to a value between 0 and 1. This represents the output of the sigmoid function can be interpreted as the probability of the input image belongs to a particular class. The sigmoid function is given in Equation (2):
Oriented Non-Maximal Suppression (ONMS) is a technique used in crack detection, which helps in refining the results obtained from an edge detection algorithm. It suppresses weaker edges that are not aligned with the strongest edge, thus ensuring that only the strongest edges are kept. First, the strongest edge in the image has to be identified. This is done by calculating the edge response at each edge point and selecting the point with the highest edge response. The edge response can be calculated using the magnitude of the gradient at each edge point or any other suitable measure. Once the strongest edge is identified, the gradient orientation at each edge point is compared to the gradient orientation of the strongest edge. Edge points that are not aligned with the direction perpendicular to the strongest edge are suppressed. If the edge response at the current edge point is greater than the edge response at the neighbouring pixel, the edge point is retained.
Figure 6 shows the various edge detection for the crack without using ONMS. From this the appropriate edge is selected using ONMS. Oriented Non-Maximal Suppression (ONMS) extends NMS by taking into account the gradient orientation of the edge points.

Before oriented non-maximal suppression.
Figure 7 shows the output of crack detection with oriented non-maximal suppression (ONMS) applied. ONMS is a post-processing technique used to refine the predicted boxes produced by our crack detection model, and improve the accuracy of the crack detection. We can see that the number of predicted boxes has been significantly reduced, and the remaining boxes have been adjusted and refined to better capture the actual cracked areas in the image. By using ONMS, we are able to eliminate false positives and overlapping boxes, which can lead to more accurate and reliable detection of cracks.

After oriented non-maximal suppression.
To achieve this, we set a threshold value for the overlap between adjacent boxes, and only kept the boxes with the highest prediction scores among those with overlapping regions. The threshold value was chosen based on the trade-off between reducing the number of false positives and preserving the true cracks. Hence, the use of ONMS in this step helps to refine the initial predictions produced by our crack detection model, and provides a more accurate and reliable detection of cracked areas in the wind turbine image. ONMS in crack detection helps in suppressing noise and other non-relevant edges, and thus enhances the detection of cracks in images. It ensures that the detected cracks are aligned with the direction of the strongest edge and are not affected by weaker edges in the image. Overall, ONMS is an effective technique in crack detection that can help in improving the accuracy of crack detection algorithms.
The ONMS function implements Oriented Non-Maximal Suppression on the output of a convolutional neural network (CNN) model. It takes two arguments: P-output: an array of shape (num_boxes, 5), where each row corresponds to a predicted bounding box and contains the coordinates of the top-left and bottom-right corners of the box (x1, y1, x2, y2), as well as a confidence score. threshold: a float value between 0 and 1, which is used as a threshold for the intersection-over-union (IoU) ratio. The formulas for Intersection over Units are given in (4) and (5).
If the IoU between two boxes K and L is greater than threshold, the box with the lower confidence score is suppressed (i.e., removed).
The function implements the following algorithm: Sort the boxes based on the confidence score (in descending order). Initialize a boolean array to keep track of the boxes to keep. Iterate over the sorted boxes: If the current box has already been marked for removal, skip it. Compute the area of the current box. Iterate over the remaining boxes: If the current box has already been marked for removal, skip it. Compute the IoU between the current box and the remaining box. If the IoU is greater than the threshold, mark the box with the lower confidence score for removal. Filter out the boxes marked for removal and return the remaining boxes.
The pseudocode for ONMS algorithm is
M onms ← φ
for m i ∈ M do
‘discard’ is set to ‘false’
for m j ∈ M do
if identical (m i , m j ) > α onms then
if edgescore(c,, m j ) >edgescore(c, m i )
then
‘discard’ is set to True
if not discard then
M onms ← M onms ∪ m i
return M onms
The current implementation of the ONMS function has a time complexity of O (n2), where n is the number of predicted boxes. This is due to the nested loops that iterate over all pairs of boxes. To optimize the function, we can use a different algorithm that has a lower time complexity. One possible approach is to use a data structure such as a quadtree or a kd-tree to efficiently compute the pairs of boxes that overlap. This can reduce the time complexity to O (n log n) or
The use of oriented non-maximal suppression (ONMS) in crack detection offers several advantages. Firstly, it provides accurate edge detection, allowing for precise localization of cracks in images. Secondly, it improves object detection by reducing false positives and false negatives. Additionally, ONMS enables faster processing of images, which is particularly useful for real-time crack detection applications. Lastly, ONMS is robust to varying edge orientations, making it effective in detecting cracks of different shapes and sizes.
The various evaluation metrics used in the context of Convolutional Neural Networks (CNNs) for classification and semantic segmentation tasks are Global Accuracy, Class Average Accuracy, and Mean Intersection over Union (Mean IoU). It is discussed below.
i. Global accuracy, also known as overall accuracy or classification accuracy, is a commonly used evaluation metric in classification tasks, including those involving Convolutional Neural Networks (CNNs). It measures the overall correctness of the model’s predictions across all classes in the dataset. Global accuracy is calculated as the ratio of correctly classified samples to the total number of samples in the dataset. The formula for calculating global accuracy is
It includes all the samples across all classes. A higher global accuracy indicates better performance of the model in correctly classifying samples, while a lower global accuracy suggests the model’s predictions are less accurate.
ii. Class average accuracy, also known as per-class accuracy or mean class accuracy, is an evaluation metric used in multi-class classification tasks, including those involving Convolutional Neural Networks (CNNs). Unlike global accuracy, which measures overall correctness across all classes, class average accuracy assesses the performance of the model for each individual class and then calculates the average accuracy across all classes. It provides insights into how well the model is performing for each specific class in the dataset. The formula for calculating class average accuracy is
Class accuracies are the individual accuracies for each class in the dataset. To calculate the accuracy for a specific class, you divide the number of correctly classified samples of that class by the total number of samples belonging to that class.
The sum of accuracies for all the classes in the dataset. It represents the total accuracy achieved across all classes. Class average accuracy provides a more granular view of the model’s performance for each individual class, helping identify classes where the model excels and classes where it may struggle.
iii. Mean Intersection over Union (Mean IoU or mIoU) is an evaluation metric commonly used in semantic segmentation tasks, including those involving Convolutional Neural Networks (CNNs). It measures the accuracy of a model’s segmentation predictions by calculating the overlap between the predicted segmentation masks and the ground truth masks for each class and then taking the average across all classes. The formula for calculating Mean Intersection over Union is
IoU is a measurement of the overlap between the predicted segmentation mask and the corresponding ground truth mask for a specific class. It is calculated as the ratio of the area of the intersection of the two masks to the area of their union. The IoU ranges from 0 to 1, with 0 indicating no overlap between the predicted and ground truth masks (completely incorrect prediction), and 1 indicating a perfect overlap (perfect prediction). The number of classes is the total count of unique classes present in the semantic segmentation task. Each class represents a different category or object that the model is expected to segment. To calculate the Mean IoU, sum up the IoUs for all classes and then divide the sum by the number of classes. The result represents the average accuracy of the model’s segmentation predictions across all classes.
In CNNs, precision, recall, and F1-score serve as critical metrics for evaluating the model’s performance, particularly in binary classification tasks with class imbalances. These metrics provide a more detailed understanding of the model’s strengths and weaknesses.
iv. Precision measures the proportion of true positive predictions (correctly classified positive samples) over the total number of positive predictions (samples predicted as positive by the model, including true positives and false positives). It focuses on how many of the predicted positive samples are actually relevant (correct). The standard formula is
v. Recall (Sensitivity or True Positive Rate) measures the proportion of true positive predictions over the total number of actual positive samples (including both true positives and false negatives). It focuses on how many of the positive samples the model has correctly identified. The formula is
vi. F1-score is the harmonic mean of precision and recall. It provides a balanced view of the model’s performance, taking into account both precision and recall. The F1-score is especially useful when dealing with imbalanced datasets, where precision and recall can be in conflict.
By considering precision, recall, and F1-score, it is possible to make more informed decisions about the model’s performance and adjust its hyperparameters, architecture, or training strategy to strike a balance between precision and recall, depending on the specific requirements of the applications.
vii. True Positive Rate (TPR) is the proportion of positive samples (belonging to the positive class) that the model correctly identifies as positive. It is also known as recall or the true positive rate. The standard formula is
False positive rate
Comparison of the proposed model with other models
viii. False Positive Rate (FPR) is the proportion of negative samples (belonging to the negative class) that the model incorrectly identifies as positive. The formula is
In a binary classification model, predictions are typically made using a probability threshold. Samples with predicted probabilities above the threshold are classified as positive, while those below the threshold are classified as negative. In Table 1 the false positive rates for all the models are recorded. The proposed model has low false positive rate when compared with other models. This shows that the proposed O-CNN gives better result in detecting the cracks.
AUC-ROC can be obtained from TPR and FPR which provides a single-value representation of the model’s overall performance. They allow to visualize the trade-off between sensitivity and specificity and determine the optimal threshold that maximizes the model’s performance based on the specific requirements of the task. Models with higher AUC-ROC values generally indicate better discriminative abilities and are preferred in binary classification tasks.
Table 2 shows the comparison of the proposed model with the other existing models based on the metrics which are discussed above. The proposed O-CNN model gives good accuracy when compared with other models. Based on other metrics class average accuracy, MIoU, precision, recall and F1 measure O-CNN model performs better. Figure 8 (Accuracy Vs Epoch graph) depicts the performance of the proposed crack detection system during the training phase. The graph shows the accuracy of the system at different epochs. The X-axis represents the number of epochs, and the Y-axis represents the accuracy of the system. From the graph, it is clearly visible that the system starts stabilizing from the earlier epochs onwards. The graph shows a significant improvement in the accuracy of the system over time.

Accuracy vs epoch.
Figure 9 (Loss vs Epoch graph) shows the performance of the O-CNN model during training. The X-axis represents the number of epochs, and the Y-axis represents the loss. At epoch 1, the system achieved an accuracy of 52%, while at epoch 70, it achieved an accuracy of 96%, which is a substantial improvement. This indicates that the system was able to learn and improve its performance over time, which is a desirable feature in any machine learning system.

Loss vs epoch.
The graph shows that at epoch 1, the model had a loss of 35%, indicating poor performance. However, as training continued, the loss gradually decreased, and by epoch 70, the model had achieved a low loss of 0.01596, indicating better performance. This demonstrates that the O-CNN model has effectively learned the features necessary for accurate crack detection. The model performs better and the convergence is achieved in earlier iterations. The generalization of the model is good as it gives better global accuracy and low loss value.
In this study, we have presented a crack detection system that utilizes a O-CNN model and ONMS algorithm to accurately detect cracks in concrete surfaces. Proposed system achieved better accuracy, precision, and recall values on a dataset of 40,000 images and was able to perform well on real-world images and videos of concrete surfaces. The combination of CNNs and ONMS allowed us to improve the accuracy and robustness of crack detection in concrete surfaces. The O-CNN model was able to effectively learn the features of cracked and non-cracked surfaces, while the ONMS algorithm was able to refine the edge detection results and suppress weaker edges. The O-CNN gives better results in all the metrics like accuracy, class average accuracy, MIoU, Precision, recall and f1 measure which is considered for evaluation. Also it works well when compared with the existing models. As a future work, model can be trained to analyze both input images and videos. Also some more additional features like crack length calculation, crack width estimation can be included. Overall, our study demonstrates the potential of using O-CNN for enhancing crack detection in concrete surfaces and has significant implications for the field of infrastructure monitoring.
