Abstract
The task of cell segmentation in microscope images is difficult and popular. In recent years, deep learning-based techniques have made incredible progress in medical and microscopy image segmentation applications. In this paper, we propose a novel deep learning approach called Residual-Atrous MultiResUnet with Channel Attention Mechanism (RAMRU-CAM) for cell segmentation, which combines MultiResUnet architecture with Channel Attention Mechanism (CAM) and Residual-Atrous connections. The Residual-Atrous path mitigates the semantic gap between the encoder and decoder stages and manages the spatial dimension of feature maps. Furthermore, the Channel Attention Mechanism (CAM) blocks are used in the decoder stages to better maintain the spatial details before concatenating the feature maps from the encoder phases to the decoder phases. We evaluated our proposed model on the PhC-C2DH-U373 and Fluo-N2DH-GOWT1 datasets. The experimental results show that our proposed model outperforms recent variants of the U-Net model and the state-of-the-art approaches. We have demonstrated how our model can segment cells precisely while using fewer parameters and low computational complexity.
Keywords
Introduction
Accurately diagnosing the disease is one of the doctors’ concerns so that they can determine the appropriate action to combat this disease. For the visual portrayal of the functionality of tissues and organs, medical imaging has become a baseline in medical intervention and diagnosis [1]. Therefore, researchers are interested in effective methods that perform the process of classification and segmentation of medical images. In microscopy images, segmenting outstanding objects like cells is a significant and difficult operation, desired for a lot of applications in both industrial and scientific surroundings. Accurately Describing how cells change their forms is critical to understanding the mechanobiology of cell migration and its various indications in both several diseases and also natural tissue evolution [2].
The process of cell segmentation involves dividing a microscopic image area into segments that each illustrate a single instance of a cell. It is recognized as a basis of cellular study using images and is a major step in many scientific studies. A properly segmented image can record morphological data that is physiologically significant. It is too hard to manually segment cell instances from large cell datasets, so it is necessary to use automatic analysis techniques for medical images. Although the fact that several methods have already been created [3], the segmentation process falls short of hand annotations for cells with more complicated textures or shapes [2]. There is still a need to study datasets with low contrast ratios, large cell densities, and inadequate edge information.
To overcome these segmentation issues, a lot of methods are available. But deep learning models and Convolutional neural networks (CNNs), which are widely employed today, have achieved a high-level accuracy [1]. These models offer a viable solution to older algorithmic approaches’ shortcomings. Deep Learning models excel in extracting high-level features from input data and constructing multiple-layer correlations between input and target [4]. Later, CNN is used to handle end-to-end image segmentation challenges utilizing the Fully Convolutional Network (FCN) [5]. FCN was a turning point of symbolic significance. It enhances segmentation accuracy by combining various layer representations and transmitting pre-trained weights of the classifier. Later, the U-Net [6] architecture have been successfully applied for cell segmentation based on FCN. Moreover, It made a notable effect generally on biomedical image segmentation [7].
It should be noted that the segmentation accuracy must be as high as possible because a biomedical image segmentation task is a component of actual patient diagnosis. In recent years, several studies in the field of biomedical image segmentation and specifically in cell segmentation were conducted to improve upon the modern U-Net [6] architecture. To achieve accurate segmentation results, a lot of researchers have proposed many U-Net variations. MultiResUNet [8], CNL-UNet [9], Res2-UNeXt [10], and DenseRes-Unet [11] are some impressive advancements in the domain of deep learning for more accurate segmentation of cells. However, these architectures improve segmentation outcomes at the expense of the high parameters and intensive computational requirements.
To address previous challenges, this study introduces a novel segmentation network based on deep learning, termed the RAMRU-CAM, by combining MultiResUNet [8] architecture with Channel Attention Mechanism (CAM) [12] and Residual-Atrous connections [11]. This model can implement well on 2D biomedical images and especially in cell segmentation. We have used the Residual-Atrous path instead of the Res path [8] as skip connections to minimize the semantic gap between the feature maps of the encoder and decoder layers, capture extra contextual data, and keep the image resolution. Moreover, we add a Channel Attention Mechanism (CAM) block in the decoder sub-network to make the network focus on key information before concatenating the feature maps from the encoder to the decoder layers. The suggested architecture is built with fewer parameters to reduce overfitting and result in a lightweight network. The evaluation of the PhC-C2DH-U373 dataset demonstrates that the proposed RAMRU-CAM beats both its baseline models and the most recent segmentation methods.
There are six sections in this paper. The previous works that are related to our work are briefly described in Section 2. Section 3 explains the details of the proposed method including image pre-processing, the network architecture where we have presented the highlights of the network architecture and its components, and then image post-processing. Section 4 introduces our dataset, training methodology, and evaluation metrics. Section 5 describes the experimental testing results and compared them with the state-of-the-art methods to prove the effectiveness of the proposed approach. Finally, Section 6 demonstrates the conclusion of the study.
Related work
Cell segmentation using classical methods
Automatic muscle fibers, cells, cell nuclei, or other sub-millimeter objects detection and segmentation from images or videos is a crucial step in many applications of biomedical image processing. There are two primary categories of cell and nucleus segmentation techniques, traditional techniques and learning-based techniques. The traditional methods for segmenting cell nuclei include statistical model [13], watershed segmentation [14], active contours [15], thresholding [16], and so on. Taheri et al. [13] designed a nuclei segmentation method that improves the nuclei border curves using a statistical level set method and utilizing topology-preserving criteria. This approach is a region-based segmentation technique, in which the nuclei were found using color deconvolution. One of these methods is Watershed, which is frequently utilized in the field of medical image analysis. [17] introduced a watershed approach to detect and segment cells for nuclei in 3D tissue images, and then a graph-cut optimization was used to locate the 3D cell nuclei. However, one of the main defects of this approach is frequently suffering from the over-segmentation issue. Also, a lengthy computing time is needed to find the cells.
Another classical approach, active contour (AC) is widely used to define disease regions in cell segmentation [18]. The fundamental concept is to derive level curves from the possibility function. However, this repeated technique makes applying it to wide-ranging histopathology images computationally costly. In addition, this approach needs previous knowledge of the required contour form, which is challenging in cell segmentation cases because of the variety of cell shapes. Another approach is thresholding, in which its basic idea is substituting every pixel in the image with a white pixel if its value surpasses a particular threshold and a black pixel otherwise. Although thresholding is the easiest and simplest approach to implementing image. However, it is affected by image noise due to its dependence on pixel values [19]. Lee et al. [16] developed a segmentation approach to extract precise cell borders by labeling picture super-pixels into the cells that have the nearest nucleus. Cell-wise graph cuts were then used to further refine the contour. However, this method is computationally expensive.
Cell segmentation using deep learning-based methods
Although biomedical image segmentation is a challenging process in and of itself, cell segmentation presents certain special difficulties. The difficulty of cell segmentation is because of high cell densities, inadequate edge information, a low signal-to-noise ratio, and biological variability. Recent works handling these difficulties relied on deep learning techniques which are proven to be precise and strong for image segmentation in general [20], and specifically for medical image analysis [7].
In deep learning, the most straightforward method for semantic Segmentation of cells is Classifying pixels as foreground or background, then labeling the associated regions of the foreground pixels to obtain cell-specific masks. This method of deep learning can effectively segment single cells with erratic shapes and appearances. Convolutional neural networks (CNNs) which were first proposed by Y. Lecun et al. [21] are a kind of artificial neural network used in deep learning. CNNs are frequently used for image processing. It can be used for tasks of image classification [22], object detection [23], image segmentation [24], and other tasks.
Baseline models
Recently, [5] introduced fully convolutional networks (FCN) which obtained modern performance and is the first successful CNN architecture to handle semantic segmentation issues using end-to-end learning on entire images and combining various layer representations. This end-to-end process makes learning and inference simple and speedy at the same time. But, the FCN’s simplified decoder results in segmentation with a dissatisfying accuracy. To address this issue, the common U-Net architecture [6], which is inspired by the FCN, was proposed by expanding the FCN’s downsampling and upsampling routes. U-Net retrieves intricate visual features at various scales. It was demonstrated to be effective for biomedical image segmentation. U-Net features skip connections between encoder and decoder paths to facilitate learning operations [25], which is its most noticeable feature.
Although the U-Net model has the aforementioned benefits, it also has certain drawbacks. For instance, the model does not account for the semantic discrepancies between the two feature maps that were concatenated. Furthermore, duplicate use of model parameters and computing resources. Similar low-level information is repeatedly extracted by all sequential network layers. Hence, Ibtehaz and Rahman [8] proposed a more effective model, namely MultiResUNet. This model implements some adjustments to the standard U-Net. It proposed MultiRes blocks instead of the encoder and decoder blocks which decrease the memory requirements. In addition, it utilized Residual paths instead of the standard skip connections. MultiResUNet produced superior results in a smaller number of epochs. Although this architecture improved segmentation accuracy and has reduced the number of parameters present in U-Net, it still requires more calculations, making it vulnerable to overfitting.
Attention mechanism
To suppress the unrelated activations and concentrate largely on the elements related to the particular task, attention gates were designed. Recently, the attention mechanism has drawn a lot of attention in variations of computer fields, such as medical image classification [26, 27], automatic segmentation tasks [28–32], and other computer tasks. Oktay et al. [33] proposed an Attention U-Net network by combining attention gates into a typical U-Net model. It minimizes the potential of false positives and lets the model focus on the task at hand to improve model performance. Furthermore, it eliminates the additional model parameters that the U-Net model suffered. Although Medical image segmentation using attention U-Net is effective, at the high-level feature maps, it cannot extract multi-scaled receptive field features.
Convolutional neural networks have a module for channel-based attention called a channel attention mechanism. The channel attention map is designed by taking advantage of the relationships between features across channels. A feature map’s channels are thought of as feature detectors, so channel attention concentrates on “what” is considerable given an input image. An early visual architectural unit called Squeeze and excitation networks (SENet) [32] was utilized to investigate the mechanism of channel attention. Through learning, SENet adaptively assessed the significance of various feature channels with a focus on their interdependence. Various weights were allocated according to various significance. SENet network presents an architectural unit that can be combined with deep CNN architectures to enhance performance at a low computational cost. However, this network lacks spatial attention, which is crucial for choosing “where” to concentrate. Wang et al. [34] introduced a channel attention mechanism based on SENet, namely ECA-Net. This module utilizes 1D convolution instead of the fully connected layer of SENet to additional reduce the number of parameters, thus the model’s complexity is significantly reduced while still performing well and more efficiently than the SENet network. Li et al. [35] presented a triple attention mechanism (TA-Net), space, channel, and feature internal attention mechanism to extract the global details of different dimensions. This can enhance the performance of segmentation and feature discrimination capability. However, it has the drawbacks of a big quantity of calculations and a poor operating speed, When the size ratio of the background and target is very varied. In our network, the CAM is presented to get feature maps at various levels, which forces the network to pay close attention to the significant details in the feature images.
The state-of-the-art models
In recent years, a wide variety of techniques have been published for automatically segmenting cells in the literature [10, 36–39]. Deeper models should, in theory, implement segmentation tasks more effectively. But training gets complicated and hard when the model is deepened. By utilizing residual networks (ResNets) [40] which was proposed by He et al., this issue is resolved. ResNets demonstrated the effectiveness of learning deeper networks by bypassing the non-linear transformations using the identity-based shortcut connection. This connection is the main factor that makes the training operation of these networks easy. These Identity shortcuts insert neither additional computational complexity nor new parameter and thus this enhances computational efficiency.
Recently, the U-Net [6] model has been subjected to many changes to increase its accuracy. The U-Net has been altered by S. Das et al. [41] by the addition of residual blocks to produce a model for cell nuclei segmentation. These blocks enable U-Net to extract additional information at each layer. Furthermore, the network partially overcame the drawbacks of U-Net’s single segmentation scale and limited information distribution. Jha et al. [42] proposed ResUNet++ as an improved model of Res-UNet. It took advantage of squeeze and excitation blocks, attention blocks, residual blocks, and Atrous Spatial Pyramidal Pooling (ASPP). These extra layers aid in the learning of deep features that can improve pixel prediction for tasks involving object segmentation. The entire object is segmented using this method. However, the area-boundary constraint, which is essential for improving segmentation performance, is disregarded. Zhou et al. [43] improved a nested U-Net (U-Net++) network by re-designing skip paths through a sequence of dense and nested skip connections. These modified skip paths are intended to minimize the semantic gap between the encoder and decoder feature maps. This network is used in the segmentation of cell nuclei, colon polyps, lung nodules, and liver. A network of this kind can enhance segmentation performance. However, this network is computationally expensive and training is more challenging. Zeng et al. [44] designed a RIC-Unet model to segment nuclei by using residual blocks, multi-scale, and channel attention mechanism. The segmentation mask was improved using a post-processing technique because the predicted nuclei included numerous overlapping cells. In comparison to the conventional UNET model, The F1-score increased by 2% as a result of this processing.
The capacity to extract spatial information at different scales is not generally given further attention in the majority of the studies described above. Moreover, these architectures have many parameters, making training computationally costly. To offset these defects, we are inspired by MultiResUNet [8] and propose a novel and lightweight architecture for cell segmentation. It is more practical than the heavyweight models stated earlier.
Proposed method
In this study, we propose a novel hybrid architecture called Residual-Atrous MultiResUnet with Channel Attention Mechanism (RAMRU-CAM) for cell segmentation. We proposed some modifications to the MultiResUNet architecture [8] by combining Channel Attention Mechanism (CAM) [12] blocks at less deep layers (Depth 2 and Depth 3) in the decoder network, and using Residual-Atrous connections [11] as skip connections at deeper layers of the network (Depth 0 and Depth 1) instead of Res path. The overview of the proposed network architecture is shown in Fig. 1. The components employed in our architecture are discussed in the following subsections, followed by a detail of the proposed architecture.

The overview of the proposed RAMRU-CAM model.
Data pre-processing plays a significant role in optimizing the performance of segmentation. We use 2D medical images. Therefore, in pre-processing our data, firstly, we apply full image analysis instead of Patch-wise analysis [1, 7] which analyzes data slice-by-slice. By using the full image analysis, we fit fully 2D medical images into our model by analyzing the complete image data instead of an analysis of cropped patches or splitting the image. Secondly, we use resizing images process. In computer vision, resizing images is an important pre-processing step. Several deep learning model designs need that our images have the same size. Because neural networks accept inputs of the same size and because our raw gathered images may be different sizes, All Images must be reshaped to a fixed size before being fed into the CNN. Because of The original size of our images is 520×696. Therefore, all inputs of our medical images are pre-processed by resizing the images to a 592×592 shape. To decrease the computational costs, there is no other pre-processing function used in our proposed approach.
MultiRes block
We used MultiRes blocks Inspired by MultiResUNet [8] to make the architecture easier to reconcile the image’s elements at different scales. This block uses a series of 3×3 kernels in succession to mitigate the greater memory requirements of 5×5 and 7×7 kernels. because, the first, second, and third 3×3 filter sets provide output that approximates the use of 3×3, 5×5, and 7×7 kernels, respectively. The MultiRes block consists of three of 3×3 convolutions and combines the results of them to extract the spatial information from different scales. Then adding a residual connection by adding the output of the concatenation with a 1×1 convolutional layer to extract some additional spatial features.
The MultiRes block is demonstrated in Fig. 2. This block assigns the parameter W =α×U to control the number of filters of the convolutional layers where α= 1.67 is the scalar coefficient and U is the number of filters in the network which are 32, 64, 128, 256, and 512. The block progressively increases the filters in the three sequential convolutional layers from 1 to 3 by assigning

The MultiRes Block.
To extract more complicated features from data, deeper networks are necessary. But, as the network becomes more complex, it loses more spatial information that lowers segmentation accuracy. In U-Net architecture, During the max-pooling processes, some essential information is lost. To address this problem, U-Net contributes by presenting skip connections between the respective layers before the max-pooling in the encoder path and after the deconvolution operations in the decoder path. These connections allow the network to transfer spatial information, lost during the pooling process, from the encoder to the decoder [6]. However, skip connections have a drawback that can be demonstrated as follow. The first skip connection bridges the encoder path before the first max-polling with the decoder path after the final deconvolution process in depth-0. Here, the encoder’s features are assumed to be lower-level features because they are computed in the network’s previous layers. On the other hand, the features generated by the decoder are higher-level features because they are computed through the network’s very deep layers (depth-0), hence, they are subjected to far more processing than the encoder. The combination of these two incompatible sets of features results in some conflict during the learning process, which affects prediction operation. This situation is described as the semantic gap problem, i.e., There is a possible semantic gap between the two sets of feature maps being combined [8].
To address the above semantic gap problem, non-linear transformations can be incorporated along the skip connections by adding some convolutional layers to reduce the variance between the encoder and decoder features such as residual connections (Res paths) [8]. For easier learning, the Res path can be used as skip connections, rather than directly concatenating features from the encoder to decoder stages.
In this paper, We introduce a Residual-Atrous connection [11] as a skip connection by combining Atrous convolutions or so-called dilated convolutions with residual connection instead of only the residual connection used in MultiResUnet [8]. In the proposed connection, we have applied a residual block followed by an Atrous block. The residual block consists of a series of 3×3 convolutional layers and 1×1 filters accompany the residual connections while the Atrous block consists of two 3×3 convolutions with dilation rates 2 and 4, respectively. We apply the ReLU activation function after every sum operation in both blocks to guarantee non-linear mapping [45], followed by the Batch Normalization layer. The ReLU activation function can be defined by Equation (1).

The Residual-Atrous connection.
It should be observed that as we near the shortcut connections in the fewer depth layers such as depth-2 and depth-3, the amount of semantic gap is likely to progressively minimize. This is because, in later phases, only features from the encoder are not subject to further processing [8]. Therefore, we used the proposed connection in depth-0 and depth-1 only because there is a semantic gap in them much more than in depth-2 and depth-3 and because this connection has a notable effect in the shallower layers.
Residual blocks have been applied to address the problem of saturation and degradation [46]. In addition, Atrous convolutions assist to cover a larger area of the input or in other words extend the receptive field without loss of resolution or coverage and at the same computational cost [47]. It should be noted that the dilated convolution and ordinary convolution both have equivalent kernel sizes. This denotes that the neural network has the same number of parameters. However, the dilated convolution has a bigger receptive field than the standard convolution [48]. Integrating the residual and Atrous blocks aids in the extraction of enough spatial information because Atrous convolution reserves the input’s additional contextual features which improve the model’s generalizability.
The proposed network presents the CAM [12] to obtain feature maps at various levels, causing the network to filter needless information and focus on the important details that are directly related to the target in the feature images. By taking advantage of the relation between features across channels, CAM creates a channel attention map. A feature map’s channels are regarded as feature detectors, so channel attention concentrates on “what” is significant given an input image.
To implement feature picking, the squeeze-and-excitation (SE) networks [32] are presented as the CAM, as illustrated in Fig. 4. The CAM consists of two stages. The first stage is Squeezing. Using the average pooling procedure, the features of input with a size of W×H×C of every channel coming from the MultiRes block in the encoder path are squeezed into the corresponding real number 1×1×C. In this manner, the global spatial information was squeezed into a channel descriptor and learning automatically determines every channel’s significance.

Channel Attention Mechanism (CAM).
The second stage is Excitation which seeks to completely extract channel-wise dependencies. The previously acquired set of real numbers is modified by implementing a sequence of convolution processes followed by a non-linear activation function on the feature map to produce the weight vector (i.e., attention vector) as follows. The features are passed through the first 1×1 convolution process to extract spatial information to perform channel-level dimension reduction followed by a ReLU activation function. This is then passed by the second 1×1 convolution operation to understand additional spatial information and expand the original number of channels followed by the sigmoid activation function. Here, the sigmoid makes the feature maps sum to 1 which holds the output normalized. The sigmoid activation function and the output of the above operations are illustrated in Equations 5).
The real numbers modified by the above-mentioned procedures are utilized as a set of weights to describe the significance of the features of every channel in the original features. The final output of CAM is then obtained by multiplying the output of Excitation operation F (x) by the original output feature map from the MultiRes block x. The final output of CAM is given by Equation (6).
where CAM out is a feature map of CAM output with the dimensions W×H×C where C is the channel number. This output gives more significance to key features of the input image and filters the undesirable regions of interest. The output from CAM then concatenates with the upsampled result from the prior decoder layer.
As shown in Fig. 1, we incorporate the CAM block into the network exactly in decoder blocks in depth-2 and depth-3 by passing the output features of MultiRes block 3 and MultiRes block 4 on the encoder path into CAM through the skip connections to restore the information that was lost during downsampling and focus on important features. The features coming from the encoder side in both depth-2 and depth-3 are coarser features than those coming from shallower depths (depth-0 and depth-1) [49] and therefore are the best candidate for passing to CAM to remove the ambiguity of noisy and irrelevant responses in skip connections.
The architecture of the proposed model is illustrated in Fig. 1. After pre-processing, the model initiates with a 2D input image of dimensions 592×592×1, where the image resolution is 592×592. The model has a depth of 4 and the number of filters is 32, 64, 128, 256, and 512 at every depth. In the encoder stage of the network, Each MultiRes block is followed by a 2×2 MaxPooling layer with stride 2. We then use the Residual-Atrous connection as a skip connection in both depth-0 and depth-1. While we pass each output feature of MultiRes block 3 and MultiRes block 4 from the encoder side as input to CAM at both depth-2 and depth-3 respectively. In the decoder path, we concatenate the output of the upsampling operation using transposed convolutions with the output features of CAM blocks at both depth-2 and depth-3. However, we concatenate upsampling layer with the output of the Residual-Atrous connection at both depth-0 and depth-1. The concatenated features are propagated to the corresponding MultiRes blocks at the decoder phase. In this network, All the convolutional layers are activated by the Rectified Linear Unit (ReLU) activation function except for the output layer and then are followed by the Batch Normalization layer. After the last MultiRes block, we apply a 1×1 convolutional layer as the output layer which is activated by a Sigmoid activation function. In the final output, each channel becomes a probability map for a specific class. The loss is then estimated by comparing these outputs to the ground truth. The proposed model has 6,051,322 trainable parameters.
Image post-processing
After preprocessing the prediction sample, the prediction process is applied. The output of the prediction sample has a size of 592×592. After the prediction process is done, we apply post-processing to the prediction output by resizing the image to the original shape of 520×696.
Experiment setup
Dataset
We apply the proposed model for cell segmentation on light microscopy images from the ISBI cell tracking challenge. We use the PhC-C2DH-U373 and Fluo-N2DH-GOWT1 datasets. It is recorded by phase contrast microscopy. Each dataset is partitioned into a 60% training set, 20% validation set, and 20% testing set. Table 1 shows the datasets detail, mean, and standard deviations of our experiments. As shown in Table 1, the first dataset is glioblastoma-astrocytoma U373 cells on a polyacrylamide substrate. It contains 230 2D images of size 520×696 in TIFF format. The second dataset is a GFP-GOWT1 stem cells which contains 184 2D images of size 1024×1024 in TIFF format. In the corresponding segmentation truth of each image, the segmented objects or cells have unique positive labels and the background has zero labels. As illustrated in Table 1, the standard deviations of our experiments are small which makes them insensitive to the changes in the dataset. These small values indicate most of our data is clustered around the mean.
Experiments detail and standard deviations
Experiments detail and standard deviations
The target of semantic segmentation is to determine whether every single pixel shows a point of interest or is just a portion of the background element. As a result, this issue eventually minimizes a binary classification challenge at the pixel level. Therefore, we use the binary cross-entropy function (BCE) as a loss function and reduced it. This BCE loss function is defined by Equation (7).
During the training of the proposed model, we apply adaptive moment estimation (Adam) [50] as an optimization algorithm for gradient descent. Adam integrates the benefits of RMSProp [51] and AdaGrad [52] optimizations by using estimations of the first and second moments of the gradients to compute various learning rates for various parameters. This optimizer has a low memory requirement, a faster running time, and is computationally efficient. This optimization updates network weights, and learning rate to upgrade the accuracy and minimize the loss function. The training was implemented for 50 epochs with batch size 2 using Adam optimizer with default parameters: β1 = 0.9, β2 = 0.999, ɛ = 1 ×10-7 and learning rate = 0.001 which is decreased each 5 epochs with the factor of learning rate decay 0.1 when metric has stopped improving.
The value of the loss function in the training and validation data sets is computed at the end of each training epoch. The training and testing accuracy of the proposed model for the PhC-C2DH-U373 dataset is illustrated in Fig. 5. Figure 5 demonstrates how the network model’s performance enhances during the training by increasing the accuracy and decreasing the loss gradually. It is noticeable how the Loss function keeps reducing after 50 epochs till the end, and it would maybe keep getting smaller still. As a result, stopping the training procedure at the Loss value of 0.26 after 50 epochs is suitable. The reason it has been found that raising the number of epochs used to train the suggested networks causes a rise in computational error on the validation data. In Fig. 5, There is an enhancement in each epoch’s performance of model accuracy. On the PhC-C2DH-U373 dataset, the proposed network’s convergence speed is quicker after 10 epochs for both training and testing accuracies. The suggested model has a strong convergence speed. This is due to the interaction between batch normalization and residual connections [25].

Training and testing performance of proposed RAMRU-CAM model.
The proposed model was run on Windows 10 with an Intel Core i5 CPU, 16 GB RAM, and NVIDIA GeForce GTX 1080Ti GPU. All experiments were executed on python 3.7.11 with TensorFlow 2.7.0, and Keras 2.7.0.
The regions of interest in semantic segmentation often make up a small segment of the full image. As a result, metrics such as recall and precision are insufficient and frequently create an illusion of excellence. Therefore, to evaluate the segmentation performance of the proposed method, two metrics including the Dice similarity coefficient (DSC) and Jaccard Index (Jac) are applied. The DSC and Jac are the most widely used metrics to evaluate image segmentation by measuring the overlap (or similarity) between the predicted sample and ground truth [53]. DSC and Jac metrics are given by Equations 9).
We measure the classification performance of our model during training using the Accuracy (Acc) metric which is used to describe the performance of the whole dataset. The Accuracy is calculated by the number of correct predictions divided by the total number of input samples. It is defined by Equation (10).
TP, TN, FP, and FN denote the number of true-positive, true-negative, false-positive, and false-negative samples, respectively.
This section introduces detail on the experiments performed to test the proposed model and evaluate it. We then evaluated the proposed model in comparison to its single equivalents and the most recent architectures.
Complexity analysis
To gain an understanding of the complexity of CNN models, it is important to determine the primary components that contribute to the network’s complexity. Thus, we can create effective architectures. The key components of energy consumption in CNNs can be regarded as the computational processes based on the filtering operations and quantity of required memory based on the feature map transactions. Processing various CNNs requires a lot of processes. Input feature maps (IFMs) are processed by convolutional filters, which produce output feature maps (OFMs). The computational operations can be specified by counting the amount of floating point operations (FLOPs) that contain multiplications, additions, divisions, . . , etc and also counting the multiply-and-accumulate operations (MACs) used to process the CNNs.
As shown in Table 2, the proposed model complexity is analyzed and measured against other recent variants of U-Net in terms of its computational requirements in the number of parameters, memory usage, FLOPs, and MACs. As noticed in this table that the trainable parameters of the proposed model are fewer than the compared models Which means that our model does not require more calculations, mitigating the overfitting and resulting in a lightweight network. The proposed model is approximately 13 Giga fewer than MultiResUNet in FLOPs and has the least FLOPs amount compared to other models in terms of computational complexity. Additionally, the proposed model operates 62.14 Giga of MACs which achieves the smallest value of MACs compared to all other models. On the other side, the proposed model requires 2075.66 MB of memory which outperform both MultiResUNet and UNet++ models. However, the Attention U-Net and AGResU-Net models require less memory usage. It is noticed that the proposed model requires less computational complexity according to the smaller values of the number of parameters, FLOPs, and MACs.
Complexity comparison of the proposed model and other recent variants of U-Net model
Complexity comparison of the proposed model and other recent variants of U-Net model
To verify the effectiveness and efficiency of the proposed approach, two case studies, the PhC-C2DH-U373 dataset and the Fluo-N2DH-GOWT1 dataset are done. The computational cost of the proposed model, i.e., training time and testing time, is used to measure the efficiency. Table 3 demonstrates the computational costs of the proposed model compared to different methods on the PhC-C2DH-U373 dataset according to the implementation details. It can be noticed that the proposed model’s training and testing times were shorter than MultiResUNet, AGResU-Net, and UNet++ models. Attention U-Net model has training and testing times shorter than the proposed model. However, it has lower segmentation accuracy. Although the MultiResUNet achieves high segmentation performance, it has the longest training time and testing time compared to all models.
Computational costs of different state-of-the-art methods on the PhC-C2DH-U373 dataset
To evaluate the proposed model, we used unknown test images and these images are not utilized in training or validation sets. Figure 6 presents some examples of cell segmentation results of the proposed model on the PhC-C2DH-U373 dataset. It shows five samples of Dataset. For each image, The first row shows the input image, the second row demonstrates the prediction output of our model, the third row shows the true segmentations (ground truth) of the image, and The Jaccard index (Jac) metric between each prediction sample and ground truth denotes the similarity score between both of them for each sample. This show that the segmentation of the proposed model is quite close to the ground truth, which provides a robust guarantee for further investigation.

Examples of Cell Segmentation Results on PhC-C2DH-U373 dataset of the proposed model.
For an equitable comparison, the comparative models were scaled to depth 4. Each model is compiled and trained using the same training set and parameters. We then evaluate the models on the same test set for each model. Two methods were used to evaluate the segmentation performance of the proposed model in comparison to the state-of-the-art models. The first method was a qualitative analysis in which the segmentation effectiveness of each model was assessed by visual inspection. The second method was a quantitative analysis that involved computing the Jaccard (Jac) and Dice (DSC) similarity coefficients.
For the PhC-C2DH-U373 dataset, we evaluated the proposed model in comparison to its single contribution i.e., MultiResUNet [8], and two other variants of the U-Net model which are Attention U-Net [33] and Attention Gate ResU-Net (AGResU-Net) [31]. Figure 7 demonstrates the qualitative performance comparison by visualization of predictions for these models on five unseen images from the test set that were picked at random. Under each prediction sample, there are Jaccard index (Jac) values corresponding to each model which demonstrates the similarity score between true segmentation and the prediction of each model. Figure 7 clearly shows that, in comparison to other segmentation techniques, the proposed model can segment cell regions most similar to the ground truth.

Visualization of Cell Segmentation Results of the proposed model compared to its single counterparts on on the PhC-C2DH-U373 dataset.
In addition to the qualitative study through visual inspection, a quantitative comparison of the segmentation accuracy based on the Jaccard index (Jac) and Dice Similarity Coefficient (DSC) values were also introduced. Table 4 demonstrates the Jac and DSC values of the proposed model and its single counterparts for ten test samples on the test set of the PhC-C2DH-U373 dataset. From Table 4, on the ten test samples, the proposed model achieved 95.23 % and 97.55 % for the average Jac and DSC values respectively, which was higher than the MultiResUNet, AGResU-Net, and Attention U-Net models. It was noticed that the MultiResUNet model also introduced a promising result, except for Sample 02-021 which had a complex structure as previously demonstrated. However, the proposed model outperformed other models at a high rate.
Performance comparison of the proposed model and its single counterparts for ten random test samples on the PhC-C2DH-U373 dataset
Performance comparison of the proposed model and its single counterparts for ten random test samples on the PhC-C2DH-U373 dataset
The quantitative comparison of the suggested model with the state-of-the-art variants of the U-Net model on the whole test set is shown in Table 5. From Table 5, the proposed model obtained a Jaccard index (Jac) of 94.12 % and the Dice Similarity Coefficient (DSC) of 96.93 % and is better than the other three state-of-the-art models. While MultiResUNet gained the second position scoring 93.28 % in the Jaccard index and 96.46 % in the DSC. But Attention U-Net got lesser segmentation accuracy, 92.59 % and 96.10 % for Jac and DSC metrics respectively. While the AGResU-Net model acquired the Jaccard index of 85.73 and the DSC of 92.21 which gained the fourth position. The UNet++ model obtained the least accuracy compared to other models.
Comparison of our proposed model with the state-of-the-art variants of the U-Net model on the test set for the PhC-C2DH-U373 dataset
Besides performance comparison with recent variants of U-Net above, we compare the proposed architecture with the other state-of-the-art models utilized for semantic segmentation. The quantitative performance comparison of the proposed RAMRU-CAM model with the state-of-the-art models RSHN [37], Res2-UNeXt [10], temporal feedback [38], and GRUU-Net [39] is shown in Table 6.
Performance Comparison of our proposed model and the existing state-of-the-art models on the PhC-C2DH-U373 dataset
The results show that the proposed method outperforms the state-of-the-art methods for cell segmentation on the PhC-C2DH-U373 dataset. Based on the comparison results of the proposed method with variants of the U-Net model and with other recent methods, as presented in Fig. 8, our study shows a high rate. therefore, we can have better Segmentation results.

The comparison results of the Proposed method with different methods on the PhC-C2DH-U373 dataset.
Fine-tuning is a transfer learning mechanism that concentrates on retaining information learned from solving one problem and using it to solve various but relevant problems. Given that CNNs include a large number of layers and parameters, the network training stage could benefit from the utilization of databases with a variety of examples to prevent overfitting issues. The fine-tuning operation is implemented by freezing or fixing the parameters (weights) of some layers and performing a new training operation to adjust the parameters of the remaining layers, rather than completely retraining the model on the new dataset.
In this work, we fine-tune the parameters of the proposed model which was pre-trained on the original dataset, namely the PhC-C2DH-U373 dataset. Then we train this fine-tuned model on a new dataset, namely the Fluo-N2DH-GOWT1 dataset. This fine-tuning process is applied as follows. Firstly, freezing the initial layers because it extracts low-level features. In other meaning, the pre-trained model’s initial values for the weights of these layers are being used. To reduce the computational cost of training, we increased the number of frozen layers by freezing all model layers, except all convolutional layers of the last two MultiRes blocks and the output layer whose weights are initialized at random. In the new datasets, the number of classes is two which is the same as in the original dataset, so it remains without change.
Similarly, the other pre-trained models, MultiResUNet, Attention U-Net, AGResU-Net, and UNet++ models are fine-tuned by freezing the all layers of each model except the links between the penultimate and final layers whose parameters are initialized at random.
The pre-trained model has an architecture that has been optimized and consolidated for the dataset for which it was trained at first. During the training stage, hyperparameter optimization is applied by optimizing the parameters and connection weights using optimization operations. The search space considered for the parameter optimization is demonstrated in Table 7.
Parameters setting
Parameters setting
To improve accuracy, certain parameters were adjusted during the estimation operation, and following that, every parameter’s optimal value was gained. Table 8 illustrates the optimal configuration of parameters and segmentation accuracy for the proposed model and other different models on the Fluo-N2DH-GOWT1 dataset. For this dataset, we evaluated the proposed model in comparison to recent models in terms of qualitative performance. Figure 9 demonstrates the qualitative performance comparison by visualization of predictions for these models on five unseen images from the test set that were picked at random.
The optimal configuration of parameters and segmentation accuracy for different models on Fluo-N2DH-GOWT1 dataset
The optimal configuration of parameters and segmentation accuracy for different models on Fluo-N2DH-GOWT1 dataset

Visualization of Cell Segmentation Results of the proposed model compared to its single counterparts on the Fluo-N2DH-GOWT1 dataset.
Using an independent dataset, cross-validation tests evaluate an algorithm’s overall performance while balancing bias and variation. A k-Fold cross-validation test randomly divides the dataset into k equal-sized or nearly equal-sized groups that are mutually exclusive. Following that, the algorithm is executed k times, with each time using one of the k splits as the validation set and the others as the training set. To evaluate the segmentation accuracy of the proposed model, we have applied 3-Fold Cross Validation tests.
In terms of the Dice Similarity Coefficient (DSC) metric which is presented in Equation (8), the validation performance of the 3-fold cross-validation for the proposed model on the PhC-C2DH-U373 dataset is illustrated in Fig. 10. In each run of 3-fold cross-validation, the optimal result on the validation set obtained through execution of the overall number of epochs (50 epochs in our case) is recorded. We can finally estimate the algorithm performance generally by averaging the results of all 3 runs together to get the final result. The results of the 3-fold cross-validation for the proposed model on validation and testing sets for the PhC-C2DH-U373 dataset regarding the Dice Similarity Coefficient (DSC) metric are shown in Table 9.

Validation performance progress of the 3-fold cross-validation for proposed model on the PhC-C2DH-U373 dataset. The value of Dice Similarity Coefficient is recorded on training and validation data after every epoch for the three folds (a) fold 1, (b) fold 2, and (c) fold 3.
Results of 3-fold cross-validation. The optimal results in the three folds are presented of the proposed model on validation and testing sets for the PhC-C2DH-U373 dataset in terms of DSC
In microscope images, the task of cell segmentation is challenging and popular. In this study, we proposed a novel cell segmentation method called Residual-Atrous MultiResUnet with Channel Attention Mechanism (RAMRU-CAM) for cell segmentation. We applied some modifications to enhance the modern MultiResUNet architecture by using Residual-Atrous connections as shortcut connections instead of Res path to manage the spatial dimension of feature maps. In addition, we integrated Channel Attention Mechanism (CAM) blocks in the decoder network to filter needless information and focus on the important details that are directly related to the target in the feature images. The proposed model has been validated for segmenting cells on the PhC-C2DH-U373 and Fluo-N2DH-GOWT1 datasets. Experimental results indicate that implementing the proposed model achieves the best segmentation results compared with the baseline models, MultiResUNet model, and other state-of-the-art cell segmentation methods. Furthermore, the proposed architecture requires less memory and is lightweight.
The recent variants of the U-Net model, which are compared with the proposed model, conduct cell segmentation with wonderful precision. However, on extremely difficult images, these models frequently under-segments, over-segments, predicts incorrectly, and even totally ignore the objects. The performance boost provided by the proposed model greatly rises for complex images with noise, disorders, an absence of distinct boundaries, etc. Also, the proposed model succeeds in capturing minor specific details. Although the segmentations produced by the suggested model are not perfect, it often performs significantly better than the recent variants of the U-Net model which are mentioned in this study. However, in the proposed method and these models, delineation of cell borders in cell overlapping or touching areas is especially challenging because the cells in some images are overlapped. therefore, Future works are still necessary to advance the segmentation of overlapping/touching cells by integrating this work with other recent deep-learning models. Moreover, the proposed RAMRU-CAM model’s segmentation performance may be enhanced in the future by using 3D input of microscope images, and the enhanced architecture may be extended and applied to more datasets to demonstrate its generalizability widely.
Footnotes
Acknowledgments
The first author would like to sincerely thank Profs: Dr. Wagdy Gomaa El-Sayed, Dr. Yasser Fouad Hassan, and Dr. Yousef Sardahi for their insightful comments and suggestions that helped significantly improve this research.
