Abstract
Fiber and vessel structures located in the cross-section are anatomical features that play an important role in identifying tree species. In order to determine the microscopic anatomical structure of these cell types, each cell must be accurately segmented. In this study, a segmentation method is proposed for wood cell images based on deep convolutional neural networks. The network, which was developed by combining two-stage CNN structures, was trained using the Adam optimization algorithm. For evaluation, the method was compared with SegNet and U-Net architectures, trained with the same dataset. The losses in these models trained were compared using IoU (Intersection over Union), accuracy, and BF-score measurements on the test data. The automatic identification of the cells in the wood images obtained using a microscope will provide a fast, inexpensive, and reliable tool for those working in this field.
Introduction
As a safe, easy to handle, and aesthetic structure, woods have been used as building and engineering material since ancient times. Wooden materials are preferred mainly for their anatomical and chemical structures as well as their mechanical and physical properties. The rational use of this raw material is associated with the known structural properties of wood.
Anatomical features such as fibers, vessels, parenchyma, and rays play an important role in identifying wood species [1, 2]. These anatomical features affect the parameters related to the quality of wood and its area of use. Therefore, for the anatomical description of the microscopic character of each wood cell, the cells must be segmented accurately.
So far, numerous methods have been implemented for the segmentation of an image into groups of similar attributes, for determining the object class of each pixel [3, 4]. Yet, continuous improvement is necessary due to the need for more accurate techniques in this field. In recent years, deep learning has attracted the attention of researchers due to the need for more accurate and effective methods. The deep learning technique is used in various fields for better performance. It’s a preferred technique, especially in medical image analysis [5–7]. Deep learning models, U-Net in particular, are very effective in image segmentation [8–11].
In this regard, several studies were carried out using computer-aided analysis for the microscopic wood images. In most of these studies, the anatomical structures of the plants were analyzed using image processing [12]. For example, the watershed algorithm has been used to identify cells in the microscopic images of softwood [13, 14]. Another study has focused on the detailed analysis of the wood cell type, and addressed only three cell types: trachea, tracheid, and rays, the main cells of wood, which have been analyzed using image analysis software, and, height, sphericity, thickness, lumen and surface area of the cells have been calculated for each cell [15].
In another study, it was stated that two challenges need to be solved in pore segmentation. The first is the variety in sizes and shapes. Fibers and vessels are similar in color and shape but vary in size. Therefore, this variance should be considered to increase the accuracy of segmentation. Consequently, an algorithm based on mathematical morphology was proposed for the solution to this problem. This algorithm uses a disk-shaped structure element that changes its radius according to the porosity size [16].
The purpose of a segmentation network is to tag each pixel in an image in such a way that pixels with the same tag share certain properties. A segmentation network classifies each pixel in an image and provides an image segmented according to its class. Commonly used for image segmentation, the CNNs replace fully connected layers with convoluted layers to perform tagging each pixel. Such architectures consist of down-sampling and sampling steps, also known as encoders and decoders, respectively. However, additional sampling layers often provide more trainable parameters in such architectures [17–20].
SegNet, which is used since 2015, is a practical, fully deep-convolutional neural network (CNN) architecture for semantic pixel-based segmentation. It consists of an encoder network, a corresponding decoder network, and a pixel-classification layer [21]. In the subsequent research, fully connected layers were removed to preserve high-resolution feature maps at the deepest encoder output. The excessive number of parameters is also significantly reduced in the SegNet encoder. Hence, SegNet provides significant improvements in accuracy, especially for smaller classes [18–20]. Another study shows that the benefit of a significantly reduced number of parameters reduces memory consumption and feature extraction time, without compromising on the performance [22].
One of the popular approaches in segmentation is the use of an encoder-decoder structure. In the traditional autoencoder architecture, the size of the input information is reduced through the subsequent layers, starting with the first stage [23].The decoder part comes after the encoder of the architecture. Linear feature representation is learned in this part, and the size gradually increases. The output size at the end of the architecture is the same as the input size. The architecture is ideal for maintaining the output size, but as a drawback, it compresses the input linearly, resulting in a bottleneck that cannot covey all the features. It differs from the U-Net in this regard. The U-Net model performs deconvolution on the decoder side, and overcomes this “bottleneck” drawback, thanks to the connections from the encoder side of the architecture.
Although numerous studies have been conducted on the microscopic images of wood, studies with deep learning on wood cell images are still in their infancy [8, 24]. Vessel and fiber structures are very similar but vary in size according to the species. In traditional methods, especially in morphological-based image processing techniques, some parameters of each image must be changed or adjusted according to the size of the pore. Moreover, fibers and vessels are difficult to characterize in detail due to their multicellular structures. Therefore, a two-stage CNN structure, customized according to the characteristics of these cells, was proposed for the segmentation of the fiber and vessel cells. In this way, pores with similar shapes, but with different sizes were better identified. Since they contain sufficient information, a deep learning method was applied on smaller input image sizes, without the need for advanced hardware. The steps taken for the segmentation of wood images are discussed following sub-sections.
Material and methods
Preparation of the dataset
In the study, cross-section microscopic images of diffuse and semi-ring porous wood species were used [25]. The species include of Populus, Aesculus, Alnus Glutinosa, Aesculus Hippocastanum L. Juglans Acer, and Betula.
The images were labeled using morphological procedures [26]. The values of the structural elements to be selected can vary for each image. Thus, these elements need to be checked and identified for each image. After labeling, the images were controlled in line with expert guidance for an accurate wood image classification. The images were grouped under three classes: black for the background, white for fiber, and dark gray for woodgrain. Manual segmentation is one of the best methods for the segmentation of small datasets, but it’s a time-consuming process [27]. Figure 1 shows the original image and the corresponding labeling (ground truth).

Original image and corresponding labeled image. (background (0), white for fiber and dark gray for vessel).
The original image was divided into patches of 48x48 pixel in size (Fig. 2). Of these 48x48-pixel patches, images with dominant background information were not taken into account and not included in the database. The database consists of 40000 patches of 48x48 pixels in size. After dividing the dataset into two parts as training and test sets (4000), data augmentation was performed. The training set was doubled using the data augmentation.

Taking 48x48 non-overlapping patches, a) original image, b) Traning Patch Extraction.
Data augmentation is the most widely used method to increase the number of images in a dataset, improve the model”s performance, and reduce the over-adaptation in the image data. As adopted by numerous studies, data augmentation is an effective way of increasing training data set for deep learning [28–30]. Data augmentation aims to help the network to learn features better. Rotation, translation, adding noise, reflection and contrast adjustment are among the frequently used methods of data augmentation in the literature.
The data were augmented by applying rotation, salt-and-pepper noise (0.01 to 0.05), circular mean filter (pillbox) for focus blur, gauss filter, reflection, and applying random transformations. Pillbox was used for simulating distortion seen in microscopic images.
There are three labeled classes in the dataset in this study. Since some classes may outperform the others in number, which may affect potential feature distributions and classification results, each class was weighted by the class balance method [31].
Where fi median is the median of all fi; fi is the pixel frequencies of class i.
Data class imbalance is a major problem negatively affecting especially minority classes in controlled classifiers. A wide variety of strategies have been developed to overcome this repetitive problem, including excessive sampling, sampling, preservation of natural proportions in training samples, data synthesis and class-weighted loss functions [32, 33].
In this study, a method is proposed for the segmentation of microscopic images of wood cross-sections. As shown in Fig. 3, the proposed method consists of a combination of two-stage CNN architectures.

The architecture of the customized CNN model.
The first stage is composed of 3×3 filtered convolution layer with 64 channels, batch normalization (BN), and ReLU blocks. The convolution layer consists of a large number of trainable filters that can be updated during training. Filters apply the convolution process to the images coming from the previous layer and generate the output data. The activation map (feature map) is generated as a result of this convolution process. The activation map shows the regions with specific features discovered by each filter [34].
Then, the activation map is fed to the next layer. Padding is selected so that the output size of the convolution layer becomes equal to the input size. A BN layer normalizes each input channel along a mini-size. Batch normalization layers are often used between convolution layers and ReLU layers in order to speed up the training of CNNs and reduce the sensitivity to the initial condition of the network. The layer normalizes the activation of each channel by first calculating the mini-batch average and dividing it to the mini-batch standard deviation. Consequently, BN adds two trainable parameters to each layer, so the normalized output is multiplied by a gamma parameter and add a beta parameter [35].
B = {x1….m}: values of x over a mini-batch, ɛis a constant for numerical stability (very small). γ and β are parameters to learn and shift.
ReLU activation was added after all the convolutional and batch normalization layers in the network.
As an effect on the input data, it sets negative values to zero. With the use of this layer, the network learns faster. ReLU is a non-linear and computationally simple method for capturing more complex properties of input data, and greatly accelerates CNN training [28, 30]. For most purposes, ReLU units are effective [36].
Unlike typical encoder-decoder based segmentation architectures, the image resolution is preserved from input to output of the first CNN. Therefore, the padding process is not necessary to maintain intermediate sizes of the images since we do not perform down-sampling on the image. Although sub-sampling (pooling) increases the robustness of a model against small variations in input, the image resolution will be poor in such cases. Since the maximum pooling process cannot be reversed, the pooling layer was not used for the images we use to prevent smoothing out small, porous features. Pooling is usually placed after the ReLU layer. Its main purpose is to reduce the input size for the next convolution layer. The reduction in size results in a loss of information due to this layer. Although the depth of the CNN architecture can be increased by additional layers, each additional layer will cause an extra parameter for training.
The second stage has an encoder-decoder architecture. The convolutional layer consists of BN, ReLU, and maximum pooling layers. The deconvolutional layer has upsampling, convolution, and BN layers. The maximum pooling layer divides the input into rectangular pooling regions and calculates the maximum of each region.
Although the pooling layer causes loss of information, it is useful for two reasons: It has a less computational burden for the following layers of the network, and reduces variance.
In the final stage, the feature maps obtained were sent to the softmax layer on a parallel network. Separate softmax classifiers were used for calculating the loss
Where N is the number of classes, x is the input vector, and σ(x)i is the output class probability.
The class probabilities of each pixel are combined and sent to another softmax classifier. The output produces a segmented image at the same resolution as the input image.
Training is an iterative process, in which training data are fed into the model in groups, predictions are made with a forward-propagation by the current model, and errors between the predictions and labels in the accuracy table are calculated. Errors are back-propagated over the network, parameters (weights) are calculated for all neurons, and the parameters are updated to minimize errors [37].
The input weights are initialized with the Glorot initializer. The choice of initializer has a bigger impact on networks without batch normalization layers [38].
During learning, the loss function calculates the difference between the pixel class distributions and the predicted probabilities.
where m i is truth value and x i is the Softmax probability for the ith class.
The weights are updated at each mini-batch and a revised model is produced at each epoch. Adam (adaptive moment estimation) adaptive learning optimization algorithm was used to minimize loss Adam maintains an element-wise moving average of both the parameter gradients and their squared values. If the gradients contain too much noise, the moving average of the gradient becomes smaller, and therefore the parameter updates are smaller too [39].
The network was trained using the Adam optimization algorithm. The architecture of the first CNN consists of blocks 3×3 convolution with 64 channels. The learning rate was set to 0.0002, the mini-batch size was 4, the training period was 20, the basic learning rate was 0.001, momentum was 0.9 and gamma was 0.1. The learning rate was reduced by 0.1 times in every 5 epochs that of the previous one.
The patches are applied to first CNN and a probability map of the same size as the input image is obtained as output. This map shows the probability of the class to which each pixel belongs. There is no pooling layer at this stage. The output of the first layer was able to detect wood cells with relatively high sensitivity, but there were some problems in the border areas of the regions. In addition, it was planned to further improve the segmentation map in the first stage by applying the second stage.
The feature maps obtained are sent to the softmax layer on the parallel network. Separate softmax classifiers were used to calculate the loss. The class probabilities of each pixel are combined and sent to another softmax classifier. The result of the softmax function shows the probability of each category. As the last step of the proposed segmentation algorithm, the probability map obtained for wood cells is converted into a segmentation mask. The output produces a segmented image at the same resolution as the input image.
Accuracy is the ratio of correctly classified pixels to the total number of pixels in that class, relative to the ground-truth, for each class. In other words,
Global Accuracy is the ratio of correctly classified pixels, regardless of class, to the total number of pixels.
where, true positives (TP), false positives (FP), false negatives (FN) and TN (true negative).
The IoU of two sets A and B (also known as Jaccard similarity coefficient) is expressed as; Jaccard(A,B)=| intersection(A,B) | / | union(A,B) |. The Jaccard index is related to the Dice index. In other words,
The BF-score indicates how well the projected boundary of each class is aligned with the actual boundary.
In this study, a method is proposed for the automatic segmentation of wood cell images using deep neural networks. A 5-fold cross-validation data split was used for training in the proposed network. The sample segmentation results for test images are shown in Fig. 4. The segmentation results are indicated by the color scale for a more specific representation.

Example segmentation results. a) test images, b) segmentation results with the proposed method (colored representation).
The pixels on the border region are added symmetrically around the image for the seamless segmentation of larger images by an overlap-tile strategy [17]. Even if 48x48 patches were used in training, the recommended method can be applied to the entire image.
To evaluate the method, it was compared with SegNet [20] and U-Net [17] architectures trained using the same dataset. The U-Net and SegNet architectures have often been used in literature as references in segmentation studies. The losses of the trained models were evaluated on the test data and the results were compared with ground truth. Since there was no previous study with this dataset, IoU, accuracy, and F-Score measurements were used to evaluate the segmentation results. A comparison of the segmentation methods is shown in Table 1. The values in the table show the average values of the test data for 5-fold validation.
Numerical comparison of segmentation methods
The global accuracy presented in Table 1 refers to the ratio of correctly classified pixels to the total number of pixels, regardless of class. Mean Accuracy refers to the mean accuracy of all classes in all images in the dataset. Mean IoU and BFScore are the average IoU and BF scores of all classes in all images, respectively. Weighted-IoU is the average IoU of each class, weighted by the number of pixels in that class. The values in Table 1 are the averages of test images. In this study, accuracies were 71% with SegNet, 80% with U-Net, and 89% with the proposed method, respectively. The proposed method was found to yield better results than the others.
In order to identify a tree microscopically, it is necessary to determine the anatomical features of all individual cell types (vessels, axis and ray parenchyma, and fibers) in three cross-sections. Determining the features of each cell type depends on the precise segmentation of each cell into segments. The features of elements in images are key to identifying sample species. However, there are many difficulties in pore detection due to various challenges, such as varying shapes and sizes, ambiguous borders, and low image quality.
The quest for more accurate and effective methods, and the increase in the number of complex problems have led researchers to deep learning. Although large data sets are needed for the training process in the convolutional deep learning models, better results have been obtained in the literature than traditional methods for small data sets [40–43].
Since the resolution of the images in our dataset was low, 48x48-pixel patch size was selected for training. The network trained with small images was used for the segmentation of larger images. The image size with adequate pixel information was used for low-resolution images. In this way, deep learning methods can be used without the need for advanced equipment.
In studies with neural networks, the model performances are known to differ depending on the structure of the data set used, the size and depth of the architecture, the type of optimization methods, and activation functions used. In addition, the performance of optimization algorithms varies depending on the choice of parameters and how the neural network is structured [44].
Many of the methods proposed in the literature have often been compared with SegNet and U-Net structures. U-Net has been shown to have better results than SegNet in many studies [8, 46]. In particular, U-Net has become a standard method for image segmentation [8–11].
In this study, the proposed method performed better than the SegNet and U-Net architectures. The method combines the results obtained with and without using the pooling layer. The process performed on the pooling layer cannot be completely reversed, therefore there is a loss. The pooling operations downsample the resolution by abstracting a local area with a single value (i.e. average or maximum pooling). A gradual reduction of the spatial size of the feature map as it moves from one convolution layer to another helps to reduce the number of parameters. Another advantage is that it allows gradual identification of important features (maximum pooling is more advantageous than average pooling). Also, pooling can be used for rotation, and position invariant feature extraction. The maximum value can be at any location within the area. Pooling does not capture the position of the maximum value, so it allows the extraction of rotational/positional invariance features.
The maximum pooling process is irreversible, but we can calculate the maximum values approximately by registering the positions (max position keys), then these positions are used to reconstruct the data in the layer above (deconvolution). The SegNet encoder also uses pooling indexes in upsampling. However, instead of using pooling indexes in U-Net, all feature maps are transferred from the encoder to the decoder, then combined to perform convolution. This requires more memory by expanding the model.
One of the important issues is that reconstruction is lossy. This issue does not pose a problem for the image classification task, as classification only deals with what the image contains (not where it is located). Thus, pooling can decrease the computational burden by periodically downsampling the feature maps. However, reconstruction, i.e. upsampling, is of importance for image segmentation.
The density of wood decreases as the ratio of vessel and fibers increases in wood. Low-density wood is soft and light and has low mechanical and technological properties. Soft and light woods are preferred in furniture and veneer industries, whereas hard and dense wood is used in mining, shipbuilding, machinery industry, and railway traverses. Therefore, the amount of vessel and fibers in the wood affects the area of use. It is important to customize the segmentation phase, according to the application to determine the number of available wood cells. The advantage of deep networks has the potential to design many models for the solution of a problem.
Conclusions
This study provides the method for the segmentation of vessel and fiber cells. The method developed by combining CNN structures was compared to SegNet and U-Net architectures. The architecture was developed in two stages to eliminate the disadvantages of the pooling layer while keeping its advantages. The promising results were achieved with a few training images. IoU, accuracy, F-score measurements were used to evaluate the results. The resulting averages of test images were presented. And, the proposed method was found to perform better compared to SegNet and U-Net in the experiments. It is deep CNN developed specifically for this problem and can be extended to solve similar segmentation tasks. It will facilitate research on wood. The use of a deep learning method is of importance in microscopic wood image segmentation. Although the use of CNNs is not common in wood, the number of relevant studies is increasing. Although CNN is used in fields such as semantic partitioning, medical image processing and autonomous driving, its applications in wood images have not yet been extensively studied due to the limited data available.
