Abstract
In spite of a large volume of segmentation algorithms, it is difficult to coherently segment a series of images from the same scene. Semantic segmentation might be a solution, but it is sophisticated with heavy computing cost. Many contour detectors could locate objects accurately, but the contours are broken, and Watershed algorithm could find the subtle ridges that could connect the broken contours, but often fails for over-segmentation. We try to combine these two methods to decompose images into meaningful regions, and propose an index S-measure to measure the segmentation consistency between images. Two steps are involved: detect edges by contour detector, and fuse the broken edges by Watershed algorithm. To fuse the broken edges, two ways are proposed, dilating edges with same template and with adaptive width templates. Experiments illustrate that the decomposing is fast, effectively, meaningful, robustly, and coherently, and the segmentation is consistent between images even with some transformations.
Keywords
Introduction
As a challenge in computer vision, it has been attracting for many years to decompose image into meaningful regions due to its importance in many applications such as scene understanding, image retrieval, autopilot, and so on. Great achievements have been made over the years and some methods, e.g. active contour [1], level set [2], Chan–Vese [3], super pixel [4], regions with Convolutional Neural Network (R-CNN) [5–7], and Fully Convolutional Networks (FCN) [8] could segment image into some relatively closed regions. However, when facing truly practical problems, we find that few methods can run at a fast speed and keep a high accuracy at the same time. Some methods heavenly rely on high performance computing capability. But even though the high performance computing technology is adopted, the speed still cannot meet the requirements in applications. Furthermore the computing capability is lack in restricted environment, especially on the embedded devices. On the other hand, some methods can run in real time, but their detection accuracy is usually not satisfactory, for example, the super pixel detector. The lack of meaningful segmentation has hindered many further developments.
In this paper, we propose a novel method to decompose image fast, accurately and meaningfully. We achieve this based on two classical methods, contour detectors and Watershed algorithm. We realize that these two classical methods have some complementary advantages and try to integrate them into one. The regions surrounded by contours are marked as seed sections and Watershed algorithm moderately grows in the rest, so that the image can be segmented accurately. Although the elementary methods are not novel, the performance is remarkably improved. Experimental results show that the effect is satisfactory and we illustrate a pair of stereo images in Fig. 1. There are some transforms between these two images and few methods could decompose them into coherent regions. But the proposed method achieves this, so we affirm that the proposed method decomposes the images into meaningful regions.

The simple, fast, efficient and meaningful segmentation. From left to right: (a) the original images; (b) the segmented images by the proposed method; (c) the colorful segmented images by the proposed method.
The major contributions of this paper are summarized as follows: Watershed algorithm is proposed to fuse the intermittent contours brought by contour detectors. As everyone knows, contour detectors are fast enough, especially the Canny operator. Even the latest widely accepted method, gPb [5, 6], still adopts the Canny operator with a set of different gradient features. Only a little computing cost is added, so the proposed method is nearly as fast as Canny operator. Object boundaries adhere well to its contours. Every object is segmented by its contour and all regions are close. So, it can be thought to be meaningful and semantic. There is few over segmentation and under segmentation which is always companying with Watershed algorithm, so the segmentation is superior to super pixel.
This paper is organized as follows: Section 2 reviews the literature on contour detector, especially on the Canny detector, and Watershed algorithm; Section 3 introduces the proposed method in detail; Section 4 illustrates some experimental results and demonstrates the advantages. Some conclusions are drawn in Section 5.
The proposed method is based on two classical algorithms, a contour detector, and a Watershed algorithm, which touches abundant literature. The related methods, such as Globalized Probability of Boundary (gPb) [9, 10], Oriented Watershed Transform with Ultrametric Contour Map (OWT-UCM) [11, 12], super pixel [4], R-CNN [5–7] and FCN [8] are involved too.
Contour detectors
Many classical contour detectors, e.g. Canny operator, can obtain very precise edges. But the edges are intermittent, especially at corners [13]. The segmentations built up from locally detected contours, such as Canny operator, often mistakenly join regions due to the leak of bounding contour. So, under-segmentation always accompanies these methods, even the latest excellent algorithm, gPb [9, 10]. gPb is a simple linear combination of the probability of edges and the salient of curves in the image, partially but not completely resolved the tradeoff. So, Pablo Arbeláezrt et al. proposed OWT-UCM method to enforce closure [11, 12].
The edges detected by the Canny operator are broken, especially at corner parts. Some mathematical morphology methods are proposed to remedy this tradeoff. Xiao-qin Zhang uses a mathematical morphology method to fill in edge gaps and pores [14]. LIANG Liang proposed a morphological method to connect the broken edges and complement canny Algorithm, but the morphological method would forcibly split an integral region into several regions [15]. CHENG Pengfei used the Hough transformation to connect broken edges [16], but Hough transformation is not applicable to corners. These methods tried to improve the Canny operator and closely related to our work, but Watershed algorithm is not proposed to find the weak edges that separate the under-segmented objects.
Watershed algorithm
Watershed algorithm, can extract precise object contours and segment image into regions, but easily suffered from noise, over-segmentation and a large number of small regions [17]. Haris used Gaussian function before Watershed algorithm to filter out the noises and the high-frequency components of the image [18]. However, the Gaussian function cannot filter out the fine textures. In contrast, morphological methods, which extend flat areas while the basic shapes of the objects and the background were preserved, might improve Watershed algorithm [19]. Another approach to solving the over-segmentation is to merge similar adjacent regions in a post-process, such as OWT-UCM [11, 12]. OWT-UCM produces a set of initial regions, and then an Ultrametric Contour Map is constructed from the boundaries of these initial regions to generate a hierarchical region tree. Contours encoded in the hierarchical segmentation retain some real-valued weights, which indicate their likelihood of being a true boundary. Given a threshold, a set of closed contours is outputted as segmentation. These operations, OWT-UCM, can be seen as a generic process to solve the over-segmentation drawback. OWT-UCM method might have a better F-measure than our method, but it is more complex, slower and more inefficient [10, 11]. Furthermore, the better F-measure is partly because of the good contour-constructed method, gPb [9, 10]. Usually, several seconds will be taken by OWT-UCM method, but only one-tenth or less time cost is taken by the proposed method.
Another well-known drawback of Watershed algorithm is too slow. The other drawback well known is the heavy computing cost. It is because of growing process, which has to compare every pixel with the seed regions, but few methods are proposed to improve it. If some seeds are marked out in advance, the drawback could be partially overcome. If the seed sections are large, and the searching sections are small, Watershed algorithm will become much faster.
Super pixel
Super pixel algorithms group pixels into perceptually atomic regions, which can be used to replace the rigid structure of the pixel grid. Super pixel was once considered to be a temporary intermediate-level. But the regions produced by super pixel usually are some regular or irregular lattices which are a transition level between a pixel level and a region level, not standing at the region level. The lattices could be seen as over-segmented regions. Furthermore, the boundaries of the super pixel are inconsistent with the semantic objects. So, super pixel, an accurate decomposition, is just pre-processing step to reduce computational complexity [4].
Convolutional networks
Convolutional networks are driving advances in recognition. Convnets are not only improving for whole-image classification, but also making progress on local tasks with structured output. These include advances in bounding box object detection, part and keypoint prediction, and local correspondence. R-CNN [5–7], FCN [8], and so on have used convnets for semantic segmentation, in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses. FCN is trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery. But just as noted above, deep learning methods heavenly rely on high performance computing capability, neither in training period nor in detecting period. Without this kind of computing hardware, convolutional networks could not be running.
Proposed method
We combine a contour detector and Watershed algorithm to decompose an image into meaningful regions. The Canny detector is selected to detect edges because of its competitive advantage in speed, and areas surrounded by Canny edges are used as seeds of Watershed algorithm. Later these seeds grow within some narrow strips, which are the dilated Canny edges, to achieve the real edges. It is based on this assumption: the edge endpoints should appear in pairs because the intermittent boundaries would locate between two objects. To fuse an intermittent boundary is to connect the point-point pair. The fusion can be handled by a simple but dogmatical way or by an adaptive but slightly more tome cost way. If the intermittent boundaries are connected with same width template, the region number of segmented image is manageable, and the running time is few. If the intermittent boundaries are connected with adaptive width templates, the region number of segmented image is fixed and the running time is more. But the segmentation is more robust and more meaningful.
The outline of the proposed method
There are three steps in the proposed method. Firstly, the image is smoothed by Gaussian function and edges are detected by Canny detector. Then edges are dilated with a disk filter. Then, there are two cases happened to broken edges according to gap widths: (1) broken edges are closed if gaps are narrower than the diameter 2r of the disk filter; (2) broken edges are still broken if gaps are wider than the diameter 2r of the disk filter, and these broken edges will be vanished at the next step. Lastly, Watershed algorithm is conducted at the dilated strips to find the final edges, with the strips as potential sections and the others as the marked seeds. We illustrate the process in Fig. 2.

The closed edges and the vanished edges. From left to right: (a) the broken edges and the gaps; (b) the dilated edges and the fused gaps; (c) the connected edges and the vanished edges. Two types’ results would happen according to the gaps: if the gap is narrower than 2r, the edges would be enforced to connect each other; if the gap is wider than 2r, the broken edges would not be connected, and the isolated edges would vanish in next process.
There are two different ways to connect broken edges. One is to dilate all edges with a same template, so gaps being narrower than the template will be filled and wide gaps will stay the same. The other is to dilate edges with different width templates which are adaptive to the corresponding gap widths.
The process which fuses all gaps with a same width template is demonstrated in Fig. 3. The original image is shown in Fig. 3a. Canny edges are shown in Fig. 3b. The dilated Canny edges are exposed in Fig. 3c. The dilated Canny edges (the white sections) are the potential sections where Watershed algorithm is conducted and the other section, the black section, are seeds of Watershed algorithm. The seeds and the potential sections are also shown in the gradient image, Fig. 3d. The regions with small gradients are seeds of Watershed algorithm and the regions with bigger gradients are regions where Watershed algorithm is conducted, just as the binary image of Fig. 3(c). Finally, Watershed algorithm is conducted from seeds to the potential sections. The segmentation and the final edges are shown in Fig. 3e. Because the final edges are firstly detected by the Canny detector and later refined by Watershed algorithm, it should be closer to the real boundaries. On the other hand, because the width of the potential section is 2r+1, and the other section is handled as seeds, there are only some very small sections to be traversed through by Watershed algorithm. So, a fast computing should be expected.

The proposed method with a same width template. From left to right: (a) the original image (b) Canny edges; (c) the dilated Canny edges with the black regions as the seeds of Watershed algorithm and the potential sections, the white regions where Watershed algorithm is conducted; (d) the gradient image with marked seeds and the potential sections; (e) the segmented image.
The regions whose widths are less than 2r+1 would be eliminated by the dilating process, so the survived regions are larger than 2r+1. Even the thin regions which are longer than 2r+1 but the width is shorter than 2r+1 would be also eliminated. So, the disk radius r defines the small regions’ scale. The smaller the radius r, the more the small regions survive, and the more the number of regions. To demonstrate this point, an image is segmented by 4 different radius r and the segmented images are illustrated in Fig. 4. There is another conclusion that increasing the radius r is equal to decreasing the resolution to a lower level. So, pyramid representation is unhelpful, and the costs to create pyramid representation, to merge regions and to project regions are cut down [20].

The segmented images with different radiuses.
To evaluate the segmentation, 4 common objective measurements are adopted: the number of segmented regions, the precision, the recall [21] and the computation time. The precision is the proportion of how many machine generated boundaries can be found in human labeled boundaries and is sensitive to over-segmentation while the recall is the proportion of how many human labeled boundaries can be found in machine generated boundaries and is sensitive to under-segmentation. The measurements are showed in Table 1. For the computation time, it took about 0.17 s for the Canny detector step, and the slowest sample took 0.3095 s. The fewer the region number, the more time would be taken. But the time cost of the slowest sample is not more than 2 times of Canny operator. So, it may be the fastest method.
The measurements with different radius
If the radius r varies from 1 to 20, the corresponding values of precision, the recall and the region number are displayed in Fig. 5.

The precision, the recall and the region number with different radius.
To illustrate the effectiveness, seven classical samples in different scenes are selected to be segmented with the proposed method in Fig. 6. These samples, a path image, a remote sensing image of a city, a bear image, a girl leaning on Great wall image, a Cambridge University image, a peppers image included in MATLAB and a road image, easily fall into the trap under segmentation and over segmentation. But our segmentations are excellent without too much under segmentation and over segmentation. Rows from left to right are the original images, Canny edge images, the binary segmentation with the proposed method and the colorful segmentation by the proposed method.

The experiments with different radiuses r. Columns from left to right: (a) the original images; (b) Canny edges; (c) the binary segmentation by the proposed method; (d) the colorful segmentations by the proposed method. Rows from up to bottom: (1) a grassland path; (2) a remote sensing image of a city; (3) a bear image; (4) a girl leaning on Great wall image; (5) a Cambridge University image; (6) a peppers image included in MATLAB; (7) a road image.
To dilate the edges and seal the intermittent boundaries with adaptive width template we have to know the gap sizes. A gap is an endpoint-endpoint pair without edge, and the distance between the endpoints is the gap size. So, we hide all edges but remain the endpoints. Then find the nearest point for every endpoint and make up endpoint-endpoint pairs, but in fact only the nearest endpoints which located on deferent edge segments make up endpoint-endpoint pairs. Besides the endpoints, cross points should be remained also. Different with endpoint, Cross point link 3 or more edge segments. For example, there are 3 cross points that are labeled as × with red color and 6 endpoints labeled as Δ with green color in Fig. 7. So, the first step of the process is to break all edges into edge segments by cross points. Then all edge segments are hidden and the endpoints and cross points are remained. Later find the nearest point for all endpoints, match these points to make up point-point pairs, connect these pairs, and dilate the connecting lines with rational widths which are a fraction of the connecting line lengths.

Edges, edge segments, cross points and end points. The edges are broken into edge segments by cross points, which connect at least 3 edge segments. Cross points are labeled as × with red color and endpoints are labeled as Δ with green color.
But there is a trap. If an endpoint is close to an edge and the nearest endpoint is further, we would connect the point with the edge rather than the nearest endpoint. Connect an endpoint with an edge, which means to connect the point with the nearest point on the edge. This is shown in Figs. 7 and 8. There are an endpoint-endpoint pair, endpoint 1 and endpoint 2, and the distance between the pair is D12, but there is an edge going through the gap. We should not connect this pair of endpoints but connect the endpoints with the nearest edge points.

The way to connect the endpoint with the nearest edge point rather than the nearest endpoint. From left to right: (a) D12 edge segment pixels are hidden when the edge segment is longer than D12; (b) the edge segment is hidden when the segment is shorter than D12; (c) the nearest edge point is located on a edge; (d) connect the points and dilate the connecting line.
Which is the nearest edge point? There is another trap that the nearest edge points are always located on the same edge segment, which would connect every endpoint to it self’s edge segment and some curled curves would come into being. So, we hide the adjacent points until the whole edge segment is hidden or the number of hidden pixels is equal to the distance between the endpoint and the corresponding nearest endpoint. We illustrate this in Fig. 8. If the edge segment is longer than D12, we hide the most adjacent D12 pixels on the segment, and the endpoint of this segment is point 3, showing in Fig. 8a. If the edge segment is shorter than D12, we hide the entire segment and only the cross point 4 is remained. This is shown in Fig. 8b. Supposing the nearest point of the endpoint 2 is the edge point 5 for these 2 cases, we connect point 2 and point 5 in Fig. 8c and dilate it in Fig. 8d.

The flow to measure gap sizes.

The pipeline of the proposed method with adaptive widths. From left to right: (a) Canny edge image with many broken edges; (b) the blocks to fuse gaps with adaptive width templates; (c) the image integrating the Canny edges and the fusing blocks; (d) the gradient image with marked seed regions; (e) the decomposed image with some broken edges; (f) the final decomposed image without broken edges.
We illustrate the process in Fig. 9 and a sample in Fig. 10. Figure 10a is the image with fragmentary Canny edges, and we try to fuse these edges by some blocks whose width are adaptive to the gap sizes. The blocks to fuse the edges are shown in Fig. 10b. Then we integrate the broken Canny edges and the fusing blocks in Fig. 10c. Then the black sections in Fig. 10c, are marked as seed regions, and the white sections are marked as potential sections. Later, the seed regions and the potential sections are transmitted to the gradient image, shown in Fig. 10d. Finally, Watershed algorithm is conducted in the potential sections of the gradient image. Some seed regions would be merged if the basin edges are too low, the fusing block will be melt and the open edges are still broken. It is shown in Fig. 10e. Fig. 10f is the final image, whose broken edges are eliminated. To illustrate details, we zoom in a Section 4 times and show it in a circle.
To validate the proposed method, we compare them with the current best segmentation, such as Ren’s algorithm and OWT-UCM. From a large number of experiments, it is found that the segmented regions are meaningful and accurate. Now that the decomposition is semantic, this method should be robust even with affine, rotation, illumination, scale and transformations. So, we validate this on some image pair with these transformations and find that these transformations have little influence on the decomposition. All these experiments are conducted in MATLAB at a computer with AMD Athlon II 2.7 GHz CPU, 2G memory.
Comparisons with the Berkeley segmentation dataset
We compare the proposed method with two algorithms, Ren’s algorithm [22] and OWT-UCM. The Ren’s algorithm and OWT-UCM have attained the top marks from the Berkeley Segmentation Dataset and Benchmark, so it is deemed to be the best ones. We compare these methods with 6 samples, including a plane image, an arc cave image, soldier image, a man image, a fishman image and a building shadow image, shown in Fig. 11. It can be seen that the proposed method is good in effect, though the other two methods might have a better F-measure. The proposed algorithm has at least two advantages: it is much more effective than the existing approaches, and the segmentation is meaningful. And an important point is that the proposed method is the fastest with the least memory. So the proposed method is simple and robust. These 6 samples come from different scenes with different features. We list the original images, the segmentations by Ren, the segmentations by OWT-UCM, the manual segmentation and the proposed method. The first 4 columns are all downloaded from the Berkeley segmentation dataset [23].

The experimental images which are decomposed with adaptive width templates, compared with the current best segmentation. Columns from left to right: (a) the original images; (b) the segmentations by Ren; (c) the segmentations by OWT-UCM; (d) manual segmentation; (e) the proposed method. Rows from up to bottom: (1) a plane image; (2) an arc cave image; (3) soldier image; (4) a man image; (5) a fishman image; (6) a building shadow image. The first 4 columns, including the original images, the Ren’s segmented images, the OWTUCM segmented images, and the manual segmentation, are all downloaded from the Berkeley segmentation dataset.
To illustrate the method’s effectiveness, we compare the time cost of the proposed method with the one of Canny operator, which is well know for its advantage on computing speed. The examples are 3 pair’s images, the house image pair which is famous and frequently used, car image pair and office image pair, shown in Fig. 12. There are some transforms in every image pair. The quantitative values are exhibited in Table 2. In Table 2 every method has 2 time costs respectively for image 1 and image 2. It took about several 0.1 seconds to find the Canny edges, and in our experiment it is usually less than 0.3 s. If we fuse the broken edges with same width template, the proposed method took less than 3 times of the ones of Canny detector. If we fuse the broken edges with adaptive width templates, the proposed method would take several seconds. Most of the time is spent to find the nearest point. If we code this program with C and optimize it, the time cost will be radically reduced. For now, it is still the fastest method. So the proposed method offers a noticeable advantage.

Some coherent segmentation examples. From left to right: (a) the original images; (b) Canny edges; (c) the segmented images by the proposed method with same width template; (d) the segmented images by the proposed method with adaptive width templates. From up to bottom: (1) house image pair; (2) cars image pair; (3) office image pair.
Comparison of the proposed method
*Every method has 2 time costs, respectively for the image 1 and 2.
The proposed method could decompose an image into meaningful regions, because Canny edges are the principal components that are causal boundaries to discern objects. So, even with some transformations, the images will be segmented consistently. We validate this on some stereo images and hope to get coherent segmentation. This has been demonstrated in Fig. 1. To measure the consistency between image pairs, we define a similarity segmentation S-measure as
N1 is the pixel number of segmented image 1, and N2 is the corresponding value of image 2 of course. NC is the common pixel number between the image pair after the transform is rectified. The S-measure is inspired and similar to the F-measure [17]. We exhibit the S-measure in Table 2. It is apparent that our method is helpful to improve the S-measure and can segment image pair consistently, especially the method with adaptive width template. So, the segmentation is meaningful, and the regions are semantic, being better than superpixel as an intermediate-level.
Case study 1: Semantic decomposition
Now that the proposed method could decompose image into meaningful regions, it is natural to be applied to retrieve image, recognize object, track object, analyze behavior and understand scene. But usually object is hierarchical, which hinders the further developments. This is also the reason why no algorithm can not segment image just as good as human. We try to describe objects in segmented image with Region Adjacency Graphs (RAG), and match these RAGs to find corresponding objects. Since the method is robust, the RAGs should be nearly-identical even with some transformations. We illustrate two RAGs of the famous photo, Lena, in Fig. 13 and match the graphic node one by one.

The RAGs of the image pair with affine transformation.
Finally, we applied the proposed framework to the case of segmenting MRI images such as brain image volumes, being very important in the field of medical image analysis. Figure 14 shows some results, including 4 original images, Canny edge images, the segmented images by the proposed method with same width template and the segmented images by the proposed method with adaptive width templates. The original images come from the MRRF library of University of Iowa Carver College of Medicine [24]. It is clear that the segmentation provides a promising result.

The segmentation of MRI images. From left to right: (a) the original images; (b) Canny edges; (c) the segmented images by the proposed method with same width template; (d) the segmented images by the proposed method with adaptive width templates.
The proposed method coherently decomposes 2 images from the same scene, and the regions are almost homonymous. To approve this, images with some transforms are illustrated to be decomposed into almost identical regions and an index S-measure is proposed to measure the consistency. It is because that contour detectors could locate objects, and Watershed algorithm could find very subtle ridges that could connect the broken contours. The advantage is that the method is very fast, effectively, and robustly. As shown in experiments, the segmentation is visually meaningful, which makes regions or object-based coding efficiently and significantly with less computational cost. So, the proposed method is a semantic segmentation with fast-speed, and better than superpixel to face with application tasks as an intermediate-level.
Footnotes
Acknowledgments
This research was sponsored by Zhejiang Provincial Public Technology Research Project of China (Project No. 2016C31117) and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry (Project No. R20150404), which are greatly appreciated by the authors.
