Abstract
PURPOSE:
Segmentation of magnetic resonance images (MRI) of the left ventricle (LV) plays a key role in quantifying the volumetric functions of the heart, such as the area, volume, and ejection fraction. Traditionally, LV segmentation is performed manually by experienced experts, which is both time-consuming and prone to subjective bias. This study aims to develop a novel capsule-based automated segmentation method to automatically segment the LV from images obtained by cardiac MRI.
METHOD:
The technique applied for segmentation uses Fourier analysis and the circular Hough transform (CHT) to indicate the approximate location of the LV and a network capsule to precisely segment the LV. The neurons of the capsule network output a vector and preserve much of the information about the input by replacing the largest pooling layer with convolutional strides and dynamic routing. Finally, the segmentation result is postprocessed by threshold segmentation and morphological processing to increase the accuracy of LV segmentation.
RESULTS:
We fully exploit the capsule network to achieve the segmentation goal and combine LV detection and capsule concepts to complete LV segmentation. In the experiments, the tested methods achieved LV Dice scores of 0.922±0.05 end-diastolic (ED) and 0.898±0.11 end-systolic (ES) on the ACDC 2017 data set. The experimental results confirm that the algorithm can effectively perform LV segmentation from a cardiac magnetic resonance image. To verify the performance of the proposed method, visual and quantitative comparisons are also performed, which show that the proposed method exhibits improved segmentation accuracy compared with the traditional method.
CONCLUSIONS:
The evaluation metrics of medical image segmentation indicate that the proposed method in combination with postprocessing and feature detection effectively improves segmentation accuracy for cardiac MRI. To the best of our knowledge, this study is the first to use a deep learning model based on capsule networks to systematically evaluate end-to-end LV segmentation.
Abbreviations
Cardiovascular disease Left ventricle Computed tomography Cardiovascular magnetic resonance imaging Support vector machine Fully convolutional network Two-dimensional Three-dimensional Circular Hough transform Fourier transform
Introduction
An estimated 17.9 million people died from cardiovascular disease (CVD) in 2016, representing 31% of all deaths globally [1]. The American Heart Association (American Heart Association) estimates that life expectancy could be extended by 10 years if CVD were effectively prevented and treated. In contrast, life expectancy could be extended by only three years even if all cancers were effectively treated. CVD has gradually become the leading cause of death in recent years, and the proportion of deaths attributable to CVD is increasing. Therefore, early quantitative diagnosis and risk assessment of CVD play a key role in extending human life expectancy and improving human quality of life. The development of medical imaging technology has provided options for diagnosing CVD, including computed tomography (CT) [2] and cardiovascular magnetic resonance imaging (CMRI) [3]. Each type of imaging has advantages and disadvantages. Magnetic resonance imaging (MRI) is the most widely used technique because it is noninvasive and provides high-resolution images [4]. For the development of cardiovascular disease treatments, short axis cine MRI has been used as a standard technique for understanding the global structural and functional characteristics of the heart by deriving clinical indicators, such as ventricular volume, stroke volume and ejection fraction [5]. Accurate assessments are based on the precise segmentation of the left ventricle (LV) [6]. However, relying on traditional manual LV segmentation by medical experts is a time-consuming process that is prone to error and rater variability [7]. Therefore, a fully automated method for LV segmentation is desirable.
Machine learning algorithms, such as deep neural networks, possess tremendous potential for matching human-level performance on generic and challenging recognition tasks [8, 9]. Machine learning methods for the automatic segmentation of LV MR images include the support vector machine (SVM), fully convolutional network (FCN) [10], and U-net. Although segmentation via SVM achieves better results than other methods when the training sample set is small, it is not suitable when the imaging changes substantially. FCN is a pioneering deep learning network in image segmentation that can achieve end-to-end segmentation. FCN classifies pixels and thus studies at the image semantic segmentation level [11, 12], and it can accept an input image of arbitrary size [13]. Compared with convolutional neural networks, FCNs use convolutional layers instead of fully connected layers. However, FCNs have disadvantages as well, and the primary disadvantages are that large-scale data are required for training and the segmentation results provide insufficient detail. The U-net architecture with data augmentation has been very successful for medical image segmentation and requires only a small training sample set [14]. In recent years, many state-of-the-art LV segmentation methods based on deep learning have substantially surpassed the previous methods in performance. Deep learning methods can be roughly divided into 2 classes: two-dimensional (2D) methods, which segment each slice independently [15, 16]; and three-dimensional (3D) methods6, which segment multiple slices together as a volume [17]. Gustavo et al. proposed using deep belief networks, which is a 2D method used to identify the LV region of interest and adapt a spline to edges [18]. Dangi et al. proposed a CNN-based multi-task learning approach to simultaneously perform LV segmentation and cardiac indices estimation [19]. A method based on the FCN algorithm for LV segmentation has also been proposed [20].
A new image recognition architecture proposed by Sabour et al. [21] called a capsule network has shown great potential in digit recognition and image classification. This type of network is also propitious for segmentation tasks [22]. A capsule is a group of neurons that output a vector [23]. The length of the vector represents the probability of the existence of the object, and the direction of the vector represents the instance parameters [21, 22]. Capsule networks can provide equivariant mapping, which means that they can preserve position and pose information (e.g., location, rotation, and thickness). A capsule network can retain information by replacing the largest pooling layer with convolutional strides and dynamic routing; thus, we believe that the capsule network warrants further exploration and that capsule network-based LV segmentation has great potential.
In this study, we propose LV detection and error correction to improve the segmentation detail in capsules for object segmentation. This study has three contributions. First, as far as we know, this is the first study to use the capsule network for LV segmentation from CMRI. Second, a method that combines feature detection and SegCaps has been proposed to improve segmentation performance. Finally, in terms of speed, the introduced method can analyze the short axis of a subject in a matter of seconds, thus saving manpower and time and improving the diagnostic efficiency of doctors. We analyze and compare the results of the research method and the original algorithm on the MICCAI ACDC data set [6].
The remainder of this paper is organized as follows. Section 2 describes the feature detection of the LV and the segmentation framework and outlines the experimental parameters, the experimental setup and the data processing. Section 3 reports the study with the experimental data, the quantitative evaluation parameters, and the evaluation results. Section 4 presents the discussion and conclusions.
Materials and methods
Figure 1 illustrates our automated LV segmentation framework, which consists of three modules. The first method is composed of a Fourier analysis and the circular Hough transform (CHT) for LV detection [6]. The second module is the segmentation algorithm based on the capsule. According to the labeled LV images, the algorithm continuously learns features and updates the parameters to classify pixels during the training process. This module outputs a segmentation probability. Based on the probability map obtained from the second module, the third module consists of the postprocessing operation that conducts threshold segmentation and morphological processing. In the following section, we provide details about feature detection, the segmentation algorithm and the postprocessing operations.

Proposed framework for automated left ventricular segmentation.
This step is based on the combination of the Fourier transform (FT) [24, 25] in the temporal domain and the CHT [6, 26] as illustrated in Fig. 2. This step returns the location and approximate region of the LV in the cardiac magnetic resonance (CMR) image. This information can be used as input for higher-level segmentation and to reduce the time of the fully automatic planning of CMR examinations [24].

Overview of left ventricular detection.
Because the heart has a frequency of beat motion, the intensity at each pixel position changes over time, and this characteristic makes the heart distinguishable from other structures. Its intensity value varies over a large range over time. Therefore, we compute the FT and H1 image only. The short-axis cardiac MR images of a slice contain an entire cardiac cycle, and each slice image sequence can be viewed as a 2D signal varying over time. Therefore, we perform a 3D FT along the time axis on each slice. The discrete FT F(T, u, v) of a 3D array f(t, x, y) is defined as
After the 3D FT, we can obtain the H1 image (Fig. 2b) by using an inverse Fast Fourier transform (FFT) on the first harmonic of the FFT because the heart cycle motion is on the same frequency.
The CHT is a feature extraction technique for detecting circles [27]. The CHT algorithm can be summarized in four steps: (1) find edges by using Canny edge detection; (2) draw a circle for each edge point, with its center in the edge point with radius r, and increment all coordinates that the perimeter of the circle passes through in the accumulator; (3) find one or several maxima in the accumulator; and (4) map the found parameters (r,a,b) corresponding to the maximum. The center coordinates and radius of many Hough circles are obtained by Canny edge detection and CHT. Only the P highest scoring Hough circles are retained, where P is a hyper-parameter6. Finally, the most likely center of the LV is determined by a Gaussian kernel function. The maximum value of the LV likelihood surface is selected as the center of the LV, and the image is cropped to a fixed size (128×128). The Gaussian function can be defined as
Convolutional and deconvolutional capsule network has been proposed in SegCaps, and the results showed a substantial decrease in parameters in the task of segmentation [22]. The primary capsule is the first capsule layer. The capsules replace the scalar-output feature detectors of CNNs with vector-output capsules. In each convolutional capsule layer, there are capsules of type T = {t1, t2, … t n |n ∈ N}. For every type, C = {c11, . . c1w, . . ch1, . . c hw } is the h×w grid of z-dimensional capsules.
At layer l + 1, every capsule
The deconvolution operation is an algorithm-based process used to mathematically reverse the effects of convolution on recorded data; this process is widely used in image segmentation [28, 29]. In the deconvolutional capsule layer, the “deconvolutional” capsules use transposed convolutions routed by a locally constrained dynamic routing algorithm.
As illustrated in Fig. 3, the explored architecture is similar to U-net, which replaces the convolution and pool layer with the convolutional capsule layer and deconvolution with the deconvolutional capsule layer20. The former is the “contracting” stage, and the latter is the “expanding” stage. The contracting path consists of a number of convolutional capsule layers that extract image features. Each convolutional capsule layer uses a 5×5 kernel. After every convolution capsule, the feature map is downsampled by a convolution capsule layer with a stride of 2 to allow the network to learn features at a more global scale. Every step in the expanding path consists of upsampling the feature map, which is followed by the application of a 4×4 deconvolution capsule that halves the number of feature channels, concatenation with the correspondingly cropped feature map from the contracting path, and implementation of a 5×5 convolution capsule operation. Finally, one convolutional layer with a 1×1 kernel followed by a soft-max function is used to predict a probabilistic label map. The segmentation is determined at each pixel by the label class with the highest soft-max probability. The weighted BCE loss between the binary mask and the manually annotated label map is used as the loss function [22]. The input to the architecture network is a 128×128-pixel image, which in this case is a slice of an MRI scan. This image is passed through a 2D convolutional layer that produces 16 feature maps of identical size but reshaped into a four-dimensional (128×128×1×16) tensor. The tensor is the input of the primary capsule. This network has a total of 16 layers, including 4 convolutional layers, 3 deconvolutional capsule layers and 9 convolutional capsule layers.

Overview of the network architecture.
The network output is a probability map that specifies the membership to the target probability for each pixel. A threshold must be applied on a network map, and it is obtained using the Otsu adaptive threshold algorithm [30], which divides the probability map into two categories with minimum variance. The Otsu adaptive threshold algorithm is a classical image segmentation method that has been widely applied in image processing [30]. In the first step of the method, the segmentation threshold is obtained by an adaptive threshold algorithm for the network output binary mask. In the binary mask, pixels larger than the threshold are set as targets and those smaller than the threshold are set as the background. In the second step, connected component analysis is performed. The connected area is marked (two pixels are considered to be in a mutually connected area if they are adjacent or have the same values in the binary image). All pixels in a connected area are marked by the same value, called the “connected area mark.” The connected areas are sorted according to the size of the area. The maximum connected area pixel value is set to 1, and the others are set to 0. In the last step, morphological operations are performed, such as binary hole-filling inside the ventricular cavity.
Experimental results
Data
The adopted dataset was sourced from the MICCAI ACD challenge 2017 training phase dataset. The challenge data contain 100 patients along with corresponding manual labels based on an analysis by a single clinical expert. The challenge images were acquired from breath-hold MRI with retrospective or prospective gating and with an SSFP sequence in the short-axis orientation. In particular, a series of short axis slices covered the LV from the base to the apex with a thickness of 5 mm (or sometimes 8 mm). The spatial resolution ranged from 1.37 to 1.68 mm2/pixel, and 28 to 40 images were obtained that completely or partially covered the cardiac cycle (in the second case, with prospective gating, only 5 to 10% of the end of the cardiac cycle was omitted), all depending on the patient.
The training data from the ACDC dataset were used throughout the experiments, and we selected all slices of each subject for experimentation. From the training challenge data, 100 subjects were arbitrarily divided into the training set, verification set and test set.
Evaluation
We used four traditional metrics to evaluate the segmentation accuracy of the network namely, the Dice metric, Jaccard coefficient, average symmetric surface distance and Hausdorff distance. The Dice metric is defined as the overlap ratio of automated segmentation (A) and manual segmentation (B). The Jaccard coefficient is the intersection of A and B divided by the union of A and B. The Dice metric and Jaccard coefficient are defined in Equations (6) and (7), respectively, as
Let S(A) denote the set of surface voxels of A. The shortest distance of an arbitrary voxel v to S(A) is defined as
The average symmetric surface distance and Hausdorff distance are defined in Equations (9) and (10), respectively, as
Experiments were conducted on the ACDC dataset, which was randomly split into four training/validation/testing sets for the experiment. Throughout the entire experiment, 1260 images were used in the training set and 460 images were used in the testing set. To increase the amount of training data and the variability of the training data, we randomly performed random translation, rotation, scaling, and shear on the original image. For example, the original image was rotated at a random angle within a range of 45 degrees and spatially shifted within a range of twenty percent of the length or width. The network used the Adam optimizer to optimize the loss function, the number of iterations was 10,000, and an initial learning rate of 0.001 was adopted. After training for 5 epochs, if the model performance no longer improved, the learning rate was reduced by 0.05. Training of the capsule networks was run on a NVIDIA GeForce GTX 1080Ti GPU.
Segmentation results
Figure 4 illustrates the results of segmentation from our method at the ED (end-diastolic) and ES (end-systolic) phases of the cardiac cycle on different slices. For the rows from top to bottom, the Dice coefficient of segmentation was 0.947, 0.943, 0.895, 0.912, 0.928, and 0.861. The errors were mostly concentrated in the apical slices. These results show that the developed method of automated segmentation achieved accurate results on most of the slices, although the segmentation at the apex slices was occasionally incorrect.

Segmentation results at both the ED and ES phases of the cardiac cycle on a subset of the ACDC training set reserved for testing. The columns from left to right show the input images, the segmentations generated by our method and the corresponding ground truths. The rows from top to bottom show short axis slices of the heart at the basal, mid and apex regions.
In this section, we analyze the effects of postprocessing on the segmentation results. Figure 5 shows the sample results based on postprocessing and the networks: (a) shows the network inputs, (b) shows the network outputs, and (c) shows the postprocessing results. The output of the segmentation network is the result of a binary category classification of image pixels, which divides the image into background and target pixels. In this paper, the target refers to the LV to be segmented and the background is the non-LV portion. In the second column, some white pixels exist around the target and the structure around the LV is misclassified. The second and third rows show a few organs similar to the LV segment that are incorrectly segmented.

Illustration of the results for the architecture output and the corresponding sample postprocessing results: (a) network input; (b) network output; and (c) postprocessing of the network output.
The network output results show that the segmentation algorithm worked properly, resulting in few segmentation errors. Such cases must be corrected by postprocessing, which can improve segmentation accuracy.
Figure 6 allows a comparison of the segmentation results between our method and SegCaps on different slices. For the apex, mid and base slices, our Dice index was 0.959, 0.969, and 0.722, respectively, whereas the SegCaps Dice index was 0.885, 0.907, and 0.655, respectively. The first row shows that SegCaps is subject to LV under-segmentation and over-segmentation.

One patient’s (a) basal slice, (b) mid slice, and (c) apex slice. The first row is the SegCaps segmentation result, the second row is the result of our developed method, and the third row is the ground truth.
Table 1 provides the results of the two methods on the test dataset, which was retained from the ACDC training set for testing. Both methods segmented all slices. Based on the four evaluation measures, our developed method is superior to SegCaps in LV MR image segmentation. Approximately 460 images from 30 patients were used in the test set, and the values in Table 1 represent the mean and standard deviation.
Comparison of the performance of the four architectures: (1) the original SegCaps, (2) SegCaps with postprocessing (Segcaps_Post), (3) SegCaps with feature detection (SegCaps_Dec), and (4) SegCaps with feature detection and postprocessing (Ours). Paired t-tests were performed to compare architectures (4) and (2) and architectures (3) and (4)
In this study, we developed a segmentation framework for MR images from count-limited data based on a capsule network with dynamic routing. The framework was validated by computer simulations performed with real data. Feature detection and error correction play very important roles in our proposed segmentation method. The framework based on the combination of feature detection with a capsule network produces obviously improved accuracies for LV segmentation compared with SegCaps for the 2017 ACDC datasets as assessed based on traditional segmentation metrics. Experimentally, we used more than 1100 cardiac MR images for training on the 2017 ACDC datasets, indicating its potential use in MR image segmentation given limited training data. In terms of speed, the introduced method can analyze the short axis for one subject within a few seconds.
In this study, we also demonstrated the utility and efficacy of a capsule network architecture based on SegCaps with feature detection for LV MR segmentation. In the experiment, we did not compare all current methods. Although this was a preliminary study aimed at exploring the feasibility of capsule-based LV segmentation, the results demonstrate the potential of the capsule network for LV segmentation. In future research, we will further explore the capsule network. First, we will expand the segmentation from 2D to 3D networks to study the segmentation performance of capsule 3D networks. Second, we will expand the segmentation of capsule networks from single objects to multiple targets. Finally, we will study the differences between capsule network segmentation and convolutional network segmentation. We believe that the proposed approach for ventricular segmentation will help clinicians diagnose cardiovascular disease and the capsule-based model has substantial potential for image detection and segmentation.
Last, we in this study aimed only to improve the capsule network and apply it to LV segmentation, and we did not compare the results with published results on LV segmentation in MRI images. Furthermore, we did not include clinical measures or detect multiple objects. In future research, we will compare the performance of the proposed method with state-of-the-art methods and explore the development of more detailed methods based on capsules for analyzing CMR images, such as multi-target segmentation and the analysis of clinical measures.
Conclusions
In this study, a simple and effective method of refining segmentation performance was proposed for LV MRI segmentation. To the best of our knowledge, this study is the first to use a deep learning model based on SegCaps to perform end-to-end LV segmentation. The main contributions of this study are three-fold. (i) This study demonstrates the feasibility of segmenting the LV based on a capsule network. (ii) A segmentation architecture that combines feature detection with a capsule network is proposed, and postprocessing is added to improve the accuracy of segmentation. Specifically, feature detection uses FT and CHT to identify the LV from CMRI. (iii) LV segmentation based on this method can be performed more rapidly than traditional manual segmentation and improves diagnostic efficiency.
Funding
This study was supported by the National Natural Science Foundation of China (81871441, 61901463), the Shenzhen International Cooperation Research Project of China (GJHZ20180928115824168), the Guangdong International Science and Technology Cooperation Project of China (2018A050506064), the Natural Science Foundation of Guangdong Province in China (2017A030313743), the Guangdong Special Support Program of China (2017TQ04R395), the Shenzhen Science and Technology Program of China (JCYJ20170413161350892, JCYJ20170818160306270).
