Abstract
In recent years, much attention has been paid to the electronic cluster eye (eCley), a new type of artificial compound eyes, because of its small size, wide field of view (FOV) and sensitivity to motion objects. An eCley is composed of a certain number of optical channels organized as an array. Each optical channel spans a small and fixed field of view (FOV). To obtain a complete image with a full FOV, the images from all the optical channels are required to be fused together. The parallax from unparallel neighboring optical channels in eCley may lead to reconstructed image blurring and incorrectly estimated depth. To solve this problem, this paper proposes a geometry based three-dimensional image processing method (G3D) for eCley to obtain a complete focused image and dense depth map. In G3D, we derive the geometry relationship of optical channels in eCley to obtain the mathematical relation between the parallax and depth among unparallel neighboring optical channels. Based on the geometry relationship, all of the optical channels are used to estimate the depth map and reconstruct a focused image. Subsequently, by using an edge-aware interpolation method, we can further gain a sharply focused image and a depth map. The effectiveness of the proposed method is verified by the experimental results.
Introduction
Modern robots are progressively becoming more realistic and similar to humans [23] being asked to solve tasks such as path planning [58], path following [18], trajectory tracking [49], reproducing the behavior of human limbs [24].
A fundamental role in robotics is played by Artificial Intelligence [55], Computational Modelling [32, 57], and Metaheuristic Optimisation. Algorithms belonging to the last category are usually inspired by nature. Some recently proposed examples include inspiration by music [41, 40], physics [43, 42, 44], swarm behavior [34, 37, 36]. Artificial Intelligence is often integrated within robot controllers [45, 28] and combined to neural systems for Machine Intelligence [19, 2, 13].
The eye of the robot is very important and complex device since it is one of the sensors that allow a robot to make decisions and thus being autonomous. Several problems arise in relation to the eyes of the robots. Some are related to design of the eye itself, e.g. retina modelling [25] while others are related to the interpretation of what the cameras incorporated the eyes capture. Some examples include image segmentation [3], feature extraction [47, 35], image recognition [7, 26], and image processing in medical applications [16, 17].
The present paper focuses on robots’ eyes and, more specifically on artificial compound eyes. Inspired by natural compound eyes, an artificial compound eye is manufactured to be an alternative imaging system to miniaturize an optical system [48]. Due to many merits, such as small size, wide field of view (FOV) and sensitivity to motion objects, much attention has been paid to artificial compound eye systems in recent years [51]. Inspired by a wasp parasite called Xenos Peckii, an electronic cluster eye (eCley) [5] was introduced by Brückner et al. in 2010 as a new type of artificial compound eyes. The details about the optical module, the working principle and the manufacturing process can be referred to [5]. The camera module of eCley is shown in Fig. 1 and its basic system parameters are listed in Table 1.
Parameters of the eCley system in [5]
Parameters of the eCley system in [5]
The eCley module (the one cent coin is used for the size comparison [5]).
An eCley is composed of a certain number of optical channels organized as an array. The array of microlenses is used for imaging. As shown in Fig. 2, each imaging channel can capture only a small part of the full FOV, so the FOV of each imaging channel is small and the FOVs of neighboring channels are partial overlapped. By using this structure, the total track length is only about 1.4 mm, which is much shorter than single-aperture camera [1]. In addition, since the focal length of eCley is only 0.778 mm, it has a merit that the eCley focuses on very closed objects easily. Therefore, eCley can be used for the navigation of micro-robot and the internal examination based on micro capsule robot. While the captured image is composed of a series of separated, small FOV and low-resolution images. It is hard for robots to make decision based on the small FOV and low-resolution image of a single channel or even several channels, thus the post-process to obtain a complete high-resolution image is required.
In the eCley system, a complete image with a full FOV is obtained by fusing the images from all the optical channels. The principal optic axes of optical channels are considered to be unparallel so as to gain a larger FOV than the parallel case on condition that the overall size of eCley is not changed. Each channel has its inherent offset against its neighboring channel. Thus, the parallaxes of objects at the same distance are not identical in different channels. The parallax from unparallel neighboring optical channels in eCley may lead to reconstructed image blurring and incorrectly estimated depth.
(a) The captured image. (b) The extracted images after flat field correction. (c) Reconstructed image focused on a single distance. (d) The reconstructed image of our method. (e) The depth map of our method.
To solve this problem, a geometry based three-dimensional image processing method (G3D) for eCley is proposed to obtain a complete focused image and dense depth map. The main idea of G3D is the introduction of the geometry relationship of optical channels for eCley. Based on the geometry relationship, the mathematical relation between the parallax and depth among unparallel neighboring optical channels can be derived. Further, a sharply focused image and a depth map can be gained. Compared with the methods in [30, 31, 59], G3D has three main advantages:
G3D can reconstruct a complete all-in-focus image. Based on the geometry relationship derived, all channel images can be directly used to estimate the depth of an object. G3D can reach sub-pixel precision in principle, because the matching is based on the space coordinates, instead of pixel shift.
The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the geometry based three-dimensional image processing method. Experimental results are shown in Section 4. Finally, conclusions are drawn in Section 5.
This study discusses the three-dimensional image processing method with respect to eCley covering object depth estimation and image reconstruction. So this section starts from brief introductions of object depth estimation and image reconstruction in the area of computer vision, and then moves to the survey of the related work on artificial compound eyes.
Object depth estimation is one of the most active research topics in computer vision, see [56, 6]. According to the different setting of system parameters, two different methods, depth from defocus (DFD) [52] and stereo matching [38, 11], are widely studied. DFD attempts to extract depth information from multiple images (at least two) captured at different focus setting [4], while the stereo matching method tries to obtain depth from multiple images captured at different positions. The performance analysis and comparisons between DFD and stereo matching algorithms was presented in [39]. Moreover, many public test images can be found in the web site by the link:
In the meantime, super-resolution (SR) image reconstruction is also a very attractive research area over the past two decades. An SR reconstruction algorithm is to obtain one or multiple high-resolution images from one or multiple low-resolution images. In general, SR algorithms consist of four main categories: nonuniform interpolation approach, frequency domain approach, regularized SR reconstruction approach and projection onto convex sets approach [33, 27].
In the area of artificial compound eyes, except for the system design, image reconstruction and depth estimation are two active research areas [51]. In SR reconstruction, Kitamura et al. proposed a pixel rearrange method to reconstruct the high-resolution image based on Thin Observation Module by Bound Optics (TOMBO) [21]. In [29], an iterative back projection method (IBP) was presented to further improve the reconstructed results. El-Sallam et al. introduced a spectral-based blind method for TOMBO [9, 10], but the point spread function (PSF) of sub-images should be known before. As for depth estimation, most algorithms were designed for TOMBO. In [53], a multiple baselines method was discussed. In [14], a sum of squared difference (SSD) evaluation based back-project method was described based on the fact that a back-project method will result in poor reconstruction performance when the object is located at the wrong position. Especially for eCley, a braiding algorithm was introduced in [30, 31] to obtain the reconstructed image with the final resolution 640
Schematic diagram of G3D.
In summary, the work on image processing for eCley stays at the initial stage and lots of problems are required to be solved before it goes to practical applications.
To eliminate the reconstructed image blurring and incorrectly estimated depth resulting from the parallax coming from unparallel neighboring optical channels in eCley, this section discusses the proposed geometry based three-dimensional image processing method (G3D) for eCley. The main idea of G3D is that the geometry relationship of optical channels in eCley is derived to obtain the mathematical relation between the parallax and depth among unparallel neighboring optical channels. The geometry relationship is very useful for gaining a sharply focused image and a depth map.
The main steps of G3D are shown in Fig. 3. In the eCley system, the captured image goes through several processes: image extraction, flat field correction, spatial coordinate calculation, matching cost computation, matching cost aggregation, initial depth estimation, initial image reconstruction, and depth & image refinement. Finally, the dense depth map and focused image can be obtained. Each step will be described in detail as follows.
The flat field correction of a channel image. (a) Before the flat field correction. (b) After the flat field correction.
An eCley system consists of multiple optical channels, but the images of all the channels are captured by one image sensor. The image of each channel occupies a small circular region on the image sensor. Assume that the eCley system has
Due to the border effect, the amount of incident light at the pixels near the border of image channel is less than the rest pixels’, so the border of each image is darker than the rest of the image, which is shown on the left-up corner of the left image (pixels in the red box) in Fig. 4. The pixels located in the left-up corner of this red box are darker than the rest pixels in the red box. With the effect of intensity attenuation, the depth of each pixel can not be accurate estimate because the brightness consistency assumption does not hold and the influence is more severe than the random sampling noise and the system noise of eCley. To reduce these effects, the flat field correction [21, 50] is performed. The steps are as follows:
Reference image capture. Two types of reference images are captured by using the eCley. One image is the white reference image, the other is the black reference. The black reference image is the Dark Current (DC) offset signal. To reduce the sampling noise, multiple reference images are captured continuously, in our case, 5 images of each reference type are enough to obtain a smooth reference image, then the average intensity of each pixel is calculated as the pixel intensity. Pixel response calculation. The pixel responsivity of camera sensor Flat field correction. For a captured object image
The result of the flat field correction is shown in Fig. 4b. Comparing with the pixels in the red box of both left and right images, the light intensity of the border area is reduced in the original image after flat field correction. So the effect of intensity attenuation in the red box is reduced significantly.
The simplified graph of the principal optic axis in eCley (only one dimension). The principal optic axes are not parallel with each other, and the angular deviation between adjacent channels is about 
This step is the key point of G3D and is also the foundation of the subsequent steps of G3D. The aim of spatial coordinate calculation is to handle the problem of nonuniform disparities at the same distance due to the nonuniform spatial sampling interval in the eCley system. Concretely, on the basis of several depth plane candidates, the pixel is back-projected to each depth plane to obtain its corresponding spatial coordinates, thus, several processes in the local stereo matching method [38], such as matching cost computation, matching cost aggregation and depth selection, can be used to estimate the correct depth by using the spatial coordinates.
Interleaving pixels from neighboring channels in the eCley. Left: two marginal channels. Right: two central channels. For simplicity, each channel has nine pixels in one dimension. The central channels nearly have the uniform spatial sampling interval, while the marginal channels do not have. At the same distance, the central channels have exactly the same parallax for all pixels, but the parallax of the pixels in the marginal channels is not uniform.
The simplified graph showing the relationship of camera system parameters (only in one dimension).
Differing from a single traditional lens camera, an array of microlenses is placed on the eCley, where each microlens has a small diameter, up to 400
Differing from the spatial sampling interval, the angular sampling interval of channel images is uniform. So this study proposes to directly use the spatial coordinates of pixels to obtain the corresponding pixels among neighboring channels so as to estimate the correct depth of object. The pixel is back-projected into space based on the geometry relationship of optical channels and the incidence angle of the pixel, thus, the spatial coordinates of pixels captured the same object will be nearly the same at the correct depth plane. The calculation of spatial coordinates is described in detail as follows.
In a camera system, suppose that
and the FOV
In our eCley system, the pixel pitch is 3.2
Schematic graph for imaging process. Brown point: optical center; red point: image center; green point: object position; blue point: the projection position of the object on the image plane. The principal optic axis of the image plane and the image plane are not orthogonal and it has a horizontal offset angle 
For a micro-lens channel shown in Fig. 8, the principal optic axis has a horizontal offset angle
The spatial coordinates of the object have the following relationship
Once the depth
In a unified coordinate system, let the optical center of the central channel
where
Schematic graph showing the object captured by 4 adjacent channels.
Consequently, for a specified depth
where the superscript
Matching cost computation is to calculate the intensity difference between a pixel and its corresponding pixels at adjacent channels. Like stereo matching, the depth is estimated based on several depth plane candidates. With the spatial coordinates of each pixel in each depth plane obtained in Eq. (7), the matching cost of these pixels with similar spatial coordinates can be obtained. If the pixel is placed at the proper distance, the matching cost would be small due to the similar intensities of its corresponding pixels.
According to the eCley parameters, there is a slight offset between the principal optic axes of adjacent channel images, though an object in the FOV of eCley can still be captured by at least four channels (except for the object located at the marginal of the FOV), as show in Fig. 9. Due to the fact that the direction of FOV between adjacent channels has a shift and the width of each image is only
An example for showing the pixels projected to the spatial plane, where the black points denote the projected point of pixels; blue and green boxes indicate the sampling regions of one pixel. (a) The spatial plane with several pixels projected to this plane. (b) A magnified figure of the green box in (a).
In order to gather as much information as possible for depth estimation, this study uses the pixels from different images located at similar spatial coordinate to compute the matching cost simultaneously. For a specified depth plane, all pixels are projected to this plane based on Eq. (7) in Section 3.2. As shown in Fig. 10, the black dots are the spatial points at this depth plane. The spatial coordinates of several pixels from different images may not be consistent and have a small deviation. According to the expected resolution of the reconstructed image, the depth plane is divided into multiple small and non-overlapped windows, each of which indicates a sampling area of a re-sampling point. Thus, the pixel points within the small window are considered as the supporting points of the re-sampling point. As shown in Fig. 10b, the orange dot is a re-sampling point; the pixel points located in the blue box are the supporting points of the re-sampling point; the intensity of the orange dot is the average intensity of the supporting points. Moreover, the sampling interval is related to the depth. The farther the depth plane is, the larger the sampling interval will be. For simplicity, those points in all candidate depth planes are re-projected to a fixed depth plane
where
The sampling area of a re-sampling point
where
Matching cost aggregation is widely used in local stereo matching methods to aggregate the local matching information and improve the matching results. In this study, due to the matching ambiguity, the depth estimation results based on the point-wise matching cost may be noisy and less accurate. So a local window-based edge preserved method based on the bilateral filter is used for cost aggregation [54, 12]. Assume that a window
where
where
Initial depth estimation and image reconstruction
In this section, the depth with the minimal matching cost is selected as the initial estimated depth. For different values of distance
In addition, it is well known that the parallax of an object captured by neighboring images is related to the depth of the object. If an object is placed in an improper depth plane, the obtained image of the object will be blurred. So the image reconstruction is based on depth information. Concretely, the intensity
Depth and image refinement
However, the depth obtained may still suffer mismatching. Therefore, a reliable depth map is constructed by comparing the minimal cost
Now a sparse reliable depth map and an initial reconstructed image
where
where
Therefore, the final depth map
Then based on the final depth map
In this section, the feasibility of the proposed method is tested and the comparisons with the two methods in [59, 46] are made in terms of the quality of reconstructed images and the depth map. Actually, the depth estimation method in [59] is replaced by the cost filter technique [15] in this paper. Both the proposed method and the other algorithms for comparison were implemented in MATLAB 2016b.
Parameters used in this study
Parameters used in this study
Input image captured by the eCley camera system.(a) Original image before flat field correction. (b) Image after flat field correction.
Since there is no ground truth for our image sets, the quantitative comparisons are performed by comparing the reconstruction errors between the final image and the original channel images. Suppose a reconstructed image
where
The captured image is extracted from all channels and shown in Fig. 11, where the left is the original image without flat field correction and the right is the image after flat field correction. In this paper, due to the large oblique distortion, only
MSE of the reconstructed image
The executing time at the cost aggregation step with different window sizes 
Figure 13 shows five test images and the corresponding reconstructed images. The first, second and fifth images both have three fronto-parallel objects placed in front of the eCley system, the third image has two objects, different from the above two images, they are placed on a textured slanted plane, and the fourth image is a pure slanted plane, all of the objects shown in image are placed up to 200 mm away from the eCley system. As shown in Fig. 13, the proposed method has a better performance than the one gained by method in [59] and by the method in [46], especially for the third and the fourth image, where the words on the slanted plane have less artifact. In addition, the MSE listed in Table 3 also indicates that the proposed method has the minimal reconstructed error among the three methods.







Figure 14 shows the final depth map obtained by the proposed method and the methods in [46, 59]. Since the principal optical axis among eCley apertures is not parallel, therefore, the estimated depth obtained by directly using the stereo method on the eCley images will suffer from the problem that the depth of an object will decrease as the object approaches to the image border. As shown in the right two columns in Fig. 14, especially for the images of the first two rows, the objects are fronto-parallel to the eCley system. While in the proposed method, this problem is largely avoided, and the depth of an object is more accurate.
Figures 15–19 show the 3D point clouds of the reconstructed images based on the depth maps. As shown in Figs 15, 16 and 19, the objects obtained by the proposed method are located at the same plane, and the objects are accurately recovered, while based on the methods in [46, 59], the objects near the border of the image are not located at the same plane. In addition, for the slanted plane shown in Figs 17 and 18, the proposed method can also accurately recover the patterns on the slanted plane, while in the methods [46, 59], the reconstructed images are much worse and the patterns on the slanted plane are hardly recognised. Therefore, based on the depth maps and the 3D point cloud maps, the reconstructed images obtained by the proposed method and the methods in [46, 59] are not much different, while the stereo images have much difference. The distance of an object in the proposed method has not dramatic change, but for the methods in [46, 59], the distance of the object located in the corner of the reconstructed image always has a dramatic change.
It is worth pointing out that there is not much difference in the aspect of computing time, several minutes on Matlab 2016b, between the proposed method and the one in [59], the computing times are listed in Table 4. The two methods use the same cost aggregation approach. As for the method in [46], since this is a global method, and a lot of iterations is needed to obtain a good result, the computing time is about two time longer than the proposed method to obtain a sufficient result. In the proposed method, all neighboring images of the reference image are used for cost matching, while the methods in [59, 46] use a two-view stereo matching method and fusion. In this study, no fusion process is required.
Average computing time
In this work, a three-dimensional image processing method is presented by using the multiple apertures images of eCley to reconstruct a focused image and to estimate depth information. Based on the projection geometry, the proposed method directly obtains an initial depth map and a reconstructed image according to the spatial coordinate of pixels. The reconstructed image and the initial depth map are further used to obtain a dense and consistent depth map by using an edge-aware interpolation method. Subsequently, the reconstructed image is further refined based on the final dense and consistent depth map. Therefore, the three-dimensional structure of the scene can be recovered. Differing from the depth map obtained by stereo matching and fusion method, the proposed method back-projects the pixel into space and directly uses the images from all channels to estimate the depth map. The information from multiple channels is merged into a single matching step. The reconstructed image and the depth map are compared with the methods in literature. Experimental results show that the proposed method can correctly estimate the reconstructed image and the depth map, and achieve a better reconstructed image and depth results than two state-of-the-art stereo methods. Future work will focus on the auto calibration method to rectify the channel image and then further improve the depth result and the use of super-resolution image reconstruction because the results still contain some errors.
Footnotes
Acknowledgments
This work is supported by the National Natural Science Foundation of China (61373047, 61672437, 61702428 and 51641506), Sichuan Science and Technology Program (18ZDYF2877, 18ZDYF1985), the State Key Laboratory of Robotics Program (2014-O09), Scientific Research Foundation of CUIT (J201508), Scientific Research Fund of Sichuan Provincial Science & Technology Department (2015GZ0304); Fund of Robot Technology Used for Special Environment Key Laboratory of Sichuan Province (14zxtk04) and China Scholarship Council.
