A confidence-aware depth estimation method for light-field cameras based on multiple cues

Abstract

Depth map estimation from a light-field camera is an interesting and challenging problem. Recent works have demonstrated many fascinating results based on different cues in light-field images. According to the characteristics of light-field spatial refocusing, we introduced a confidence -aware depth estimation method on the basis of multiple cues. In this paper, the focus/defocus cue of focal stack is estimated in Discrete Cosine Transform (DCT) domain. Based on photo-consistency metric and relevance analysis, the correspondence cue between different rays of a refocusing pixel is extracted. Then the edge confidence analysis is introduced as the depth and color discontinuity cues. In order to get refined depth map, an iterative graph cut optimization framework with label cost is used to integrate these aforementioned cues with their confidences. Experimental results showed that our method can achieve accurate depth maps, especially in the depth discontinuous areas.

Keywords

Light-field camera depth estimation multiple cues confidence-aware graph cut optimization

1. Introduction

Light-field or plenoptic cameras can capture 4D spatio-angular information of light field [1, 2]. Compared with traditional cameras, this characteristic of light-filed cameras provides more helpful cues and information for visual analysis and understanding. An important benefit of light-field camera is multi-views or sub-apertures can be obtained from a single shot, so depth recovery from a single light-field image becomes an interesting and attractive task. However, being limited by the fundamental tradeoff between spatial and angular resolution [3], depth estimation using light-field camera still faces the challenges of robustness and accuracy. Many recent works focused on using different cues, such as focus/defocus and correspondence.

Figure 1.

Comparisons of depth estimation results.

Inspired by the recent work [4], we propose a multiple cues confidence-aware depth estimation method. The main differences and our main contributions are:

A multiple-cue confidence-aware depth map optimization framework is proposed. It integrates multiple cues with confidence into a high-order multi-label graph cut optimization.

Edge confidence is incorporated in our optimization framework as the depth and color discontinuity cues. It can improve the accuracy of depth map in depth discontinuous areas.

A robust focus/defocus cue is estimated from focal stack by analyzing the alternating current (AC) coefficients calculated in DCT domain.

Before the computing of correspondence cue, the possible occluded rays are excluded according to the photo-consistency metric and relevance analysis.

Figure 1 shows the better performance of our method compared with that of Tao et al. [4]. Figure 1a is the input central views of light-field images, Fig. 1b is the depth maps of Tao et al, and Fig. 1c is the depth maps by our method and the pseudo-color version are shown in Fig. 1d. Clearly, our depth maps are much clearer and more accurate.

2. Related works

Compared with traditional 2D image, light-field image implies many significant cues. Our work is related to depth recovery from light-field images that perhaps can be traced back to the research on focal stack extraction [5] and depth-of-field (DOF) extension [6].

Georgiev and Lumsdaine [7] estimated disparity maps by computing a normalized cross correlation between microlens images. Bishop and Favaro [8]proposed a Bayesian framework to reconstruct scene depth. Perwass and Wietzke [9] introduced a correspondence technique to estimate depth. Yu et al. [10] analyzed the 3D geometry of lines in a light field image and computed the disparity maps through line matching between the sub-aperture images. Tao et al. [4] showed how multiple cues like defocus and correspondence can be combined. Jeon et al. [11] estimated the multi-view stereo correspondences with sub-pixel accuracy using phase shift theorem. Lin et al. [12] described a technique to recover depth from a light field image based on focal stack symmetry analysis and their data consistency measure. References [13] and [14] introduced methods to measure or control DOF.

Inspired by Tao et al. [4], we integrate five depth cues (photo-consistency, defocus, correspondence, occlusion, depth discontinuity) into a multi-label graph cut optimization framework.

3. Multiple cues from focus stack

Ray gathering and computational refocusing capabilities are the unique characteristics of light-field camera. In light field imaging processing, rays with different directions are separated by microlens array (MLA), and then recorded by image sensor. Computational refocusing is a process of ray tracking. The rays, from different positions of the main lens plane, arrived at a pixel of a virtual refocusing plane can be tracked and computed. Ng et al. [1] discussed and derived the refocusing equation for light-field image. We rewrite it in discrete form and ignore the dilation factor:

$\displaystyle\left\{{{\begin{array}[]{*{20}c}{I_{\alpha}\left({x,y}\right)=% \sum\limits_{u}{\sum\limits_{v}{R_{\alpha}\left({x,y,u,v}\right){\begin{array}% []{*{20}c}\hfil&\hfil&\hfil&\hfil&\hfil&\hfil\\ \end{array}}}}}\hfill\\ {R_{\alpha}\left({x,y,u,v}\right)=L_{F}\left({x+u\left({1-\frac{1}{\alpha}}% \right),y+v\left({1-\frac{1}{\alpha}}\right)u,v}\right)}\hfill\\ \end{array}}}\right.$ (1)

where $L_{F}(x,y,u,v)$ is the 4-dimensional light-field data, ( $x$ , $y$ ) and ( $u$ , $v$ ) are the spatial coordinates on the sensor plane and the angular coordinates on the main lens plane, respectively. $I_{\alpha}$ ( $x$ , $y$ ) is the value of pixel ( $x$ , $y$ ) in the refocusing plane $\alpha$ . The main lens plane, sensor plane, and the refocusing plane are shown in Fig. 2. $R_{\alpha}$ ( $x$ , $y$ , $u$ , $v$ ) is the ray radiation arrives at the pixel ( $x$ , $y$ ) in the sensor plane from the position ( $u$ , $v$ ) in main lens plane. $\alpha=F^{\prime}/F$ represents a virtual refocusing plane.

To express easily, we define the central ray is a ray emitted from scene point and passing through the main lens optical center $O(u=0$ , $v=0$ ), and its radiation is $R(x$ , $y$ , 0, 0). Noted that $R(x,y,0,0)$ is regardless of $\alpha$ . We define the central image (central view) $I_{0}(x,y)=R(x,y,0,0)$ is formed by the central rays,

$\displaystyle I_{0}\left({x,y}\right)=R\left({x,y,0,0}\right)$ (2)

In order to recover the depth map of scene, a focal stack is built by a set of virtual refocusing planes, and then we will find the best depth estimation of each scene point from the focal stack. For a scene point $P$ , as shown in Fig. 2, the best depth estimation of scene point $P$ is to select the most in-focus position $P_{\alpha}^{\prime}$ from the intersections of the central ray PO with the virtual refocusing planes in focal stack.

Figure 2.

Ray tracking and refocusing.

During the judgment of most in-focus position, we consider multiple cues as follows.

3.1 Photo-consistency cue

It is one of important cues and implied in most computer vision processing. It means all rays from the same scene point have the same radiation.

3.2 Defocus cue

In the focal stack, we have a group of images focused on different regions. Generally, the focused area is more informative, that means more details and the higher spatial frequency. The higher the spatial frequency, the higher the clarity of the image. We use the variance of AC coefficients of DCT block as the defocus metric. DCT-based metric is simple and efficient [15, 16] because it can achieve stable defocus metric without a lot of convolution calculations.

For a pixel $p(x,y)$ in the refocusing plane $\alpha$ , we extract an $N\times N$ image block $I_{\alpha}^{p}$ with its center at $p$ . The two-dimensional DCT transform of this image block is defined as:

$\displaystyle d_{\alpha}^{p}\left({m,n}\right)=\frac{2}{N}w\left(m\right)w% \left(n\right)\cdot$

(3) $\displaystyle\left({\sum\limits_{j=0}^{N-1}{\sum\limits_{i=0}^{N-1}{I_{\alpha}% ^{p}\left({i,j}\right)}}\cos\left[{\frac{\left({2i+1}\right)m\pi}{2N}}\right]% \cos\left[{\frac{\left({2i+1}\right)n\pi}{2N}}\right]}\right)$

$w\left(k\right)=\left\{{\begin{array}[]{ll}1/\sqrt{2},&k=0\\ 1,&\text{otherwise}\\ \end{array}}\right.,\mbox{ }k\in\left\{{m,n}\right\}$

where $m$ , $n=$ 0, 1, 2, $\ldots$ , $N-1$ , $i$ and $j$ are the pixel coordinates of image block. Here $d_{\alpha}^{p}(0,0)$ is known as the direct-current (DC) coefficient and it represents the mean of image block. The remaining coefficients are regarded as AC coefficients. The defocus metric can be written as follows,

$\displaystyle D_{\alpha}\left(p\right)=\frac{1}{N}\sqrt{\sum\limits_{m=0}^{N-1% }{\sum\limits_{n=0}^{N-1}{d_{\alpha}^{p}\left({m,n}\right)^{2}-d_{\alpha}^{p}% \left({0,0}\right)^{2}}}}$ (4)

We find the $\alpha$ value with the maximum defocus metric, and take it as the most in-focus position:

$\displaystyle\alpha_{D}\left(p\right)=\mathop{\arg\max}\limits_{\alpha}D_{% \alpha}\left(p\right)$ (5)
3.3 Correspondence cue

Without occlusion, all the rays converged at the most in-focus point should come from the same scene point, and should have same radiation according to the photo-consistency cue.

Considering the noises and the errors introduced in the refocusing process, we assume the radiation of all these converged rays should obey a narrow Gaussian distribution with small variance and its mean should close to the radiation of the central ray.

For a pixel $p(x$ , $y)$ in the refocusing plane $\alpha$ , we estimate a Gaussian distribution model ${\cal G}(\mu_{\alpha}(p)$ , $\sigma_{\alpha}(p))$ from all rays converged to pixel $p$ . As ( $u$ , $v$ ) are uniform samples of those rays, we use the sample mean and sample variance as those of the Gaussian model respectively, as shown in Eq. (2).

$\left\{{{\begin{array}[]{*{20}c}{\mu_{\alpha}\left(p\right)=\frac{1}{N_{u}N_{v% }}\sum\limits_{u}{\sum\limits_{v}{R_{\alpha}\left({x,y,u,v}\right){\begin{% array}[]{*{20}c}\hfil&\hfil&\hfil&\hfil&\hfil\\ \end{array}}}}}\hfill\\ {\sigma_{\alpha}\left(p\right)=\frac{1}{N_{u}N_{v}}\sum\limits_{u}{\sum\limits% _{v}{\left({R_{\alpha}\left({x,y,u,v}\right)-\mu_{\alpha}\left({x,y}\right)}% \right)^{2}}}}\hfill\\ \end{array}}}\right.$ (6)

The correspondence metric can be written as follows,

$\displaystyle C_{\alpha}\left(p\right)=\sigma_{\alpha}\left(p\right)+\left|{% \mu_{\alpha}\left(p\right)-R\left({x,y,0,0}\right)}\right|$ (7)

We find the $\alpha$ value with the minimal correspondence metric as the most in-focus position:

$\displaystyle\alpha_{C}\left(p\right)=\mathop{\arg\max}\limits_{\alpha}C_{% \alpha}\left(p\right)$ (8)

3.4 Occlusion cue

Occlusion is generally caused by inevitable object overlapping and difficult to be detected in traditional computer vision tasks. Because photo-consistency cue doesn’t hold when occlusion occurs, the correspondence cue metric will not be correct. As shown in Fig. 2, the red rays are occluded by the gray object and should be excluded before the correspondence cue metric computation. In other words, the red rays are emitted from different points on the occlusion objects rather than from the scene point $P$ .

Benefit from the characteristic of light-field that each ray is theoretically computable and traceable, we can propose a simple and efficient occlusion elimination method under the 3-sigma rule. It is not an occlusion detection method. It just uses an iterative hypothesis testing strategy for removing the outliers which is against the 3-sigma rule, and we consider those outliers include occluded rays. More specifically, we firstly compute the mean and variance from all the ( $u$ , $v$ ). After removing the outliers, we re-compute the mean and variance from the rest of ( $u$ , $v$ ), and remove the outliers again. These processes will end when no outliers can be remove or the best less than half. Finally, we get the new mean and new variance, and put them in Eq. (3.2) to compute the correspondence metric with the occlusion cue.

$\displaystyle{C}^{\prime}_{\alpha}\left(p\right)={\sigma}^{\prime}_{\alpha}% \left(p\right)+\left|{{\mu}^{\prime}_{\alpha}\left(p\right)-R\left({x,y,0,0}% \right)}\right|$ (9) $\displaystyle{\alpha}^{\prime}_{C}\left(p\right)=\mathop{\arg\max}\limits_{% \alpha}{C}^{\prime}_{\alpha}\left(p\right)$ (10)

3.5 Depth discontinuity cue

Depth discontinuities generally occur at object boundaries or occluding contours, and are reflected by edge pixels. We get the edge confidence map $E$ by using a gradient-based edge detector embedded confidence [17, 18] on the central view $I_{0}$ . The advantage of this edge detection technique can retain more low visual cues, such as the sharp but weak edges.

4. Multiple-cue confidence-aware depth optimization framework

Our light-field depth estimation, as mentioned before, is to select a most suitable $\alpha$ value (the most in-focus position) for each pixel. It is a labeling problem, and all of $\alpha$ values make up a finite label set $L$ . To find an optimal labeling solution, we propose a multiple-cue confidence-aware global optimization framework. It is based on the energy minimization method with label costs [19, 20]. The energy function combines data costs, smooth costs and label costs.

$\displaystyle E\left(\alpha\right)=\overbrace{\sum\limits_{p}{E_{\textit{data}% }\left({\alpha_{p}}\right)}}^{\textit{data}\cos ts}+\overbrace{\sum\limits_{% \left\{{p,q}\right\}\in N}{E_{\textit{smooth}}\left({\alpha_{p}}\right)}}^{% \textit{smooth}\cos ts}+\overbrace{\sum\limits_{\alpha,\beta\in L}{E_{\textit{% label}}\left({\alpha,\beta}\right)}}^{\textit{label}\cos ts}$ (11)

where $N$ is neighborhood. Data term $E_{\textit{data}}$ ( $\alpha_{p}$ ) is a unary term and measures the cost of assigning the label $\alpha_{p}$ to pixel $p$ . Smooth term $E_{\textit{smooth}}$ ( $\alpha_{p}$ , $\alpha_{q}$ ) is a pairwise term and penalizes assigning different labels $\alpha_{p}$ and $\alpha_{q}$ to the adjacent pixels $p$ , $q$ . It is used to impose piecewise smoothness constraints on the depth surface. Label term $E_{\textit{label}}$ ( $\alpha$ , $\beta$ ) is a high order term and expresses the likelihood of assigning $\beta$ label if current label is $\alpha$ .

Our data term integrates defocus metric and correspondence metric with the occlusion cue by estimating confidences as their weights. We use Peak Ratio to estimate the confidences of the defocus cue and correspondence cue, similar to the method proposed by Tao et al. [4].

Let $\alpha_{\cal D}\left(p\right)$ and $\alpha_{D}^{\ast}(p)$ are the maximum and the second maximum values of the defocus metric $D_{\alpha}(p)$ , respectively. ${\alpha}^{\prime}_{C}(p)$ and $\alpha_{C}^{\ast}(p)$ are the minimum and the second minimum values of the correspondence metric with the occlusion cue ${C}^{\prime}_{\alpha}(p)$ , respectively. Their confidences can be defined as follows.

$\displaystyle D_{\textit{conf}}\left(p\right)=D_{\alpha_{\cal D}}\left(p\right% )/D_{\alpha_{\cal D}^{\ast}}\left(p\right)$ (12) $\displaystyle C_{\textit{conf}}\left(p\right)=C^{\prime}_{\alpha^{\prime}_{% \cal C}}\left(p\right)/C^{\prime}_{\alpha^{{}^{\prime*}}_{\cal C}}\left(p\right)$

The cost of assigning the label $\alpha_{p}$ to pixel $p$ can be defined as follows,

$\displaystyle E_{\textit{data}}\left({\alpha_{p}}\right)=\lambda_{c}\cdot C_{% \textit{conf}}\left(p\right)+\lambda_{D}\cdot D_{\textit{conf}\left(p\right)}% \cdot\left({1-D_{\alpha_{p}}\left(p\right)}\right)$ (13)

where $\lambda_{C}$ and $\lambda_{D}$ control the weights between defocus metric and correspondence metric.

Our smooth term used here is a weighted Potts model,

$\displaystyle E_{\textit{smooth}}\left({\alpha_{p},\alpha_{q}}\right)=w\left({% p,q}\right)\cdot T\left({\alpha_{p}\neq\alpha_{q}}\right)$ (14)

where $T(\cdot)$ is the generalized Potts metric. $w\left({p,q}\right)$ is weights based on edge confidence map $E$ to preserve the depth discontinuous information.

$\displaystyle w\left({p,q}\right)=\min\left({1-\left|{\varepsilon(p)-% \varepsilon(q)}\right|,c_{\varepsilon}}\right)$ (15)

where $c_{\varepsilon}$ is a constant. Our label term is defined as

$\displaystyle E_{\textit{label}}\left({\alpha,\beta}\right)=\min\left({1-e^{-% \frac{\left({\alpha-\beta}\right)^{2}}{2\sigma^{2}}},c_{l}}\right)$ (16)

where $c_{l}$ is a constant.

Figure 3 shows some intermediate results of our method that are defocus metric, correspondence metric with the occlusion cue, edge confidence map and the final refined depth map, respectively.

Figure 3.

The intermediate results of our method.

5. Experiments and analysis

In our experiments, the refocusing plane parameter $\alpha$ ranges from 0.2 to 2.0, and the number of depth layer in focal stack is set to 256. According to experience and experiment analysis, we set $\lambda_{c}=$ 0.2, $\lambda_{D}=$ 0.6, $c_{\varepsilon}=$ 0.1, $c_{l}=$ 0.3. We compare our method with Tao’s [4], which estimates depth by using the defocus and correspondence cues.

We validate our method on real-world scenes taken by Lytro Illum camera. The size of light-field data is 15 $\times$ 15 $\times$ 434 $\times$ 625, where 15 $\times$ 15 and 434 $\times$ 625 are angular and spatial resolution of the light-field, respectively. We focus on the algorithm effectiveness and robustness in handling discontinuities and occlusions. Therefore, the scenes with lots of depth, color and texture discontinuities, especially the objects and backgrounds with complex textures, are captured as our experimental data.

Figure 4.

Comparisons of depth estimation results between Tao’s method and ours.

Figure 4 shows the experimental results on our data with fine structures and occlusions. Tao’s method loses fine structures, because details are vulnerable to edges and noises. Our method not only preserves more details at depth discontinuities, but also performs more robust at the surface color/texture discontinuities.

Figure 5.

Failure case: Large weak texture or textureless regions.

6. Limitations and discussion

Since our defocus metric is based on the image detail information, it may fail to get good results of defocus metric in some large weak texture or textureless regions, as shown in Fig. 5.

7. Conclusions

In this paper, we introduced a multiple-cue and confidence-aware depth estimation method, which combines the focus/defocus cue, correspondence cue, and the edge confidence analysis together. In addition, a multi-label graph cuts optimization framework is applied to correct and refine the depth estimation of uncertain regions. Experiment results showed our method exhibits better performance in achieving accurate depth maps, especially in the depth discontinuous regions.

Conflict of interest

The authors confirm that this article content has no conflicts of interest.

Footnotes

Acknowledgments

This work is supported by Zhejiang Provincial Natural Science Foundation of China (No. LY15F01 0002), and the Key Project of Zhejiang Provincial Natural Science Foundation of China (No. LZ14F020003).

References

Levoy

Br’edif

Duval

Horowitz

and Hanrahan

, Light field photography with a handheld plenoptic camera, Computer Science Technical Report CSTR 2(9) (2005), 4.

Georgiev

Lumsdaine

and Goma

, Lytro Camera Technology: Theory, Algorithms, Performance Analysis, in Proc. SPIE 8667, Multimedia Content and Mobile Devices, 2013.

Liang

and Ramamoorthi

, A Light Transport Framework for Lenslet Light Field Cameras, ACM Transactions on Graphics, 2015.

Tao

M.W.

Hadap

Malik

and Ramamoorthi

, Depth from Combining Defocus and Correspondence Using Light-Field Cameras, in International Conference on Computer Vision (ICCV), 2013.

Nava

Marichal-Hernndez

and Rodrguez-Ramos

, The discrete focal stack transform, in 16th European Signal Processing Conference (EUSIPCO 2008), 2008.

Georgiev

and Lumsdaine

, Depth of field in plenoptic cameras, in 30th Annual Conference of the European Association for Computer Graphics (EuroGraphics 2009), 2009.

Georgiev

and Lumsdaine

, Reducing plenoptic camera artifacts, Computer Graphics Forum 29(6) (2010), 1955–1968.

Bishop

T.E.

and Favaro

, The light field camera: Extended depth of field, aliasing, and superresolution, IEEE Trans. PatternAnal. Mach. Intell. (PAMI) 34(5) (2012), 972–986.

Perwass

and Wietzke

, Single lens 3d-camera with extended depth-of-field, in Imaging Proceedings Volume 8291, Human Vision and Electronic Imaging XVII; 829108, 2012.

10.

Guo

Ling

Lumsdaine

and Yu

, Line assisted light fleld triangulation and stereo matching, in Proceedings of International Conference on Computer Vision (ICCV), 2013.

11.

Jeon

H.-G.

Park

Choe

Park

Bok

Tai

Y.-W.

and Kweon

, Accurate depth map estimation from a lenslet light field camera, in Proceedings of International Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

12.

Lin

Chen

Kang

S.B.

and Yu

, Depth recovery from light field using focal stack symmetry, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.

13.

Liu

and Zhang

, A new freeform depth of field controlling method based on focused plenoptic camera, Journal of Computational Methods in Science and Engineering 16(2) (2016), 87–195.

14.

and Fang

, Method for measuring target depth by using optical field focusing image, Journal of Computational Methods in Science and Engineering 16(2) (2016), 369–377.

15.

Haghighat

Aghagolzadeh

and Seyedarabi

, Multi-focus Image Fusion for Visual Sensor Networks in DCT Domain, Computers and Electrical Engineering 37(3) (2011), 789–797.

16.

Phamila

and Amutha

, Discrete cosine transform based fusion of multi-focus images for visual sensor networks, Signal Processing 95 (2014), 161–170.

17.

Meer

and Georgescu

, Edge detection with embedded confi-dence, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 23 (2001), 1351–1365.

18.

Comaniciu

and Meer

, Mean shift: a robust approach toward feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), 603–619.

19.

Delong

, Advances in Graph-Cut Optimization: Multi-Surface Models, Label Costs, and Hierarchical Costs, University of Western Ontario, Canada, 2011.

20.

Delong

Osokin

Isack

and Boykov

, Fast Approximate Energy Minimization with Label Costs, International Journal of Computer Vision 96(1) (2012), 1–27.