Cell-based shape reconstruction from incomplete silhouettes

Abstract

Shape reconstruction from images is one of the most widely adopted approaches to compute accurate 3D reconstructions of people or objects in a multi-camera environment. However, such algorithms are traditionally very sensitive to errors in the silhouettes due to imperfect foreground-background estimation or occluding objects appearing between the camera and the object of interest. We propose a novel algorithm that is still able to provide high quality reconstruction from incomplete silhouettes. At the core of the method is the partitioning of the reconstruction space in cells, i.e. regions with uniform camera and silhouette coverage properties. An iterative process is proposed which incrementally adds cells to the temporal reconstruction based on their potential to explain the observed silhouettes from different cameras. Experimental results are close to manually labelled approaches and outperform standard leave-M-out reconstruction techniques in terms of F1-score.

Keywords

Shape reconstruction occlusion multi-camera fusion

1. Introduction

Shape-from-silhouettes algorithms, are very sensitive to errors in the provided silhouettes. Three types of errors are common: inaccurate silhouette boundaries, holes in the silhouettes and parts of the silhouette that are missing entirely. Such incomplete silhouettes may be due to errors in the segmentation algorithm, such as foreground/background (FG/BG) segmentation, but their primary cause is occlusion. If a static object is positioned between the camera and the moving object, foreground/background segmentation is unable to segment parts of the silhouette of the moving object. However, occlusion is very common in many real life situations. In indoor as well as outdoor environments, occlusion is often inevitable. Furthermore, occlusion is a major issue in the field of people tracking and 3d reconstruction. It leads to target loss and incorrect shape reconstruction.

An example application of the presented algorithm is monitoring traffic situations using multiple cameras. Occluding objects, such as trees, light poles and parked cars prevent the cameras from completely observing the objects of interest, such as moving cars, cyclists and pedestrians. Occlusion causes inconsistencies when the images of the different cameras are fused. Ignoring these inconsistencies may lead to the disappearance of objects which may cause accidents.

Another application in which occlusion causes major problems is athlete performance analysis. For instance, when the leg of a person is occluded by a table in one of the camera views, the leg will be cut off from the person’s 3d reconstruction when using conventional 3d reconstruction methods, such as shape-from-silhouettes. In case of pose estimation, this is a major setback as this leg will not be found and therefore the leg’s position will remain unknown, or worse, detected in the wrong place.

We propose a novel algorithm that is still able to provide a high quality reconstruction from incomplete silhouettes. Experimental results show that we obtain results close to a reconstruction method where the occlusion is known by manual labelling. This work is a further improvement and adaptation over the work published in [1, 2, 3]. In [1] we tried reconstructing occlusion depth maps over time on a pixel level by checking pairwise reconstructions to find the inconsistencies in the silhouettes. Our publication in [2] introduced our cell-concept, which is also used in [3] and this article. The key difference between these publications, is the strategy to find the subset of cells that are part of the object being reconstructed.

The contributions of this article are twofold: a novel consistency score is presented and the algorithm is extended to also operate in parts of the reconstruction scene which are not necessarily covered by all cameras. These contributions both lead to a more general approach while producing improved results.

2. Related work

2.1 Depth sensors

A first class of methods models depth relations to detect and map occluding objects. The most straightforward way to obtain such depth relations is by using sensors with inherent depth perception, e.g. stereo cameras as in [4, 5, 6, 7], time of flight cameras [8], or micro-lens cameras [9]. However, such specialized sensors are not as cheap and widely adopted as general purpose cameras, which are already ubiquitous in many sites including industrial environments and traffic intersections.

2.2 Standard cameras

Standard cameras have also been used to infer depth relations in the scene, as in [10], but the drawback of this method is that it considers part-time occluded regions as full-time black spots in the camera view, from which no information is used even if an object passes in front of the occluder. When rigid objects are modelled, it is possible the infer their interrelationship, despite being partially occluded [11]. However, such methods fail for non-rigid and unknown objects.

2.3 Object modelling

A second class of occlusion-proof 3D reconstruction methods models the object of interest to estimate the position of occluded object parts as well as possible [12, 13, 14, 15, 16]. While these methods work well for pose estimation of known objects, they cannot be generalized to more diverse object classes such as road vehicles.

Figure 1.

Illustration of shape from inconsistent silhouettes. The blue vehicles function as occluders for the 6 camera views on the left. Each colour in the reconstructed car corresponds to another cell.

2.4 Tracking

Related to the topic of occlusion-affected pose estimation is the field of occlusion-affected tracking. This is usually solved using temporal modelling of trajectories, the inference of occlusion regions, as well as appearance models for known object classes [17, 18, 19, 20, 21]. In [22] multiple frames in time are used to cope with the incompleteness of the observed silhouettes. However, the method only works on rigid objects which is unsuitable for tracking people. These concepts offer few benefits for the more general applications targeted in this paper.

2.5 Extended visual hull

Much more relevant are the adaptations to standard shape-from-silhouettes reconstruction proposed by [23], who show that extending silhouettes into regions suspected to be occluded yields more accurate 3D reconstruction than ignoring the occluded areas completely as done in [10]. However, occluded regions are still treated as completely devoid of information even when objects can pass in front of the occluding object. This is remedied in [24], where the reconstruction takes into account that one or more of the cameras may have unreliable silhouette information, without explicitly modelling in which cameras or in which parts of the scene this occurs. In [25] this approach is extended to estimate the number of unreliable cameras locally rather than globally, using an a priori object model to reduce the overestimation of object size.

2.6 Octree-based reconstruction

An attempt of reducing the computational cost was made in [26] using Dempster-Shafer theory with an octree-based method that obtained comparable results to voxel-based methods [24, 27].

2.7 Clustering

Another way to reduce the computation time is performed by clustering the smallest building blocks of space (voxels). It decreases the number of possibilities significantly [28].

The method proposed in this paper is similar to these works in the sense that it also aims to reconstruct the shape of arbitrary objects of unknown size and complexity in a way that maximally agrees with the incomplete silhouettes. However, instead of minimizing an energy function over a very large set of voxels configurations, we will reason in terms of voxel space regions with homogeneous properties, which is computationally less demanding. Additionally, our iterative growing algorithm does not require the need for sparsity or regularization parameters.

3. Incomplete silhouettes and space partitioning

A key idea of our approach is to apply geometric reasoning not to individual points, but to point sets called cells. The cells we define are continuous regions in space formed by the backprojected generalized cones of the silhouettes. The intersections of the cone boundaries delimit regions of 3D space with uniform silhouette information, e.g. all points within one such cell fall inside the cones of one particular set of cameras and outside the cones of the silhouettes of the remaining cameras. We use geometric reasoning on cells, rather than on smaller entities, such as 3D points for two reasons. Each object usually consists of a 3D connected set of points (rather than a scattered set of points in the reconstruction space) and, given the observed silhouettes and the camera calibration, the cells are the smallest coherent entities. Figure 1 illustrates a shape reconstruction with the proposed algorithm.

A major advantage of our approach over the work of Landabaso et al. [24] and Haro and Pardàs [25] is that we are able to decide for every cell if it is a valuable addition to the reconstruction, rather than using a single threshold which treats all cells equally.

In the remainder of this section, we review the visual hull and shape-from-silhouettes concepts before we arrive at the partitioning of the reconstruction space into cells.

3.1 Visual hull and shape-from-silhouettes

For an object ${\cal S}\subset\mathbb{R}^{3}$ , the visual hull depends on the object ${\cal S}$ and the viewing set ${\cal V}=\{\text{V}_{1},\ldots,\text{V}_{n}\}$ , where each of the $N$ viewpoints $\text{V}_{j}$ corresponds to a camera position ( $j$ indicates the index of a camera).

.

The visual hull $\text{{VH}}({\cal S},{\cal V})$ of an object ${\cal S}$ relative to a viewing set ${\cal V}$ is the set of all points $P\in\mathbb{R}^{3}$ such that, for each point $P\in\text{{VH}}({\cal S},{\cal V})$ and each viewpoint $V_{j}\in{\cal V}$ , the half-line starting at $V_{j}$ and passing through $P$ contains at least one point of ${\cal S}$ (definition from [29]).

However, Definition 1 is based on a known object ${\cal S}$ . More common is the approximation of an unknown shape based on observed silhouettes I ${}_{j}$ , which are projections of the object ${\cal S}$ on each of the image planes corresponding to the viewpoints in ${\cal V}$ . Therefore, we define the shape-from-silhouettes as follows.

.

The shape-from-silhouettes $\textbf{{SfS}}(I,{\cal V})$ of the observed silhouettes $I_{j}$ from a set of viewpoints ${\cal V}$ , is the set of all points $P\in\mathbb{R}^{3}$ such that, for each viewpoint $V_{j}\in{\cal V}$ , the projection of $P$ on the image plane of camera $j$ , $P_{j}$ is part of silhouette $I_{j}$ .

If the object ${\cal S}$ is fully visible by all cameras, both Definitions 1 and 2 describe the same set of points in $\mathbb{R}^{3}$ . However, as soon as silhouettes are prone to errors, discrepancies arise between both volumes. Errors may be due to bad FG/BG segmentation, but in many cases, they are due to other objects blocking the view of the object of interest, either partially or completely, from a viewpoint $V_{j}$ , the so-called occluders. Our aim is to work around these discrepancies and obtain the shape-from-silhouettes as close to $\text{{{VH}}}({\cal S},{\cal V})$ as possible despite the incomplete silhouettes.

Figure 2.

Example of the space partitioning into cells in 2D with 4 viewpoints in case a stationary truck is partially blocking the view of camera 3. The aim is to find those cells which are part of the car. Different cells with membership count 2, 3 and 4 are coloured as these are the cells which have to be evaluated. In some of the cells, we printed the cell’s membership vector $\bm{\psi}$ .

3.2 Space partitioning into cells

Let $\text{$I$}_{j}$ be the silhouette of an object ${\cal S}$ with respect to camera $j$ . We denote the projection of a point $P\in\mathbb{R}^{3}$ on the image sensor of camera $j$ as $P_{j}$ . For each point $P\in\mathbb{R}^{3}$ we define a membership function $\text{$\psi$}_{j}(P)$ as follows:

$\displaystyle\text{$\psi$}_{j}(P)=\begin{cases}1&\quad\text{if }P_{j}\in\text{% $I$}_{j}\\ 0&\quad\text{otherwise,}\\ \end{cases}$ (1)

which indicates whether or not the projection of point $P$ lies inside or outside the silhouette of a particular camera $j$ . For $N$ cameras we define the membership vector

$\displaystyle\bm{\psi}(P)=(\text{$\psi$}_{1}(P),\ldots,\text{$\psi$}_{N}(P)).$ (2)

That is, $\bm{\psi}(P)$ will be a binary vector of the form $(\ldots,0,\ldots,1,\ldots)$ that indicates for which cameras the projection of $P$ lies within the silhouette, and for which cameras it is not. When the silhouettes are complete, the membership vector will be of the form $(1,1,\ldots,1)$ not only for the points in ${\cal S}$ , but also for points in the shape-from-silhouettes of ${\cal S}$ . The points outside the shape-from-silhouettes are the points $P$ for which at least one element of $\bm{\psi}(P)$ is zero. When at least one of the silhouettes is incomplete, however, some elements of $\bm{\psi}(P)$ may be zero even when $P$ belongs to ${\cal S}$ . We will refer to the number of ones in the membership vector as the membership count of the cell. Figure 2 shows some of these membership vectors.

Now let $P$ be any point in $\mathbb{R}^{3}$ , then the cell $A$ is defined as the set of all points $Q\in\mathbb{R}^{3}$ for which there exists a continuous path from $Q$ to $P$ such that $\bm{\psi}(P)=\bm{\psi}(R)$ for all points $R$ along the path. Thus, a cell has the following properties:

All points in a cell share the same membership vector;

All points in a cell are path-connected;

A cell is maximal in the sense that it cannot be a subset of a larger set that has the properties in Eqs (1) and (2).

The above definition of a cell can easily be adapted for a 2D scene. Figure 2 illustrates the subdivision of 2D space into polygonal cells. The moving object is a car, the view of which is partially blocked by a parked truck. The cells arise from the backprojection of the silhouettes observed by each camera. Hence in 2D all cells are convex polygons as the objects on the line sensor are convex. In 3D, however, cells are no longer convex by definition, but may be very irregular.

Definition 2 is equivalent to the union of cells for which all elements in the membership vector are 1. Clearly, this will be a poor approximation of the visual hull when one or more of the silhouettes are incomplete. This is illustrated in Fig. 2, where the cell D is not part of the shape-from-silhouettes. By adding cell $D$ which is seen from cameras 1, 2 and 4, but not from 3, we could almost reconstruct the visual hull as if the occluder had not been present.

The approximated shape that we want to find is the minimal union of all the cells that contain at least one point of the visual hull. We can write this as:

$\displaystyle\bigcup_{k\in K}A_{k}:A_{k}\cap\text{{{VH}}}({\cal S},{\cal V})% \neq\emptyset.$ (3)

Note however that Eq. (3) does not provide a method for finding the cells $A_{k}$ , since we do not know $\text{{{VH}}}({\cal S},{\cal V})$ . Hence, we need other criteria to decide whether $A_{k}$ should be part of the reconstruction.

4. Cell-based geometric reasoning

A first difficulty is that it is not feasible to test all possible cell configurations, but since the number of these configurations scales exponentially in the number of cells (which easily exceeds 60 in a real-world experiment with 8 cameras), a strategy is needed to quickly navigate the search space.

A second challenge is that parts of a silhouette can be explained by multiple cells. Reconsidering Fig. 2, the missing purple part in the silhouette of camera $V_{1}$ (given the shape-from-silhouettes reconstruction, the red cell), can be explained by cells A, B, C, D and E. It also illustrates that once a cell is added to the reconstruction (e.g. cell D), other cells may become superfluous.

The above remarks suggest that we need an iterative strategy to add cells, and that this strategy must be based on some sort of consistency score for each cell. The consistency score needs to measure the improvement of consistency across the silhouettes. The consistency is improved by explaining parts of the silhouettes that have not been explained. On the other hand consistency may also deteriorate by adding unnecessary parts to some silhouettes.

The iteration starts from a reconstruction that equals the shape-from-silhouettes $\text{{{SfS}}}(I,{\cal V})$ . This initial shape $\text{$Y$}^{0}$ is then projected onto each of the camera views. We define $I(\text{$Y$}^{0})_{j}$ as the projection of the initial shape $\text{$Y$}^{0}$ on camera $j$ . At each iteration a cell will be added. The iteration thus proceeds as follows:

$\displaystyle\text{$Y$}^{t}=\text{$Y$}^{t-1}\cup A_{k_{m}},$ (4)

where $A_{k_{m}}$ indicates the cell which corresponds to the highest consistency score at iteration $t-1$ .

4.1 Evaluation metric

We will explain the choice for our consistency score by first looking at the end goal. A method such as the one presented in this paper needs an objective evaluation to compare different solutions. While we want to approximate the 3D shape of the object as closely as possible as in Eq. (3), the only information available to us are 2D silhouettes. Therefore, a reconstruction method is usually evaluated by simulating incomplete silhouettes [23, 27, 26], so that the ground truth is known and the F-score can be used in the evaluation:

$\displaystyle F_{\beta}=(1+\beta^{2})\frac{\text{precision}\text{ . }\text{% recall}}{(\beta^{2}\text{precision})+\text{recall}}.$ (5)

Figure 3.

Coverage (a) and resemblance (b) based on the projection of a shape $I(H_{t,k})_{j}$ and the extended silhouette $I^{t}_{E,j}$ .

Figure 4.

Typical iterative coverage and resemblance graph of an occluded camera view and averaged.

Ideally, the reconstruction method adds cells one by one such that the F-score in Eq. (5) increases by an amount $\Delta{}F$ at each step. However, since ground truth is not available, $\Delta{}F$ cannot be evaluated. Therefore, we propose a technique that estimates $\Delta{}F$ , which proves to work well.

4.2 Extended silhouettes

As our consistency score needs to reflect whether a cell explains part of a silhouette that was not yet explained by earlier additions, the silhouette we compare to should be a combination of the initial silhouette and that of the reconstructed volume at iteration iteration step $t$ rather than only the initial silhouette. We represent the extended silhouette at iteration $t$ of camera $j$ with the newest addition of cell $A_{k_{m}}$ as follows:

$\displaystyle I^{t}_{E,j}=I^{t-1}_{E,j}\cup I(A_{k_{m}})_{j},$ (6)

where $I^{0}_{E,j}=I_{j}$ , the observed silhouettes.

A small drawback of using extended silhouettes is that the consistency scores for the remaining cells in the search space need to be recalculated each time a cell is accepted. In fact, as the reconstruction $Y^{t}$ grows at every iteration, the extended silhouettes $I^{t}_{E,j}$ grow accordingly and therefore the consistency scores of the remaining cells may change. However, the iterative adaptation of the silhouettes also allows to find a maximum consistency score and therefore a clear stop criterion.

4.3 Coverage and resemblance

We will compare how well a cell fits the extended silhouettes using scores inspired by recall and precision as in Eq. (5), as these describe the properties of overlap and excess well. Since the comparison is not against the ground truth however, we will not use the terms recall and precision so as not to confuse the reader. Instead we will define coverage and resemblance as the equivalent comparison metrics against extended silhouettes.

From Eq. (6) we know that the extended silhouettes depend on the previous iteration step $t-1$ . Coverage and resemblance in this section are calculated for candidate extensions of the current reconstruction. A candidate extension is denoted as $H_{t,k}=Y^{t-1}\cup A_{k}$ .

Coverage on camera $j$ is defined as the fraction of the extended silhouette that is covered by the projection of $H_{t,k}$ denoted as $I(H_{t,k})_{j}$ (Fig. 3a):

$\displaystyle\text{cov}_{j}(I^{t}_{E,j},I(H_{t,k})_{j})\!=\!\frac{\text{area}(% I(H_{t,k})_{j}\cap I^{t}_{E,j})}{\text{area}(I^{t}_{E,j})}.$ (7)

Resemblance on camera $j$ equals the ratio between the area of the projected shape $I(H_{t,k})_{j}$ that is part of the extended silhouette $I^{t}_{E,j}$ , and the the area of the projected shape $I(H_{t,k})_{j}$ (Fig. 3b):

$\displaystyle\text{res}_{j}(I^{t}_{E,j},I(H_{t,k})_{j})=\frac{\text{area}(I(H_% {t,k})_{j}\cap I^{t}_{E,j})}{\text{area}(I(H_{t,k})_{j})}.$ (8)

Figure 4 shows an example of the iterative progress of coverage and resemblance for an occluded view and averaged over all camera views. Even when the resemblance drops in iteration 4 for the occluded camera, we see that the average resemblance still increases.

4.4 Consistency score

Based on Eq. (5), we estimate an F-score per camera, where precision is estimated by resemblance and recall by coverage. Equations (7) and (8) are used to determine the value $\lambda_{\beta,j}(I^{t}_{E,j},I(H_{t,k})_{j})$ for each view:

$\displaystyle\lambda_{\beta,j}=(1+\beta^{2})\frac{\text{res}_{j}\text{ . }% \text{cov}_{j}}{(\beta^{2}\text{res}_{j})+\text{cov}_{j}},$ (9)

where we omit the parameters $I^{t}_{E,j}$ and $I(H_{t,k})_{j}$ for simplicity. The value of $\beta$ will be discussed in Section 4.5.

For a scenario where $N$ cameras are equally important to the reconstruction, we consider the estimated F-scores of the cell with regards to the global reconstruction as the average of all $\lambda_{\beta,j}$ -scores. The consistency score $\Lambda_{\beta}(I^{t}_{E},H_{t,k})$ of candidate solution $H_{t,k}$ is then

$\displaystyle\Lambda_{\beta}(I^{t}_{E},H_{t,k})\!=\!\frac{1}{N}\sum_{j=1}^{N}% \lambda_{\beta,j}(I^{t}_{E,j},I(H_{t,k})_{j}).$ (10)

Furthermore, we assume that the F-score of the 3D reconstruction as in Eq. (5) is related to $\Lambda_{\beta}(I^{t}_{E},H_{t,k})$ , where the latter is computed by taking the average of estimated F-scores based on 2D silhouettes. We assume that any increase of $\Lambda_{\beta}(I^{t}_{E},H_{t,k})$ is expected to increase $F$ .

This consistency score is calculated at each iteration step for each cell $A_{k}$ in the search space by testing the corresponding candidate solution $H_{t,k}$ . If the consistency score $\Lambda_{\beta}(I^{t}_{E},H_{t,k})$ is larger than the current consistency score $\Lambda_{\beta}(I^{t}_{E},Y^{t-1})$ , the cell is kept in the search space. If not, the cell is removed from the search space. This mechanism reduces the search space at each iteration step.

The cell $A_{k_{m}}$ with the highest consistency score is added to the current reconstruction $Y^{t}=Y^{t-1}\cup A_{k_{m}}$ and also removed from the search space.

The stop criterion in this optimization process is straightforward. Once no cell in the search space can improve the consistency score, the algorithm terminates. Since the number of cells is finite, the algorithm will always stop.

4.5 Value of

\beta

Depending on the application, coverage or resemblance becomes more important. On the one hand, the reconstruction of a person in sports analysis should reconstruct the complete person in order to perform further analysis such as skeleton fitting. Therefore the coverage should be as high as possible. On the other hand, in very cluttered real-world scenes with a lot of spurious foreground detections, resemblance should be as high as possible, because otherwise very large cells may be included which are barely part of the object that is being reconstructed. Typical values for $\beta$ are between 0.5 and 10. If a system needs to be calibrated we propose to start out with $\beta$ equal to 1 and adapt $\beta$ for a typical scene until the subjective assessment of an operator in terms of connectedness, convexity or any other measure is met. The ideal value of $\beta$ depends on the characteristics of the camera and silhouette detector. A sensitivity analysis of parameter $\beta$ is provided in Section 6.2.

4.6 The occlusion handling algorithm

Algorithm 1 shows the different steps of the proposed method as explained in the previous sections. In a practical implementation, we often reconstruct an object as a set of voxels, and we define a cell as a set of voxels. However, the algorithm can also be implemented without discretization to a voxel space.

Objective

Given

N

calibrated observed (incomplete) silhouettes. Find the subset of cells that are part of the visual hull.

Algorithm

(i) Divide the reconstruction space into cells:

K_{0}

represents the set of all cell indices.

(ii) Compute the reconstructed shape:

(a) Find all cells that project inside each silhouette of its observing cameras:

Y^{0}=\{P\in\mathbb{R}:\psi_{j}(P)=1,\forall j\}

K^{0}=\{k:A_{k}\not\subset Y^{0}\}

(b) Set

I^{0}_{E,j}=I_{j}

for all camera views

j

t

while

K^{t}\neq\emptyset

k_{m}=\underset{k\in K^{t}}{\mathrm{argmax}}\text{ }\Lambda_{\beta}(I^{t-1}_{E% },Y^{t}\cup A_{k})

\Lambda_{\beta}(I^{t-1}_{E},Y^{t-1}\cup A_{k_{m}})>\Lambda_{\beta}(I^{t-1}_{E}% ,Y^{t-1})

Y^{t}=Y^{t-1}\cup A_{k_{m}}

I^{t}_{E,j}=I^{t-1}_{E,j}\cup I(A_{k_{m}})_{j}

K^{t}=K^{t-1}\setminus\{k_{m}\}

Else: stop return

Y^{t-1}

Algorithm 1: The proposed shape reconstruction method from incomplete silhouettes. For simplicity we have omitted the pruning of unfit cells from the above algorithm. Note that the algorithm terminates immediately when there are no incomplete silhouettes.

4.7 Limited field of view (FOV)

Real cameras have limited fields of view. Therefore parts of the scene may lay outside of the FOV of a camera. Equations (7) and (8) only consider the part of the scene which project inside the FOV of each camera. In Section 6 a number of experiments are conducted where the object of interest is not inside the FOV of all cameras to illustrate this feature. Also the initial reconstruction is different from the strict definition of $\text{{{SfS}}}($ I $,{\cal V})$ . The initial shape becomes the union of all cells which project inside the observed silhouettes, only considering the cameras for which the cell projects inside the camera’s FOV.

Table 1
Example of the algorithm with a stationary truck as occluders for camera 3, captured by four cameras and $\beta=$ 2. From iteration 1 we list all candidate cells, which form the search space. The last column indicates the cell’s decision: keep (K), add (A) and reject (R). A cell is rejected if the consistency score is lower than the previous accepted cell in the updated reconstruction. The algorithm adds the cell with highest consistency score at each iteration as long as the consistency score increases. Iteration 2a shows the final iteration of the algorithm when extended silhouettes are used. Iteration 2b is the second iteration in case the observed silhouettes are used. Note that the stop criterion is hard to find in the latter case

It	Cell	Mc	$\text{res}_{1}$	$\text{res}_{2}$	$\text{res}_{3}$	$\text{res}_{4}$	$\text{cov}_{1}$	$\text{cov}_{2}$	$\text{cov}_{3}$	$\text{cov}_{4}$	$\Lambda_{\beta}$	Decision
0	$Y^{0}$	4	1.00	1.00	1.00	1.00	0.60	0.80	1.00	1.00	0.871	A
1	$Y^{0}$	4	1.00	1.00	1.00	1.00	0.60	0.80	1.00	1.00	0.871	$-$
	A	3	1.00	0.67	1.00	1.00	0.61	0.80	1.00	1.00	0.858	R
	B	3	1.00	1.00	0.31	1.00	1.00	1.00	1.00	1.00	0.922	A
	C	3	1.00	1.00	1.00	0.83	0.60	1.00	1.00	1.00	0.903	K
	D	3	1.00	1.00	0.83	1.00	0.60	0.80	1.00	1.00	0.862	R
	E	3	0.71	1.00	1.00	1.00	0.60	0.80	1.00	1.00	0.863	R
	F	3	1.00	1.00	1.00	0.83	0.60	0.80	1.00	1.00	0.862	R
	G	2	1.00	0.67	0.50	1.00	1.00	0.80	1.00	1.00	0.901	K
	H	2	1.00	0.80	0.40	1.00	0.90	0.80	1.00	1.00	0.872	K
	I	2	1.00	1.00	0.73	0.90	0.70	1.00	1.00	1.00	0.913	K
	J	2	1.00	1.00	0.67	0.83	0.60	1.00	1.00	1.00	0.881	K
	K	2	0.63	1.00	1.00	0.63	0.60	0.80	1.00	1.00	0.833	R
2a	B	3	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.000	$-$
	C	3	1.00	1.00	1.00	0.83	1.00	1.00	1.00	1.00	0.990	R
	G	2	1.00	0.75	0.50	1.00	1.00	1.00	1.00	1.00	0.943	R
	H	2	1.00	0.84	0.62	1.00	1.00	1.00	1.00	1.00	0.963	R
	I	2	1.00	1.00	1.00	0.90	1.00	1.00	1.00	1.00	0.995	R
	J	2	1.00	1.00	0.87	0.83	1.00	1.00	1.00	1.00	0.983	R
2b	B	3	1.00	1.00	0.31	1.00	1.00	1.00	1.00	1.00	0.922	$-$
	C	3	1.00	1.00	1.00	0.83	1.00	1.00	1.00	1.00	0.990	?
	G	2	1.00	0.67	0.50	1.00	1.00	1.00	1.00	1.00	0.936	?
	H	2	1.00	0.80	0.31	1.00	1.00	1.00	1.00	1.00	0.911	?
	I	2	1.00	1.00	1.00	0.90	1.00	1.00	1.00	1.00	0.995	?
	J	2	1.00	1.00	0.27	0.83	1.00	1.00	1.00	1.00	0.902	?

4.8 Weighted average consistency scoring

In Eq. (10), we assumed that all cameras are equally important. In practical applications, camera calibration and silhouette extraction using foreground-background segmentation are both subject to inaccuracies and noise. Cameras closer to the object of interest can usually observe the object better. Therefore, as an extension, we propose the use of weights $w_{k,j}$ per camera $j$ and per cell $A_{k}$ depending on the average distance between the camera and the cell. Let $d_{j,k}$ be the average distance between the cell $A_{k}$ and camera $j$ , then we are able to normalize the weights as follows:

$\displaystyle w_{j,k}=\frac{d_{j,k}}{\sum_{1}^{N}d_{j,k}}.$ (11)

Therefore, we are able to rewrite Eq. (10) as:

$\displaystyle\Lambda_{\beta}(I^{t}_{E},H_{t,k})\!=\!\!\sum_{j=1}^{N}w_{j,k}% \lambda_{\beta,j}(I^{t}_{E,j},I(H_{t,k})_{j}).$ (12)

5. An example of how the method works

Figure 5 shows an example where four cameras are observing a red car. A parked truck is occluding the car partially in camera view C3. We illustrate how the proposed consistency score $\Lambda_{\beta}$ is able to reconstruct the car entirely despite the occluding view by listing all values in Table 1. The algorithm initializes with cell $Y^{0}$ which corresponds to a consistency score of 0.871. At the first iteration the resemblance and coverage are calculated based on the extended silhouettes $I_{j}\cup I(Y^{0})_{j}$ for each camera $j$ , which is basically $I_{j}$ in the first iteration. In the next iteration, cells A, D, E, F and K are already rejected because their consistency score is lower than the current consistency score. The highest consistency score, 0.922, corresponds to cell B. This cell is added to the reconstruction, which is now $Y^{1}=Y^{0}\cup B$ .

Table 2
Description of the comparison methods

Author	Ref	Method name	Method description
Laurentini’94	[29]	Classical shape-from-silhouettes	Shape-from-silhouettes algorithm, not taking into account the possibility of occlusion.
Guan’06	[23]	Occlusion mask reconstruction	OR-operation between occlusion mask and silhouettes.
Landabaso’08	[24]	Shape from inconsistent silhouettes	Shape-from-silhouettes which keeps all voxels projecting within the silhouettes of at least $N-e$ views, $e=$ number of occluders.
Slembrouck’17	[3]	Cell-based reconstruction v1	Previous approach: cell-based with cell types and counting functions less optimal than the presented approach in this paper.

Figure 5.

Example with one stationary truck, acting as a static occluder for cameras 3.

Possible candidates to be added to the reconstruction are cells C, G, H, I and J. Notice that in the table we have iteration steps 2a and 2b. 2a uses the extended silhouettes, whereas 2b uses the observed silhouettes (just for illustration). In 2a we clearly see that after updating the extended silhouettes, we cannot find any cell that corresponds to a higher consistency score than $Y^{0}\cup B$ . However, in case of iteration 2b it is unclear when to stop as the candidate cells C, G, H, I and J actually produce a higher consistency score. This example illustrates that improving consistency of the extended silhouettes results in better reconstruction than merely explaining the observed silhouettes as well as possible.

Figure 6.

The simulation consists of a car moving from left to right, captured by 7 cameras. The blue truck on the opposite side of the road is static and occludes the car in the 3 upper most cameras on the right, depending on the location of the car. Different colours in the proposed algorithm indicate different cells.

6. Experiments

We present three experiments that were conducted to show the potential of our method. We compare against four methods from literature (Table 2). We simulated Guan’06 by using perfect occlusion masks because we lack the source code of the actual method. The real performance of this method will most likely be worse than the reported results. Also Guan’06, learns a static occlusion mask over time and therefore fails in the case that a moving object can both appear in front as well as behind a static object because depth is not taken into account. For Landabaso’08, the parameter $e$ has to be chosen carefully, depending on the actual camera coverage in the scene (so we report two results for this method, one where $e$ equals the number of occluders in the scene to allow a full reconstruction and one extra camera to increase precision at the cost of full reconstruction). The previous approach, Slembrouck’17 will also be used to compare to when possible. All code for the simulations and experiments was developed in-house using libraries such as OpenCV [30] and PCL [31] and the implementations of the different methods were thoroughly tested. The reconstructed shapes are evaluated at the level of voxels.

6.1 Experiment 1: smart traffic simulations

The first experiment concerns smart traffic. The idea is to make full reconstructions of vehicles in traffic situations. The silhouettes of the moving car are obtained by projecting the 3D model of the car on the different virtual camera sensors. The same accounts for the other vehicles. By subtracting the car of interest from the other shapes in a pixel-wise fashion, we obtain the occluded silhouette of the moving car. Camera positions are chosen to mount on street and traffic lights.

Figure 6 shows the first situation. A stationary truck is blocking the view of multiple cameras. The red car at the bottom drives in a straight line from left to right. Laurentini’94 is unable to reconstruct the complete car (in the example only 30% of the car is reconstructed). When the car is completely next to the truck, not a single voxel is reconstructed. Landabaso’08 performs better because more voxels are found, but also include a lot of voxels that are not part of the car.

Table 3 shows that Guan’06 achieves the highest F1-score followed by our proposed algorithm. The precision of Laurentini’94 is almost 100%, but its recall is considerably less. Whereas with Landabaso’08, the recall is above 90%, but the precision is lower. In Fig. 6, we see that the proposed reconstruction is much more accurate than the reconstructions produced by the Landabaso’08 and Laurentini’94 and comparable to Guan’06. The ground truth in this context is the visual hull of the model in Fig. 6a when the truck is not present. It is important to note, however, that the application of Guan’06 is much more restricted than our method, as the method first requires knowledge about the occluders in the scene.

Figure 7.

Typical traffic situation where both the blue truck and the blue car want to turn to their left. The car is unable to see the oncoming traffic due to the truck. One extra camera is mounted on the traffic lights (yellow pole).

Figure 8.

Visualization of the occluders. Each camera has one possible occluder (green) which can be either turned on or off. Two examples are shown The eight cameras are places facing the same area. All cameras are placed at approximately 2.2 m from the ground plane.

Table 3

Results for a car going straight, occluded by a parked truck. The proposed method performs almost as well as the supervised method of Guan’06 in terms of precision, recall and F1-score. Other methods are less accurate. We show the results of Landabasso with both $e=$ 2 and $e=$ 3 to show the 100% recall when $e$ equals the number of occluders

Automatic methods
Method	Prec.	Rec.	F1-score
Laurentini’94	1.000	0.396	0.567
Landabaso’08 ( $e=$ 2)	0.561	1.000	0.718
Landabaso’08 ( $e=$ 3)	0.817	0.904	0.859
Proposed	0.947	0.922	0.934
Semi-supervised method
Guan’06	0.945	1.000	0.972

The second situation is a common one in traffic (Fig. 7). The blue car wants to turn to its left at the intersection. However, the blue truck also wants to turn to its left from the opposite direction. Since the truck is blocking the view of the oncoming traffic, the blue car has to wait or take a risk with possible catastrophic consequences. The proposed algorithm could be used to detect oncoming traffic and indicate if crossing is safe or not, even if the truck in the middle of the road is blocking some camera views to observe the oncoming traffic. One extra camera on top op the traffic lights increases camera coverage. Table 4 shows again that our methods performs almost as well as Guan’06.

Given the difficulty of automatically or manually labelling occluders in dynamic traffic scenes, the good performance of the proposed method is an important result. Even in case of severe occlusion, the algorithm manages to obtain a reliable reconstruction, as long as the object of interest can be partially observed from multiple viewpoints.

Table 4

Results for an oncoming car with a truck in the middle of the road in front of a car, trying to cross the street in terms of precision, recall and F1-score

Automatic methods
Method	Prec.	Rec.	F1-score
Laurentini’94	1.000	0.093	0.171
Landabaso’08 ( $e=$ 2)	0.486	1.000	0.654
Landabaso’08 ( $e=$ 3)	0.744	0.837	0.788
Proposed	0.900	0.952	0.925
Semi-supervised method
Guan’06	0.918	0.998	0.957

6.2 Experiment 2: Qualitative comparison of the 3D reconstruction

The second experiment uses the JP sequences (breakdancer) from the CVSSP-3D dataset [32], which can be requested to the authors. Each of these sequences consists of a synchronised stream of images from 8 cameras, which are placed around the subject at 2.2 m high about every 45 degrees (see Fig. 8). A total of six sequences is available. Each sequence is between 10 and 20 seconds long. In this experiment we will simulate the presence of occluding objects and compare with the ground truth, which is the output of the classical shape-from-silhouettes without occlusion (see Fig. 9). Each frameset has equal weights on the evaluation. Displayed values are therefore averages over each sequence. Reconstructions are performed at cubic voxel size of 20 mm.

Figure 9.

Some examples from the reconstruction of the CVSSP-3D dataset [32] at 40 mm voxel size. In the experiments we use 20 mm.

In this experiment we simulate the presence of occluding objects by recalculating the input images as if there would be an occluder present. We defined an occluding object for each camera. This object is modelled as a vertical column partially blocking the view of the camera (similar to a spectator in front of the camera. For each number between 1 and 8 occluders we choose eight random combinations (the same for each reconstruction method). For example, 2 occluders could block the view of camera 3 and 5. Each occluder roughly occludes 17% of the total image. We averaged the results over these combinations for each number of occluders. Figure 8 visualizes the camera setup together with the possible occluders.

In general, a lower value of $\beta$ translates to a lower recall in the 3D reconstruction. On the other hand, lower values of $\beta$ produce results with higher precision. Therefore the optimal value of $\beta$ in terms of F1-score is neither zero nor very large.

Figure 10.

Sensitivity analysis of $\beta$ . The value of beta becomes more important in case of severe occlusion, but the range that produces suitable values is quite wide.

Table 5 shows the results for all methods, averaged over all sequences. Only the F1-score is shown, but for Guan’06, Landabaso’08 and the proposed method the recall is close or equal to 100%. Only Laurentini’94 does not obtain a high recall value because occluded parts are not reconstructed. On the other hand the precision of Laurentini’94 is 100%. The last column represents the results of the proposed method.

Table 5

Results in case of multiple occluders. First column indicates the number of occluders. The occluder is placed in front of a certain camera. Average over all possible combinations are shown

#	[29]	[23]	[24]	[3]	$\beta=$ 3	opt. $\beta$
0	1.00	1.00	1.00	1.00	1.00	1.00
1	0.48	0.99	0.95	0.99	0.99	0.99
2	0.38	0.98	0.86	0.97	0.97	0.97
3	0.36	0.96	0.77	0.94	0.94	0.94
4	0.31	0.93	0.63	0.90	0.91	0.91
5	0.26	0.88	0.43	0.81	0.85	0.85
6	0.24	0.77	0.20	0.64	0.68	0.69
7	0.23	0.51	0.04	0.30	0.39	0.39

In Fig. 10 we provide a sensitivity analysis which shows that the optimal $\beta$ value is not critical, but there is a wide range of suitable values. The optimal range width decreases with an increasing amount of occlusion. Note that the scale of $\beta$ values in the graph does not increase linearly.

Figure 11.

Visualization of the output from the proposed method, Guan’06, Laurentini’94 and Landabaso’08 in a real world example in case of a single person and multiple large occluders in the scene.

6.3 Experiment 3: Real-world single person tracking

For the third experiment we use the setup shown at the left in Fig. 11. It is a staged setup of an office environment. This setup has 7 cameras, mounted around the scene at about 3.5 meters high. A table, 2 chairs, a L profile panel and a display introduce natural occlusion. The goal is to track the position of a person as well as possible. The plot in Fig. 11 shows that the proposed method is able to track the person very well in the room. Note that the measured trajectories are not smoothed in any way. Although the recordings were made in a lab, the conditions where not optimal, especially for the foreground/background segmentation step because the colours of the clothes of the tracked person were similar to colours in the background, not unlike many real world environments. A classical tracking algorithm suffers from tracking loss in case the person is completely occluded for at least one of the cameras. Partial occlusion introduces inaccurate positions because the position is then calculated on a limited number of voxels.

Both the proposed method and Landabaso’08 produce a person location in every frame from the moment the person enters the region of interest until he leaves again (total track: 505 frames). On the other hand, Laurentini’94 only outputs positions on 175 frames, which represents only 34.65% of the interesting frames. This fact and the smoother tracking with our method show the need for occlusion handling. We also compared against Guan’06 because that was the nearest competitor in the other experiments. We clearly see that the track of this method is much less smooth than the one produced by our method. There are two reasons for that: Guan’06 cannot recover from bad foreground/background segmentation because this is different from static occlusion (resulting in less voxels in de 3D reconstruction) and their method indicates per pixel whether or not occlusion can happen thereby rejecting depth information. Although their algorithm outputs a position in 503/505 frames, the position is less accurate than with our proposed method. In Fig. 11 we also show the number of voxels detected at each frame. Landabaso’08 produces comparable tracking results in this experiment, but the number of voxels varies significantly, which means the reconstructed person is less accurate. Fortunately, many of the unnecessary voxels do not detoriate the person’s position too much, despite being no part of it. Position averaging softens the incorrectness, but for a full 3D reconstruction, this is unwanted, in particular when an accurate proximity detection is needed (e.g. cooperation between man and machine).

7. Conclusions

In this paper we presented an algorithm for shape-from-silhouettes which can cope with incomplete silhouettes. We showed that our algorithm is able to perform well under different complexity levels of occlusion and without prior knowledge of the occluders. The algorithm automatically detects the occluded parts in the camera views and uses this information to only use relevant silhouette information for the reconstruction of the object of interest.

The algorithm succeeds in reconstructing the entire object of interest and the reconstruction closely resembles the visual hull.

As illustrated in the results section, our algorithm improves state-of-the art for reconstruction as well as the tracking of a moving object in a scene with occlusion in a completely automatic fashion.

References

Slembrouck

Van Cauwelaert

Van Hamme

Van Haerenborgh

Van Hese

Veelaert

, et al. Self-learning voxel-based multi-camera occlusion maps for 3D reconstruction. In: International Conference on Computer Vision Theory and Applications, Proceedings. SCITEPRESS, 2014, p. 8.

Slembrouck

Van Cauwelaert

Veelaert

Philips

. Shape-from-silhouettes algorithm with built-in occlusion detection and removal. In: International Conference on Computer Vision Theory and Applications, Proceedings. SCITEPRESS, 2015.

Slembrouck

Veelaert

Van Hamme

Van Cauwelaert

Philips

. Cell-based approach for 3D reconstruction from incomplete silhouettes. In: P. and Veelaert. Springer. 2017, p. 12.

Kang

Szeliski

Chai

. Handling occlusions in dense multi-view stereo. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. vol. 1. IEEE, 2001. pp. 1-103.

Wegner

Stankiewicz

Domański

. Occlusion handling in depth estimation from multiview video. In: Signals and Electronic Systems (ICSES), 2014 International Conference on. IEEE, 2014, pp. 1-4.

Zitnick

Kanade

. A cooperative algorithm for stereo matching and occlusion detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2000; 22(7): 675-684.

Sun

Kang

Shum

. Symmetric stereo matching for occlusion handling. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). vol. 2. IEEE, 2005. pp. 399-406.

Ding

Song

. Robust object tracking using color and depth images with a depth based occlusion handling and recovery. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2015 12th International Conference on. IEEE, 2015. pp. 930-935.

Zhang

Zhu

Jiang

Neri

. Geometry based three-dimensional image processing method for electronic cluster eye. Integrated Computer-Aided Engineering. 2018; (Preprint): 1-16.

10.

Favaro

Duci

Soatto

. On exploiting occlusions in multiple-view geometry. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003. pp. 479-486.

11.

Xiang

Savarese

. Object Detection by 3D Aspectlets and Occlusion Reasoning. In: 2013 IEEE International Conference on Computer Vision Workshops, 2013. pp. 530-537.

12.

Girshick

Felzenszwalb

Mcallester

. Object detection with grammar models. In: Advances in Neural Information Processing Systems, 2011, pp. 442-450.

13.

Mathias

Benenson

Timofte

Van Gool

. Handling occlusions with franken-classifiers. In: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1505-1512.

14.

Duan

Lao

. A structural filter approach to human detection. In: European Conference on Computer Vision. Springer, 2010, pp. 238-251.

15.

Ouyang

Wang

. A discriminative deep model for pedestrian detection with occlusion handling. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3258-3265.

16.

Ouyang

Zeng

Wang

. Partial occlusion handling in pedestrian detection with a deep model, 2015.

17.

Possegger

Mauthner

Roth

Bischof

. Occlusion Geodesics for Online Multi-Object Tracking. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

18.

Topçu

Alatan

Ercan

. Occlusion-aware 3D multiple object tracker with two cameras for visual surveillance. In: Advanced Video and Signal Based Surveillance (AVSS), 2014 11th IEEE International Conference on, 2014. pp. 56-61.

19.

Jung

Yoon

Paik

. Object Occlusion Detection Using Automatic Camera Calibration for a Wide-Area Video Surveillance System. Sensors. 2016; 16(7): 982s. Available from: http//www.mdpi.com/1424-8220/16/7/982.

20.

Otsuka

Mukawa

. Multiview occlusion analysis for tracking densely populated objects based on 2-D visual angles. In: Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. vol. 1, 2004. p. I–90-I–97.

21.

Ess

Schindler

Leibe

Van Gool

. Improved multi-person tracking with active occlusion handling. In: ICRA Workshop on People Detection and Tracking. vol. 2 Citesee, 2009.

22.

Toyoura

Iiyama

Funatomi

Kakusho

Minoh

. 3d shape reconstruction from incomplete silhouettes in multiple frames. In: Pattern Recognition, 2008. ICPR 2008. 19th International Conference on. IEEE, 2008. pp. 1-4.

23.

Guan

Sinha

Franco

Pollefeys

. Visual hull construction in the presence of partial occlusion. In: 3D Data Processing, Visualization, and Transmission, Third International Symposium on. IEEE, 2006. pp. 413-420.

24.

Landabaso

Pardàs

Casas

. Shape from inconsistent silhouette. Computer Vision and Image Understanding. 2008; 112(2): 210-224.

25.

Haro

Pardàs

. Shape from incomplete silhouettes based on the reprojection error. Image and Vision Computing. 2010; 28(9): 1354-1368.

26.

Díaz-Más

Madrid-Cuevas

Muñoz-Salinas

Carmona-Poyato

Medina-Carnicer

. An octree-based method for shape from inconsistent silhouettes. Pattern Recognition. 2012; 45(9): 3245-3255.

27.

Díaz-Más

Muñoz-Salinas

Madrid-Cuevas

Medina-Carnicer

. Shape from silhouette using Dempster – Shafer theory. Pattern Recognition. 2010; 43(6): 2119-2131.

28.

Poikolainen

Neri

Caraffini

. Cluster-Based Population Initialization for differential evolution frameworks. Information Sciences. 2015; 297: 216-235. Available from: shttp//www-sciencedirect-com.web.bisu.edu.cn/science/article/pii/S0020025514010962.

29.

Laurentini

. The visual hull concept for silhouette-based image understanding. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 1994; 16(2): 150-162.

30.

Bradski

. The OpenCV Library. Dr Dobb’s Journal of Software Tools, 2000.

31.

Rusu

Cousins

. 3D is here: Point Cloud Library (PCL). In: IEEE International Conference on Robotics and Automation (ICRA). Shanghai, China, 2011.

32.

Starck

Hilton

. Surface capture for performance-based animation. Computer Graphics and Applications, IEEE. 2007; 27(3): 21-31.

Cell-based shape reconstruction from incomplete silhouettes

Abstract

Keywords

1. Introduction

2. Related work

2.1 Depth sensors

2.2 Standard cameras

2.3 Object modelling

2.5 Extended visual hull

2.6 Octree-based reconstruction

2.7 Clustering

3. Incomplete silhouettes and space partitioning

3.1 Visual hull and shape-from-silhouettes

.

.

4.6 The occlusion handling algorithm

4.7 Limited field of view (FOV)

Table 2 Description of the comparison methods

6.1 Experiment 1: smart traffic simulations

7. Conclusions

References

Table 2
Description of the comparison methods