Abstract
Shape reconstruction from images is one of the most widely adopted approaches to compute accurate 3D reconstructions of people or objects in a multi-camera environment. However, such algorithms are traditionally very sensitive to errors in the silhouettes due to imperfect foreground-background estimation or occluding objects appearing between the camera and the object of interest. We propose a novel algorithm that is still able to provide high quality reconstruction from incomplete silhouettes. At the core of the method is the partitioning of the reconstruction space in cells, i.e. regions with uniform camera and silhouette coverage properties. An iterative process is proposed which incrementally adds cells to the temporal reconstruction based on their potential to explain the observed silhouettes from different cameras. Experimental results are close to manually labelled approaches and outperform standard leave-M-out reconstruction techniques in terms of F1-score.
Introduction
Shape-from-silhouettes algorithms, are very sensitive to errors in the provided silhouettes. Three types of errors are common: inaccurate silhouette boundaries, holes in the silhouettes and parts of the silhouette that are missing entirely. Such incomplete silhouettes may be due to errors in the segmentation algorithm, such as foreground/background (FG/BG) segmentation, but their primary cause is occlusion. If a static object is positioned between the camera and the moving object, foreground/background segmentation is unable to segment parts of the silhouette of the moving object. However, occlusion is very common in many real life situations. In indoor as well as outdoor environments, occlusion is often inevitable. Furthermore, occlusion is a major issue in the field of people tracking and 3d reconstruction. It leads to target loss and incorrect shape reconstruction.
An example application of the presented algorithm is monitoring traffic situations using multiple cameras. Occluding objects, such as trees, light poles and parked cars prevent the cameras from completely observing the objects of interest, such as moving cars, cyclists and pedestrians. Occlusion causes inconsistencies when the images of the different cameras are fused. Ignoring these inconsistencies may lead to the disappearance of objects which may cause accidents.
Another application in which occlusion causes major problems is athlete performance analysis. For instance, when the leg of a person is occluded by a table in one of the camera views, the leg will be cut off from the person’s 3d reconstruction when using conventional 3d reconstruction methods, such as shape-from-silhouettes. In case of pose estimation, this is a major setback as this leg will not be found and therefore the leg’s position will remain unknown, or worse, detected in the wrong place.
We propose a novel algorithm that is still able to provide a high quality reconstruction from incomplete silhouettes. Experimental results show that we obtain results close to a reconstruction method where the occlusion is known by manual labelling. This work is a further improvement and adaptation over the work published in [1, 2, 3]. In [1] we tried reconstructing occlusion depth maps over time on a pixel level by checking pairwise reconstructions to find the inconsistencies in the silhouettes. Our publication in [2] introduced our cell-concept, which is also used in [3] and this article. The key difference between these publications, is the strategy to find the subset of cells that are part of the object being reconstructed.
The contributions of this article are twofold: a novel consistency score is presented and the algorithm is extended to also operate in parts of the reconstruction scene which are not necessarily covered by all cameras. These contributions both lead to a more general approach while producing improved results.
Related work
Depth sensors
A first class of methods models depth relations to detect and map occluding objects. The most straightforward way to obtain such depth relations is by using sensors with inherent depth perception, e.g. stereo cameras as in [4, 5, 6, 7], time of flight cameras [8], or micro-lens cameras [9]. However, such specialized sensors are not as cheap and widely adopted as general purpose cameras, which are already ubiquitous in many sites including industrial environments and traffic intersections.
Standard cameras
Standard cameras have also been used to infer depth relations in the scene, as in [10], but the drawback of this method is that it considers part-time occluded regions as full-time black spots in the camera view, from which no information is used even if an object passes in front of the occluder. When rigid objects are modelled, it is possible the infer their interrelationship, despite being partially occluded [11]. However, such methods fail for non-rigid and unknown objects.
Object modelling
A second class of occlusion-proof 3D reconstruction methods models the object of interest to estimate the position of occluded object parts as well as possible [12, 13, 14, 15, 16]. While these methods work well for pose estimation of known objects, they cannot be generalized to more diverse object classes such as road vehicles.
Illustration of shape from inconsistent silhouettes. The blue vehicles function as occluders for the 6 camera views on the left. Each colour in the reconstructed car corresponds to another cell.
Related to the topic of occlusion-affected pose estimation is the field of occlusion-affected tracking. This is usually solved using temporal modelling of trajectories, the inference of occlusion regions, as well as appearance models for known object classes [17, 18, 19, 20, 21]. In [22] multiple frames in time are used to cope with the incompleteness of the observed silhouettes. However, the method only works on rigid objects which is unsuitable for tracking people. These concepts offer few benefits for the more general applications targeted in this paper.
Extended visual hull
Much more relevant are the adaptations to standard shape-from-silhouettes reconstruction proposed by [23], who show that extending silhouettes into regions suspected to be occluded yields more accurate 3D reconstruction than ignoring the occluded areas completely as done in [10]. However, occluded regions are still treated as completely devoid of information even when objects can pass in front of the occluding object. This is remedied in [24], where the reconstruction takes into account that one or more of the cameras may have unreliable silhouette information, without explicitly modelling in which cameras or in which parts of the scene this occurs. In [25] this approach is extended to estimate the number of unreliable cameras locally rather than globally, using an a priori object model to reduce the overestimation of object size.
Octree-based reconstruction
An attempt of reducing the computational cost was made in [26] using Dempster-Shafer theory with an octree-based method that obtained comparable results to voxel-based methods [24, 27].
Clustering
Another way to reduce the computation time is performed by clustering the smallest building blocks of space (voxels). It decreases the number of possibilities significantly [28].
The method proposed in this paper is similar to these works in the sense that it also aims to reconstruct the shape of arbitrary objects of unknown size and complexity in a way that maximally agrees with the incomplete silhouettes. However, instead of minimizing an energy function over a very large set of voxels configurations, we will reason in terms of voxel space regions with homogeneous properties, which is computationally less demanding. Additionally, our iterative growing algorithm does not require the need for sparsity or regularization parameters.
Incomplete silhouettes and space partitioning
A key idea of our approach is to apply geometric reasoning not to individual points, but to point sets called cells. The cells we define are continuous regions in space formed by the backprojected generalized cones of the silhouettes. The intersections of the cone boundaries delimit regions of 3D space with uniform silhouette information, e.g. all points within one such cell fall inside the cones of one particular set of cameras and outside the cones of the silhouettes of the remaining cameras. We use geometric reasoning on cells, rather than on smaller entities, such as 3D points for two reasons. Each object usually consists of a 3D connected set of points (rather than a scattered set of points in the reconstruction space) and, given the observed silhouettes and the camera calibration, the cells are the smallest coherent entities. Figure 1 illustrates a shape reconstruction with the proposed algorithm.
A major advantage of our approach over the work of Landabaso et al. [24] and Haro and Pardàs [25] is that we are able to decide for every cell if it is a valuable addition to the reconstruction, rather than using a single threshold which treats all cells equally.
In the remainder of this section, we review the visual hull and shape-from-silhouettes concepts before we arrive at the partitioning of the reconstruction space into cells.
Visual hull and shape-from-silhouettes
For an object
.
The visual hull
However, Definition 1 is based on a known object
.
The shape-from-silhouettes
If the object
Example of the space partitioning into cells in 2D with 4 viewpoints in case a stationary truck is partially blocking the view of camera 3. The aim is to find those cells which are part of the car. Different cells with membership count 2, 3 and 4 are coloured as these are the cells which have to be evaluated. In some of the cells, we printed the cell’s membership vector 
Let
which indicates whether or not the projection of point
That is,
Now let
The above definition of a cell can easily be adapted for a 2D scene. Figure 2 illustrates the subdivision of 2D space into polygonal cells. The moving object is a car, the view of which is partially blocked by a parked truck. The cells arise from the backprojection of the silhouettes observed by each camera. Hence in 2D all cells are convex polygons as the objects on the line sensor are convex. In 3D, however, cells are no longer convex by definition, but may be very irregular.
Definition 2 is equivalent to the union of cells for which all elements in the membership vector are 1. Clearly, this will be a poor approximation of the visual hull when one or more of the silhouettes are incomplete. This is illustrated in Fig. 2, where the cell D is not part of the shape-from-silhouettes. By adding cell
The approximated shape that we want to find is the minimal union of all the cells that contain at least one point of the visual hull. We can write this as:
Note however that Eq. (3) does not provide a method for finding the cells
A first difficulty is that it is not feasible to test all possible cell configurations, but since the number of these configurations scales exponentially in the number of cells (which easily exceeds 60 in a real-world experiment with 8 cameras), a strategy is needed to quickly navigate the search space.
A second challenge is that parts of a silhouette can be explained by multiple cells. Reconsidering Fig. 2, the missing purple part in the silhouette of camera
The above remarks suggest that we need an iterative strategy to add cells, and that this strategy must be based on some sort of consistency score for each cell. The consistency score needs to measure the improvement of consistency across the silhouettes. The consistency is improved by explaining parts of the silhouettes that have not been explained. On the other hand consistency may also deteriorate by adding unnecessary parts to some silhouettes.
The iteration starts from a reconstruction that equals the shape-from-silhouettes
where
We will explain the choice for our consistency score by first looking at the end goal. A method such as the one presented in this paper needs an objective evaluation to compare different solutions. While we want to approximate the 3D shape of the object as closely as possible as in Eq. (3), the only information available to us are 2D silhouettes. Therefore, a reconstruction method is usually evaluated by simulating incomplete silhouettes [23, 27, 26], so that the ground truth is known and the F-score can be used in the evaluation:
Coverage (a) and resemblance (b) based on the projection of a shape 
Typical iterative coverage and resemblance graph of an occluded camera view and averaged.
Ideally, the reconstruction method adds cells one by one such that the F-score in Eq. (5) increases by an amount
As our consistency score needs to reflect whether a cell explains part of a silhouette that was not yet explained by earlier additions, the silhouette we compare to should be a combination of the initial silhouette and that of the reconstructed volume at iteration iteration step
where
A small drawback of using extended silhouettes is that the consistency scores for the remaining cells in the search space need to be recalculated each time a cell is accepted. In fact, as the reconstruction
We will compare how well a cell fits the extended silhouettes using scores inspired by recall and precision as in Eq. (5), as these describe the properties of overlap and excess well. Since the comparison is not against the ground truth however, we will not use the terms recall and precision so as not to confuse the reader. Instead we will define coverage and resemblance as the equivalent comparison metrics against extended silhouettes.
From Eq. (6) we know that the extended silhouettes depend on the previous iteration step
Figure 4 shows an example of the iterative progress of coverage and resemblance for an occluded view and averaged over all camera views. Even when the resemblance drops in iteration 4 for the occluded camera, we see that the average resemblance still increases.
Based on Eq. (5), we estimate an F-score per camera, where precision is estimated by resemblance and recall by coverage. Equations (7) and (8) are used to determine the value
where we omit the parameters
For a scenario where
Furthermore, we assume that the F-score of the 3D reconstruction as in Eq. (5) is related to
This consistency score is calculated at each iteration step for each cell
The cell
The stop criterion in this optimization process is straightforward. Once no cell in the search space can improve the consistency score, the algorithm terminates. Since the number of cells is finite, the algorithm will always stop.
Depending on the application, coverage or resemblance becomes more important. On the one hand, the reconstruction of a person in sports analysis should reconstruct the complete person in order to perform further analysis such as skeleton fitting. Therefore the coverage should be as high as possible. On the other hand, in very cluttered real-world scenes with a lot of spurious foreground detections, resemblance should be as high as possible, because otherwise very large cells may be included which are barely part of the object that is being reconstructed. Typical values for
The occlusion handling algorithm
Algorithm 1 shows the different steps of the proposed method as explained in the previous sections. In a practical implementation, we often reconstruct an object as a set of voxels, and we define a cell as a set of voxels. However, the algorithm can also be implemented without discretization to a voxel space.
Limited field of view (FOV)
Real cameras have limited fields of view. Therefore parts of the scene may lay outside of the FOV of a camera. Equations (7) and (8) only consider the part of the scene which project inside the FOV of each camera. In Section 6 a number of experiments are conducted where the object of interest is not inside the FOV of all cameras to illustrate this feature. Also the initial reconstruction is different from the strict definition of
Example of the algorithm with a stationary truck as occluders for camera 3, captured by four cameras and
2. From iteration 1 we list all candidate cells, which form the search space. The last column indicates the cell’s decision: keep (K), add (A) and reject (R). A cell is rejected if the consistency score is lower than the previous accepted cell in the updated reconstruction. The algorithm adds the cell with highest consistency score at each iteration as long as the consistency score increases. Iteration 2a shows the final iteration of the algorithm when extended silhouettes are used. Iteration 2b is the second iteration in case the observed silhouettes are used. Note that the stop criterion is hard to find in the latter case
Example of the algorithm with a stationary truck as occluders for camera 3, captured by four cameras and
In Eq. (10), we assumed that all cameras are equally important. In practical applications, camera calibration and silhouette extraction using foreground-background segmentation are both subject to inaccuracies and noise. Cameras closer to the object of interest can usually observe the object better. Therefore, as an extension, we propose the use of weights
Therefore, we are able to rewrite Eq. (10) as:
Figure 5 shows an example where four cameras are observing a red car. A parked truck is occluding the car partially in camera view C3. We illustrate how the proposed consistency score
Description of the comparison methods
Description of the comparison methods
Example with one stationary truck, acting as a static occluder for cameras 3.
Possible candidates to be added to the reconstruction are cells C, G, H, I and J. Notice that in the table we have iteration steps 2a and 2b. 2a uses the extended silhouettes, whereas 2b uses the observed silhouettes (just for illustration). In 2a we clearly see that after updating the extended silhouettes, we cannot find any cell that corresponds to a higher consistency score than
The simulation consists of a car moving from left to right, captured by 7 cameras. The blue truck on the opposite side of the road is static and occludes the car in the 3 upper most cameras on the right, depending on the location of the car. Different colours in the proposed algorithm indicate different cells.
We present three experiments that were conducted to show the potential of our method. We compare against four methods from literature (Table 2). We simulated Guan’06 by using perfect occlusion masks because we lack the source code of the actual method. The real performance of this method will most likely be worse than the reported results. Also Guan’06, learns a static occlusion mask over time and therefore fails in the case that a moving object can both appear in front as well as behind a static object because depth is not taken into account. For Landabaso’08, the parameter
Experiment 1: smart traffic simulations
The first experiment concerns smart traffic. The idea is to make full reconstructions of vehicles in traffic situations. The silhouettes of the moving car are obtained by projecting the 3D model of the car on the different virtual camera sensors. The same accounts for the other vehicles. By subtracting the car of interest from the other shapes in a pixel-wise fashion, we obtain the occluded silhouette of the moving car. Camera positions are chosen to mount on street and traffic lights.
Figure 6 shows the first situation. A stationary truck is blocking the view of multiple cameras. The red car at the bottom drives in a straight line from left to right. Laurentini’94 is unable to reconstruct the complete car (in the example only 30% of the car is reconstructed). When the car is completely next to the truck, not a single voxel is reconstructed. Landabaso’08 performs better because more voxels are found, but also include a lot of voxels that are not part of the car.
Table 3 shows that Guan’06 achieves the highest F1-score followed by our proposed algorithm. The precision of Laurentini’94 is almost 100%, but its recall is considerably less. Whereas with Landabaso’08, the recall is above 90%, but the precision is lower. In Fig. 6, we see that the proposed reconstruction is much more accurate than the reconstructions produced by the Landabaso’08 and Laurentini’94 and comparable to Guan’06. The ground truth in this context is the visual hull of the model in Fig. 6a when the truck is not present. It is important to note, however, that the application of Guan’06 is much more restricted than our method, as the method first requires knowledge about the occluders in the scene.
Typical traffic situation where both the blue truck and the blue car want to turn to their left. The car is unable to see the oncoming traffic due to the truck. One extra camera is mounted on the traffic lights (yellow pole).
Visualization of the occluders. Each camera has one possible occluder (green) which can be either turned on or off. Two examples are shown The eight cameras are places facing the same area. All cameras are placed at approximately 2.2 m from the ground plane.
Results for a car going straight, occluded by a parked truck. The proposed method performs almost as well as the supervised method of Guan’06 in terms of precision, recall and F1-score. Other methods are less accurate. We show the results of Landabasso with both
The second situation is a common one in traffic (Fig. 7). The blue car wants to turn to its left at the intersection. However, the blue truck also wants to turn to its left from the opposite direction. Since the truck is blocking the view of the oncoming traffic, the blue car has to wait or take a risk with possible catastrophic consequences. The proposed algorithm could be used to detect oncoming traffic and indicate if crossing is safe or not, even if the truck in the middle of the road is blocking some camera views to observe the oncoming traffic. One extra camera on top op the traffic lights increases camera coverage. Table 4 shows again that our methods performs almost as well as Guan’06.
Given the difficulty of automatically or manually labelling occluders in dynamic traffic scenes, the good performance of the proposed method is an important result. Even in case of severe occlusion, the algorithm manages to obtain a reliable reconstruction, as long as the object of interest can be partially observed from multiple viewpoints.
Results for an oncoming car with a truck in the middle of the road in front of a car, trying to cross the street in terms of precision, recall and F1-score
The second experiment uses the JP sequences (breakdancer) from the CVSSP-3D dataset [32], which can be requested to the authors. Each of these sequences consists of a synchronised stream of images from 8 cameras, which are placed around the subject at 2.2 m high about every 45 degrees (see Fig. 8). A total of six sequences is available. Each sequence is between 10 and 20 seconds long. In this experiment we will simulate the presence of occluding objects and compare with the ground truth, which is the output of the classical shape-from-silhouettes without occlusion (see Fig. 9). Each frameset has equal weights on the evaluation. Displayed values are therefore averages over each sequence. Reconstructions are performed at cubic voxel size of 20 mm.
Some examples from the reconstruction of the CVSSP-3D dataset [32] at 40 mm voxel size. In the experiments we use 20 mm.
In this experiment we simulate the presence of occluding objects by recalculating the input images as if there would be an occluder present. We defined an occluding object for each camera. This object is modelled as a vertical column partially blocking the view of the camera (similar to a spectator in front of the camera. For each number between 1 and 8 occluders we choose eight random combinations (the same for each reconstruction method). For example, 2 occluders could block the view of camera 3 and 5. Each occluder roughly occludes 17% of the total image. We averaged the results over these combinations for each number of occluders. Figure 8 visualizes the camera setup together with the possible occluders.
In general, a lower value of
Sensitivity analysis of 
Table 5 shows the results for all methods, averaged over all sequences. Only the F1-score is shown, but for Guan’06, Landabaso’08 and the proposed method the recall is close or equal to 100%. Only Laurentini’94 does not obtain a high recall value because occluded parts are not reconstructed. On the other hand the precision of Laurentini’94 is 100%. The last column represents the results of the proposed method.
Results in case of multiple occluders. First column indicates the number of occluders. The occluder is placed in front of a certain camera. Average over all possible combinations are shown
In Fig. 10 we provide a sensitivity analysis which shows that the optimal
Visualization of the output from the proposed method, Guan’06, Laurentini’94 and Landabaso’08 in a real world example in case of a single person and multiple large occluders in the scene.
For the third experiment we use the setup shown at the left in Fig. 11. It is a staged setup of an office environment. This setup has 7 cameras, mounted around the scene at about 3.5 meters high. A table, 2 chairs, a L profile panel and a display introduce natural occlusion. The goal is to track the position of a person as well as possible. The plot in Fig. 11 shows that the proposed method is able to track the person very well in the room. Note that the measured trajectories are not smoothed in any way. Although the recordings were made in a lab, the conditions where not optimal, especially for the foreground/background segmentation step because the colours of the clothes of the tracked person were similar to colours in the background, not unlike many real world environments. A classical tracking algorithm suffers from tracking loss in case the person is completely occluded for at least one of the cameras. Partial occlusion introduces inaccurate positions because the position is then calculated on a limited number of voxels.
Both the proposed method and Landabaso’08 produce a person location in every frame from the moment the person enters the region of interest until he leaves again (total track: 505 frames). On the other hand, Laurentini’94 only outputs positions on 175 frames, which represents only 34.65% of the interesting frames. This fact and the smoother tracking with our method show the need for occlusion handling. We also compared against Guan’06 because that was the nearest competitor in the other experiments. We clearly see that the track of this method is much less smooth than the one produced by our method. There are two reasons for that: Guan’06 cannot recover from bad foreground/background segmentation because this is different from static occlusion (resulting in less voxels in de 3D reconstruction) and their method indicates per pixel whether or not occlusion can happen thereby rejecting depth information. Although their algorithm outputs a position in 503/505 frames, the position is less accurate than with our proposed method. In Fig. 11 we also show the number of voxels detected at each frame. Landabaso’08 produces comparable tracking results in this experiment, but the number of voxels varies significantly, which means the reconstructed person is less accurate. Fortunately, many of the unnecessary voxels do not detoriate the person’s position too much, despite being no part of it. Position averaging softens the incorrectness, but for a full 3D reconstruction, this is unwanted, in particular when an accurate proximity detection is needed (e.g. cooperation between man and machine).
Conclusions
In this paper we presented an algorithm for shape-from-silhouettes which can cope with incomplete silhouettes. We showed that our algorithm is able to perform well under different complexity levels of occlusion and without prior knowledge of the occluders. The algorithm automatically detects the occluded parts in the camera views and uses this information to only use relevant silhouette information for the reconstruction of the object of interest.
The algorithm succeeds in reconstructing the entire object of interest and the reconstruction closely resembles the visual hull.
As illustrated in the results section, our algorithm improves state-of-the art for reconstruction as well as the tracking of a moving object in a scene with occlusion in a completely automatic fashion.
