Abstract
Depth data from conventional cameras in monitoring fields provides a thorough assessment of human behavior. In this context, the depth of each viewpoint must be calculated using binocular stereo, which requires two cameras to retrieve 3D data. In networked surveillance environments, this drives excess energy and also provides extra infrastructure. We launched a new computational photographic technique for depth estimation using a single camera based on the ideas of perspective projection and lens magnification property. The person to camera distance (or depth) is obtained from understanding the focal length, field of view and magnification characteristics. Prior to finding distance, initially real height is estimated using Human body anthropometrics. These metrics are given as inputs to the Gradient-Boosting machine learning algorithm for estimating Real Height. And then magnification and Field of View measurements are extracted for each sample. The depth (or distance) is predicted on the basis of the geometrical relationship between field of view, magnification and camera at object distance. Using physical distance and height measurements taken in real time as ground truth, experimental validation is performed and it is inferred that with in 3m–7 m range, both in indoor and outdoor environments, the camera to person distance (Preddist) anticipated from field of view and magnification is 91% correlated with actual depth at a confidence point of 95% with RMSE of 0.579.
Introduction
Microsoft’s latest version of kinect [2] circumvents the above problem by utilizing the time of flight principle. This theory unlike the previous version uses only single view point for computing depth with in a complete scene in a single shot. But while operating in dynamic scenes, especially where there is a motion, time of flight camera also requires multiple shots. In such circumstances, camera may shake in motion and induce a blur in depth map and as a result, motion artifacts get corrupted.
Compared with 2D images, 4D structures encoding angular information provide better solutions for vision and scene understanding, especially when dealing with video stabilization, object detection, tracking and recognition problems. Nevertheless, the light field imaging [3] also suffers from spatial resolution, poor reconstructed depth quality and large storage requirements.
Considering all the implications, selected for their practicality and action ability, we have built an idea of mapping a real world 3D object to 2D image plane of a sensor, i.e. perspective projection and relating the three fundamental parameters: Field of View, objective magnification and camera to object distance in the photograph. The work is accomplished in two phases. In the first phase, human anthropometrics are feeded as inputs to the machine learning algorithm in order to predict the real height of a person. Knowing the real height and image height of a person, Magnification (M), Field of View (FoV) and camera to object distance (or depth) measurements are taken from the relation obtained in the second phase. This model neither requires additional hardware nor special software to interact with the existing underlying surveillance infrastructure. Hence, it is affordable to be used both in indoor and outdoor environments.
Research contributions
A relation is established between Field of view, lens magnification and camera to object distance using the definitions of Field of view and magnification. An Image Dataset is created for Anthropometric Based Real height Estimation Anthropometric based feature extraction is introduced for real height estimation. A standard error estimated for consideration in order to deal with perspective errors that arise in lens properties. An error model is introduced to characterize the influence of Field of View and magnification parameters on camera to person distance (or depth) for a variable focal length and fixed sensor size.
State of Art is presented in Section 2; proposed work is presented in Section 3, Results & Interpretation including privileged Real height and Distance Prediction, Estimation of measurement errors, Method Validation are discussed in Section 4. Rest of the paper is concluded in Section 5.
State of art
It is well known that traditional 2D approaches are no longer suitable and also lack ability in achieving the required level of accuracy in 3D imaging. However, in today’s consumer landscape, three technologies namely [4] stereo triangulation, time of flight and structured light are available in market for 3D imaging.
The stereo triangulation technique in [5] the disparity map between the corresponding features of test and reference images seen by the left and right eyes respectively. In the literature of computer vision, disparity is treated as inverse relationship to depth, especially, D Scharstein et al. [6–8], have evolved sequences of steps viz. matching cost computation, cost aggregation and disparity map computation. The matching cost [9, 10] is generally expressed as sum of squared Differences (SSD), Absolute Intensity Differences (AID) and correlation. Overall, the pixel to pixel correspondence is determined either with maximum correlation or with minimum SSD or SAD, and a disparity map is constructed. However, the construction of disparity map needs the knowledge of camera configuration with epipolar geometry constraint. Also, occlusions in the monitoring environment creates mismatch in pixel to pixel correspondence, which in turn leads to ambiguity in estimating the disparity map.
Jian Sun et al. [11] have addressed these problems by deriving a better solution through a Bayesian approach and also handled the uncertainties through defining prior constraints such as spatial smoothness and occlusions. However, these methods show poor results, when subjected to changes in illumination and contrast. Fradi et al. [12] have opined that, introducing bidirectional matching in occluded and low textured areas resolve image ambiguity problem.
In real time scenarios, correlation methods are employed to retain the smoothness in disparity map. High variation and distortion in information inside a window area leads to less matching accuracy in correlation based models. As a design alternative, the symbolic features derived from intensity images rather than image intensities themselves are considered to serve the purpose. It is evident from On-uma Pramote et al. [13] that, about 22–25% accuracy and speed up is observed after testing algorithm on Middlebury stereo image data set. But, the Feature based strategies fail in selecting an appropriate interpolation method for non-featured areas, where reconstruction of 3D surface is a prerequisite. For a better refinement of disparity map, Nadia Baha et al. [14] have trained a neural correlation network with training data comprising hundred pairs of matched and unmatched pixels, which is a burden to the resource constrained surveillance cameras. Even though, stereo vision systems work well in many environments, it is primitively limited by baseline distance.
Yujiao Chen [31] has used SSD correlation disparity method for depth extraction. In this process, initially SIFT feature points corresponding to the object region in both images are extracted. The corresponding matching feature point pertaining to the object is obtained for each feature point in one image using Sum of squared difference disparity estimation technique. The depth is computed knowing the disparity map, micro lens pitch and distance between image plane and micro lens array.
As highlighted in our literature, the matching points always induces pixel to pixel correspondence problem especially in occluded environments. Whenever there is a discontinuity at the boundaries, SSD correlation fails and as a result information flow in surveillance gets disrupted.
Kulkarni et al. [15] have highlighted these pitfalls and proposed a triangulation technique in his three tier camera sensor network design. This approach requires at least two visual sensor nodes for depth information retrieval which adds a burden to the infrastructure.
A single view point of ToF depth sensor [2, 16] offers a significant benefit in providing accurate depth measurements without being affected by ambient lightning, shadows and occlusions. Here the design itself provides illumination and also phase measurement is taken as criteria for depth measurement but not the intensity. As on Today CW-ToF (Continuous Wave ToF) sensors are dominating the consumer electronics and low end robotics space with certain limitations. As stated in [17] the design mainly suffers from three fundamental issues. One is the range which is limited by power consumption and eye safety considerations. Second is accuracy which is adversely affected by illumination affects. And finally the interference comes in to picture when they start operating in bulk amount in indoor and outdoor environments. Further in order to evaluate depth range accuracy, Plenoptic cameras [18–20] design came in to existence. This camera uses not only the intensity of light in the scene but also the directional information of light distribution in scene for retrieving depth of the surface. But with poor reconstruction depth quality and large storage requirements, human action recognition in surveillance systems is impossible and hence the design is not recommended.
Owing to economic and technical feasibility addressed in the literature, we have proposed a theory by relating the three parameters namely Field of View, object distance and sensor size. This theory can be well established in all existing surveillance camera systems without any additional hardware and software. The entire theory can be embedded in to a single RGB camera providing depth (or distance) from a single view point.
Proposed work
Prerequisites
Focal Length:
Consider an object ‘O’ placed at a certain distance say, ’Obj dist ’, which is photographed by rectilinear convex lens with a focal length ‘f’ to form an image at a distance say, ‘Imgdist’ from the image plane as shown in Fig. 1.

Mathematical Model of Lens.
From the thin lens equation [21] it is known that
In many imaging applications, always object distance from lens is considerably greater than image distance. Hence, the above equation can be approximated as
The angle of view always remains constant for a given sensor size ‘dim’ and lens of fixed focal length ‘f’ but the coverage distance of lens (termed as

Relationship between Field of View, Sensor size and Object distance.
For conventional cameras, the sensor is film plane. The film plane is a place where an image is formed. But in case of digital cameras, it is an imaging sensor usually a CCD array. The sensor size refers to physical dimensions (i.e. width and height). In our experimental set up, we used Nikon D5300 camera with 23.5×15.6 mm Full frame CMOS sensor. Here 23.5 and 15.6 mm are refereed as physical dimensions (i.e. width (m) and height (n)) of sensor.
As our primary objective is to estimate the distance of the Person from lens center(i.e. camera to object distance), it can be derived from the relation involving focal length, Field of View and sensor size as shown in Fig. 2.
In general, an increase in size of object image, ‘Img
ht
’ relative to true size, ‘Real
ht
’ of an object is seen, when lens project the person on to 2D image plane. This is termed as Objective Magnification ‘m’.
In the conventional camera systems, lenses are structured [22] in such a way that always relative consistent person image size is seen on camera sensor (or) image plane. Hence higher magnifications are seen at the smaller Field of View (FoV) measurement. This inverse relationship says, the parameter Field of View (FoV) can be obtained from the magnification measurement for a given sensor size (dim).
By substituting Equation (5) in Equation (3), we obtain
By substituting Equation (4) in Equation (6), we obtain
The study aims at measuring the depth of an person using single camera by understanding the Field of View and magnification parameters for a given sensor size and lens of variable focal length. In this perspective, a DSLR Nikon camera with 18–55 mm zoom lens(known as conventional lenses with variable angle of View) is used for focusing the object at different working distances in order to obtain corresponding Field of Views(FoV’s).
In order to regulate the amount of light entering the camera F stop is used. And at the same time some fixed time duration is required to keep the shutter open till sensor receives the required amount of light.
On the other hand, ISO settings are adjusted to obtain better quality picture in different light conditions. A perfect combination of F stop , shutter speed and ISO is chosen for better exposure. F stop , Shutterm speed and ISO, parameters are tuned in desired proportions for attaining exact light on sensor. The device and its exposure are mentioned in Tables 1 and 2. The experiments are carried out both indoor and outdoor on subjects (say person) in standing posture and with frontal exposure.
Experimental Device Details for Photogrammetry Experiment
Experimental Device Details for Photogrammetry Experiment
Parameter influencing the Camera Exposure
As shown in Fig. 3, starting at 381 cm distance from camera axial line, person keep moving away from camera until 681 cm is reached. A total of 506 photographs are shot continuously on standing postures with in 30 cm interval by varying the focal length in indoor environment. And then subsequently for our outdoor visual data collection, a total of 450 photographs are shot continuously on 50 persons with in 10 mts range using 50 mm Nikkor prime lens. All these images are preprocessed and are used for real height and distance prediction.

Standing Posture of Person along Camera Axial line.
As mentioned in [23] we invited 33 distinct subjects (22 males and 11 females) for our data collection and captured 506 image samples with a fixed camera view point as shown in Fig. 3 at different focal lengths using Nikon D5300 DSLR with Nikon AF-S Nikkor 18–55 mm under the supervision of
Sample photographs of subject(s) taken at different working distances considering the age, gender and height variations
Similarly We invited 50 distinct subjects(say persons) for our outdoor visual data collection and captured 450 images(9 classes with 50 instances each) using a DSLR Nikon camera with 18–55 mm zoom lens(known as conventional lenses with variable angle of View). For images captured in daylight, mid of the day and in moon light, illumination effects are also considered. A few samples of subjects taken under different lightning condition throughout the day are depicted in Table 4.
Sample photographs of subject(s) taken at different working distances in outdoor environment under different lightning conditions
For removal of blur [24, 25] in image due to linear motion or unfocussed optics, filtering operation is applied. As it is known that Wiener filters are suitable for reconstruction of original from the noisy image and hence it is chosen for image filtering operation. Finally dilation and erosion operations are applied for removal of image imperfections.
As mentioned in our previous work [23] initially we have considered all nine anthropometrics namely 1. stature (hairline to feet) 2. Acromial length (from base of feet to acromion) 3. neck height (from base of feet to and the trapezius) 4. head length(from stomion to the top of the hairline) 5. centerof stomion to the top of hair line, 6. forehead to chin distance(lowest point of hairline to chin distance).7. Sellion to chin distance 8. biocular distance (distance between outer corner of eyes) 9. bitragion distance(distance between left and right regions) for real height estimation. Anthropometric statistics obtained for 22 males and 11 females is analyzed in terms of mean, standard deviation measures and is compared with seminal anthropometric statistical survey in [26, 27]. The details are given in Table 5.
Anthropometric statistics (in cms) for Males and Females
On evaluating the performance of samples in indoor environment, the experiment is further extended to outdoor environment. Including indoor and outdoor environment around 956photographs of different person(s) in upright standing pose without slouching or leaning are considered for real height prediction. Among nine anthropometrics, only six metrics namely Body_height (head to foot distance), Face_height (face to chin distance), Neck_height, Mouth_to_Forehead_Distance, Eye_to_Chin_Distance and binocular distance are used for real height estimation.
Feature Extraction for Real height prediction
Given Photograph as an input, initially haar features are used for extracting the frontal face, full body, and mouth and eye regions.
As shown in Algorithm 1, all multi scale possible instances are stored in tuples of 4 in ‘rects’. Each tuples is (x,y,w,h), where (x,y) is left top most point, w = width and h = height. Here we defined three parameters namely cascade of item (body/face/mouth/eye) to be detected. Test image in which item to be detected and a temp variable which decides the mode of operation. For temp = 1, best possible body/face coordinates are selected. For temp = 2, best possible mouth and for temp = 3, best possible eyes coordinates are extracted.
If test Image contains body/face the height of the corresponding bounding box itself taken as maximum length
If test image contains face, store all the possible instances in tuples of 4 (x,y,w,h) in ‘rects’ variable. Crop and store the best fitting face as illustrated in Algorithm 3 and then best possible mouth (max y + h) is extracted. The max y + h represent
In order to calculate the
Following the procedure illustrated in Algorithms 2–4, for about 956 images all six anthropometric namely Body_height, Face_height, Neck_height, Mouth_to_Forehead_Distance, Eye_to_Chin_Distance, Binocular_distance are extracted. Few of them are listed in Tables 6 and 7.
Extraction of anthropometrics from samples taken in indoor environment
Extraction of anthropometrics from samples taken in indoor environment
D1: Body_height D2: Face_height D3: Neck_height D4: Mouth_to_Forehead_Distance D5: Eye_to_Chin_Distance D6: Binocular_distance.
Extraction of anthropometrics from samples taken in outdoor environment
D1: Body_height D2: Face_height D3: Neck_height D4: Mouth_to_Forehead_Distance D5: Eye_to_ Chin_Distance D6: Binocular_distance.
Privileged real height and distance prediction
A Gradient boosting Regressor(GBR) is trained with six anthropometrics along with Act ht , Di=0to5 as mentioned in Section 3.5 in order to obtain Pred ht . The Real height Pred ht . prediction rate is evaluated using RMSE and pearson’s correlation coefficient(r)for various proportions of test and train samples. Once Real height is predicted, subsequently for the corresponding test and train samples, Pred Obj dist is obtained from the Equation (9). The Algorithm 5 is as follows.
The magnification and the Field of View measurements are calculated from Equations (5) and (6). The camera to object distance is obtained from Equation (8). The results are tabulated as shown in Tables 8–11.
Predicted Height and Distance Measurements of Persons in Indoor Environment with respect obtained Magnification and Field of View Measurements
Predicted Height and Distance Measurements of Persons in
Predicted Height and Distance Measurements of Persons in
Predicted Height and Distance Measurements of Persons in
Predicted Height and Distance Measurements of Persons in Outdoor Environment with respect obtained Magnification and Field of View Measurements
In general, perspective errors [29] are seen compressed or expanded around the center of an image when captured by the camera using conventional lenses. As shown in Fig. 4, due to variation in the diverging view of camera ‘α’ and the lateral displacement of an object from the point where it is placed, lens is subjected to perspective errors.

Perspective Errors caused by out of plane displacement.
As a result, some significant changes in lens magnification properties are seen. This in turn affects the accuracy in predicted object distance measurement as shown in Tables 8–11. In order to deal with the accuracy of error prediction in estimated distance from actual, it is essential to include the error in observed measurement.
From Fig. 4 the Perspective Error (Δx) is [30] computed as follows:
The camera to object distance estimated Pred
Obj
dist
cm)) is corrected with a factor ±Δx
Where Δx=σ √ n
n: number of samples = 956
The data obtained from Tables 8–11 fits the multiple linear regression models forming the relation between dependent variable Pred
Obj
dist
(cm) and independent variables.
Where Pred Obj dist = is an observed score on the dependent variable, c is the intercept, p1 and p2 are slopes, FoV, m and f are observed scores on the independent variables.
For Validation purpose we have taken a statistical Analysis tool named SPSS (
The Table 12 provides the R, R2, Adj R2 and standard error of estimate (Δx) determines, how well a regression model fits the data.
Determining how well Model fits: Model Summaryb
Determining how well Model fits: Model Summaryb
a: Predictors(constant), FoV, objective magnification(m), focal length (f). b: Dependent variables (Pred Obj dist ).
The R column represents multiple correlation coefficients, which is considered to be the measure quality of prediction of dependent variable. A value of 0.992 indicates good level of prediction. The R2 column represents proportion of variance in dependent variable. We can see from the value of 0.992 that our independent variables explain 99.2% of the variability of our dependent variable.
With reference to F-Table, the critical of F-ratio is 2.79. The obtained F-ratio from Table 13 is 1006.237 which are larger than above. Hence we conclude the obtained F-ratio likely to occur by chance with p<0.05. And also the output shows that the independent variables statistically significantly predict the dependent variable (Pred Obj dist ), F(3, 49) = 1006.237, p<0.05.
ANOVA Table-Statistical Significancea
Predictors(constant), FoV, objective magnification(m), focal length(f). b: Dependent variables(Pred Obj dist ). SS- sum of squares. MS-mean square.
As shown in ‘Sig’ column of Table 14 all independent variables coefficients are significantly different from 0 (zero), and p<0.05 concludes that the coefficients are significantly different to 0 (zero). Hence from Table 14 general forms of an equation to predict distance from FoV, m and f is expressed as:
Estimated Model Coefficients-Statistical significance of Independent variables
B, Std. Error: Standardized Coefficients, Beta: standardized coefficients.
Putting up all together, a multiple regression has been computed to predict the Estimated Pred Obj dist from magnification, Field of View and focal length. These variables statistically significantly predicted the Pred Obj dist , F(3, 49) = 1006.237, p < 0.05, R2= 0.984. All four variables added statistically significantly to the prediction, p < 0.05.
It is found from Table 15 that our model is equally performing well with Yujiao Chen et al. [31] C.Q. Farias et al. [32] and Said Pertuz et al. [33] Palmieri, L et al. [34] depth estimation techniques by showing high consistency among the predicted (estimated) Pred Obj dist and reference (actual) distance values (Act Obj dist with a Pearson’s correlation coefficient of r = 0.81.
State of Art comparison with reference to recent existing methods in terms of Experimental Set up and Techniques
State of Art comparison with reference to recent existing methods in terms of Experimental Set up and Techniques
In design point of view, unlike micro-lens and parallel stereo system arrangement in Yujiao Chen et al. [31] and C.Q. Farias et al. [32] we used DSLR nikkor 18–55 mm zoom lens for taking 2D images. Unlike the disparity and distance between image planes and micro lens, we used the inherent properties of camera, like Field of view and lens magnification for distance measurement. Nowhere binocular vision strategies are applied in the proposed method and instead a single 2D image is taken from depth measurement.
A quantitative comparison has also been made on proposed method with reference to recent existing works of C.Q. Farias (2016) et al. [31], Said Pertuz (2018) et al. [33] in terms of two regression performance metrics namely Root mean Square Error (RMSE), R-Squared. In order to know how well the model is performing, RMSE is used and on the other hand R square when used in linear regression model context tell about the amount of variability in predicted distance that is explained by model. There is another metric pearson’s correlation coefficient (r) taken for consideration in order to examine the strength of linear association between predicted and ground truth (or actual) distance. It is observed from Table 16 our model on par with the existing works exhibiting 98.4 % correlation in predicting the dependent variable with RMSE of 0.579.
Quantitative comparison of camera to person distance estimation methods on the Test Images: The Values Show the Root Mean Square Error (RMSE), Regression Correlation Coefficient (R2), Pearson’s Correlation Coefficient(r) between Predicted and Ground Truth (Actual) Distance values
This work presented a theory in relating the Field of View and magnification for camera to object distance (or depth) estimation. Close range photography was employed for this purpose in order to investigate the influence of magnification and Field of View on Camera to object distance and also the impact of perspective errors while estimating the object distance from lens center. A nikkon DSLR camera model with AF-P Nikkor 18–55 mm zoom lenses are used for extracting the Field of View and magnification measurements up to 6.81 m in 30 cm interval in indoor and up to 7 m in outdoor with reference to camera lens center.
The proposed model was tested on data set taken in indoor and outdoor environments in distance range of 3 m–7 m. For estimating real height of person 22 males and 11 females with each of 11 instances at different distances with varying focal lengths are taken initially and later the experiment is extended to outdoor environment. Around 50 persons with each of 9 instances with in 3 m–7 m are photographed under different lightning conditions(early hours, mid-day and sunset) using 18 mm Nikkor zoom lens. Considering human body anthropometrics, real height is estimated and subsequently the depth values are obtained from Field of view and magnification measurements. The study developed a multiple regression model governed by Field of View and Magnification parameters with camera to object distance with R and R2 as 0.984 & 0.983. And also the model seemed to be competing with the recent works reported, by providing high consistency among the referenced and observed measurements with 0.91 pearson’s correlation coefficient value.
Footnotes
Acknowledgments
The research is supported by Viswesvaraya Ph.D. scheme for Electronics & IT [Order No: PhD-MLA/4(16)/2014], a division of the Ministry of Electronics & IT, Govt of India.
We gratefully acknowledge the association of SICA (Southern Indian Cinematographers), Chennai, for creating the Dataset to validate the proposed theory. The author would also like to thank the Student society, Department of Computer Applications who have been with us and supported us in Dataset creation.


