Abstract
The security of elderly people living alone is a major issue. A system that detects anomalies can be useful for both individual and retirement homes. In this paper, we present an adaptive human tracking method built on particle filter, using depth and thermal information based on the velocity and the position of the head. The main contribution of this paper is the fusion of information to improve tracking. For each frame, there is a new combination of coefficients for each particle based on an adaptive weighting. Results show that the tracking method can deal with the cases of fast motion (fall), partial occultation and scale variation. To assess the impact of fusion on the tracking process, the robustness and accuracy of the method are tested on a variety of challenging scenarios with or without depth-thermal fusion.
Introduction
According to the French institute of health education (INPES), 9,300 people die each year from falls. These falls occur mainly at home (78% of falls) and especially at night (60%) causing physical and psychological consequences. In accordance with the World Health Organization (WHO), falling is the second leading cause of accidental or unintentional injury deaths worldwide [1]. For these reasons, an automatic system that could prevent and detect falls and call emergency services can be useful even for retirement homes. Actually, many fall detection (FD) and fall prevention (FP) systems have been presented by researchers. These systems can be classified into three categories according to the type of the sensor used: wearable technologies, ambient technologies and a combination of wearable and ambient technologies. Wearable technologies encompass two different types of hardware: inertial sensors (e.g., tri-axial accelerometer) and locating systems (GPS). Ambient technologies include vision sensors (e.g., cameras), sound sensors (e.g., microphones), radar sensors (e.g., Doppler radar), infrared sensors and pressure sensors (e.g., floor sensors) or combinations of them [2].
In this paper, we aim to develop a person tracking algorithm in order to improve the accuracy and the sensitivity of the system proposed in [3] and to reduce the number of false alarms. Moreover, we would like to track the elderly person’s activity in order to prevent falls.
In a previous work [4], a head tracking method using the fusion of low cost thermal and depth sensors for home environments whilst preserving privacy was proposed. The addition of thermal sensor improves the tracking with depth sensor. For example, thermal information adjusts depth detection by discriminating between hot objects and cold objects moved after calculating the background image. The results demonstrate that fusion improves tracking, namely when segmentation was erroneous. However, it missed partial occluded falls, and was unable to track fast motions in real time which are interesting for fall detection. For these reasons, this paper examines the data fusion to improve fast motion tracking and partial occlusion using particle filter (PF) algorithm based on head position. Particle filtering is a sequential importance sampling method using a set of particles to estimate the posterior distribution of a Markovian process, given noisy observations. The key idea of PFs is to represent and maintain the posteriori density function by a set of random samples with associated weights and to compute the state estimate from those samples and their weights. For each depth-thermal image pair, the head position is first segmented in the depth image, and then matched with the thermal image using calibration information to predict the actual position according to the previous state. The fusion of thermal and depth information is used to update this predicted state.
This paper extends the depth-thermal tracking method based on particle filter, explained in [4], by including the velocity of the head in the state vector to improve fast motion. The method was tested on several sequences, with or without depth-thermal fusion: results show its robustness and accuracy and also demonstrate that adaptive measurements of each particle by using the velocity and the position of the head improve the fast motion, partial occlusion and scale variation.
The paper is organized as follows: Section 1 contains a general introduction of fall detection system. Section 2 gives an overview of the state-of-the-art vision fall detection systems. Section 3 describes the material used, the architecture of tracking algorithm, and proposed methodology to detect falls. Section 4 discusses the experimental set-up of our dataset, the results with or without depth-thermal fusion, as well as the performance evaluation and a detailed discussion. Section 5 provides conclusion and further research potential.
Related work
A fall is defined as an event which results in a person coming to rest inadvertently on the ground or floor or other lower level. Adults older than 65 years of age suffer the greatest number of fatal falls [1]. Several FD systems have been proposed to identify and classify human activities of daily living (ADL) and to reduce the risk of elderly falls, the response and the rescue time. Many studies focusing on FD survey were increased rapidly in the world. For example, Mubashir et al. [5] chose to classify the FD systems into three categories: wearable device based, ambience sensor based and camera (vision) based. However, Igual et al. [6] chose only two categories: context-aware systems and wearable devices. While falling detection context is promising, exciting challenges still occur. In this paper, we will study the most commonly cited works in the literature according to their advantages and their drawbacks such as cost, application, installation and privacy.
Over the last decade, the focus has been on context aware systems (vision systems especially), because the person is more independent and not constrained by the presence and the configuration of the device. Several methods use particle filters for object tracking and localization. In [7, 8] the authors describe the application of particle filters for tracking moving objects using background subtraction to track human silhouettes based on color images. In [9], Rougier et al. have used the head’s velocity to detect the fall in visual videos by setting thresholds manually. In the same vein, Bouaynaya et al. [10] have used particle filter for head tracking based on colored histograms. In [11], Loza et al. have applied PF on thermal imagining. Mubashir et al. [5] have used head position to track the person’s silhouette based on a Gaussian classifier. In [12] the silhouette was extracted from video to localize the person which is a common strategy in the literature. However, these methods provide false alarms because it is difficult to distinguish a fall from other similar actions, e.g. sitting down. Therefore, in [13], Auvinet et al. have added other cameras to analyze the shape of the person in 3D and avoid hidden falls. But elderly people dislike the use of visual cameras even with local processing. They prefer non-invasive devices which preserve their privacy according to a psychosocial study done by LAUREPS laboratory at University of Rennes 2.
In order to protect user privacy, 3D fall detection systems using depth sensors were used in a fall detection context. The aim of using a depth camera like Kinect is to analyze the human shape and extract 3D features for fall detection [14]. A recent work used head position detection, extracting from depth images [15] and the experimental results confirm the feasibility and the effectiveness of the approach for real world applications. In [16], 3D data are exploited to perform head detection for a fall detection framework. Human silhouettes, obtained by a background subtraction, are detected and all possible head positions are searched on contour segments. But in fall detection, it could not recognize correctly for instance when the person bent his knee too much to slow down the fall.
To avoid this problem, some works have used other non-invasive sensors such as thermal sensors. For example, Hayashida et al. [17] integrate a thermal infrared array sensor to detect falls by computing the maximal thermal difference between the background and foreground pixels which is a technique used for static cameras. The current frame is subtracted from the model of the background scene and eventually, the difference, determines the moving objects. However, the configuration was sensitive to room temperature and brightness. In [18] authors proposed a system to recognize human activities, which include falls, by means of a single thermal infrared sensor. Several features based on temperature thresholds are proposed to be evaluated by a Support Vector Machine (SVM) in the classification. However, in [19, 20] authors have proposed another type of thermal sensor but relatively expensive. In [21], a very economical thermal imaging based input modality is proposed to detect falls using the optical flow of human movements tested on public datasets. These proposed methods achieved a good performance but included some confusion in distinguishing between falling and sitting.
The number of studies using analytical methods is still increasing but there is a new trend in fall detection which is the use of machine learning methods and the most popular algorithm in this context is deep learning. For instance, Quero et al. [22] detected falls from non-invasive thermal vision sensor (Heimann HTPA
In order to efficiently improve results, some papers combined sensors. Interesting examples are provided in [25] and [26], for example Kinect and accelerometers, or cameras with microphones plus accelerometers. In [27], human silhouette was extracted using RGB-D camera. Recently RGB-T systems attracted a lot of attention, e.g. Wu et al. [28] combined RGB and thermal data into one vector which, however, introduces redundant information and the use of color information cannot preserve user privacy.
In this paper, we propose a combination between depth and thermal sensors, FLIR sensor (80
Material and methods
The proposed system aims to track the head position using two types of sensors with different resolutions which are mounted together. The head position can be tracked according to an analytical method applying on a segmented frame. With the calibration step done before starting processing, the unidirectional thermal-depth matching can be made throughout all sequences.
In this research, we chose head position as Region Of Interest (ROI) because it is non-deformable, the hottest, highest and least hidden part of the body which can easily be approximated as an ellipse with only few parameters. Head motion is also a significant marker for fall detection.
Cameras and dataset
The fall detection system is based on thermal sensor (FLIR lepton 2.5, Focal length: 5 mm, Thermal Horizontal Field of View
Camera system.
Figure 2 illustrates the framework of our proposed fall detection system. This proposed system can be divided into three principal stages: calibration, segmentation and tracking. The calibration step, which is done only one time, is executed after attaching the sensors to the ceiling to be able to match a depth pixel to its corresponding thermal pixel. The segmentation step, based on acquired images, serves to detect depth foreground image by subtraction of the background and extract head position. The tracking step is based on the head position segmented on the depth image and matched to the thermal position and improved by the particle filter.
Fall detection framework.
A calibration step is required to calculate the transformation parameters (extrinsic parameters). In the literature, a conventional black and white chessboard pattern is often used in many existing methods to calibrate two cameras or more. To obtain higher accurate calibration results, this pattern needs to be kept near the cameras. The orientation could sometimes result in limiting the number of poses [29]. Besides, this pattern cannot be seen by thermal sensors. For these reasons, we have decided to design a special pattern which contains several tubes of different heights mounted together on a board and different resistors fixed on each tube. The idea of this pattern is simple. The tubes will be seen by the depth sensor and the heat emitted by the resistors will be seen by the thermal sensor. Calibration pattern is shown on depth, thermal and color images in Fig. 3.
Calibration pattern on thermal image, depth image and color image respectively from the left.
The calibration operation comprises modeling the image transformation process. This process transforms points from image coordinates to a common world coordinates system. The idea is to find the relation between the coordinates of a point in the depth image with the associated point in the thermal image taking into consideration the spatial coordinates of each point (Fig. 4). The estimation of the relationship between these two coordinate systems needs three steps [30]:
Calibration system.
The estimation of the transformation of the depth image coordinates
The transformation between the coordinate system
where The transformation between the coordinate system
In our case, the intrinsic parameters are the values given by the constructor. So the purpose of the calibration is to estimate 3 parameters of rotation transformation and 3 parameters of translation transformation
In order to improve the segmentation robustness, we calculate a reference map based on the mean and standard deviation of the
if
then
else
where
Next we have one or more areas detected as foreground: we compare these areas and we hold only the bigger one. Following the choice of the area, we approximate the body with an ellipse.
Silhouette and head position.
Finally, we model the head as a smaller ellipse with the same orientation of the silhouette ellipse. Human adult body proportions are brought about by differential growth of the body segments. From 25 years of age the head is only approximately one-sixth of the total body length [32]. Therefore, we fixed the center
where
The aim of the tracking is to estimate the position of the head during a sequence by considering the last movement of this ROI. Therefore, we chose a sequential Monte Carlo method which is Particle Filter (PF) method. At each frame
PF method seeks to estimate the hidden state vector
PF uses a sample of
To improve the estimation of the head position especially in cases of fast motion, we added the velocity
Below, we briefly define the PF algorithm. For each frame, we resample a new sample of
where
Segmentation step and PF algorithm.
Thus, the steps of iterative PF tracking algorithm are:
Initialization: Generate a sample of
Resampling: Resample particles to prevent the problem of particles degeneration, if frame
Prediction: Propagate particles according to prediction model to predict the state vector
Updating: Update the particle weight Then normalize the weight:
and return to step 2.
Updating particle weights is a key point of PF and is specific for each application (see [35] for color information).
The weight of a particle is defined by Eq. (11):
where
We have tested several values and we returned the one that rendered the best result and
When updating particle weight, we have observed that an occluded particle can decrease the performance of tracking. For example, it can influence the result of the AM model. To avoid this problem, we have added a flag to each particle at each frame (Flag (
Occluded particle conditionss
A particle is occluded for a sensor if it is out of the vision field of this sensor. This can occur especially after the prediction step.
In this work, we tested four models of coefficient combination to update the particle weights in Eq. (11).
The first model (M1) uses only 2 depth coefficients (
where
The second model (M2) combines one depth coefficient (
The third model (M3) combines the 3 coefficients, Eq. (14):
We call parameters
We tested several combinations of static IF in order to estimate the impact of each coefficient (see Table 2).
Importance factor
Tracking results on different frames of depth sequence a) Segmentation only, b) Depth version (M1 model), c) First fusion model (M2). Tracking results are in white, silhouette ellipse is red and GT ellipse is black.
The use of static IF values is a general way to estimate coefficient impact because we fix a static value during the whole sequence. But at certain frames, thermal information can be more important than depth information and conversely. For instance, when the person is close to furniture that was moved after the reference map was calculated, the depth observations may not be relevant, because the silhouette can be merged with the furniture. Or if the person is close to a heater, the thermal observation cannot be efficient. Therefore, we decided to adjust the important factors dynamically and change the values at each frame according to the importance of each coefficient using these rules, Eqs (3.7)–(3.7).
Tracking results on two frames of sequence a) MM model and b) AM model. Tracking results are white, segmentation ellipse is red and GT ellipse is black.
Tracking comparison results between a) first fusion model (M2) and b) second fusion model (M3). Tracking results are white, silhouette ellipse is red and GT ellipse is black.
Quantitative measurements over a sequence. Localization error (a) and the overlap score (b) using the segmentation (red), the depth version (blue) the M2 model (cyan) and the M3 model (blue).
Tracking results of 6 IF tests on the same frame a) C1 test, b) C2 test, c) C3 test, d) C4 test, e) C5 test and f) C6 test, tracking results are white, silhouette ellipse is red and GT ellipse is black.
In subsequent sections of this paper, we have compared this model called (M4) with other models defined previously.
In this section, we demonstrate the performance of the proposed algorithm. We have performed several sequences of people moving in a room with co-calibrated static depth and thermal cameras which were fixed in the ceiling. We have tested our system with the following objectives: (1) compare our proposal work with segmentation only and depth tracking methods, (2) evaluate the performance of the fusion algorithm, (3) evaluate each IF model, and (4) compare IF values.
In all tests, we used the following values:
The criteria for evaluation of our method utilizes two quantitative metrics (more details in [36]): the localization error (called precision plot) which is defined as the average Euclidean distance between the center locations of the tracked targets and the manually labeled ground truths, and the overlap score (called success plot) which is the overlap of the ground truth area and the tracking area.
Fusion of information
In this section, we illustrate, as described in Sections 3.4 and 3.5, the results of head segmentation (Eq. (6)), depth version (M1 model) (Eq. (12)) and first fusion model M2 (Eq. (13)), the difference between AM and MM models and a comparison between the first fusion model M2 and second fusion model M3 (Eq. (14)).
Figure 7 shows a representative a normal ADL of our datasets used in the evaluation experiment. The first three images in Fig. 7 represent the results of segmentation. These results show that segmentation is wrong because the size and the position of the head do not vary whereas the silhouette’s position from the captor does. In this case, the problem is caused by the segmented silhouette which does not contain legs. The second test is based on the depth method mentioned before. Comparing these results, we can see that the depth version M1 is totally erroneous because this method used segmentation to calculate the distance coefficient. In other words, the depth sensor is useless on its own. The last three images show the results of the first fusion model M2 which is able to track the head more accurately as the person moves because it employs a combination of thermal and depth imaging.
The results of MM and AM models are shown on Fig. 8a) and b) respectively. We can see that the AM model provides the closest pose to GT.
In order to evaluate coefficient combination, Fig. 9 shows a comparison of two fusion models a) M2 model and b) M3 model. Visually the second fusion model provides the closest pose to GT.
To validate these results, we have evaluated these models according to the precision and success metrics. The evaluation results of quantitative measurements over a sequence show in Fig. 10 that fusion of 3 coefficients provides the most accurate results. As expected, considering three coefficients together gives better results than using only two coefficients.
Comparison of static IF models
As mentioned in Section 3.6, the third model (M3) combines the 3 coefficients (Eq. (14)).
To evaluate the importance of each IF
Figure 11 illustrates a comparison between these tests on a normal ADL frame. The visual results show the impact of IF in estimating the new head position. Confirming the IF impact during a sequence using the two quantitative measurements, Fig. 12 shows a clear difference between the performance of C4 (Fig. 11d) compared to other tests.
IF quantitative measurement over a sequence. Localization error of C1 test (blue), C2 test (yellow), C3 test (black), C4 test (red), C5 test (cyan) and C6 test (blue).
In this study, we present an improved version of an algorithm initially proposed in our earlier work [4]. In addition to size and orientation of the head ellipse, we have added the velocity to the state vector. Figure 13 illustrates a representative scene of fast motion. The first two images (Fig. 13a) show results of the algorithm without adding velocity on the state vector. Figure 13b) shows the impact of velocity especially in fast movement.
Tracking comparison results between a) algorithm without velocity b) algorithm with velocity. Tracking results are white, silhouette ellipse is red and GT ellipse is black.
As mentioned in Section 3.7, the fourth model (M4) adjusts the important factors dynamically and changes the values at each frame according to the importance of each coefficient. To evaluate the impact of the adaptive combination, we have compared this model with the result of test C4 mentioned in Section 4.2. Figure 14 illustrates a comparison between C4 and this model according to the success metrics. Figure 14 shows a clear difference between the performance of M4 compared to C4.
Summary
In this research, we started by testing the models of head estimation. The first model AM, which considers the weighted average of these
Quantitative measurements over a sequence. Localization error of C4 (red) and M4 (purple).
Row depth image with large black regions (masked information due to poor quality).
In order to estimate the impact of each observation, we assigned an importance factor IF to each coefficient and we compared 6 different tests of static IF. The results were clearly different between each test according to the environment at time
Finally, we added velocity to the state vector to improve the estimation of the head position especially in cases of fast motion.
In this paper, we have detailed a tracking approach based on a particle filter using depth and thermal information fusion to detect the position of the head of a person in an indoor environment. Position, velocity, orientation, and size of the ellipse enclosing the head are used to predict the new position of the head. Furthermore, adaptive weighting was applied on the measurements of each particle according to the strength of each coefficient to update the predicted position on each frame. Consequently, this method solves the updating problem we encountered in previous tracking works, caused by changes in the background. The proposed framework has been tested in several situations with different models and compared with other methods to establish the accuracy of the algorithm. Moreover, results have shown that our system gave the most accurate tracking results even in critical situations with very low resolution images.
Our aim is to refine the work presented here and better address the constraints of fall detection systems. Going forward, we plan to use deep learning (DL) methods due to their performances as mentioned in recent works [19, 20] to more accurately recognize human posture. We will start by 4 postures (standing up, sitting, lying on the ground and lying on a bed or sofa) in the context of fall detection and also fall prevention by activity analysis. Before using DL, we will apply a preprocessing step on depth images to enhance their quality and avoid losing pertinent information (see Fig. 15).
Footnotes
Acknowledgments
This work is funded under the PRuDENCE project (ANR-16-CE19-0015-02) which has been supported by the French National Research Agency. A sincere thank you to Tabitha Courbin for her diligent proofreading of this paper.
