Abstract
Recently, drone technology has developed rapidly for various purposes. A drone is very useful for aerial surveillance due to its remote sensing capability. Multiple target detection and tracking are essential to recognize any harmful threats in advance, however the image captured at a distance is easily degraded due to blurring and noise as well as low resolution. This paper addresses the detection and tracking of moving vehicles with drone imaging. A drone captures video sequences of multiple moving vehicles from a distance. Cars and buses are the objects of interests driving on urban roads. The detection step consists of frame difference followed by thresholding and morphological operation considering the size of region of interest (ROI). The centroids of the ROI’s are considered measurements for tracking. Tracking is performed with interacting multiple model (IMM) filtering, which estimate the state of vectors and covariance matrices using multiple modes of Kalman filtering. The measurements in the validation region are associated with established tracks by the nearest neighbor rule. In the experiment, total seven moving cars and buses are captured at a long distance by a drone. It will be shown that the proposed method well detects the moving vehicles and achieves a good accuracy in estimating their locations.
Introduction
Recently, there is increased usage of unmanned aerial vehicles (UAVs) or drones for aerial surveillance. The usage of drones is cost effective and the camera installed in the drone is easily manipulated to capture the scenes of interest. Thus, the advanced technology in drones can build efficient surveillance systems [1].
Multiple object detection and tracking is important for security and surveillance [2]. Various researches have been performed for visual detection and tracking of moving objects in the literature. Methods based on background subtraction and frame difference are studied in [3–9]. Gaussian mixture modeling (GMM) was used to analyze background and target regions [3–5]. Scale invariant feature transform (SIFT) was adopted to extract foreground objects from the background scene [6, 7]. In [8], background was subtracted under Gaussian mixture assumption followed by morphological filters. Long range moving objects are detected by a drone in [9]. In [10, 11], moving cars are detected with a background image and they are tracked by Kalman filtering.
Many studies have been conducted on multiple target tracking [10–15]. High-maneuvering, closely located multiple targets, heavy clutter (false alarm), and low detection probability are often considered difficulties to overcome [12]. Kalman filtering provides a solution for estimating the state of the target in real time. It is known optimal under independent Gaussian noise assumption [13]. Interacting multi model (IMM) is actively applied to targets of high maneuvering due to its adaptation capability switching to multiple modes [14]. When multiple measurements are detected at a frame, data association is required to assign the measurements to the established tracks [12, 15].
In this paper, we address the detection and tracking of multiple moving vehicles on urban roads captured by a drone. A drone captures videos sequence of moving cars and buses from a long distance. Object detection is performed through frame difference followed by thresholding, morphological filtering, and regions of interest (ROI) size limitation. The frame difference is calculated using two frames at a constant interval. Then, thresholding is applied to generate a binary image. Two morphological operations: erosion and dilation are applied to the binary image to produce candidate ROI’s, which are disjoint alternative areas representing moving vehicles. Finally, false ROI’s are removed by comparing the size of the ROI with the real object sizes. The centroids of all ROI’s are fed to the tracking stage as measured two dimensional positions.
Tracking is performed by IMM filtering to estimate the state of the target and the covariance matrix. A nearly constant velocity (NCV) models with two difference covariance matrices of the process noise are assumed for the dynamic states of the target [14], that is, each IMM mode is set up with a different process noise. A gating process excludes the measurement outside the validation region. The nearest measurement-to-track association scheme assigns one measurement to one track [10, 11]. In the experiments, total seven moving cars or buses are captured at a long distance at a height of 100 meter by a drone. Experimental results show that their locations are well detected and tracked with good accuracy by a proposed method.
The remains of the paper are organized as follows. Object detection is discussed in Section II. Multiple target tracking is presented in Section III. Section IV demonstrates experimental results. Conclusion follows in Section V.
Object detection with frame difference
The block diagram of object detection is described in Fig. 1. First, the difference between the current and the past frames at a constant interval is obtained and then, a thresholding step follows to generate a binary image as

Block diagram of moving object detection.
System modeling
The dynamic state of the target is modeled as a nearly constant velocity (NCV) model; the targets’ maneuvering is modeled by the uncertainty of the process noise, which is assumed to follow the Gaussian distribution. The following is the discrete state equation of a target:
The state vectors and the covariance matrices of the IMM mode filters at the previous frame k - 1 are mixed to generate the initial state vectors and the covariance matrices for each of the IMM mode filter at the current frame k. The initial state vector and the covariance matrix of target t for mode j with the mixing probabilities are obtained, respectively, as
where r is the number of IMM filter modes,
Kalman filtering is performed for each IMM mode. The first step is to predict the state of a target when the dynamic state is modeled by mode j [14]:
Measurement gating is a pre-process of data association that reduces the number of candidate measurements. Data association is the process of associating multiple measurements to established tracks [12]. Figure 2 illustrates an example of the gating process. Each ellipsis shows the validation region of each target. Track 1 and 2 are initialized at frame 1 and 2, respectively. They intersect each other between frame 5 and 6. When no measurement exists inside the validation region of track 1 at frame 4, the state estimate is the same with the state prediction for track 1. At frame 5, two measurements are detected, but only one is associated with track 1, which is inside the validation region. Two measurements exist in the validation region of track 2, where the nearest neighbor association rule is applied to assign one measurement to the prediction of track 2.

Illustration of gating process and data association.
Let Z (k) be a set of m
k
measurement vectors at frame k:
The measurement gating is performed by chi-square hypothesis testing assuming the Gaussian measurement residuals. Thus, a set of valid measurements for target t and mode j at frame k is obtained as as
The state estimate and the covariance matrix of targets are updated as
If
The mode probability is updated as
Finally, the state vector and the covariance matrix of targets are updated as
The procedures from Equations (9 to 28) repeat until the track is terminated. A track is terminated when no validated measurement is detected for several frames.
Scenario description
A drone (Phantom 4 Advanced) captures seven moving vehicles at 48 frames per second (fps). A total of 590 frames are captured for about 12 seconds. The size of one frame is 4096×2160 pixels; the frame is reduced to 20% size, 820×432 pixels for efficient image processing. The drone hovers (stays still in the air) while capturing video sequences at a height of 100 meters above the urban road. One pixel corresponds to 0.168 meter. The interval k d in Equation (1) is set at 10, thus the detection process is applied from 11th frame.
Figure 3(a) shows Cars 1–5 at the 11th frame, and Fig. 3(b) shows Cars 2–7 at the 311th frame, and Fig. 3(c) shows Car 2–4 and Car 7 at the 590th frame. In the figures, the red circles indicate the moving vehicles. The first car (Car 1) is located at the top middle of the scene from the 11th frame and moves to the right until it disappear after the 121th frame. The second car (Car 2) moves around a curve to the left from the 144th frame to the 590th frame. The third car (Car 3) moves up diagonally to the right from the 11th frame to the 461th frame. The forth car (Car 4) moves down from the 11th frame to the 209th frame. The fifth car (Car 5) moves down diagonally to the right from the 11th frame to the 356th frame. The sixth car (Car 6) moves to the right from the 103th frame to the 552th frame. The seventh car (Car 7) moves the same direction with the sixth car from the 208th to the 590th frame. When Cars 3 and 4 move very slowly, they are not considered targets of interest; the speed of Car 3 is below 2.28 m/s (8.2 km/h) after the 461th frame and the speed of Car 4 is below 0.8 m/s (2.9 km/h) after the 209th frame.

(a) Cars 1–5 at the 11th frame, (b) Cars 2–7 at the 311th frame, (
Total 580 frames are processed to detect moving objects; θ T in Equation (1) is set at 30; E in Equation (2) is set at [1] 2×2 and [1] 16×16 for two erosion processes, and D in Equation (3) is set at [1] 18×18 for dilation process; θ s and θ f in Equation (4) are set at 200 and 2000 pixels, respectively.
The first row in Fig. 4 shows the detection process of Fig. 3(a). The second and the third rows are of Fig. 3(b) and (c), respectively. Figures 4(a) are the frame difference with thresholding to generate a binary image. Figure 4(b) shows the morphological filtering result; the centroids of the segmented areas are marked by ‘x’. Assume that the size of the moving cars is known, Equation (4) is applied to Fig. 4(b) to result in Fig. 4(c). Figure 4(d) shows the segmented regions by ROI windows representing moving vehicles.

(a) Frame difference and thresholding, (b) morphological filtering and centroids, (c) size limitation, (d) segmented regions by ROI windows.
Table 1 shows the detection rate of seven cars; average detection rate is 97.51%.
Detection rate
Figure 5 shows all the measurements of seven cars for 580 frames, including false alarms. The sampling time Δ in Equation (6) is 0.021 sec since the frame rate is 48 fps. The standard deviations of the process and the measurement noise are set at σx1 = σy1 = 30 m/s2, σx2 = σy2 = 60 m/s2, and r x = r y = 1.5 m, respectively; γ in Equation (17) is set at 16. The track is initializedby a two-point initialization method with speed gating, which limit the maximum speed of target by 30 m/sec [11]. Figure 6(a)-(g) are the results of Cars 1–7 tracking, respectively.

Measurements including false alarms for 580 frames.

Tracking result when Δ= 0.021 sec, (a) Car 1, (b) Car 2, (c) Car 3, (d) Car 4, (e) Car5, (f) Car 6, (g) Car 7.
The sampling time is set differently in the following experiments. Figure 7 shows the tracking result with Δ= 0.063 sec, that is, measurements are acquired every third frame from the original data. Cars 1–7 are initialized by the first and the forth frames. Figure 8 is the tracking results with every sixth frame (Δ= 0.125 sec) updated. Figure 8 is the tracking results with every eighth frame (Δ= 0.167 sec) updated.

Tracking result when Δ= 0.063 sec, (a) Car 1, (b) Car 2, (c) Car 3, (d) Car 4, (e) Car5, (f) Car 6, (g) Car 7.

Tracking result when Δ= 0.125 sec, (a) Car 1, (b) Car 2, (c) Car 3, (d) Car 4, (e) Car 5, (f) Car 6, (g) Car 7.
Figure 10 shows the ground truth of positions of Cars 1–7. Table 2 shows the root mean-squared error (RMSE) of the position. The large RMSEs of Cars 1, 6, and 7 are mostly caused by false alarms. RMSE is similar for 1, 3, and 6 sampling steps, but becomes larger when the sampling step is 8.

Tracking result when Δ= 0.167 sec, (a) Car 1, (b) Car 2, (c) Car 3, (d) Car 4, (e) Car5, (f) Car 6, (g) Car 7.

Ground truth, (a) Car 1, (b) Car 2, (c) Car 3, (d) Car 4, (e) Car 5, (f) Car 6, (g) Car 7.
Root mean-squared error (RMSE) (meter)
In this paper, multiple moving objects are captured from a long distance by a drone. The locations of objects are detected using frame difference. Seven targets are tracked with IMM filtering. Experimental results show that the proposed algorithm detects and tracks moving objects with good accuracy. More complicated scenarios where targets are closely located may require more advanced data association techniques, which remain for future study.
Footnotes
Acknowledgments
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grant Number: 2017R1D1A3B03031668).
