Abstract
According to the National Crime Records Bureau, 63,407 children have gone missing in the year 2016, which makes almost 174 children go missing in India every day, out of which only 50% are ever found again. This brings up a need for an efficient solution to trace missing children. The proposed solution uses machine assistance during these search activities with face recognition technologies and can be used for essential development of applications which use CCTV footage across a camera network to identify the person lost. In our solution we use One Shot learning for face recognition to identify stranded people in places such as mass gatherings. The same technology can be used for identification of criminals across the city. The paper also talks about the tracking of people across a network of multiple non-overlapping cameras, with a feature of shifting the target tovehicle, if the target gets into one. The experimentation is performed using mobile cameras and thus, helps in monitoring actions of criminals and finding their hideout.
Introduction
With a current population of more than 7.8 billion people around the world it becomes an arduous task in identifying specific people in challenging environments like mass gatherings, political events, large shopping malls, etc. [1]. Crowded environments, especially in large cities are often prone to such challenges. The anonymity of the city permits people to commit crimes with little fear of being recognized. A very common and disturbing issue that has been eventuating in the past and even today is identifying and locating the missing people, of which child missing is the most common one.
An estimated 8 million children are said to go missing every year, worldwide [2]. A number of surveillance systems are being employed to identify and track the persons gone missing, but still, fail to give accurate results in many of its aspects. Providing such accurate results is still a challenge because handling occlusion, several pose variations associated with people, etc. are all important factors to be considered while developing a surveillance system. Usage of ordinary cameras for reliable tracking often becomes a difficult task mostly due to the occlusions that occur when many people are involved [3], and mainly due to the inability in getting application specific information. Over the years, with the advent of Deep Learning (DL) and its immense practicability, DL finds several of its applications in the various AI solutions such as computer vision, machine learning, natural language processing, robotics and so on [4]. Also, since the operating angles of a single camera are limited, it is preferable to employ multi-cameras to obtain the entire region of interest [5]. The goal of our proposed system is the identification, location, and tracking of missing people in crowded areas employing multi-cameras. A recorded set of videos will be available from each camera for the system to process. Human supervision of these multiple videos has become a very tedious task, at the same time with multiple overlapping views on different cameras leads to complication in the tracking [6, 33]. Some of the best-recognized methods among several algorithms in computer vision are TLD [7], CT [8] and KCF [9]. In this paper, we propose a system that is capable of identifying and tracking a person(s). Section 2 provides information regarding the related work. Section 3 discusses the proposed method deployed for the identification of the target person(s) and long-term tracking of the person in motion considering various different angles and postures a person can take. Section 4 talks about the results obtained using the proposed method followed by the conclusions resulting from our proposed algorithms and their frameworks followed by summary and most of the future works. The last section contains references to all the papers we have cited for our project.
Literature review
The most important concern that lies in deep learning is in the requirement of training on a very large dataset to ensure better accuracy [29, 31]. Also, there is an enormous amount of time that is taken in training such datasets. Instead, usage of a model called the One-Shot Learning [10], trained on fewer images and can be used for newer images without the requirement to retrain the model. Hence, One-Shot learning aims to learn the object categorization from only a few training images. Facenet is a deep learning architecture that directly learns a mapping from the face images and transforms it into a Euclidean space where distances correspond to a measure of the face similarity using deep convolutional networks [11]. This method has obtained an accuracy of 99.63% on the widely used Labelled faces in the wild dataset and on the YouTube faces DB, it achieves 95.12% explained in [11]. Some of the recognized state-of-the-art Face Recognition methods such as VGG, KDA, DCNN, etc are briefly explained in [34]. Facenet however does not explicitly make use of any alignment techniques, it simply performs a complex 3D alignment internally. Also, it does not make use of any intermediate CNN layers or additional processing. The usage of Triplet loss to minimize the loss function by reducing the L2 distance between faces of the same identity and forcing a margin on faces of dissimilar identities. Face detection using Multi-Task Cascaded Neural Network (MTCNN), obtains the most accurate results by performing face detection and alignment jointly in a multi-task training fashion as explained in [12]. However, detection alone cannot fetch the desired results because of its tendency to lose track of faces when the person is in motion or faces when subject to different angles, occlusion, illumination, etc. Some of the Tracking methods are explained in [36]. One of the main problems faced in tracking algorithms over the years is the data association across frames, i.e. when lost track of a person in a frame, several papers focussed on re-ID of the lost targets in the future frames. One such implementation is shown in [13]. Another major challenge that has been occurring is the trade-off between accuracy and speed. [14, 30] talks about the YOLOv3 which is one of the fastest and as accurate as the best object detection algorithms like Single Shot Detector (SSD).
We will focus on dealing with the person re-identification (ReID) task across consecutive frames as well as across multi-cameras using the existing methods and give a solution to track the target(s) when he (she) tries to escape into a vehicle. For this, deep learning frameworks that perform unbeatable in partial or full occluded scenes, functioning in the long-term, and operate in real-time have been used to achieve the outcome.
Some of the state-of-the-art reID methods are discussed in [35].
Methodology
This section gives an in-detail description of the proposed technique and the different algorithms used. [15] aims to perform the person reID task in a single camera using triplet loss as a metric learning step and Resnet 50 as the feature extractor. We have extended this work to multiple cameras; in our experiment precisely 5 cameras are employed. We have divided the task into three scenarios - Searching a person across the 5 cameras, Person reID and tracking when cameras are installed within a building and the scenario when cameras are installed in the parking lot.
Figure 3.1 is a flowchart of the proposed method. The photos of the missing person are to be collected and stored in the dataset which is then trained on a one-shot model, for which we have used a partially trained model.

Block diagram for searching a person.
The required person can be identified with the help of Facenet and if found, the corresponding camera location information will be sent to the user of interest. If the target person is not found then tracking is performed based on when and where the person was last seen, and a route map is identified and plotted based on the target’s movements. It depicts the multi-camera scenario, when the target is not found in a camera, searching occurs across the neighbouring cameras.
Figure 3.2 describes the flow of how a person is identified from the last seen location and tracked across a single camera as well as the multi-camera scenario. A detailed explanation of the same is mentioned later in this chapter.

Block diagram for re-identification (other than parking lot camera).
Figure 3.3 depicts the scenario if the person enters the parking lot in the case of a shopping mall or an exhibition.

Block diagram for tracking car.
It consists of three stages namely, P-Net, R-Net and O-Net. The input image is initially resized and passed into P-Net that outputs a number of bounding boxes with both correct and false predictions, and their associated confidence scores. The redundant boxes are eliminated using the Non-Maximum-Suppression (NMS). The bounding box coordinates are saved and passed to the R-Net stage. Here, for every bounding box, its pixel values are stored in the form of an array. If a bounding box appears out of bounds, only that part of the image is copied to the array and the rest is padded with zeros. Further resizing is performed and an NMS is applied to further eradicate the redundant bounding boxes. At the final stage of O-Net, the image is resized to its original scale and at this stage we obtain the bounding box coordinates, facial landmarks and the confidence for each face contained in the bounding box. In this way, all the faces present in a frame, from large to small will be detected. [12] makes use of this algorithm.
Facenet and one-shot learning
Unlike other deep learning techniques that use large training datasets like in [11–17], we have used a different approach of training on a relatively small dataset making use of the One-shot learning technique.
We make use of an already existing half pre-trained model. We manually add our faces to this pre-trained model. This entire model is now used at the time of training. This technique uses features learned from the previous classes and a comparison of these features are performed to identify the new class (here, faces of new people). This is briefly explained in [18].
Facenet uses a one-shot learning technique for Face recognition and identification. This is a 22 layer deep Convolutional Neural Network (CNN). An input batch file when given simply outputs the face embeddings which is a 128-D representation. Features can be considered as vectors and hence can be represented in a Cartesian coordinate system. At initial time of training, this network architecture randomly places the different and similar face images, slowly it learns by the random choosing of an image as an anchor and further randomly choosing of another positive image, which is the same as the anchor. This similarity or dissimilarity is learned by computing the L2 distance between the facial features that are extracted.
In this way, the network parameters are adjusted to reach the convergence and the system learns to categorize and cluster images of similar faces with smaller distance and images of faces of different identities as the larger distance between them. This method of clustering the various faces is being performed by an important loss function known as triplet loss.
Tracking
One of the key challenges in tracking techniques are the detection of an object when it reappears in the future frames and the re-tracking of objects that are lost track of. Video streams that are processed at frame-rate and in which the process runs for very long periods of time is called Long-term tracking and it is transparent that neither tracking nor detection can solve the challenge of long-term tracking [19]. In this section we explain the tracking method employed how a person can be searched across multiple non-overlapping FOV cameras without losing track of the target.
YOLOV3 and Deep SORT
After obtaining the identified person, in order to achieve results in real-time, there is a need to track the target person. For this, we have employed Deep SORT.
Simple Online and Real-time Tracking (SORT) is an approach to track multiple objects. Tracking by detection has become the state of art paradigm in multi-object tracking. SORT performs Kalman filtering in image space and also data association using the Hungarian method with an association metric measuring the overlap of bounding boxes [20]. SORT, when compared with Deep SORT, has a relatively high number of identity switches. Also, SORT has a deficiency in tracking through occlusions which occurs in many cases. Through integration of CNN with an association metric combining appearance and motion information. This approach of combining SORT with CNN is called Deep SORT. The detections for the tracker are obtained using the YOLOV3 detector. YOLO makes use of fully convolutional networks (FCN). We obtain the detections by using a pre-trained model called darknet used as a feature extractor. Originally a 53 layered network, to obtain the detections at three different stages, an additional 53 layers are stacked to it.
In Deep SORT, the tracking scenario is defined using eight-dimensional space, containing the centre of the bounding box (u, v), aspect ratio, height, and their respective velocities in image coordinates. For every track, we count the number of frames since the previous successful measurement association. This counter is incremented when Kalman filtering is applied and will be set to zero whenever there will be a successful association. The object tracks that exceed a predefined maximum age are considered to have left the scene and hence deleted from the track list. To predict a Kalman state and newly arrived states an association is required, that will be solved using Hungarian algorithm. Into this formulation we integrate motion and appearance information through a combination of two appropriate metrics. The incorporation of motion metrics is done using Mahalanobis distance between the predicted Kalman state and new associated metric [21] as shown in (3.1).
The contribution of deep learning in person reID has seen an immense rise some of which are seen in [22–24]. Person reID is considered a challenge in the field of tracking, where proper data association is required in order to correctly track the person of interest. We perform the task of person reID in two steps: feature extraction and deep metric learning.
Here, Resnet_v1_50 is used as the feature extractor and creates the embeddings in the form of a vector for a person, but the last 2 layers are replaced with normalization to make the embedding 128-dimensional. For the target person identified, a human signature is generated based on the procedure mentioned in [25]. The weights are provided from [26]. The human signature of the person to be tracked is compared to every other person in the subsequent frames. We have provided a particular threshold to assign an ID to a person. We used triplet loss as our deep metric learning. Here, the person with the least distance and above the declared threshold is assigned the ID of that person.
The triplet loss can be formulated as an equation:
Where
This section gives a brief description on how the implementation is carried forward, various software and hardware requirements.
The frames have been captured using mobile cameras and wirelessly transmitted to the server with the help of an application called IP webcam. This is an open-source android application that uses a smart phone camera as an IP camera and needs to be connected over the same wireless network.
The smartphones used were Samsung A50 - 25 MP camera, Resolution – 1920×1080 at 30 Frames per second. Samsung M20 - 13 MP camera, Resolution - 1920×1080 at 30 Frames per second.
The processing happens at the server. The captured frames are read one after another using the OpenCV [27] tool and processing is performed using TensorFlow [28] tools. The computations have been done on an NVIDIA GeForce processor. As mentioned in the methodology, the processing consists of the following: Dataset created using photos of the target with the help of MTCNN. Training and classifier created using Facenet. Searching across neighbouring cameras. Identifying the person’s vehicle.
Dataset creation MTCNN
The individual photos of the target(s) are collected and stored into folders and each of the folders is labelled with their respective name(s). Higher the number of photos the greater will be the accuracy. For our experiment we have used 30 photos for a single identity. From the face of the target is extracted from each of the pictures contained in the respective folder by loading it into the MTCNN model. Once the face(s) is extracted, it is aligned to obtain the face-centred to the image aiding the next step of face recognition task easily.
Training using facenet
Once the dataset has been created, a customized classifier is generated by producing embeddings from the faces of each of the targets present in the dataset. A half-trained model of face detection is made using Facenet. This half-trained model helps in identifying a face and also helps creation of the embeddings for the same. The embeddings are used for comparison in the One-Shot Siamese network.
Training a face recognition model
Photos of the persons that need to be recognized are collected into a database. They are processed by running them through the MTCNN algorithm. The training images are subjected to a data augmentation technique of rotating the images horizontally and an alignment to obtain the face centred image. We have also performed a pre-processing technique called globing to remove the blurring from images. The face recognition model is then trained on these images. We have used 10 training images for each individual, using these images, an embedding is formed using triplet loss. In triplet loss, the training is performed using an anchor image, a positive and negative image [15]. The initial part of the model is a pre-trained Facenet model, trained to identify a human face, and the final part is trained to make an embedding for the person’s face. To run the MTCNN algorithm, a TensorFlow background is required which has been installed on the laptop.
Once the coordinates of the detected faces are obtained, OpenCV commands have been used to draw rectangular boxes around the faces.
Searching across neighbouring cameras
The cameras used here are numbered as 1, 2, 3, 4 and 5. The neighbouring cameras of a particular camera are designated during the installation. Figure 4.1 is a top-view representation of the proposed camera network. The feed from the cameras are read continuously and when a person goes out of an exit/entry point of a camera, only the neighbouring cameras are assigned the task of searching. All the cameras have the IP webcam application installed and are connected to the same network. This helps reduce the search time drastically. Finally, a route map of the movements of a person across the cameras is given for the specified area.

Path identified using the algorithm.
When the target enters the parking lot, the respective camera begins tracking. The camera also identifies different vehicles in the environment using Yolov3. As the target moves towards their vehicle, the algorithm measures the Euclidean distance between the target and all the vehicles around. The algorithm identifies the nearest vehicle and also checks if the target is outside. If there is a vehicle near but the target is still outside, the algorithm does not shift the tracking to the vehicle. It follows two conditions: 1. the nearest vehicle 2. If the person is not visible. On meeting these conditions, the tracking is shifted from the person to the vehicle. In this way we achieve a solution to not just track the person alone, but also keep the track when the target tries to escape into a vehicle.
Results
We have categorised our results into three tasks: Face recognition. Person reID. Vehicle tracking.
Face recognition
The results for the face detection are as shown in Fig. 5.1 and the recognized faces in Fig. 5.2. The Siamese network has been trained using 30 images of the 3 persons of which two are males and one female. Feature extraction has been done using Facenet. The trained Siamese network was then tested on a set of 30 different images of the same person.

MTCNN detected faces.

Identified faces.

(Left) Person being tracked in camera 1; (Middle) Person being re-identified in camera 2; (Right) Person being re-identified in camera 2.

Re-identification on CCTV images.
The accuracy of the model has been obtained using a confusion matrix, with 30 test images calculated using the formula given in (5.1).
This model gives an accuracy of 97%. The threshold for the detection has been set to 80%.
The person reID method has been evaluated on top two reID datasets - Market-1501 and MARS.
The model obtained an MAP of 69.14% on Market 1504 and 67.70% on MARS dataset [15].
This method however at times works on the assumption that persons with the same clothes are the same person thus assigning the same ID. For this we have added a feature confirmation of whether it is the same person or not. Thus, help in reducing the identity switches.
This experiment has been done in a controlled environment with multiple people and multiple vehicles in the scene. The results for vehicle tracking are shown in Fig. 5.5.

Shifting from person to vehicle.
It is observed that each of the persons is given a label based on the count of this class of persons and similarly, the vehicle is also given its respective label. As seen in Fig. 5.5, YOLOv3 performs the detections and labelling well which did not let the detections fail at any point of time making it efficient. The proposed method performs well when the person is not occluded in the scene.
We state that the algorithm when combined with background subtraction and a license plate extractor makes the system more efficient.
In this paper, we have presented an intelligence surveillance system and used mobile cameras for testing purposes. However, employing IP cameras this system can be established inside a building or an area. Using this system missing children or persons can be identified by searching across the camera network using a model trained using Facenet algorithm and identified using one shot learning. If not found, the person’s movements can be tracked from his (her) last seen location. We combine the face recognition and reID methods discussed and add an additional feature of a shift in tracking when the target person tries to escape into a vehicle. This system can be extended to camera networks across cities and searched for people and also track their path if needed. Additionally, in future we aim at improving the accuracy obtained in reID. Thus, helping in reducing the number of missing cases and helping in catching hold of criminals.
