Abstract
The accurate tracking of vehicle loads is essential for the condition assessment of bridge structures. In recent years, a computer vision method that is based on multiple sources of data from monitoring cameras and weight-in-motion (WIM) systems has become a promising strategy in bridge vehicle load identification for structural health monitoring (SHM) and has attracted increasing attention. The implementation of vehicle re-identification, namely, the identification of the same vehicle from images that were captured at different locations or time instants, is the key topic of this study. In this study, a vehicle re-identification method that is based on HardNet, a deep convolutional neural network (CNN) specialized in picking up local image features, is proposed. First, we obtain the vehicle point feature positions in the image through feature detection. Then, the HardNet is employed to encode the point feature image patches into deep learning feature descriptors. Re-identification of the target vehicle is achieved by matching the encoded descriptors between two images, which are robust toward scaling, rotation, and other types of noises. A comparison study of the proposed method with three published vehicle re-identification methods is performed using vehicle image data from a real bridge, and the superior performance of our proposed method is demonstrated.
Keywords
Introduction
The primary function of bridge structures is to carry vehicle loads (e.g., heavy trucks and cars); therefore, vehicle loads are the primary loads of bridges. The determination of vehicle loads is one of the research foundations of time-varying condition assessment, remaining life prediction, and ultimate bearing capacity calculation of bridge structures (Chen et al., 2019; Lan et al., 2011). The installed weight-in-motion (WIM) system can provide vehicle weight information as vehicles pass a measurement site, while a monitoring camera system installed along the bridge records the real-time position information of the vehicles. The integration of the above two aspects of information is a promising strategy for bridge vehicle load identification in structural health monitoring (SHM) and has attracted increasing attention in recent years. Vehicle re-identification is the key step of this technique. The objective of vehicle re-identification is to quickly recognize and track the position of target vehicles from images, which were captured at various time instants or by different cameras, by establishing vehicle matching relationships between images. In the SHM community, Chen et al. (2016) were the first to combine monitoring cameras and WIM systems for the estimation of bridge vehicle loads, in which grayscale features, edge features, and image centroids were employed to identify the same vehicle in images that were captured by different monitoring cameras. However, these features are sensitive to illumination changes; hence, the recognition accuracy for various scenarios cannot be guaranteed. Dan et al. (2019) assumed that all cameras on a bridge are aligned over time and that the fields of view (FOVs) of adjacent cameras overlap, and they employed a target handoff algorithm under the overlapped field of view to perform vehicle re-identification, which is only applicable for these special cases. Zhou et al. (2020) presented a method that identifies only nine vehicle types (e.g., sedan car, mini bus, and large truck) on a bridge and set a representative load value for each type of vehicle, which rendered the bridge vehicle load identification relatively inaccurate. Therefore, the exploration of highly accurate vehicle re-identification methods is important for the identification of the vehicle loads of bridges.
Recently, SHM based on computer vision has attracted extensive attention in civil engineering, such as visual inspection and damage detection (Gao and Mosalam, 2018; Morgenthal and Hallermann, 2014), crack width measurement (Coca et al., 2020; Yang et al., 2019), and dynamic displacement measurement (Dong and Catbas, 2019; Dong et al., 2020). The vehicle re-identification based on computer vision in the large-scale traffic monitoring scene can be regarded as a problem of the nearest neighbor search for repetitive image retrieval (Mei et al., 2014), is of substantial significance in intelligent traffic (Zhang et al., 2011), urban computing (Zheng et al., 2014), and road travel time estimation (Kwong et al., 2009). The image-based vehicle re-identification methods can be divided into two main categories: metric learning and feature representation-based methods (Zhu et al., 2019).
Metric learning (Xing et al., 2002), which is also known as distance metric learning or similarity learning, aims at learning a metric space in which the samples of the same class and of different classes are close to and far away from each other, respectively. It is a commonly applied method in computer vision tasks such as image retrieval (Lee et al., 2008), pedestrian re-identification (Yu et al., 2017), and face recognition (Nguyen and Bai, 2010). Guo et al. (Guo et al., 2018) proposed a coarse-to-fine ranking loss for metric learning under which images of the same vehicle as close as possible, which was used to discriminate between images of different vehicles and of vehicles from different vehicle models. To solve the problems of intra-class differences and inter-class similarities, Bai et al. (2018) designed a group-sensitive triplet embedding method that uses an end-to-end approach for metric learning. The method can focus on the details of a vehicle and can achieve satisfactory re-identification performance, but the training of the model is unstable, and the convergence speed is slow.
For the feature representation-based methods, the employed features can be divided into handcrafted features and deep learning features according to the strategy of feature acquisition. Handcrafted features are extracted according to manual design rules; therefore, they are interpretable. Lowe (2004) proposed the scale invariant feature transformation (SIFT) method, which is robust against image rotation, scaling, brightness changes, viewing angle changes, and affine transformation, but it is time consuming. The method performs feature detection in the difference of Gaussian scale-space pyramid to obtain the position and scale of a point feature. Then, the gradient information of the 16 × 16 neighborhood that is centered on the point feature is transformed into a 128-dimensional descriptor for feature matching. Leutenegger et al. (2011) developed the binary robust invariant scalable keypoints (BRISK) method, which can perform as well as SIFT but has faster processing speed. This method employs the adaptive and generic accelerated segment test (AGAST) algorithm to quickly locate point features in the scale-space pyramid. Then, the sampling points are obtained by sampling concentric circles in the point feature neighborhood, the gray values are compared between the sampling points, and finally, a cascaded binary bit string is obtained for describing each point feature. However, the performance of the handcrafted feature-based re-identification method depends on the prior knowledge and parameter adjustment of the handcrafted feature designer. Moreover, the generalization capability of the model is poor. Liu et al. (2016a) proposed the bag of words with scale-invariant feature transform (BOW-SIFT) and bag of words model with color name features (BOW-CN) methods, which are based on texture and color features, respectively. Chen et al. (2009) presented a vehicle retrieval system that is based on vehicle color and multi-instance learning in a complex background, which is robust against car orientation and color changes. Brown (2010) obtained color features by using weighted and quantization interval color histograms for the retrieval of vehicles. Zapletal and Herout (2016) employed the color histogram and the histogram of oriented gradients (HOG) features with linear regression to perform vehicle re-identification.
Deep neural networks, especially convolutional neural networks (CNNs), have injected new vitality into the development of feature representation-based vehicle re-identification methods. The image features that are extracted by deep neural networks such as GoogLeNet, AlexNet, and VGGNet (He et al., 2016; Krizhevsky et al., 2012; Simonyan and Zisserman, 2014) have been introduced into vehicle re-identification in recent years. Liu et al. (2016a) proposed a fusion of attributes and color features (FACT) method, which combines BOW-SIFT texture features, BOW-CN color features, and GoogLeNet deep learning features with multimodal features to analyze image similarity for vehicle re-identification. Subsequently, Liu et al. (2016b) added license plate verification information and spatiotemporal correlation information to the FACT method and proposed a progressive vehicle re-identification (PROVID) approach based on a deep neural network. Tang et al. (2017) proposed a multimodal learning architecture in which LBP and BOW-CN are used directly as the input of the neural network. After learning by a multilayer perceptron and CNN, they are fused to produce a robust and recognizable feature representation for vehicle re-identification. The main drawback of the deep learning features used in the current vehicle re-identification methods is that they are often obtained by training the classification task with the entire vehicle image as input; hence, these methods cannot fully extract the local information, which is important for distinguishing vehicles. In addition, compared to handcrafted features such as SIFT, these deep learning features are more sensitive to image rotation and scaling.
In recent years, the study of deep learning-based local image feature descriptors, such as HardNet (Mishchuk et al., 2017), has provided a powerful tool for feature-based computer vision research. In contrast to deep learning features mentioned above, HardNet feature possesses satisfactory scale and rotation invariance properties, which is realized by model training using image patches instead of the entire images; thereby, the local information of images can be fully extracted. As a CNN-based local image feature descriptor, HardNet has demonstrated excellent feature description capabilities in a wide range of tasks, such as wide baseline stereo, patch verification, and instance retrieval benchmarks (Mishchuk et al., 2017).
In this study, a HardNet deep learning descriptors-based vehicle re-identification method is proposed, which is highly accurate and robust for re-identifying individual vehicles. Compared with the other existing methods, it can re-identify individual vehicles rather than simple vehicle types, and possesses less restriction on the camera placements along the bridge. First, it obtains the point feature positions in a vehicle image through a feature detection method. Then, the HardNet network is employed to encode the point feature image patches into robust deep learning feature descriptors. Finally, the matching relationship of point features between two images is established through the descriptors, and the target vehicle is re-identified.
This article is organized as follows. In the section Vehicle Re-identification Method Based on HardNet Features, the proposed vehicle re-identification method, which is based on the HardNet local feature descriptor, is presented. Then, the vehicle image data from a real bridge are used to evaluate the performance of the proposed method, and a comparison study with two vehicle re-identification methods that are based on handcrafted features is performed in the section Validation Test. Finally, the conclusions of this study are presented in the section Conclusions.
Vehicle re-identification method based on HardNet features
Point features are prominent points in an image that do not easily change under influencing factors such as light, affine transformation, and noise. Point features include corner points, edge points, bright spots in dark areas, and dark points in bright areas. When object occlusion occurs or an image has readily observable scale and slight changes, image matching can be still achieved effectively by employing these point features. In the proposed vehicle re-identification method, image matching is the core step, which is used to establish a vehicle point feature matching between two images.
The point feature-based image matching procedure can be divided into three steps: feature detection, feature description, and feature matching. In feature detection, the point features in an image, which are to be matched with those in other images, are identified. Then, in feature description, the detected point features and their surrounding areas (small image patches) are encoded into a compact and stable descriptor. Finally, feature matching is performed to search for possible matching pairs of point feature descriptors in two images.
To realize satisfactory matching performance, the descriptors of different point features should be sufficiently distinguishable. Meanwhile, the descriptors should be robust against angle changes, scale changes, and brightness changes. Traditional handcrafted feature descriptors are typically designed artificially for low-level features (such as pixel intensity gradients) through biological knowledge or professional experience, which are easily affected by image transformations (e.g., scale changes, brightness changes, and viewing angle changes); thus, these descriptors inevitably suffer from information loss. In this study, a HardNet descriptor is introduced, which is a data-driven deep learning local feature descriptor that can fully extract high-level local features. These high-level features are robust against image changes and, hence, can produce excellent description performance.
Feature detection
For point feature detection, to ensure the scale invariance of image point features, we construct an image scale-space pyramid with inspiration from the BRISK algorithm (Leutenegger et al., 2011). The image scale-space pyramid consists of n octave layers Scale-space pyramid (the squares denote pixels; the number of pixels on the side of octave c
i
is twice that of ci+1, and the same holds for intra-octaves d
i
and di+1).
Therefore, the scale t for layers Feature detection in an image patch.
In the next step, to obtain the most stable point feature in a region, non-maxima suppression (Leutenegger et al., 2011) is performed on all candidate point features (pixels) to remove redundant features which are not brightest or not darkest in their neighborhoods. As a result, the function value V of each candidate point feature, which represents image saliency, will be calculated
Based on this procedure, we can obtain the point feature positions in each layer of the scale-space pyramid. Since the saliency of the image is continuous not only on the image but also in the scale dimension, the subpixel and continuous scale refinement are performed on each detected position. First, the V values of 3 × 3 patches that surround these point features are fitted with 2D quadratic functions to find the subpixel maximum V values. Next, these values are used to fit a 1D parabola (see Figure 1) to yield the final value estimate, which is the maximum value of the 1D parabola, and its corresponding scale is the determined scale. Finally, the image coordinates are re-interpolated between the patches in the layers next to the determined scale, and the precise coordinates of the image point features are obtained, as presented on the right side of Figure 1. Next, the HardNet descriptor will be introduced for feature description.
Feature description based on the HardNet deep convolutional neural network
The goal of HardNet (Mishchuk et al., 2017) is to learn an excellent feature descriptor, which can fully describe the information of local areas of an image. This is achieved by training the network with a specially selected set of descriptor triplets that contains matching descriptors which are originated from the same point feature area and the closest non-matching descriptor which is originated from another. And the loss function of HardNet aims at maximizing the distance between the closest positive (matching descriptors) and closest negative samples (non-matching descriptors) in a batch.
The structure of HardNet is illustrated in Figure 3. It consists of six consecutive convolutional layers with 3 × 3 convolution kernels and one convolutional layer with 8×8 convolution kernels. Except for the last layer, all convolutional layers are padded with zeroes to preserve the spatial size information. Meanwhile, a batch normalization layer and an ReLU non-linear layer are added after each convolutional layer. It is found that the pooling layer will reduce the performance of the descriptor; hence, no pooling layer is present in the HardNet network structure, and regularization with a dropout rate of 0.3 is applied before the last convolutional layer. In our method, the input to the network is HardNet deep convolutional neural network structure.
For the training of HardNet, the data of small image patches are employed, each of which contains a point feature. The loss function is defined as Sampling process for obtaining a descriptor triplet in the loss function.
Finally, L descriptor triplets
In the illustrative application in the next section, the training of the HardNet model is performed on the combination of the UBC Phototour (Brown and Lowe, 2007) dataset and the HPatches (Balntas et al., 2017) dataset, which can ensure the richness and diversity of the training data. The stochastic gradient descent optimization algorithm is employed for network optimization. The basic learning rate, momentum, and weight decay are set to 10, 0.9, and 0.0001, respectively, and the learning rate decays linearly. The network weights are initialized orthogonally, with a gain of 0.6 and a deviation of 0.01. All hyper-parameters above are determined according to the default settings in (Mishchuk et al., 2017).
After the training of the HardNet network model, point feature description is realized by the input and output relationship. Based on the HardNet descriptor, point feature matching is investigated in the next section.
Feature matching based on the HardNet descriptor
To match HardNet descriptors in two images, a two-step feature matching procedure is employed in this study. The brute force matching method and SIFT matching criteria (Lowe, 2004) are first combined to find possible matching relationships. Then, considering that there may be a false match in the matching relationships, the random sample consensus (RANSAC) (Fischler and Bolles, 1981) algorithm is utilized to remove those outliers.
In the first step, brute force matching is performed: a HardNet point feature descriptor
To further increase the matching accuracy, the RANSAC (Fischler and Bolles, 1981) algorithm, which is an outlier removing algorithm for geometric transformation and estimation of multiple view relationship between corresponding point features, is employed to remove mismatched relationships by learning a feature matching model between images.
In the RANSAC algorithm, as the definition of a matching model M, the homography matrix
Vehicle re-identification
The HardNet feature representation-based image matching method establishes vehicle point feature matching relationships between two vehicle images. In the proposed method, the number of matching point features is the basis for judging whether the vehicle to be identified is coincided with the target vehicle. However, difficulties can be encountered in determining a gap between vehicles that effectively should be considered matching vehicles and other vehicles. Corresponding to each target vehicle, a sequence of vehicles is arranged in descending order of the numbers T of the point feature matching pairs with the target vehicle. The numbers of matching point feature pairs are examined (in descending order), and the threshold T
n
is selected as the sudden decrease point in T. If the number of matching point feature pairs satisfies T > T
n
, the vehicle to be re-identified will be considered to coincide with the target vehicle; otherwise, it is considered an interfering vehicle. The steps for the proposed vehicle re-identification method are summarized as follows, and a flowchart of the image matching, which is the key component of the proposed vehicle re-identification method, is presented in Figure 5. Construct the image scale-space pyramid and detect point features in vehicle images by running the AGAST algorithm. Input the point feature image patches into the trained HardNet deep neural network to obtain HardNet point feature descriptors. Use the brute force matching and SIFT matching criteria to initially establish a point feature matching between images. Then, remove the remaining mismatched point feature pairs by running the RANSAC algorithm. Judge whether the vehicle to be identified coincides with the target vehicle according to whether the number of matching point feature pairs exceeds the selected threshold T
n
. The flowchart of the image matching based on HardNet network model.

When the monitoring information of WIM at one cross-section of a bridge is available, it is tractable to identify the spatial information of vehicle loads on this bridge for each time instance by combining the WIM data and the video images from cameras along the bridge. Once a vehicle is detected in the region of the WIM location, vehicle detection and segmentation methods can be employed to obtain this vehicle image from the camera video at the location of the WIM, and the associated vehicle weight, which is measured by the WIM, can also be extracted. Then, for a time instant, using the video images that have been captured by various cameras along the bridge, all vehicles can be re-identified and located by running our proposed vehicle re-identification method. In combination with the weight information of each re-identified vehicle, the spatial information of vehicle loads on bridges for this time instant can finally be obtained. The flowchart of bridge vehicle load identification using the proposed vehicle re-identification method is shown in Figure 6. The flowchart of bridge vehicle load identification.
Validation test
To evaluate the performance of the proposed method, monitoring image data of vehicles on the Dayang Port Bridge are used. The Dayang Port Bridge is an extra-large bridge in Qidong City, China. It is 817.2 m long and 33.5 m wide and has six lanes in both directions. Since heavy trucks are the main concern for examining the loading conditions and service behaviors of bridges, only vehicles of this type are considered in this illustrative application.
In Figure 7, the images of a target vehicle and three vehicle candidates for re-identification are presented. To match the point features with a focus on the target vehicle, the vehicle detection and segmentation process is conducted on the original image, as shown in Figure 8, which can be simply realized using vehicle segmentation methods (Ren et al., 2016; Zhang et al., 2019). Images of the target vehicle and a set of vehicles to be re-identified. Detection and segmentation of the target vehicle.

Our proposed image matching method, which is based on the HardNet point feature descriptor, is applied to establish the matching relationships between the target vehicle and each vehicle to be identified, and the results are presented in Figure 9. The lines in the figures represent the matching relationships that have been established between the two images, and the ends of the lines are the corresponding two point features. To show the matching effect more clearly, the images are displayed in grayscale. According to Figure 9, Vehicle 3 has significantly more matching relationship lines (194) with the target vehicles compared with the other two vehicles (0, 3). Therefore, Vehicle 3 can be identified as the same vehicle as the target vehicle, which is consistent with the ground truth information. Matching relationships between the target vehicle and all candidate vehicles to be identified.
To evaluate the performance of our proposed method (called the HardNet method), a comparison study with two published point feature-based vehicle re-identification methods that are based on the SIFT feature (Lowe, 2004) (called the SIFT method) and based on the BRISK feature (Leutenegger et al., 2011) (called the BRISK method) is also conducted, as shown in Figures 10–13. Various scenarios are investigated, which include matching of a vehicle front (Figure 10) and rear (Figure 11) during the daytime and matching of a vehicle front (Figure 12) and rear (Figure 13) during the night. Compared with the two published methods, significantly more point feature matching pairs are established by our HardNet method for each case. The BRISK method only produces less than half the numbers of point feature matching pairs as our method (the numbers are 56, 3, 31, and 72 compared with 226, 48, 77, and 183 for the four cases that correspond to Figures 10–13, respectively), while the numbers of point feature matching pairs that are identified by the SIFT method are even smaller (only 9, 3, 9, and 0 for the four cases). This implies that our HardNet method can establish vehicle point feature matching relationships stably and effectively in various scenarios with brightness, viewing angle, and scale changes; therefore, it can substantially outperform the two traditional methods in vehicle re-identification. Illustrative example of the matching performance comparison of the three methods using images of vehicle fronts that were captured during the daytime. Illustrative example of the matching performance comparison of the three methods using images of vehicle rears that were captured during the daytime. Illustrative example of the matching performance comparison of the three methods using images of vehicle fronts that were captured during the night. Illustrative example of the matching performance comparison of the three methods using images of vehicle rears that were captured during the night.



To evaluate the accuracy of vehicle re-identification, the indices of mean average precision (mAP) and rank-k (Khan and Ullah, 2019) are introduced. Corresponding to each target vehicle, a sequence of vehicle candidate images for re-identification is arranged in descending order of the number of point feature matching pairs with the target vehicle. For a target vehicle, the average precision is defined as
Based on the same vehicle image sequence, another index for a target vehicle is defined as
Evaluation results of the three methods on real vehicle image data.
Conclusions
Vehicle loads are the primary loads of bridge structures, and the identification of vehicle loads is essential for time-varying condition assessment, remaining life prediction, and ultimate bearing capacity calculation of bridges. Aiming at the realization of vision-based vehicle recognition for bridge vehicle load identification together with measurement information of WIM systems, a vehicle re-identification method that is based on monitoring image data and deep learning is proposed in this study. The robust HardNet deep CNN local image descriptor is introduced for establishing the point feature matching relationships between the target vehicle and the vehicle to be identified. Three steps are performed: feature detection, feature description, and feature matching. The performance of the method is evaluated by applying it to vehicle image data sets from a real bridge. It is demonstrated that our proposed method can re-identify vehicles with high accuracy under various scenarios with brightness, viewing angle, and scale changes. Compared with three published methods, our proposed method can identify significantly more matching point features and, thus, produce superior vehicle re-identification results.
When the monitoring information from a weight-in-motion system is available, we can identify the spatial information of vehicle loads on the bridge for each time instant by employing the vehicle re-identification method. In future studies, it would be useful to explore the practicality of the method when the image quality is poor (e.g., when the vehicles are photographed using low resolution camera or in foggy and hazy weather).
Footnotes
Acknowledgments
This research is financially supported by the Open Funding of State Key Laboratory of Safety and Health for In-Service Long Span Bridges (Grant No. 1800008) and the National Natural Science Foundation of China (Grant No. 51778192).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
