Abstract
Traditional object detection algorithms and strategies are difficult to meet the requirements of data processing efficiency, performance, speed and intelligence in object detection. Through the study and imitation of the cognitive ability of the brain, deep learning can analyze and process the data features. It has a strong ability of visualization and becomes the mainstream algorithm of current object detection applications. Firstly, we have discussed the developments of traditional object detection methods. Secondly, the frameworks of object detection (e.g. Region-based CNN (R-CNN), Spatial Pyramid Pooling Network (SPP-NET), Fast-RCNN and Faster-RCNN) which combine region proposals and convolutional neural networks (CNNs) are briefly characterized for optical remote sensing applications. You only look once (YOLO) algorithm is the representative of the object detection frameworks (e.g. YOLO and Single Shot MultiBox Detector (SSD)) which transforms the object detection into a regression problem. The limitations of remote sensing images and object detectors have been highlighted and discussed. The feasibility and limitations of these approaches will lead researchers to prudently select appropriate image enhancements. Finally, the problems of object detection algorithms in deep learning are summarized and the future recommendations are also conferred.
Introduction
Object detection is the prerequisite of advanced visual tasks such as scene content understanding. It has been applied to intelligent video surveillance, content-based image retrieval, robot navigation and enhancement [1, 2, 3, 4, 5, 6]. Compared with object detection in video, static image object detection is more difficult and challenging [7, 8, 9, 10, 11, 12, 13]. The object detection in optical remote sensing imagery is widely used for surveillance, traffic monitoring, agriculture development, disaster planning, geographical referencing and many more. The processing of optical remote sensing images is more challenging due to several limitations such as less number of datasets, low-resolution images, complex environment and so on [14, 15, 16, 17]. So far, traditional object detection methods mainly use Histogram of Oriented Gradient (HOG) and Scale-Invariant Feature Transform (SIFT) for feature extraction. HOG [18, 19, 20] and SIFT [21, 22, 23] features are used to discriminate sliding windows. The main representative methods are Deformable Part Model (DPM) and its extensions [24, 25, 26, 27]. Because sliding windows require a massive computational burden, therefore, object detection methods based on region proposals have come to the fore. At present, the more promising region proposal methods include Selective Search (SS) [28] and Edge-Box [29].
In recent years, a deep convolution network has made breakthroughs in the field of computer vision. A deep convolution network deepens the network hierarchy through weight sharing strategy, which makes the network more analytical. The establishment of large-scale image databases such as ImageNet [30], COCO [31] greatly promotes the development of deep convolution network. Alex-Net [32] convolutional network with seven layers won the championship with absolute advantage in ImageNet image classification contest and its effectiveness has been more and more verified [33, 34, 35, 36]. Subsequently, VGG [37] network, Google Net [38] and residual network [39] push the convolution network to a deeper level which has greatly improved the performance of the network and improved the accuracy of large-scale image classification to a high level.
At the same time, researchers have initiated to seek the expansion of deep convolution network in other fields. In object detection, the region-based convolutional neural network (R-CNN) [40] successfully connects the object detection and deep convolution network and improves the accuracy of object detection to a new level. R-CNN consists of three independent steps: Candidate windows, feature extraction, SVM classification and then window regressions are generated. R-CNN mainly uses the SS method to generate many favorable candidate windows. All candidate windows are sent to the deep network to extract features at one time. Finally, SVM classifier is trained to classify all candidate windows and to perform regression. The detection efficiency is very low in R-CNN because it is divided into three independent processes. Based on this situation, researchers improved R-CNN and proposed Spatial Pyramid Pooling Net (SPP-net) [41] and Fast Region-based Convolutional Neural Network (Fast-RCNN) [42]. Candidate windows are mapped on a certain layer of the network, which greatly improves the detection speed of the model.
Faster region-based convolutional neural network (Faster-RCNN) [43] generates candidate regions using Region Proposal Network (RPN). Classifies and regresses regions using the same structure as Fast-RCNN. RPN and Fast-RCNN share the main depth of the network. Faster-RCNN combines object detection so a unified deep network framework is established. Based on this, the region-based Fully Convolutional Network (RFCN) [44] is further improved. It is found that the network layer pooled by Region of Interest (ROI) is no longer shift-invariant, and the number of ROI pooled layers will directly affect the detection efficiency. Therefore, RFCN designs a location-sensitive ROI pooling layer, which directly discriminates the results after pooling and greatly improves the detection efficiency. The proposals of YOLO (You Only Look Once) [45] and SSD (Single Shot Multibox Detector) [46] aim at improving the detection efficiency of object detection and try to achieve real-time detection for object detection applications. In particular, SSD can enhance the detection accuracy and detection efficiency of object detection.
Compared with traditional object detection methods, the method based on a deep network has obvious advantages in accordance with accuracy. Firstly, the neural network is a network structure with a self-learning function to simulate the human brain. Secondly, forward propagation of deep network can be regarded as a process of abstracting objects continuously. The upper layer of deep neural network (near the output layer) records more perceptual information of objects [34]. Thirdly, the structure of a deep network can fit large-scale training samples well. The major difficulty of object detection lies in the variety of target objects, which have different colors, sizes and shapes in different scenarios. The deep networks have a large number of parameters which makes them have strong learning ability. A large number of training samples are helpful to activate deep network neurons. Based on the above two points, we can see the inevitability that the deep networks achieve excellent results in the field of object detection and the accuracy of object detection is much higher than that of conventional methods.
The main motive of this contribution is to consistently synopsize the object detection frameworks which are based on a deep convolution network in recent years. At the same time, we describe the performance of regression-based object detection methods. Moreover, we assess the advantages, disadvantages and limitations of distinctive methods in computer vision. We also describe some basic preprocessing approaches which could enhance the accuracy or computational speed of detectors. Finally, we describe the current limitations and future challenges of DCNN in developing context for object detection in optical remote sensing imagery.
The rest of the paper is organized as follows: The general framework of object detection, limitations of optical remote sensing images and object detectors, along with challenges have been described in Section 2. Section 3 presents a review of different approaches for object detection based on deep learning and regression. Finally, Section 4 concludes this paper.
General framework of object detection
Object detection becomes challenging due to the deviations of illumination, view angle and partial occlusion during the detection process. Using the same standard database can effectively compare the detection performance of various algorithms. At present, the main databases used are PASCAL VOC [47], Caltech and ImageNet [48] database. Pre-processing is utilized to process the brightness, color and size of the original image, so as to get the accurate features of the object to be detected, so as to reduce the complexity of the algorithm in the subsequent processing, and effectively improve the efficiency. Common image preprocessing operations include image enhancement, binarization, grayscale, etc.
In the feature extraction process, the required features are expressed as much as possible in the numerical form so it becomes convenient to filter out the interference features and obtain the actual features of the image. The main problem of feature generation is how to extract features, which affects the accuracy and timeliness of the detection algorithm. Building a model is a key step of the object detection system, which extracts the similarities of the same category of objects, distinguishes different types of objects effectively, and efficiently processes, stores and utilizes the extracted features and the spatial structure between them. According to the statistical structure, the model construction can be divided into Generative Model and Discriminative Model [49].
When the characteristics and models of the object to be detected are obtained, the training of the model is carried out. The important basis of object detection is to obtain the model parameters of the object by learning and training the specified training image set. This training can be divided into unsupervised training, supervised training and semi-supervised training according to the different training methods. According to the different classifiers, it can be divided into Neural Networks (NNs), KNN, Support Vector Machine (SVM) and Random Forest (RF).
The general process of object detection is to match the model obtained from the training of the sample set with the model extracted from the detected image to obtain the category and orientation information of the object in the image to be detected. Among them, object localization directly affects the final performance evaluation of the detection system, which is the key step. Current object searching methods mainly include image segmentation based search method and sliding window based search method. With the development of computer and communication technology, it is easier to obtain a large amount of monitoring data than before. Now a difficult problem to be solved is how to improve the efficiency of obtaining useful information while having a large amount of data. Taking object detection and recognition based on image and video data as an example, the traditional manual and human-computer interaction methods cannot adapt to the current situation of the “explosion” of data information, especially the recognition efficiency. The generalized steps of the object detection system for optical remote sensing imagery have shown in Fig. 1.
General implementation steps of the object detection system for optical remote sensing imagery.
In general, target detection algorithms based on deep learning can be divided into two categories: 1) R-CNN algorithm series based on region nomination; 2) YOLO and SSD algorithm series without region nomination. R-CNN target detection algorithm framework and YOLO target detection algorithm framework provide two basic frameworks for our research on target detection. On this basis, researchers have proposed some methods to improve the performance of target detection: 1) hard negative mining [50]; 2) multi-layer feature fusion [51]; 3) using context information [52, 53]. In addition, a variety of open-source deep learning frameworks are emerging including Theano, Keras, Caffe, Tensorflow and so on. These open-source frameworks accelerate the development of deep learning and provide an efficient learning tool for object detection. It also helps to design a more reasonable network structure, improve the detection efficiency of the recurrent neural network, and achieve multi-scale and multi-category object detection. In general, object detection based on deep learning is still a challenging subject, which is mainly playing a role in the following two aspects: robustness and computational complexity [54, 55, 56]. At present, with the arrival of the era of big data and artificial intelligence, the subject provides new opportunities and challenges, so it is worth further research. The feasibility of methods used in this discussion can be observed in Fig. 6. Among all methods, Faster-RCNN is widely used in object detection due to its simplified structure and robustness.
Limitations of remote sensing images
The optical remote sensing images have several drawbacks with respect to shapes, sizes, occlusions, resolutions, pixels and so on, of objects. The detectors generally utilize the region of interests (ROIs) which are not appropriate frequently. There is a need to improve the selection of required training objects. The exact number of pixels that hold the characteristics of an object should be selected for training. By considering the resolution, if a detector is trained on high-resolution images then it is difficult to get optimal accuracy on low-resolution images and it will generate more false alarms. Nowadays, the researchers are delivering the solutions for this problem in various ways such as fine-tuning, wise selection of layers of the neural net, and scaling of images. The key problems in point form are given below:
The targeted object frequently appears at diverse scales and scale changes due to different resolution of images. This limitation becomes the reason for the poor performance of object detection. The objects in high-resolution images are comparatively very small which reduces the accuracy when CNN features are pooled by the uppermost layer at a lower resolution. After several down-scaling of the sample, the feature map becomes 1/16 of the input image which causes the loss of key information of small objects and creates false alarms. In remote sensing imagery, the labeled data is very less and accessible annotated data is not enough for optimal performance of object detection. This consumes much time to collect new labeled data and it is very expensive as well. The remote sensing data is generated by distinctive tools (e.g. satellite, airborne, airplane, UAV and SAR, etc.) with varying resolution. It increases the plea for emerging a distinctive object detection method which could be applicable for all kinds of remote sensing imageries.
There are several different sized objects in various datasets and this reduces the performance of different classifiers/detectors due to irregular change in size and shape. There is a dire need to improve the optical remote sensing training images for object detection using geometric transformations. The training images have a great impact on object detection performance. Several well-intentioned contributions for image enhancement have been proposed in past decades [57, 58, 59, 60] still there is a gap to make more improvements in training images. Another wise way is to compress training images to reduce the required storage space for high-resolution remote sensing images. While lossless compressed images take similar training as compared to original images but it is advantageous for storage space. On the other hand, lossy images reduce the accuracy of object detection due to a lack of information. The optimal image enhancement approaches for training images are mentioned as under:
Image denoising Image filtering Edge sharpening Image rotation Image scaling Image compression etc.
The compression of the object detection model will improve detection efficiency. Because the deep network model will occupy a large amount of storage space and its application in a mobile terminal is limited, compression of the object detection model will also be an important research idea in the future. The research based on analog network aims to design special supervised and unsupervised tasks to make object detection promising. The main idea is to train smaller object detection models by using the trained deep network object detection framework. The training of the small framework becomes more efficient through the training of additional tasks. Therefore, the overall framework of object detection can be compressed on the premise of ensuring the performance of object detection. Specifically, the object detection with deep training can be achieved. In addition, unsupervised training tasks can be added to strengthen network training. Additional tasks can be done to the shallow network by setting up coders and decoders. The training of service will further improve the efficiency of network training.
The search ability for large-scale remote sensing images is weak. At present, most of the algorithms focus on object detection for a small size image containing the object to be interpreted, and the ability to interpret the object from a large-scale remote sensing image is relatively weak. The algorithm has a high time complexity, and cannot fully meet the actual needs of automatic interpretation of the surge data, fast detection and recognition of objects. On the other hand, due to the difference of detector angle and distance from the object, the features presented by the object change greatly, which makes the database of the object characteristic view larger and consumes a lot of storage space. At the same time, the computational complexity of related algorithms increases in the process of object search. Existing interpretation algorithms have strong pertinence. When interpreting different kinds of objects in a remote sensing image, different algorithms need to be designed to make the same remote sensing image that needs to be processed many times, this increases the processing time. Therefore, different types of object interpretation are worthy of attention and research.
Objects in remote sensing images are affected by a variety of random factors, such as the number of unknown objects, the probability of complete or partial occlusion of objects, the probability of unknown objects, etc., which makes the object information incomplete. The research work of knowledge base remote sensing is not comprehensive, and the existing algorithms do not fully integrate the important prior knowledge such as object environment, so it is necessary to further mind the characteristics of the object environment.
Object detection based on deep learning
Reinforcements of CNN’s towards computational burden
In many recent applications of object detection and classification, CNN is the most popular method as compared to traditional feature extraction methods. The previous approaches were based on pre-designed hand-crafted features. The traditional features such as HOG and SIFT etc. are not promising as compared to CNN features due to redundant feature computations of CNN. Feature learning is primarily based on the end-to-end process in CNN, exploiting a gradient descent method to update the parameters in the propagation system. In a conventional CNN, the initial layers are established by convolutional and pooling layers while final layers are made up of fully-connected layers and a soft-max layer. The softmax layer is used as a classification/detection layer which can be replaced by the Support Vector Machine (SVM) classification layer for performance evaluation. In each layer of the CNN, the features are generated from the preceding layer by sharing weights. The authors Han et al. [61, 62] revealed an efficient framework for CNN training by taking advantage of useful connections and pruning useless connections. After pruning, retraining of the network is necessary to fine-tune the weights.
Parallel computing is an effective way to reduce processing time. It has been exploited in Deep Neural Networks (DNNs) that increases the speed ten times as compared with the processing time of an individual DNN although with a slight decay in accuracy [63]. A single-stage Fully Convolutional Network (FCN) has been proposed to enable faster detection compared to multi-stage schemes [64]. This method is performed by interchanging the convolutional layers with fully connected layers. An additional CNN based approach was developed to increase the robustness and accuracy by a combination of Binary Normed Gradients (BING) and CNN in which BING has been implemented for localization and CNN for classification [65]. The computational cost and storage space is the major concern in optical remote sensing imagery so the training images have been compressed and down-scaled to reduce the computational burden with effective accuracy. The comparison of region-based methods shows that the training with compressed images is more beneficial than down-scaled images and the size of these images was reduced significantly to save the storage space [66]. Artificial Neural Networks (ANNs) are playing a vital role in the field of robotics in which DNNs are becoming more attractive. DNNs have various variants, which are mainly differentiated by their number of layers. The structure of DNNs consists of multiple hidden layers between one input and one output layer. The number of neurons can be different or the same in each layer of DNNs that depends upon the architecture.
Object detection frameworks for region-based CNN methods
Region-based convolutional neural network (R-CNN)
In 2014, Girshick et al. [67] introduced CNN model into the object detection field and proposed the R-CNN model. In SPP-Net proposed by He et al. [41] in 2015 and Fast R-CNN proposed by Girshick [42], respectively, several problems are improved by them such as 1) For each candidate region, to extract feature expression of fixed length, it needs to be adjusted to a uniform size (i.e. resized patches as shown in Fig. 2), which may lead to loss of image information to some extent; 2) Candidate region recommendation, depth feature extraction, step by step SVM training for large-scale data so the process is usually costly. 3) The consideration of the quantity of training time and storage space; each candidate region covers a large proportion of each other which results in repeated feature extraction.
R-CNN framework for object detection and classification.
According to the intuitive ideas, object detection refers to the location of the object which consists x, y, width and height of bounding box, which covers the object, whereas, classification tells us that which object is located. In other words, classification/recognition gives us the class label for the object. The whole detection process of R-CNN is shown in Fig. 2. R-CNN algorithm can be divided into four steps: 1) Candidate region generation: input a natural image and extract about 2000 candidate regions by Selective Search method; 2) feature extraction: normalize the image of each candidate region, and input the deep convolution network CNN for convolution, pooling and other operations to extract features; 3) Classification judgment: extracted features; Linear SVM classifier is fed to judge the classification; 4) position refinement: the candidate box position is refined by boundary regression to get accurate object region.
To overcome the time-consuming problem of R-CNN feature extraction, SPP-Net [41] proposed that the whole image features should be extracted first, instead of the cumbersome convolution calculation for each candidate region, and then the feature of candidate region should be intercepted on the feature map before classification, and this method needs 0.5 seconds to process an image.
In SPP-Net, He et al. [41] extract feature expression for each candidate region by using spatial pyramid pooling method, which effectively avoids losing information in the process of adjusting the size of the candidate region, and makes the detection accuracy of SPP-Net surpass R-CNN model. Girshick proposed a ROI Pooling layer in Fast-RCNN. Using this layer, only one depth feature is extracted. Then, according to the location of each candidate region mapped to the ROI Pooling layer, the feature expression of uniform length is extracted by combining the idea of SPP-Net. In addition, because of the introduction of the ROI Pooling layer, feature extraction and classification model training are completed in the same network. Therefore, compared with R-CNN and SPP-Net, the performance of Fast-RCNN in training, testing speed and detection accuracy has been improved.
SPP-net structure for fixed-length image representation.
The detection flow of SPP-Net is shown in Fig. 3. Although the overall detection process remains unchanged, the following improvements have been made on the basis of R-CNN: 1) it is no longer necessary to unify the images into fixed-size by intercepting and normalizing operations in advance, so as to solve the problem of image distortion and information loss caused by normalization operations; 2) using spatial pyramid structure to replace the pooling layer of the last convolution layer.
Therefore, SPP-Net has the following advantages: 1) By defining a scalable pooling layer, SPP-Net can process images of arbitrary aspect ratio and arbitrary scale, produce fixed size output, and improve the robustness of extracted features through multi-scale; 2) Because the features of all candidate regions are extracted directly from the overall feature mapping, the repeated counting of convolution layer is effectively solved. The efficiency is improved by calculating problems. But at the same time, there are still some problems in SPP-Net: 1) the isolated training process makes it impossible to train the network parameters as a whole, and still needs to preserve a large number of intermediate results; 2) the convolution layer and the connection layer located on both sides of the spatial pyramid pooling layer still need to be adjusted separately, which limits the effect of deep convolution network; 3) the extraction of candidate areas is still very important in the whole detection process.
Fast-RCNN [42] is mainly improved by accelerating R-CNN. The detection flow of Fast-RCNN is shown in Fig. 4. In Fast-RCNN, firstly, a series of candidate regions are generated by one candidate region method; secondly, convolution neural network feature maps of input images are generated by multiple convolution layers; secondly, candidate region pooling layer maps candidate regions to convolution neural network feature maps, and extracts feature maps of the same size for candidate regions of different sizes; finally, pooling is located in candidate regions. After the layer, the final feature is extracted, and the classification score and the location of the detection box are predicted and refined based on the feature.
Fast-RCNN framework for object detection and classification.
The improvement of Fast-RCNN can be summarized as follows: 1) Compared with R-CNN, the detection quality is improved; 2) Multi-task loss layer is introduced by combining classification and border regression loss function, and single-level training process is exploited which improves the accuracy of the algorithm; 3) end-to-end training is utilized except for candidate region extraction and all features are stored in display memory without additional disk space. 4) Referring to the idea of SPP-Net, a single fixed-scale ROI layer is proposed and the full-connection layer is speeded up by SVD decomposition.
However, Fast-RCNN still has the following shortcomings: 1) SS method is still used to extract candidate regions, and the relative feature extraction is still time-consuming; 2) real end-to-end training and testing are not achieved, which cannot meet the requirements of real-time detection; 3) GPU acceleration is used, but the candidate region extraction method is still CPU implementation.
Faster-RCNN is an advanced improvement of Fast-RCNN [68]. The region proposal network (RPN) is proposed in Faster-RCNN. It takes only 10 ms to generate regions by applying the RPN on GPU with the convolutional neural network. The region proposal generation and Fast-RCNN have been merged into a new network model to form a real-time object detection framework which enhances the detection speed and accuracy. The object detection structure of Faster-RCNN has shown in Fig. 5.
Ren et al. proposed a Regional Proposal Network (RPN) based on CNN in 2017 [43], which is known as RPN. The experiment results show that compared to the conventional region proposal algorithm such as Selective Search (SS), RPN achieves higher accuracy by recommending fewer regions (RPN: 300, Selective Search: 2000). The RPN has been combined with Fast-RCNN and trained them alternately, and acquired the unified detection model which is known as Faster-RCNN. In addition, Redmon et al. also proposed a network model without region proposals, known as YOLO (You Only Look Once) [45]. This proposed model predicts the conceivable locations of the objects in the images using regression. The model has achieved optimal real-time performance on GPU and the detection speed is about 45 FPS (frame per second). Nevertheless, Faster-RCNN has maintained its definite advantages in terms of detection accuracy. However, Faster-RCNN has some shortcomings such as it has not reached to the real-time detection. The way of getting region boxes before classification needs a huge amount of computations. This limitation has led to another advance approach, is called as Mask-RCNN [69]. The important features which make the difference between region-based CNN object detectors are given in Table 1.
The distinctive characterization of CNN object detectors
The distinctive characterization of CNN object detectors
Faster-RCNN framework for object detection and classification.
In present years, many scholars have presented CNN models for vehicle detection in aerial images because CNN has excellent feature representation ability. Ammour et al. has proposed the detection model based on deep learning for vehicle detection in unmanned aerial vehicle (UAV) images [70]. This proposed model is analogous to the R-CNN framework as mentioned above. The key difference is that they use segmentation rather than the region proposal algorithm. The method of image segmentation describes two advantages: the processing speed of the segmentation is better than selective search, and the extracted regions do not overlap with each other. However, the extracted region by the segmentation method needs post-processing before using it on CNN. Qu et al. has proposed a detection method for UAV images which is based on SPP-Net and region proposal algorithm [71]. However, in these two detection methods, the CNN framework has been used for feature extraction, and there are still analogous problems with R-CNN models. The regions are very important in order to enhance accuracy; therefore, RCNN is utilized to produce accurate bounding boxes [72]. These boxes are horizontal at the first stage and rotated at the second stage which was adopted from cascade RCNN [73] and it verifies that the regression of targets by this approach can produce more precise results. The detection accuracy is also concerned with the dataset’s characteristics. More specifically, the performance of Faster-RCNN has been improved for fixed input size in terms of average precision [74]. In region-based methods, SS is prominently involved to enhance the accuracy. A systematic region proposal method was proposed in comparison with SS to reduce the processing speed with optimal accuracy [75].
Faster-RCNN algorithm is one of the main object detection algorithms at present, but its speed does not meet the real-time requirements. Then algorithms such as YOLO and SSD gradually highlight their superiority. This kind of method makes full use of the idea of regression, returns directly to multiple positions of the original image, and gives the accurate box of the object position and category.
You only look once (YOLO)
YOLO algorithm proposed by Redmon et al. in 2016 is a convolutional neural network that can predict multiple box locations and classes at once. The network design strategy of the YOLO algorithm continues the core idea of Google Net [76] which reveals end-to-end object detection in a real-time scenario. It gives a full role to the advantages of fast speed, but its essence towards accuracy has declined. However, the YOLO 9000 [77] algorithm proposed by Redmon et al. in 2016 improves its accuracy on the speed of the original YOLO algorithm. There are two main improvements: 1) a series of improvements have been made on the original YOLO detection framework to make up for the lack of detection accuracy; 2) a method of combining object detection and training is proposed. The training network of the YOLOv2 algorithm [78] can adjust dynamically under certain circumstances by the down-sampling method. This mechanism can make the network predict different size pictures and balance the speed and accuracy of detection. The YOLO framework carries out the detection likewise regression problem. It transforms an image into S
Single shot multibox detector (SSD)
In 2016, Liu et al. proposed the SSD algorithm [46], which combines YOLO’s regression idea and Faster-RCNN anchor mechanism to achieve the coexistence of speed and accuracy. The original YOLO algorithm recognizes objects by dividing image into S
Statistical assessments of feasibility of object detectors.
Finally, based on the previous related work, we investigate that the sliding windows approach has the lowermost difficulty in execution and there is essentially a recurrence of executing image classification while it might be very slow during the process of training and testing. Though the accuracy of image classification relies on the effectiveness of the network it needs the expertise to build such a mature network. The region proposal approach has reduced the training and testing time compared with the conventional sliding window approaches and helps to enhance the detection accuracy. Indeed, Faster-RCNN attains the maximum accuracy as compared to other methods [81]. By means of testing and training time, the well-known SSD is considerably faster than the state of the art methods whereas it eliminates the region proposal approach, but it costs the reduced accuracy as compared to those techniques with region proposals.
Object detection is a fundamental and challenging topic in the field of optical remote sensing imagery. Several substantial contributions for object detection have been proposed during the past few years. This paper conveys a brief review of the current progress for object detection. Here, we have broadly discussed the region and regression-based object detection approaches. More specifically, we have briefly discussed R-CNN, Fast-RCNN, Faster-RCNN, SSD and YOLO in the appraisal. Besides, we have summarized the current limitations and challenges which would be considerably useful for the upcoming researchers to find and fill the gaps in this research.
Object detection is the premise of high-level visual tasks such as scene content understanding. There are many difficulties in static image object detection such as object deformation, object occlusion and small object detection. How to solve these problems more effectively and reasonably is an important research direction in the future. The traditional effective object detection methods which deal with some special problems (object deformation, occlusion, etc.) are introduced into the deep networks to increase the ability to deal with these special problems and improve the performance of object detection. The specific neurons of the corresponding network layer (such as some operations need to be carried out for a single neuron and some operations need to be carried out on different areas of the Feature Map) could be organized reasonably to get the required results. In the process of backpropagation, the derivative is needed. The approximation of derivatives is obtained by limited methods. The local network structure can be formed by combining different network-layer operations.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grants 61471148.
Conflict of interest
The authors declare no conflict of interest.
Author’s Bios
