Abstract
Salient object detection plays a vital role in image processing applications like image retrieval, security and surveillance in authentic-time. In recent times, advances in deep neural network gained more attention in the automatic learning system for various computer vision applications. In order to decrement the detection error for efficacious object detection, we proposed a detection classifier to detect the features of the object utilizing a deep neural network called convolutional neural network (CNN) and discrete quaternion Fourier transform (DQFT). Prior to CNN, the image is pre-processed by DQFT in order to handle all the three colors holistically to evade loss of image information, which in-turn increase the effective use of object detection. The features of the image are learned by training model of CNN, where the CNN process is done in the Fourier domain to quicken the method in productive computational time, and the image is converted to spatial domain before processing the fully connected layer. The proposed model is implemented in the HDA and INRIA benchmark datasets. The outcome shows that convolution in the quaternion Fourier domain expedite the process of evaluation with amended detection rate. The comparative study is done with CNN, discrete Fourier transforms CNN, RNN and masked RNN.
Keywords
Introduction
Object detection is the localization of the targeted object in the image or video. The advantage is to understand and analyze scenes in images or videos such as tracking or counting the objects [1, 2]. The detected object is denoted by the boundary box or pixel masking of the object [3, 4]. Some of the applications include vehicle detection, human behavior identification, machine inspection, image retrieval and security-related problems. The hardware device for the application is the intellective video camera. Many challenges occurs while capturing the images in indoor and outdoor areas like illumination variation, appearance changes of the moving objects, occlusion, complex background and shadow etc. Prior to the deep learning technique era, the object is detected in the following three steps 1) Proposal generation 2) Feature extraction and 3) Region classification. Proposal generation searches the location of the objects in the image namely region of interest (ROI). In ROI, the feature is extracted by sliding window process in the image with some of the feature descriptors like Gradient Location and Orientation Histogram (GLOH) [5], Speeded up Robust Features (SURF) [6], scale-invariant feature transform (SIFT) and principal component analysis (PCA) etc. The region classification labels the covered regions and the techniques used to classify regions are Adaboost, SVM and Cascade learning. The occurrence of more false positives in proposal generation leads to complex features of the image are unable to handle and the optimization is discretely carried out for each step so there is no global optimization is done for the whole process. To overcome these inhibitions, deep learning [7, 8] approach achieved remarkable progress in computer vision applications. The handcrafted feature descriptor [9] used in classical models turned into an automatic feature descriptor from raw pixels while training the data in deep learning model namely CNN which is the biological inspired model to extract the high-level and low-level features of the image.
In this paper, to accelerate the CNN process in terms of computational efficiency, the image is processed in Fourier domain. Nowadays, different kinds of Fourier transform is used in many computer vision applications for pre-processing the image to achieve efficient outcome [31]. Fourier transform cannot holistically handle the color image. Hence, in this paper, we have handled the quaternion complex number, which is a four dimensional complex number which resembles the three colour channels with zero real part in Fourier domain and performs discrete quaternion Fourier transform namely DQFT handles the three color channels of the pixel which simultaneously congruous for authentic-time as well as offline immensely colossal to increment the computational efficiency. Then, the features of such preprocessed images are trained using CNN and finally the image is re-transferred into the spatial domain to yield the detection of object. The Fig. 1 shows the process flow diagram of the proposed work.

Process flow diagram.
The rest of the paper is structured as follows: Section 2 presents the related works about the current methods of object detection, Section 3 gives the background definitions of the methods which was used in the proposed system, Section 4 discusses the proposed work, Section 5 gives the experimental results with the comparative analysis and the conclusion is given in Section 6.
In literature, object detection mainly focuses on the detection of features and also behaves like a classifier [10]. Considering the video frames, background subtraction is a preprocessing step for efficient object detection [11]. Readers can refer, haar features with a sliding window for human face detection, HOG feature with SVM for pedestrian detection [12]. Deep learning for object detection comes under two categories [13]: (a) Two-stage detector and (b) One-stage detector. Category (a) detects the object based on proposal generation and region classification, Recurrent Convolutional Neural Network (RCNN) is one of the example of (a). Category (b) takes all the image pixels as a whole input to the model, so it doesn’t require a two-step process. Sermanet et al. developed object detection using one-stage detection method [14]. Alexnet is one of the architecture of CNN which consists of five convolutional layers and three fully connected layers [15], GoogLeNet is developed by google which is deep convolutional network implements convolution and pooling layers simultaneously [16, 17]. GoogLeNet contains Inception Modules, so that it can perform variant sizes of convolutions and concatenates the filters to the next layer. AlexNet does not supports filter concatenation. Chu implemented CNN model for unary object proposal and for feature extraction of the object [18]. Chen cumulates region proposal network with object detection which has a feature extrapolating layer to detect the object categories [19]. Li induced CNN in the course and fine detection phase. The course phase, detects the coarse candidate regions which contain objects. The fine phase, object candidate regions are cropped and classified as objects and backgrounds [20]. Kang developed a temporal convolution framework to handle video-predicated object detection for temporal information [21]. Quaternion Convolution neural network is utilized for color image relegation and denoising theme identification of telephone conversation between agents and customer [22, 23], image reconstruction [24], image forensics [25] and for detecting lake, grass, forest, and town areas [26]. Ren et al. detects the objects based on two modules. The first one contains deep fully convolutional network which proposes regions, and the second module is the object detector [38]. Mask RCNN is a two-stage pipeline, with an RPN as the first stage and the second stage prognosticating the class and box offset [39]. Our proposed method overcomes the computational complexity of above discussed methods with better efficiency of object detection.
Background
Quaternion complex number
It is a four dimensional complex number [27] Q = a + bi + cj + dk or (a, b, c, d) with one real unit and three imaginary units i, j, k where a, b, c, d are real numbers with the following multiplications laws ij = - ji = k ; jk = - kj = i ; ki = - ik = j ; i2 = j2 = k2 = ijk = -1, i.e.,
Let Q1 = a1 + b1i + c1j + d1k and Q2 = a2 + b2i + c2j + d2k. Addition of two quaternion complex numbers is given by
The discrete color image f(m, n) image matrix can be transformed into the quaternion complex matrix by encoding the red, green, and blue as a pure quaternion with zero real part, i.e.,
Conversion of 2D image into quaternion image
For any pure quaternion μ1 = μ(1, 1) i + μ(1, 2) j + μ(1, 3) k, p = p(1, 1) i + p(1, 2) j + p(1, 3) k, the pixel intensity of quaternion image is obtained by
Let
Fourier transform is a fundamental transform used in both signal and image processing. It converts a signal (sample format) or image (pixel format) from a spatial domain into amplitudes and phases of sinusoid’s in frequency domain [30, 31]. Discrete Fourier transform (DFT) is called the sampled Fourier transform which takes the necessary samples rather than the all pixels of the image to represent it in the frequency domain. The size of the Fourier image is equal to the size of the image in the spatial domain. Generalization of a Fourier transform is the Discrete Quaternion Fourier Transform (DQFT) [32–34] in which quaternion complex number executes in the place of complex number and the whole process is defined as the symplectic decomposition which is explained below
Symplectic decomposition
Once the image is converted into the pure quaternion image, it is decomposed into two components, the first part contains real part and red components and the second part consists of green and blue components. The 2D-DFT is applied to both parts discretely and the cumulative results of DFT’s produces the DQFT which is briefly explained in the following steps and the respective images are shown in Fig. 2:

From left to right: Original image, (i, j, k) component for processing the quaternion image, Discrete quaternion Fourier transform of image.
The object feature extraction plays an important task in object detection. There are many feature extractors namely PCA, SIFT, HOG, etc [35]. A Convolutional neural network is one of the deep learning methods which extract the features of the image. To detect the targeted object, object features have to be trained. Once the features are trained the targeted object can be easily detected. There are three different layers in CNN, which is explained with quaternion image and the process flow of the CNN architecture is shown in Fig. 3. 1) convolutional layer 2) pooling layer 3) fully connected layer.

CNN architecture.
This section briefly explains about our proposed model. Once the pre-processing is done using DQFT, the whole color image becomes a 2D pure quaternion matrix. The convolution product (⊙) which is the rudimentary step in CNN, the kernel matrix K(m, n) of order X x X is applied in the convolutional layer. The quaternion Fourier convolution
In the fully connected layer, to connect the neurons from one layer to the next layer the below two equations are used.
CNN architecture is a subsidiary for training and testing data. 75% of image dataset used for training. Only if the training images are more than the testing, precision will be more efficient. Transfer learning is the re-usability of a previously trained model for a new problem. The advantages of transfer learning includes preserve training time, efficient performance of neural networks and not requires a lot of data. In deep learning, this method does the initial training of a CNN for a particular task like object classification and object detection. The pre-trained ImageNet model contains the ImageNet architecture and learned weights. This pre-trained model used in our proposed for defining the initial weights of CNN network to performs the detection task. This process is used to lessen the computational costs of training a deep neural network from the beginning, or to hold the necessary feature extractors trained during the initial stage.
Algorithm for training
Input: f(RGB Image) Output: Classified Object with Label
The Remaining 25% of the images from the dataset executed for testing. Steps 1, 2, 5, 6, 7 and 8 without backward propagation in training algorithm is applicable for the testing process and from the response map the object is detected.
Experiment analysis
The experiment is virtually conducted in the ubuntu environment using python programming language. Tensorflow and Keras libraries in the python programming used to implement the deep learning concept of CNN.
Data analyzation
In order to do experiment analysis, we have used INRIA [40], HDA [41] which has both static and moving person dataset. In these datasets, training and testing data were separately given. The training image has pixels of 96 x 160, the testing image has a resolution of 70 x 134. To avoid the boundary conditions for processing, the testing image size is scaled as same as the training images [36]. For positive and negative training 1208 and 1218 images taken respectively. The images are taken from indoor, outdoor, city, mountain, and beach. HDA person dataset is for the dataset of high definition surveillance. 18 cameras taken videos duration of 30 mins. The videos are taken from different cameras of resolution 640 x 480, 1280 x 800, 2560 x 1600. The dataset is analyzed in both the spatial and Fourier convolution domain. For computation complexity, whenever the image resolution is less, the result yielded will be more or less the same otherwise there will be a major impact on the quaternion Fourier convolution process which results in reduced computational time rather then the spatial convolution.
Comparative analysis
The performance of the proposed method is efficiently compared with the classical CNN [37] (which examines the raw spatial image as the input from which the object feature is extract) and with the discrete Fourier transform (DFT) based CNN, the resultant images are displayed in Fig. 4. On moving to large and larger dataset, the spatial image will require more computation time and memory. In DFTCNN, the spatial image is transformed into Fourier domain. Processing of image in Fourier domain reduce the computational time. Still there exists a computation complexity because of handling r, g and b values as individual components. So still there exits a computation and recollection requirement problem. To overcome this situation proposed method (DQFT) uses quaternion complex number to be processed with Fourier to handle the rgb components simultaneously. Then DQFT image is given to the CNN to handle the limitations of the anterior models (CNN, DFTCNN). RCNN [38] is a method which integrates CNN with the Regional Proposal method. Training is expensive for a larger dataset and testing process is gradual. Masked RCNN [39] is a simple effective framework for object instance segmentation which overcomes the shortfall of RCNN but it has a major drawback of avail in authentic-time applications because of the computational complexity. Our proposed method DFTCNN is cost-efficient and most suitable for real time applications.

Object Detection Results of DQFTCNN with other Current Methods.
Jin et al. [42] used quaternion Fourier transform in spatial domain for orientation detection which helps dynamically to update the vector median filter with respect to size and shape. But, our proposed work enhances the image using quaternion Fourier transform as a pre-processing step. To improve and incorporate all the color channels in the singe component form, we have used quaternion with Fourier transform. Here, the CNN is used train the human object features. Guo et al. [43] handled Quaternion Fourier transform (QFT) to obtain the location of salient areas as well as to represent the image as spatial-temporal saliency. In QFT, they have been extracted only phase QFT, by leaving the magnitude. They have handled the three imaginary components as intensity, color and motion feature. But in our proposed work, we have taken r, g and b color components as the three imaginary components and also used the whole QFT, instead of phase QFT. Because, our concept of work is to enhance the image, not to detect the location of the object.
The performance of the proposed work is evaluated using the mean average precision (mAP) method. In this method, after calculating the precision and recall, mAP is computed by using the average precision. Precision (P) gives the true detected objects by the total number of true positive and false positive objects. Recall (R) gives the measurement of the true detected objects by the total number of true positive and false negative objects in the image. In mAP the precision and recall computed using the below equations:
In Table 1, the 2 nd , 3 rd and 4 th column represents the average precision values of the images. Figure 5 shows the mAP results for different methods.
mAP evaluation metric for Different methods with proposed method for INRIA and HDA datasets

Graphical representation of mAP.
Our proposed method of object detection using CNN with discrete quaternion Fourier transform (DQFT) influences the holistic color image process leads to less computation and more efficient detection. The novelty of our work is DQFT processed image is passed to the CNN process. Further, background modeling is used for background subtraction to eliminate non-moving objects in the background. The experimental analysis shows that the results obtained from the proposed method are more efficient than the prior works. The proposed work can be directly incorporated for the real-time applications for security management and for the surveillance system.
In future, the work has been extended with other algebraic approaches like octonion number and other statistical modeling which helps to enhance the image by considering more features of the object. So, there occurs ease of CNN training for detecting the objects. Added to that, GPU implementation makes more powerful while handling larger dataset.
Footnotes
Acknowledgments
The work of R.Sundara Rajan was partially supported by Project No. 2 /48(4)/
2016/NBHM-R D-II/11580, National Board of Higher Mathematics (NBHM), Department of Atomic Energy (DAE), Government of India.
