Abstract
Video retrieval technology has drawn considerable attention over the years. Compared with the underlying information such as color, edge, etc., the text in the video contains rich semantic information and can well summarize the video information. Many scholars have proposed methods based on SVM to detect video text. For most of these methods, feature dimension is too large, and the time complexity and detection effect remain to be improved. In this paper, a new method of SVM video text detection based on color, edge and HOG features is proposed. And for the problem of single frame detection, the detection effect is improved based on the detection of three adjacent frames. In this paper, video text detection is implemented through steps such as sample selection, feature extraction, model training, and text detection. Finally, many experiments are performed to compare the proposed method with other literatures. The results show that the proposed single-frame and three-frame detection algorithm has a high recall rate and accuracy, which reduces the false detection rate and improves the effectiveness of video text detection.
Introduction
The rapid development of social informatization has made the demand for various kinds of information more and more urgent. Video plays a very important role in human daily life by possessing very important semantic content and diversification of expression forms. The role has been widely applied in various fields such as aerospace, video conferencing, safe city, intelligent transportation, and education and training. Thus, video information has become one of the most important ways of information dissemination. As more and more textual information is presented in the video, many scholars have conducted in-depth research on the text in the video and they have shown that the video text has very detailed semantic content, which can not only summarize the main information expressed by the video. And it’s easy for people to understand the video. Recently, most researchers have used color or edge features to detect video texts, because the single feature detection effect is not ideal. Therefore, in order to improve the detection effect, this paper proposes an algorithm based on color, edge and HOG integrated features+ support vector machine (SVM) to achieve video text detection. However, when the video background is more complicated and the edges are more obvious, the false detection rate is relatively high. In response to these deficiencies, this paper improves the single frame and proposes a video text detection algorithm based on three adjacent frames. At present, many scholars at home and abroad have conducted detailed discussions and research on the retrieval of video texts, and have proposed many methods for retrieving video texts. The commonly used extraction methods are mainly divided into feature extraction through edges, feature extraction using textures, and feature extraction using machine learning.
Method based on edge features
Cao et al. [1] proposed to preprocess the image first, then proposed an algorithm for projected statistical feature analysis of the image and positioned the text region in the video according to the final projection. Zhang [17] proposed to use the Sobel edge operator to perform edge processing on the image to get the corresponding edge map. Then use the mathematical morphology method to process the edge and get some text area based on some prior knowledge about the text. Yang et al. [12] proposed that the candidate caption frame in the video is first detected according to the color feature of the text, and then the region where the video text is located is detected according to the edge feature of the text. Liang [7] detected the key frame based on the difference between frames and detected edges in the video according to the Sobel operator. The result of the detection was horizontally projected and the text area was made according to some characteristics such as text area, size, and area. Finally, the video text area is obtained. After analyzing and comparing the effects of several edge detection operators, Yin et al. [13] decide to use the Sobel operator to obtain the edge information in the image, then analyze the connected domain and filter out the non-text area by using the characteristics of the text, such as the text size, the position relation and so on, thus getting the final text. Piao [6] combined the characteristics of the text. The Sobel operator was used to obtain the corresponding edge detection map by analyzing the edge detection operator, then morphological processing is performed, and the arrangement rules of the text itself and some prior conditions are used to determine the final text position.
Method based on texture features
Yang et al. [14] first used MSER to extract regional feature regions to obtain candidate text regions, and then deleted non-text regions based on the text classifier of the Hu moment feature. Finally, according to text texture information, the co-occurrence matrix was used to calculate the corresponding feature values to train the corresponding. The textual classification model thus yields the final text area. Wu [9] proposed to use text edges and wavelet characteristics to obtain video text candidate regions, and then use SVM to extract text feature training model to achieve accurate positioning of video text. Manjunath Aradhya et al. [5] proposed to use wavelet transform and Gabor Filter to extract texture features of text, and then determine the exact region of the text based on the energy value obtained from the information entropy of the wavelet transform.
Method based on machine learning
Due to the emergence of machine learning and rapid development, machine learning has been successfully applied in various fields. Therefore, video text extraction algorithm that is based on text features combined with machine learning methods to achieve text detection has become possible. SVM is a new method of machine learning and this method is based on statistical learning theory. It can well solve a series of problems such as non-linearity in video text detection, small sample size, and too large dimension of features, and it is important to use the common video text feature extraction algorithm for the development of video text detection technology value. Therefore, SVM-based video text detection can not only improve the video text retrieval technology, but also greatly improve its efficiency.
Yan [15] pointed out that using Gabor filter and combining SVM method to detect text. Firstly, Gabor filter is used to obtain the texture maps in four different directions: horizontal, vertical and diagonal. Then the SVM is trained according to the obtained characteristic data and the video text is detected according to the obtained SVM classification model. Zeng et al. [16] proposed to first use the maximum gradient difference (MGD) to decompose an image into multiple images. Then, according to the proposed T-LBP, the texture features corresponding to the text are obtained, which are then sent to the SVM for training and finally based on the SVM. The classification model detects the text area where it is located. Liu [4] proposed that Haar wavelet was used to decompose the text image and then extract the corresponding texture features using LBP operators with different scales. After the multi-level wavelet transform and LBP operator’s operation, a 36-dimensional feature data is formed by extracting the features of the wavelet transform and then sent to the training to obtain the corresponding classification model and text detection according to the corresponding SVM classification model.
In summary, in the existing literatures, most of the video text detection technology is to initially obtain the edge information of the image and perform projection analysis according to the edge map to obtain the final text area. Due to the complexity of the method, the method of detecting video text using SVM has emerged. Although using SVM for video text detection can improve its detection effect, many methods of this type are currently based on pixel values. Because the dimension is too large, the detection speed is not ideal. Later, many scholars proposed to extract the text color or edge features+SVM detection, which have a single feature and a high time complexity. In order to balance the time complexity and detection effects, this paper proposes a method that combines multiple features in video text with SVM to achieve video text detection. The method proposed in this paper is to combine the HOG feature with the color and the edge to detect the text. Furthermore, in order to reduce the rate of text false detection, this paper proposes an algorithm based on adjacent three-frame detection.
The proposed method
At present, many documents use the HOG feature combined with the SVM method to perform pedestrian detection and achieve good results. Based on this, this paper attempts to combine the HOG features with color and edge features in order to overcome the shortcomings of previous algorithms and proposes a new algorithm for video text detection that combines multiple features with SVM, and analyzes color, edge, and HOG features. The overall flow of text detection based on SVM is introduced in this paper, including sample selection, feature extraction, model training, text detection and other steps. Finally, by analyzing the experimental results and the evaluation criteria of the control algorithm, it is proved that the algorithm can effectively improve the accuracy of text detection and reduce the false detection rate.
Sample selection
Generally speaking, the proportion of positive and negative samples and the selection of samples in training samples will affect the classification effect of SVM [10, 18]. For the detection of text in a video sequence, the SVM generally divides the image into non-text areas and text areas. Therefore, for the detection of video texts, we first need to select a certain number of samples from the video for training. We randomly select 500 images with a resolution of 1280×720 from the video as training samples. After selecting the text area of each image as a positive sample, the size of the positive sample captured from each frame of the video is unified to a size of 16×16, and then the color, edge, and HOG-based features of each positive sample are extracted. The data, constitute a 45-dimensional feature vector, each feature vector will be a positive sample data. Finally, the sum of positive sample feature data we extracted is a 1600×45 matrix. There are 1600 positive sample data, and each sample consists of 45-dimensional feature vectors.
The choice of positive samples is relatively simple but how to reasonably select negative samples is more difficult. Because the background of the image is dynamically changing and random, any background area in the video that does not contain a text area does not completely replace all non-text samples. We need to select a large number of images containing various information as negative samples to cover as much as possible. Most non-textual situations make it representative. Therefore, we randomly select non-text regions from 500 sample images as negative samples, which are the same as positive samples. We calculate each sample based on color, edge, and HOG characteristics based on the selected 16×16 size negative samples. Therefore, negative sample data also consists of 45-dimensional feature vectors. In the end, we got negative sample data of 2100×45.
After we get the feature data for all positive and negative samples, we define the label of the text block as +1, which corresponds to the positive sample data. Defining the label of a non-text block as -1 corresponds to negative sample data. The SVM classifier is trained according to the obtained 45-dimensional positive and negative sample data and whether each sub-block is a corresponding label of a text block.
Feature extraction
Feature extraction plays an important role in video text detection and has a great influence on the quality of text detection. Therefore, the selection of text features and how to extract features is a key step in the follow-up work of this paper. Through comparative analysis, the text has its own characteristics in color, HOG, and edges compared with the background area. Many literatures proposed a method for detecting video texts by extracting text color or edge features. Since the features are relatively simple, the detection effect is not ideal. Therefore, this paper analyzes and improves the previous algorithm and proposes a new feature combination algorithm to detect video texts.
Color-based feature extraction
Because the text has a unique color feature compared to the background, this paper achieves text detection by extracting the color features of the text in the video. Orengo and Stricker explored in 1995 a very simple and effective method of extracting color features-color moments. This method is mainly used to calculate first-order moments(mean), second-order moments(variance) and third-order moment(skewness) in three channels of an image [8]. Because in the color moments, the low-order moments can express most of the color information well. Therefore, it is sufficient to adopt one, two, and three moments. The principle of this method is that we can use the moment in the color moment to represent any color information in the image. Therefore, we extract the first-order moments of the 3 channels of the YUV image and the 2nd and 3rd-order moments of the Y-channel to form a feature vector of 5 dimensions. Compared with other color features, this method has a low dimension of feature vectors and requires less training time. Three color moments are given in Equation (1):

Color moment feature extraction flow chart.
Generally speaking, the text has obvious and stable edges in some static regions, its edges are rich and the proportion of edges in all directions is almost equal. Therefore, the text edge is an important feature that distinguishes between textual and non-textual regions. Different edge detection operators can extract different edge information of the image, and according to the unique edge information of the text, the horizontal, vertical, 45-degree and 135-degree edges have a larger proportion of edges. Therefore, in order to obtain edge information, the horizontal, vertical, 45-degree, and 135-degree directions of the image can be accurately positioned, and the edge characteristics of the text can be well described. Therefore, we use the Sobel operator to detect the edges in these four directions. The Sobel edge detection operator used is shown in Fig. 2. The edge map contains rich edge information in the image but it cannot be directly used as an edge feature. Therefore, we need to extract related features from the image edge map to represent the edge features of the image. Because the standard deviation can represent the edge intensity information in the corresponding direction on the edge map. Therefore, after the image is processed by the edge detection operators in four directions, the edge characteristics are represented by the standard deviations of the four-direction edge gradient maps. Therefore, a 4-dimensional feature vector is formed in the edge feature. Edge feature extraction flow chart shown in Fig. 3.

Sobel edge detection operator.

Edge feature extraction flowchart.
Among them, the first two operators shown in Fig. 3 indicate that the vertical and horizontal edge operators correspond to the vertical and horizontal edge information in the image. The last two operators in Fig. 3 can detect edge information in the 1350 and 450 directions in the image.
After years of research and efforts by Dalal and Triggs, a new representation method of HOG (Histogram of oriented gradients) was proposed at the CVPR conference in 2005 [2]. This method mainly calculates the gradient size and the corresponding gradient direction of each Block region, and then forms the corresponding HOG feature. The text in the video tends to have features such as sharp edges and gradient directions. Therefore, according to the characteristics of the text, we can distinguish the text region from the background region by extracting the HOG feature, so as to detect the region where the text is located. Many scholars have proposed a HOG feature +SVM training method and applied it in many tests. Nowadays, the combination of HOG features and support vector machines has been widely used in pedestrian detection and has achieved very good results in this detection. According to the characteristics of the text, this paper attempts to use the HOG feature in text detection. Based on the use of HOG features combined with the use of SVM in pedestrian detection, this paper applies the combined features of HOG, color, and edges with SVM to the detection of video texts.
In this paper, we define the size of each Cell as 8×8, and one block consists of 2×2 Cells. Therefore, 16×16 is the size of a Block. A block diagram is shown in Fig. 4. In this paper, we choose the positive and negative samples as 16×16 images, then exactly one 16×16 image is a Block. Previously we divided a Cell into 9 Bins. We calculated the gradient of each Cell in 9 directions. Finally, each Cell forms a 9-dimensional feature data. Because we specify that a block consists of 4 cell units and each cell unit has a characteristic dimension of 9. Therefore, the final eigenvector of a block is 36 dimensions. Figure 5 below shows a flowchart of feature extraction based on HOG.

Block diagram.

HOG feature extraction flow chart.
A single frame based SVM training
We chose the default radial basis function (RBF) as the kernel function of the SVM to train a single frame based SVM classification model. Based on the obtained positive and negative sample data to extract all samples based on color, edge, and HOG features, the resulting feature vector is 45-dimensional, positive samples the tag is defined as +1, the negative sample tag is defined as –1, and the feature vector and corresponding class tag (+1, –1) are sent to the training function for training. We choose RBF as the kernel function of the SVM to train and finally a single frame based SVM classifier was obtained.
The adjacent three frames based SVM training
As with the single-frame training SVM classifier idea, according to the extracted sample data of the adjacent three frames, a 135-dimension vector based on the color, edge, and HOG features obtained from the same position in the three adjacent frames of the video is formed. Unlike a single frame, 45-dimension feature data per frame corresponds to a classification label (+1, –1) in a single frame, in the three frame, the 135-dimension feature data of each group corresponds to a classified label (+1, –1). We also select RBF and input all its feature data and corresponding labels into the fitcecoc function for training. The corresponding SVM classification model based on the adjacent three frames is finally obtained through the training of sample data.
Text detection
Text detection based on a single frame
Selecting the video sequence to be detected, we set the sliding window size to 16*16 according to the previous setting, and the sliding step of the window is set to 16. The sliding window is used to traverse each area in the image, and the feature data of the samples are extracted by calculating the color, edge and HOG characteristics of each window. The feature data is used to train the SVM classification model to test the text according to the trained SVM classification model. By detecting the corresponding classification tags we have set before, we mark the region of the text area as +1, and the corresponding region of the non-text area is marked as -1. Figure 6 shows its text detection flow chart.

SVM-based single-frame detection flow chart.
Similar to the first method, in order to improve the detection performance, we propose a new SVM-based detection of adjacent three frames on the basis of a single frame, which is the same as the single frame detection, except that the color of the adjacent three frames of the image is edge and HOG feature extraction. We define its three frames as pre, cur, and next, and first extract the features of these three frames based on color, edge, and HOG and then subtract any two frames of HOG features: pre-cur, cur-next, pre-next. And use its data as a feature of HOG. So, the feature vector consists of the color of the adjacent three frames, the edge, the HOG feature of pre-cur, the HOG feature of cur-next, and the HOG feature of pre-next. A 135-dimensional eigenvector is formed, and a SVM classifier based on three adjacent frames is constructed by performing corresponding SVM training according to the feature data extracted from the feature and the corresponding label. Then in the detection process, a sliding window with a size of 16*16 is selected in the same manner as a single frame. The sliding window is a sliding window with 16 steps, and each area of the image is traversed by means of a sliding window. The calculation is based on the colors, edges, and edges of each window. The HOG feature extracts feature data of the sample and detects the video text according to the above-mentioned trained classification model. Figure 7 describes the flow of text detection implementation.

SVM-based three-frame detection flow chart.
The hardware configuration used in this paper is 8.00G RAM, Intel Corei5 CPU, and experiments are conducted under Windows10 operating system and Matlab 2017a environment.
This paper is mainly to detect the text in the video. In order to verify the feasibility and effectiveness of the algorithm proposed in this paper, we first randomly select a certain number of pictures containing text as test samples from the video.In this paper, we randomly selected 600 1280*720 images as test samples. We then tested the test samples based on the previously trained SVM classification model. The test sample is shown in Fig. 8.

Testing sample set.
Similarly, we run the algorithm in the environment of MATLAB 2017A, and verify the effectiveness of our algorithm through experimental results. Figure 9 is the result of the corresponding experiment. Among the results, the first column corresponds to the original image in the video, and the second column is the result map of the video text detection using the color feature in the literature [11], and the third column is the result of the video text detection based on the color, edge and HOG feature combination+ SVM proposed in this experiment. In the same way, from the analysis of the experimental results, the document [11] can be mischecked in the video text detection algorithm, and the algorithm we proposed in this paper compared with the literature [11] effectively avoids the misdiagnosis of some non-text regions in the video.
With the deepening and improvement of video text detection algorithm, how to evaluate the performance of a detection algorithm is particularly important. The evaluation of video text detection algorithm needs to ensure that its detection of text is objective and accurate. Therefore, many researchers have put forward the objective criteria for evaluating the performance of algorithms. Among them, how to evaluate the efficiency of the detection algorithm in video text, literature [3] lists a series of evaluation criteria on the detection algorithm, including recall, precision, difficulty of detection, importance of detection, quality of text border detection, and so on. However, because of its precision, recall rate and F-Measure value, it is not only easy to calculate, easy to understand, but also can directly respond to the energy index of the algorithm, so it is widely used. Similarly, this paper selects three commonly used algorithm evaluation criteria, recall rate, precision rate and F-Measure value, to measure the video text detection algorithm proposed in this paper. Among them, the precision rate can detect the accuracy of video text detection algorithm, and recall can reflect the comprehensiveness of the algorithm. F-Measure value can reflect the relationship between recall and precision rate, which are shown in Equations (2–4):

Comparison of single-frame detection results. (a) original image, (b) results of literature [11], (c) results of the proposed method.
Recall Rate:
Precision Rate:
F:
Among them, n c is the total number of all the text regions that are finally detected in the result of video text detection, and the n m is the total number of regions that belong to the text area but not detected in the text detection process, that is, the total number of the missing text regions, and n f represents the detection of the non-text area error in the video text detection results. The total number of text areas is measured.
Our algorithm is based on the following three criteria: recall rate, precision rate and F-Measure value. The Table 1 is a comparison between the color feature detection algorithm proposed in literature [11] and the algorithm proposed in this paper. The data in Table 1 is obtained on the condition that the sliding window of 16×16 size and the sliding step length of the window are 16.
Text detection algorithm comparison
Among them, the recall rate of the algorithm is:
Precision Rate:
The recall rate of Literature [11] is:
Precision Rate:
From Table 1, we can see that the text detection algorithm proposed in this paper is effective. Compared with other algorithms, this paper presents a new feature combination in feature selection that is, three features of color, edge and HOG are used synthetically, while literature [11] only uses a single-color feature, and it cannot generalize the features of the text. In this paper, a 45-dimensional feature vector is extracted from the samples based on the features of color, edge and HOG, and the feature dimension of the literature [11] extraction is 9 dimensions. Compared with [11], the algorithm has improved recall and accuracy. It not only improves recall, but also avoids misdiagnosis in some non-text areas. Therefore, the experimental results prove that the algorithm proposed in this paper improves the detection effect very well.
In the same way, in order to reduce the false detection rate and improve its detection effect, we improved the text detection algorithm proposed in this paper to the video text detection based on three adjacent frames. We ran the algorithm in the environment of MATLAB 2017A, and the experimental results are shown in Fig. 10.

Comparison of three-frame detection results. (a)original image, (b) single frame detection result, (c) three-frame detection results.
In Fig. 10, the first column in the image is the original image in the video, and the second column is result of video text detection based on the single frame. The third column represents the result based on three adjacent frames. Although the single frame detection method proposed by us has avoided the misdiagnosis of most of the non text regions, there are also some misdiagnosis in some regions with strong marginal nature. Therefore, we also carry out the video text detection based on three adjacent frames. From the experimental results, we can see that the text detection algorithm of the adjacent three frames can avoid the shortage of some background regions and improve the accuracy of text detection.
The Table 2 shows the results based on single frame and three adjacent frames respectively. In the detection process, we set the 16 × 16 as the size of the sliding window in the image, and the corresponding sliding step is set to 16.
Comparison of text detection algorithms
Among them, the recall rate of single frame detection is:
Precision Rate:
The recall rate of three-frame detection is:
Precision Rate:
From Table 2, we can see that the dimension based on single frame detection feature is 45-dimension; while three frame detections is 135. According to the result of the algorithm evaluation standard, the algorithm based on three adjacent frames can improve the recall and accuracy, which can effectively reduce the false detection rate and improve the detection effect. Therefore, the video text detection algorithm is effective in achieving the effect.
In this paper, an algorithm based on color, edge and HOG integrated feature+SVM for video text detection is proposed in this paper, aiming at high feature dimension, single feature and unsatisfactory detection effect in SVM video text detection algorithm. In text algorithm, we choose sample, feature extraction, model training, text detection and other steps to detect video text. From the above experimental results, we can see that the method proposed in this paper has higher recall and accuracy than other methods, which not only improves the recall rate, but also reduces the false detection rate. When the background is more complex and the edge is stronger, the error detection based on single frame is higher. The proposed algorithm based on three adjacent frames can effectively avoid this problem. Therefore, the experimental results show that the algorithm proposed in this paper is effective. However, this algorithm also has some shortcomings, that is, it only implements the detection of video text, and does not realize text recognition. In order to better achieve video retrieval, in future research work, video text recognition is the direction of future research.
Footnotes
Acknowledgment
This research was supported by the National Natural Science Foundation of China (Grant Nos. 61401265, 41171338, 61501286), Fundamental Research Funds for the Central Universities (GK201803058), Interdisciplinary Incubation Project of Learning Science of Shaanxi Normal University.
