In this paper, a new method is proposed for video summarization and keyframe extraction using combined color, texture and motion information of the video as well as the sparse representation of local descriptors. To reduce the computational overhead of the algorithm, a non-uniform frame sampling strategy is employed using a shot detection algorithm. Subsequently, Binary Robust Invariant Scalable Keypoints (BRISK) and Histogram of Oriented Gradient (HOG) around the keypoints in the sampled frames are extracted as the local descriptors. By sparse representation and spatially partitioning of local features, the frame discriminating curve is constructed. We extract initial keyframes by detecting local maxima of frame discriminating curve and removing weak maxima. In order to remove redundant keyframes, we use similarity measure and motion model between the initial keyframes to extract final keyframes. Experimental results and the comparison of the results of the proposed algorithm with those of other methods show that the proposed algorithm enhances recall and F-measure indices.
During the past decade due to the widespread use of digital videos and their high volume of data, the need for fast review and retrieval of the video information has attracted considerable interest from researchers. Among the various research areas, video summarization is one of the most effective approaches for automatically analyzing events in the video content. Video summarization not only can be used to reduce the volume of digital video data to facilitate video browsing but also it is the main stage in content-based indexing and retrieval of video sequences. Video summarization may also be used as an end product in applications like providing sport highlights. Various video summarization methods are categorized into two groups comprising methods based on video skims and keyframe-based approaches [1]. A video skim is a moving storyboard, in which the main and more informative segments of a video along with audio data are extracted as the video abstract [2, 3, 4, 5]. The result of summarization based on video skimming is a video with voice data which is shorter than the original video. In keyframe-based video summarization, a set of more important frames of a video or keyframes are extracted as the video summary. Although the video skimming is useful for video browsing by humans, a keyframe-based summary is more appropriate for a computer-based video retrieval, indexing and content analysis. Feature extraction is the main step in keyframe-based video summarization. Although different audio and visual properties have been used for keyframe extraction, global image-based features such as color [6, 7, 8, 9, 10, 11], texture [12, 13], textual [14], facial [15] and motion [16, 17, 18] descriptors are mostly used for keyframe-based video summarization. Some methods apply a clustering algorithm to the extracted features to cluster similar frames in a group. Then a limited number of frames are selected from the clusters as the video summary. For instance, in [19] a descriptor based on color histogram and k-means clustering algorithm are used for video summarization. This method that is called Video SUMMarization (VSUMM) algorithm selects keyframes by extracting frames with minimum distances to cluster centroids. Furini et al. [20] also used Video STOryboard (VISTO) algorithm for keyfraeme-based video summarization which is based on low-level color features. Mundur et al. [12] utilized color histogram in the HSV color space and a technique based on Delaunay Triangulation (DT) for clustering the frames in videos. Wu et al. used the histogram of HSV color channels and their singular value decomposition for keyframe extraction [21]. The main problem with video summarization algorithms based on the clustering is the loss of time-evolving nature of the video during the clustering stage. Furthermore, color-based features can only distinguish obvious changes in the video frames. Tint and Soe [22] used wavelet coefficients as a texture descriptor for keyframe extraction. The algorithm suffers from detecting multiple similar frames in different temporal segments of the video. The use of texture features cannot completely handle the drawbacks of color descriptors; therefore some authors used color and texture descriptors in a combined approach to obtain more efficient video summarization results [23, 24, 25, 26]. Various algorithms have been employed for the fusion of color and textural descriptors. In some algorithms, the color descriptor is used for video segmentation and the texture descriptor is leveraged for feature extraction and keyframe detection. However, some algorithms combine textural and color descriptors in the feature extraction level. In addition to color and texture descriptors, motion information is also used as another global descriptor for video summarization [27, 28, 29, 30]. For instance, in [27] relative motion of video frames are estimated using a block matching algorithm and the frames with high motion content are selected as keyframes. In [31], the global motion of a video frame is modeled by 2D affine model and the coefficients of the affine model are analyzed to detect special camera motions and recognize the scenes. Lu and Grauman used dense optical flows and the flow angles and magnitudes to segment an input egocentric video into a series of subshots [30]. Descriptors based on motion information are generally restricted to the global motion of the camera and produce weak results for frames with local motions.
Chakraborty et al. defined keyframe selection algorithm as an optimization problem that the summary length and frame selection criteria are combined into a single framework [32]. Lee and Grauman used a supervised learning algorithm for egocentric video summarization [33]. This algorithm concentrates on detecting video frames with important objects and people that the camera wearer interacts. Zhang et al. used Long Short-Term Memory (LSTM) network and its variants for supervised video summarization [34]. LSTM and other supervised deep learning models have been used for video summarization in several studies [35, 36]. The use of supervised learning approaches enhances the robustness of the algorithm for the trained video categories; however, the algorithm loses its generality to the unseen video categories.
Local image descriptors such as Scale-Invariant Feature Transform (SIFT) descriptors [37] are also used for video summarization. Guan et al. [38] used SIFT features and their descriptors for keyframe extraction. In [39] SIFTdescriptors and their importance are used for keyframe-based video summarization. The method employs the bag of importance model to determine the score of importance for individual features. The representativeness of each frame is then determined based on the score of the features in the frame and finally, more representative frames are extracted as the keyframes. SIFT feature extraction and matching are computationally expensive; therefore, video summarization algorithms based on SIFT features comprise high computational burden.
In this paper, we propose a new method for video summarization by using local and global image features. The proposed algorithm is a multistage algorithm that utilizes color, texture and motion information of the video frames for robust and fast extraction of keyframes. In the first stage of the proposed method, the input video is divided into shots by using global color descriptors. The shot boundaries are used for non-uniform sampling of the video frames in order to reduce the computational overhead of the algorithm. Consequently, local texture features are extracted from sampled frames by using Binary Robust Invariant Scalable Keypoint (BRISK) approach [40]. Local features are then represented in the sparse space to reduce the redundancy and extracting initial keyframes. Finally, motion-based similarity index is used to remove redundant keyframes and extract final keyframes.
It is important to note that the sparse representation of global features has been used in several studies to formulate keyframe extraction task [41, 42, 43]. However, we use the sparse representation of local features for keyframe extraction. Instead of using the coefficients of sparse representation, we also employ the sparse representation error of a local descriptor to measure its distinctiveness. In brief, the main contributions of this paper are as follows:
A new approach based on the sparse representation of local features is used for keyframe extraction.
A new method is proposed to construct frame-level representation by calculating the spatially divided and accumulative sparse representation errors of local descriptors.
To reduce the computational overhead of the sparse representation of local features, a shot boundary detection algorithm is used for non-uniform sampling of the video frames.
A motion-based similarity index is used to remove redundant keyframes and reduce false positives.
The rest of this paper is structured as follows. In the next section, the proposed algorithm and its various stages are described. Experimental results and the comparison of the results of the proposed algorithm with other methods appear in Section 3, and we conclude the paper in Section 4.
The general block scheme of the proposed algorithm for keyframe extraction.
Proposed algorithm
Figure 1 shows the general block scheme of the proposed algorithm for keyframe extraction. The proposed algorithm for video summarization utilizes a hierarchical structure that employs texture, color and motion information of video frames for efficient video summarization. The proposed algorithm comprises seven steps: 1 – shot boundary detection, 2 – keypoint extraction and local descriptor construction, 3 – sparse representation of local descriptors, 4 – construction of frame discriminating vectors, 5 – initial keyframe extraction, 6 – refining keyframes using image-based similarity measure, and 7 – refining keyframes using motion-based similarity measure. Figure 2 graphically illustrates the functionality and output of various steps of the proposed algorithm. The detailed functionality of each step will be described in the following sections.
A graphical view of the various steps of the proposed algorithm.
Shot boundary detection
The main idea and objective of the proposed algorithm are to use the sparse representation of local features of video frames. However, processing all video frames leads to high computational cost. Although uniform temporal sampling is used as a solution to handle the problem of time complexity in some video summarization algorithms, this method may lose some keyframes in time intervals with rapid scene changes. Additionally, in time intervals with no significant changes in the video content the computation burden of the algorithm increases. To handle the problem, a shot detection algorithm is used here to detect changes in the visual content of the videos. Then a non-uniform temporal sampling approach is employed based on the location of the video shots. This method considerably reduces the time complexity of the algorithm without losing keyframes. In the proposed shot detection algorithm which is based on the method of [44], it is not necessary to detect the exact locations of the shots. In this approach, the input video is divided into non-overlapping temporal windows of 21 frames and the differences between the first and last frames of the windows are calculated using the color information of , and channels as follows:
where represents the difference between the first and last frames of the -th window. By applying the threshold on , windows with shot boundaries are extracted. Here is adaptively determined as follows:
where denotes the global mean of all values in the input video and and denote the local mean and standard deviation of values in 10 consecutive windows. When the windows with shot boundaries are detected, non-uniform temporal sampling is employed to select candidate video frames for further analysis. To this intent, we select two groups of frames from the input video comprising intrashot and intershot frames. Intrashot frames include frames from windows containing a shot boundary and intershot frames are sampled from windows without any shot boundaries. We select 5 frames from each window containing a shot boundary and at most 30 frames from windows between two successive windows with shot boundaries.
Keypoint extraction and local descriptor construction
In the proposed algorithm, local feature points and their descriptors are used for keyframe extraction. Local feature points or keypoints are extracted using BRISK algorithm [40] and the Histogram of Oriented Gradients (HOG) [45]. The SIFT algorithm [37] is one of the well-known local feature point detectors that extracts keypoints by the difference of Gaussians. Although the SIFT algorithm is most popular and generally stable, it has high computation overhead which restricts its use for real-time applications. The performance of binary keypoint extractors, like BRISK, is similar to SIFT but with a significant reduction in computational cost. However, the BRISK algorithm employs a binary descriptor that is not suitable for sparse representation. The efficiency of HOG descriptor has been discussed in several applications like image classification and object detection [46, 47]. Consequently, HOG approach is used in this study to construct local descriptors for the keypoints extracted by the BRISK algorithm. After extracting BRISK points, HOG of local patches centered on keypoints are calculated as their descriptors. We construct an initial feature matrix using HOG descriptors, where denotes the number of BRISK keypoints extracted from the sampled video frames and defines -th local descriptor vector with the size of 36.
Sparse representation of local descriptors
In a video summarization process, keyframes are the representative frames of the video. When a frame is selected as a keyframe, other similar frames should be removed from the list of candidate keyframes. Local features can be used for similarity measure between video frames and hence for the selection of more distinctive frames; however, the direct use of this method needs pairwise similarity measure between sampled video frames which increases the computational complexity of the algorithm dramatically. To overcome this problem, in this paper a new method is used to select more distinctive frames of the video based on the sparse representation of local descriptors. In sparse representation, a feature vector which denotes a local descriptor vector is expressed as:
where is the sparse code for the descriptor , is the dictionary or basic signals and is the error vector. Sparse representation can be considered as a solution to underdetermined system of linear equations as follows:
where refers to the number of nonzero elements in a vector. To make the solution of the Eq. (5) more feasible, the sparse representation problem may be expressed as follows:
Here refers to L1 norm. Since both and in the Eq. (6) is unknown; therefore the sparse representation of local descriptors comprises two stages: 1 – the selection of appropriate basic signals or atoms, and 2 – the calculation of sparse coefficients. To solve the first problem, which is generally called dictionary learning, K-SVD algorithm [48] is used in this study. To learn a dictionary for an input video, we use local descriptors of intrashot frames or sampled frames from windows with a shot boundary. Let denote the local descriptors of intrashot frames and represent the corresponding sparse coefficients of as follows:
where is the total number of feature points in intrashot frames. The calculation of matrix using K-SVD algorithm comprises the following steps:
Initialize dictionary with L2 normalizedcolumns.
Calculate sparse coefficients using the following equation:
Update the dictionary as:
Repeat steps 2 and 3 until the convergence.
After the calculation of the dictionary, the feature-sign search algorithm [49] is used to calculate sparse coefficients for all features in the sampled frames i.e. and the corresponding sparse codes are expressed as the matrix .
Construction of frame discriminating vectors
In this study, the sparse representation of the local descriptors is used to measure the distinctiveness of the local features. Consequently, a frame is selected as a keyframe if its local features are more different than other frames. In this study, sparse representation errors of local features are used to measure the distinctiveness of video frames. Sparse representation error for a local descriptor is defined using the following equation.
Spatially dividing local features into 21 different areas, (a) area includes all features of the frame, (b) areas to contain point features in 4 non-overlapping windows of the frame, (c) areas to comprise features in 16non-overlapping windows of the frame.
Sparse representation error of a local feature denotes its dissimilarity to dictionary atoms and hence its distinctiveness. A frame with more distinctive descriptors is also considered as a more representative frame. Therefore in this study, the accumulative error of sparse representation of local descriptors in a frame is used as a criterion to measure the distinctiveness of the frame. Generally, the number of local features in high-textured areas dramatically increases; therefore, these areas have much more impact on the calculation of the accumulative error. To handle this problem, feature points in a frame is spatially divided into 21 areas comprising to as illustrated in Fig. 3. As shown in this figure, to areas are defined as:
Area includes all the features in a frame (Fig. 3a).
Areas to comprise feature points in 4 non-overlapping windows of a frame (Fig. 3b).
Areas to include feature points in 16non-overlapping windows of a frame (Fig. 3c).
By employing features points in 21 areas of a frame , the discriminating vector of the frame i.e. is defined using the following equation:
where is the accumulative error of sparse representation for local descriptors of the frame in area is calculated as follows:
Initial keyframe extraction
In the proposed approach, the frame discriminating vectors are calculated for the sampled frames of the video. Consequently, frame discriminating curve is calculated using discriminating vectors of the sampled frames. Figure 4 shows the frame discriminating curve for a typical video. The frame discriminating curve is calculated using L2 norm of frame discriminating vectors as follows:
where denotes frame discriminating curve, is the number of sampled frames of the video and represents -th sampled frame of the video. By applying a threshold and detecting local maxima in the frame discriminating curve, the initial keyframes are extracted. In comparison with pairwise similarity measure using local descriptors, the time complexity of the proposed algorithm is dramatically reduced; however, among the detected keyframes there may be some similar frames that are removed using the subsequent analysis.
Frame discriminating curve for a sample video.
The results of the proposed keyframe extraction algorithm for a typical video.
Refining keyframes using image-based similarity measure
After detecting initial keyframes using frame discriminating curve, an image-based similarity measure is used to extract similar consecutive keyframes. To calculate the similarity between two consecutive keyframes among intrashot frames, Structural SIMilarity (SSIM) index is used [50]. SSIM index between two consecutive frames and is defined as follows:
where and are positive constants and and functions are calculated using the following equations:
where and are the mean intensity values of and frames respectively and and are standard deviations. SSIM index is not a proper approach to measure the similarity between two consecutive frames in the case of gradual shots. Therefore, to measure the similarity between two consecutive keyframes among intershot frames, edge information and correlation based similarity index are used. To detect edge pixels, Sobel edge detection algorithm is employed in this study.
Refining keyframes using motion-based similarity measure
SSIM and edge-based similarity indices fail to measure the correct similarity between two frames with camera motion, rotation or scale change. To handle the problem, a motion-based similarity index is used here to measure the similarity between consecutive keyframes in the case of camera motion or zoom. Here affine transformation is used to represent motion model between two consecutive keyframes. Affine transformation accounts for camera translation, zoom, rotation and skew as follows:
where and are motion model parameters, are the coordinates of salient features in the frame and are the coordinates of matched points is in the frame . Affine transformation has six parameters; therefore three salient points and their matches are required to fully recover the affine transformation. However, because of erroneous and outlier matches robust statistical methods such as RANdom SAmple Consensus (RANSAC) [51] or Least Median Square (LMedS) [52] algorithms are required to select proper three matching pairs. In this study, LMedS algorithm is used to calculate affine parameters. After calculating the affine transformation, it is applied to the frame to calculate the transformed frame . Then normalized cross-correlation is used to measure the similarity between the transformed frame and . In the case of enough similarity, frame is removed from the list of extracted keyframes.
Performance evaluation indices of the proposed algorithm for various test videos
Video title
Precision
Recall
F-measure
America’s new frontier, segment 01
9
1
1
0.9
0.9
0.9
Oceanfloor legacy, segment 02
10
1
0
0.9091
1
0.9524
Senses and sensitivity, introduct. to lecture 4
11
5
3
0.7333
0.8462
0.7858
America’s new frontier, segment 10
8
1
0
0.8889
1
0.9412
Voyage of the lee, segment 15
8
1
2
0.8
0.8889
0.8422
Drift ice as a geologic agent, segment 05
5
0
3
1
0.625
0.7692
Hurricane force-a coastal perspective, 04
12
0
3
1
0.8
0.8888
Exotic terrane, segment 04
17
1
2
0.9444
0.8947
0.9188
Future of energy gases, segment 05
8
0
1
1
0.8889
0.94118
A new horizon, segment 03
23
0
3
1
0.8846
0.9388
Experimental results
The proposed algorithm was implemented in MATLAB environment and tested with the provided data-base. To show the efficiency of the proposed algorithm, the results of the algorithm are also compared with those of other methods.
The test dataset in this study comprises 50 color videos from the Open Video (OV) project [53] which comprises various subjects such as sport, students, aerology and so forth. To compare the results of the proposed algorithm with existing approaches in a fair condition, we used the same 50 videos that were used in [19, 12, 20]. These videos have MPEG-1 format and the spatial resolution of 352 240 pixels with an average frame rate of 30 frames per second. The temporal duration of the videos also varies from 1 to 4 minutes approximately.
Figure 5 illustrates a typical video from the test database and the results of keyframe extraction. To evaluate the performance of the proposed algorithm and compare its results with those of other methods, F-measure, recall and precision criteria are used that are defined as follows [54]:
where is true positive or the number of correctly detected keyframes, is false positive or falsely detected keyframes and is false negative or the number of undetected keyframes. is also a constant that is set to 1 in our experiments. To calculate these performance evaluation metrics, we have used the same ground truth as [19]. In [19], the results of video annotation by five different users were used to provide ground truth for the test dataset comprising 50 videos. Table 1 shows the results of proposed keyframe extraction algorithm for 10 randomly selected videos of the test dataset. The F-measure, precision and recall indices for the proposed algorithm are 0.873, 0.881 and 0.863, respectively.
Effect of algorithm parameters on the performance
We evaluated the performance of the proposed algorithm by changing various parameters of the algorithm. One of the important parameters of the algorithm is the parameter in Eq. (3). The parameter is used as a constant to determine threshold for the detection of video shots. Larger values of increases values and hence some shot boundaries may not be detected. Small values may also create false shot boundaries. We tested the proposed algorithm with various values comprising 0.1, 0.3, 0.5, 0.7, 0.9. Figure 6 shows F-measure of the proposed algorithm for various values. The figure shows maximum F-measure of 0.86 in . The figure also shows a limited change in F-measure in a wide range of values.
We also tested the proposed algorithm with various values that is the parameter of sparse representation in feature-sign search algorithm. Table 2 shows F-measure of the proposed algorithm for 0.1, 0.15, 0.3, 0.5. Based on the results of the table, we used in our experiments. The results of the table also show the low sensitivity of the proposed algorithm to values.
Effect of on the performance of the proposed algorithm
Parameter
F-measure
0.8702
0.87254
0.81426
0.78806
F-measure of the proposed algorithm for various values.
In the proposed approach, the input video is divided into non-overlapping temporal windows of 21 frames and the differences between the first and last frames of the windows are calculated using the color information. Figure 7 shows the effect of the non-overlapping temporal window size on the F-measure index. According to the results of Fig. 7, we use the window size of 21 in our experiments.
Execution time of the proposed algorithm in seconds for one frame of an input video with a spatial resolution of 352 240 pixels
Processing
stage
Shot boundary detection
Keypoint extraction and dictionary learning
Sparse representation and initial keyframe extraction
Motion-based refining
Total
Processing time (sec.)
0.010
1.001
0.019
0.257
1.288
The effect of the non-overlapping temporal window size on the F-measure index of the proposed algorithm.
The results of proposed algorithm using three different local features.
Video summarization results for 1 – A New Horizon, segment 01 (left) and 2 – Oceanfloor Legacy, segment 09 (right) videos, (a) ground truth, (b) the result of VSUMM approach [19], (c) the result of DT method [12], (d) video summarization using STIMO algorithm [8], (e) the result of BOIVS method [39], and (f) the results of proposed method.
Comparison of the proposed algorithm with three existing approaches using the F-measure, precision and recall indices.
A source of error of the proposed algorithm, (a) keyframes that are extracted by the proposed algorithm, and (b) keyframes in the ground truth.
We also tested the proposed algorithm with Speeded-up Robust Features (SURF) [55] as well as local deep features [56]. Our algorithm uses the sparse representation of local features. Therefore the intermediate local deep features are used for the test of the algorithm. Figure 8 illustrates the results of the proposed algorithm for keyframe extraction using three local features comprising combined BRISK and HOG descriptors, SURF algorithm and local deep descriptors. The results of Fig. 8 show that the combined BRISK and HOG descriptors give rise to better results. Although the efficiency of deep features has been shown in several applications, our analysis shows that local deep features produce high errors during the sparse representation and hence does not generate proper results as HOG features. In addition, combined BRISK and HOG descriptors are scale and rotation invariant; therefore they are robust in the case of camera zoom and rotation.
Table 3 illustrates the execution time of the proposed algorithm for one frame of an input video with a spatial resolution of 352 240 pixels. The table shows the processing time for the various stages of the algorithm as well as the total processing time. Since the implementation of our algorithm is in the MATLAB environment, the processing time of the algorithm is high. We expect that the C or C implementation of the proposed algorithm will considerably improve the processing speed.
Comparison with existing approaches
To demonstrate the efficiency of the proposed algorithm, we compared its results with existing approaches both quantitatively and qualitatively. Figure 9 illustrates ground truth and the extracted keyframes of 1 – A New Horizon, segment 01 and 2 – Oceanfloor Legacy, segment 09 videos using the proposed algorithm and four existing video summarization algorithms comprising VSUMM algorithm [19], DT method [12], STIMO approach [8] and BOIVS [39] algorithms. In this figure, correctly recognized, unrecognized and falsely recognized keyframes are shown and labeled as Tp, Fn, and Fp, respectively.
To show the efficiency of the proposed algorithm, we quantitatively compared the performance of the proposed algorithm with existing approaches including VSUMM algorithm [19], DT method [12] and STIMO approach [8], as well. To compare the results of the proposed algorithm with other approaches, we use results from the published papers of the approaches. F-measure is the general criteria to evaluate the efficiency of video summarization algorithms method quantitatively. Precision and recall indices are also used in the literature to compare various keyframe extraction algorithms. Figure 10 shows F-measure, precision and recall indices for the proposed and three state-of-the-art approaches.
The results of Fig. 10 show that the proposed algorithm outperforms its opponents according to the F-measure index. The results of Fig. 10 also show that the precision index for the proposed algorithm is higher than STIMO and DT method; however, the precision index of the VSUMM algorithm is marginally better than the proposed algorithm. This means that the proposed method has higher false positive than VSUMM algorithm. This is generally because of the higher sensitivity of the proposed algorithm to the scene change for videos with a static background and moving objects entering the scene. In this case, our algorithm detects a keyframe which is not detected by VSUMM algorithm. As will be discussed in the section of error analysis, these frames are categorized as keyframes by some experts. The comparison of the proposed algorithm with other algorithms using the recall index shows that the proposed algorithm outperforms existing approaches.
Error analysis
We analyzed the sources of errors for the proposed algorithm. One of the sources of errors of the proposed algorithm is the detection of keyframes in the video segments with a static background while moving objects entering the scene. In this case, our algorithm detects a keyframe which is not considered as a keyframe in the ground truth. Figure 11 shows two video segments that reveal this type of error. Although these frames have not been considered as keyframes in the ground truth provided by [19], they are considered as keyframes by several human experts depending on the application of video summarization.
Another source of error for the proposed algorithm is the false positives during video dissolve or fading. In the case of video dissolve, a gradual transition occurs from one frame to another. Therefore a combined image is constructed during the transition that is detected as extra keyframe in the proposed algorithm. In rare cases, the video segments may contain several motion sources such as the camera or multi-object motions in the scene. In such situations, the LMedS algorithm fails to calculate the motion model correctly. Therefore, the algorithm may detect an extra keyframe.
Conclusions
In this article, a new method was proposed for keyframe extraction and video summarization using combined color, texture and motion information. The method utilizes a non-uniform frame sampling strategy by using a shot detection algorithm. Then local features based on BRISK algorithm are used for keyframe extraction. The proposed algorithm utilizes a new approach using the sparse representation of local descriptors for the similarity measure among video frames and keyframe extraction. The proposed approach also employs a similarity index based on the motion model to measure the similarity between two frames with camera motion, rotation or scale change. The proposed algorithm was tested with a dataset comprising 50 videos and evaluated using various performance evaluation indices. Experimental results using three local features comprising combined BRISK and HOG, SURF and deep local features showed that the use of combined BRISK and HOG features gives rise to higher F-measure, precision and recall indices. Additionally, the comparison of the results of the proposed algorithm with those of other methods shows that the proposed algorithm enhances recall and F-measure indices. We analyzed the sources of the errors for the proposed algorithms and results showed that the main sources of errors are related to false positives during video dissolve or fading, scenes with a static background while moving objects enter the scene as well as scenes with several moving objects.
References
1.
TruongBTVenkateshS. Video abstraction: A systematic review and classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). 2007; 3(1): 3.
2.
StreibKDavisJW. Summarizing high-level scene behavior. Machine Vision and Applications. 2013; 25(1): 229-44.
3.
GygliMGrabnerHVan GoolL, editors. Video summarization by learning submodular mixtures of objectives. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015.
4.
KumarKShrimankarDD. F-DES: Fast and deep event summarization. IEEE Transactions on Multimedia. 2018; 20(2): 323-34.
5.
KumarKShrimankarDD. Deep event learning boost-up approach: DELTA. Multimedia Tools and Applications. 2018: 1-21.
6.
XiaohuaHLingJ, editors. A Video summarization method based on key frames extracted by TMOF. International Conference on Image Analysis and Signal Processing (IASP); 2012.
7.
GongYLiuX, editors. Video summarization using singular value decomposition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2000.
8.
FuriniMGeraciFMontangeroMPellegriniM. STIMO: STIll and MOving video storyboard for the web scenario. Multimedia Tools and Applications. 2010; 46(1): 47-69.
9.
MajumdarjSanthosh KumarKVenkateshG, editors. Analysis of video shot detection using color layout descriptor and video summarization based on expectation-maximization clustering. International Conference on Cognitive Computing and Information Processing (CCIP)2015.
10.
ZongZGongQ, editors. Key frame extraction based on dynamic color histogram and fast wavelet histogram. 2017IEEE International Conference on Information and Automation (ICIA). 2017: IEEE.
11.
ChupeauBForestR. Evaluation of the effectiveness of color attributes for video indexing. Journal of Electronic Imaging. 2001; 10(4): 883-95.
12.
MundurPRaoYYeshaY. Keyframe-based video summarization using Delaunay clustering. International Journal on Digital Libraries. 2006; 6(2): 219-32.
OtaniMNakashimaYSatoTYokoyaN. Video summarization using textual descriptions for authoring video blogs. Multimedia Tools and Applications. 2017; 76(9): 12097-115.
15.
ZhangTWenDDingX, editors. Person-based video summarization and retrieval by tracking and clustering temporal face sequences. IS&T/SPIE Electronic Imaging; 2013: SPIE.
16.
MaYFHuaXSLuLZhanHJ. A generic framework of user attention model and its application in video summarization. IEEE Transactions on Multimedia,. 2005; 7(5): 907-19.
17.
OsianMGoolL. Video shot characterization. Machine Vision and Applications. 2004; 15(3): 172-7.
18.
DivakaranARadhakrishnanRPekerKA. Video summarization using descriptors of motion activity: A motion activity based approach to key-frame extraction from video shots. Journal of Electronic Imaging. 2001; 10(4): 909-17.
19.
De AvilaSEFLopesAPBda LuzAde AlbuquerqueAraújo A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters. 2011; 32(1): 56-68.
20.
FuriniMGeraciFMontangeroMPellegriniM, editors. VISTO: Visual storyboard for web video browsing. 6th ACM International Conference on Image and Video Retrieval (CIVR’07); 2007.
21.
WuJZhongS-hJiangJYangY. A novel clustering method for static video summarization. Multimedia Tools and Applications. 2016: 1-17.
22.
TintKTSoeK. Key frame extraction for video summarization using DWT wavelet statistics. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET). 2013; 2(5): 1829-33.
23.
CarvajalJMcCoolCSandersonC, editors. Summarisation of short-term and long-term videos using texture and colour. IEEE Winter Conference on Applications of Computer Vision (WACV); 2014.
24.
CahuinaEJCamaraCG, editors. A new method for static video summarization using local descriptors and video temporal segmentation. 26th SIBGRAPI - Conference on Graphics, Patterns and Images (SIBGRAPI); 2013.
25.
PapadopoulosDPChatzichristofisSAPapamarkosN. Video summarization using a self-growing and self-organized neural gas network. Computer Vision/Computer Graphics Collaboration Techniques: Springer; 2011. p. 216-26.
26.
MahmoudKMIsmailMAGhanemNM, editors. Vscan: An enhanced video summarization using density-based spatial clustering. 17th International Conference on Image Analysis and Processing (ICIAP); 2013.
27.
KamojiSMankameRMasekarANaikA. Key frame extraction for video summarization using motion activity descriptors. International Journal of Research in Engineering and Technology. 2014; 3(3): 491-5.
28.
PeyrardNBouthemyP. Motion-based selection of relevant video segments for video summarization. Multimedia Tools and Applications. 2005; 26(3): 259-76.
29.
LiCWuYTYuSSChenT, editors. Motion-focusing key frame extraction and video summarization for lane surveillance system. 16th IEEE International Conference on Image Processing (ICIP); 2009.
30.
LuZGraumanK, editors. Story-driven summarization for egocentric video. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2013.
31.
LuoJPapinCCostelloK. Towards extracting semantically meaningful key frames from personal video clips: from humans to computers. IEEE Transactions on Circuits and Systems for Video Technology. 2009; 19(2): 289-301.
32.
ChakrabortySTickooOIyerR, editors. Adaptive keyframe selection for video summarization. 2015 IEEE Winter Conference on Applications of Computer Vision (WACV); 2015: IEEE.
33.
LeeYJGraumanK. Predicting important objects for egocentric video summarization. International Journal of Computer Vision. 2015; 114(1): 38-55.
34.
ZhangKChaoW-LShaFGraumanK, editors. Video summarization with long short-term memory. European Conference on Computer Vision; 2016: Springer.
35.
EjazNKhanUAMartínez-del-AmorMASparenbergH, editors. Deep learning based beat event detection in action movie franchises. Tenth International Conference on Machine Vision; 2018: SPIE.
36.
YanXGilaniSZQinHFengMZhangLMianA. Deep keyframe detection in human action videos. arXiv preprint arXiv: 180410021; 2018.
37.
LoweDG. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. 2004; 60(2): 91-110.
38.
GuanGWangZLuSDengJDFengDD. Keypoint-based keyframe selection. IEEE Transactions on Circuits and Systems for Video Technology. 2013; 23(4): 729-34.
39.
LuSWangZMeiTGuanGFengDD. A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Transactions on Multimedia. 2014; 16(6): 1497-509.
40.
LeuteneggerSChliMSiegwartRY. BRISK: Binary robust invariant scalable keypoints. IEEE International Conference on Computer Vision (ICCV). 2011: 2548-55.
41.
CongYYuanJLuoJ. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia. 2012; 14(1): 66-75.
42.
MeiSGuanGWangZWanSHeMDagan FengD. Video summarization via minimum sparse reconstruction. Pattern Recognition. 2015; 48(2): 522-33.
43.
KumarMLouiAC, editors. Key frame extraction from consumer videos using sparse representation. 18th IEEE International Conference on Image Processing; 2011: IEEE.
44.
LuZMShiY. Fast video shot boundary detection based on SVD and pattern matching. IEEE Transactions on Image Processing. 2013; 22(12): 5136-45.
45.
DalalNTriggsB, editors. Histograms of oriented gradients for human detection. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)2005.
46.
PaisitkriangkraiSShenCZhangJ. Performance evaluation of local features in human classification and detection. IET Computer Vision. 2008; 2(4): 236-46.
47.
LiangJYeQChenJJiaoJ, editors. Evaluation of local feature descriptors and their combination for pedestrian representation. 21st International Conference on Pattern Recognition (ICPR); 2012: IEEE.
48.
AharonMEladMBrucksteinA. SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing. 2006; 54(11): 4311-22.
49.
LeeHBattleARainaRNgAY, editors. Efficient Sparse Coding Algorithms. Advances in Neural Information Processing Systems; 2006.
50.
WangZBovikACSheikhHRSimoncelliEP. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing. 2004; 13(4): 600-12.
51.
FischlerMABollesRC. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM. 1981; 24(6): 381-95.
52.
RousseeuwPJ. Least median of squares regression. Journal of the American Statistical Association. 1984; 79(388): 871-80.
53.
Open Video (OV) project [Internet]. https://open-video.org/. Accessed March, 2016 [cited March, 2016]. Available from: https://open-video.org/.
54.
SokolovaMJapkowiczNSzpakowiczS. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. AI 2006: Advances in Artificial Intelligence. Lecture Notes in Computer Science2006. p. 1015-21.
55.
BayHEssATuytelaarsTVan GoolL. Speeded-up robust features (SURF). Computer vision and image understanding. 2008; 110(3): 346-59.
56.
KrizhevskyASutskeverIHintonGE, editors. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems; 2012.