Abstract
The ever growing video data over the internet has raised the challenge of efficient storage and retrieval of multimedia data. Video Summarization is one of the solutions to the problem which extracts interesting parts of a video. These summaries capture the essential content of the original video and enable the viewers to pick out interesting videos from huge video repositories by viewing its compact representation. To address the challenge of managing the video data, a new method for static video summarization is proposed here which is capable of describing input videos using a precise yet meaningful subset of frames. The method utilizes HOG descriptors of Gabor maps of input frames. A high-level representation of HOG descriptors is created using sparse auto-encoders (SAE). The final summary of the video is formed using candidate frames obtained after K-means clustering of these feature vectors. The summarization step is preceded by redundancy elimination to overcome computational burden. The method provides better results compared to the other state-of-the-art video summarization methods.
Keywords
Introduction
The expansion of video data over the internet with the advancement of technology leads to the problem of managing huge video repositories pertaining to entertainment, business, education and personal applications. The users have to browse large volumes of video data each spanning several hours long to get the relevant video. The mining of these huge video repositories to select a particular video of interest is costly, time-consuming and tedious task. This revolution in video data has posed the challenge of storing and retrieving the data effectively and efficiently. Video summarization is a solution to handle tremendous increase in the number of videos over the internet. The aim of video summarization is to enable users to understand the video in short time by viewing a condensed form of input video that conveys the semantic content of the original video.
There are two categories of summarization methods: static summarization methods [1] and dynamic summarization methods [2]. The static summarization methods produce still images or frames corresponding to the input video such that these frames capture the prime contents of the video. These are called still abstracts or static storyboard. The temporal components of the input video is not preserved in static summarization. Whereas, dynamic summarization methods produce short videos preserving the temporal component of the original video. These summaries are known as dynamic skims. Most of the existing approaches focus on dynamic summarization and are meant for the videos pertaining to a particular domain such as sports videos, surveillance videos, consumer videos etc. An efficient method for domain-independent static video summarization which facilitates fast browsing and retrieval of video data is still lacking in the literature.
This paper focuses only on static summarization approaches in the literature. Traditional methods extract key-frames based on ranking algorithms. The various frame discrepancies strategies are used to rank the frames of the video. The ranked frames are filtered to discard similar frames and to retain the frames with a content change in the summary. Many works in literature detect key-frames based on low-level visual features. These features fail to capture the semantics of the video which is very important for creating summaries. The low-level features are computed both locally and globally. The global features represent global characteristics of frames such as colour histogram, texture features etc. The features need to be localized to small portions in an image for incorporating the local details to improve the accuracy. The commonly used localized low-level features for summarization in literature are SIFT [3] and HOG [4]. Mutual Information between consecutive frames is also explored in summarization as in [5]. The low-level features are not good at determining semantically meaningful frames as they fail to capture the characteristics of the video that seize the attention of users, and works has been extended by utilizing higher level features capturing object identities as in [6]. In [7] object level video abstraction is done by detecting objects and correlation between objects using high-level semantic features extracted from frames. Recent works on summarization in [8–10] emphasize on deep features which can better discriminate frames carrying key contents than handcrafted features.
The summarization methods also vary in the computational mechanism used. Unsupervised clustering is most commonly used as in [11–14] in which keyframes are centroids of each cluster formed based on chosen distance measure between feature vectors. In these methods, number of cluster are predetermined but these need to be adaptively fixed as in [11] for domain independent methods where number of clusters vary with the input video.
There is no consistent evaluation method for video summarization in literature. The summaries are evaluated either using subjective measures, objective metrics or using both the methods. In subjective evaluation, the summaries are evaluated by a group of experts visually as in [15]. The summaries are also evaluated by comparing the key-frames automatically generated with the user summaries in ground truth by a distance measure between feature vectors corresponding to frames in user summary and ground truth [11]. The summaries are also evaluated using Precision, Recall and F-score which also requires an efficient similarity measure between the frames [16]. By considering the merits of related works and concepts, we propose a method to generate high-quality video summary which can address some of the shortcomings of existing approaches. The generated summaries can ultimately speed up video indexing and retrieval processes.
The proposed method concentrates on generating static summaries from the videos belonging to different categories. To reduce the time complexity, redundancy elimination is done on a set of input frames before summarization based on flow vectors computed using SIFT Flow algorithm. The local HoG descriptors are extracted from Gabor maps corresponding to each frame in the reduced set. These low-level features are converted to high-level features using autoencoders which are clustered using K-Means algorithm to find out the frames that best represent a group of similar frames. The method is tested on VSUMM and OVP dataset, and it attain better results compared to the other state-of-the-art video summarization methods.
The rest of this paper is organized as follows. Section 2 describes the different steps of the proposed methodology for static video summarization. Experimental results and discussions are illustrated in Section 3. We conclude the paper in Section 4.
Proposed methodology
This paper focuses on the development of a novel algorithm for domain independent static video summarization. The method processes input video on a frame by frame basis. Initially, uniform sampling is done on the set of frames. Then, redundancy elimination is performed on this pre-sampled set to lessen the computational burden of processing the frames in later steps. Each frame in the subset is represented using HOG descriptors of Gabor maps of frames. The high-level representation of these feature vectors are generated using SAE. The key-frames representing the input video are then selected using K-means clustering. Fig. 1 illustrates the different steps included in the proposed method.

Overview of the proposed approach.
The input video needs to be divided into frames before processing. Instead of processing the entire set of input frames, a subset of frames is generated using uniform sampling. The subset is chosen based on pre-determined sampling rate which is an important parameter that directly influence the quality of final summary. The sampling rate depends on the frame rate associated with the input video. A low sampling rate leads to key-frame miss, and a high sampling rate can cause more redundant frames to appear in the output preserving key-frames. The sampling rate used in existing approaches is one frame per second. It works well with high frame rate videos like sports videos and is not suitable for low frame rate videos like cartoon videos where frames are dissimilar even among the frames that span one second due to an abrupt change in the content.
To overcome this issue, the sampling rate that we choose in this step is two frames in one second. The segments are formed from input frames where each segment consists of frames that correspond to one second in the input video. The frames are sampled by selecting the first and middle frame of each segment. 91% of frames are eliminated from input set after uniform sampling, and the set preserves all the distinct frames in the input.
Redundancy elimination
There are still many redundant frames in the sampled subset of frames as a result of similarity in the content of frames at many points in a video. This step discards duplication of contents which adversely affects the goal of producing a concise summary. So, to reduce the complexity of subsequent steps and to meet the goal of generating a good quality summary the proposed approach first eliminate redundant frames preserving the key-frames.
The motion vectors based on optical flow between consecutive frames is explored here which is an important feature of videos for redundancy removal. The optical flow vectors are calculated from SIFT images of corresponding frames which are formed by replacing each pixel in the frame with 128 - D SIFT feature vectors [17]. Each SIFT image is of size X × Y × 128 for an image of size X × Y. The optimal flow vectors are calculated at each pixel position at the finer level based on a coarse to fine level processing of frames as in [18]. The magnitude of displacement at each pixel position is calculated as in (1)
This section gives detailed description of feature extraction step in the proposed method which is done using Gabor filters, HoG and SAE.
Convolution with Gabor filters
Gabor filters are widely used in image processing applications like texture analysis, face verification, handwritten number recognition and image retrieval. A complex sinusoidal plane wave is used to modulate a Gaussian kernel function to form 2D Gabor filter. The impulse response of the filter is given in (3).

Representation of a frame in different scales and orientation using Gabor filter bank.
Gabor filters provide an ideal solution for combined localization of both spatial and frequency domain. These filters have been successfully applied for edge detection tasks, and it facilitates detection of edges in the orthogonal base. The features captured by bank filters are correlated to one another. The filters can represent complex objects using local features invariant to scale, rotation, translation and illumination changes by combining the response of several filters in different orientations, frequencies and scale [19]. Considering the advantages of Gabor filters, we utilized multi-resolution and multi-scale Gabor filters for decomposition of the frames instead of extracting HoG descriptors directly from intensity values.
The representation of a frame using Gabor filter bank is shown in Fig. 2. In the proposed approach a gabor filter bank with two different scales (4,6) and orientation (90,180) is considered. The choice of scale and orientation values of filter bank has been decided empirically. From Fig. 2, it is evident that the selected combination can better represent the given image. Each frame that remains in the subset of input frames after redundancy elimination is convolved with Gabor filter bank. We explored only the magnitude component of Gabor maps. We observed experimentally that the average of all Gabor maps obtained after convolution can better represent the corresponding frame than individual maps. Thus, to reduce the computational burden of feature extraction, we replaced four maps corresponding to a frame with one average map.
Histograms of Oriented Gradients (HOGs) [20] are feature descriptors of images invariant to 2-D rotation which have been used in many different problems in computer vision. The descriptor can characterize the appearance of local object and shape by the distribution of local intensity gradients or edge directions. The HOG descriptor is obtained by counting occurrences of gradient orientation within the region of interest (ROI) defined in an image. The histogram of edge orientation is computed for each pixel within the cell where a cell is a small connected region within the image. According to the gradient orientation, each cell is discretized into angular bins. A weighted gradient is contributed by the pixels of each cell to its corresponding angular bins. Blocks are groups of adjacent cells and histograms corresponding to cells are normalized to form the block histogram. The HoG descriptor is represented by the set of these block histograms. The basic configuration parameters required for the computation of the HoG descriptor are the (i) Masks to compute derivatives and gradients (ii) Geometry of splitting an image into cells and grouping cells into a block (iii) Block overlapping and (iv) Normalization parameters.
The parameters for the extraction of HoG descriptors in the proposed method is chosen as in [20] such that 1-D centered derivative mask is [-1, 0, +1], detection window size is same as input image size, cell size is 8 × 8 and the block size is 16 × 16 (2 × 2 cells). The descriptors are extracted from average Gabor maps corresponding to each frame in the subset of input frames [21]. The HoG descriptors are calculated for each frame from the Gabor map representation of the corresponding frame.
Extraction of high level features
The next step is to generate a better representation of high dimensional HoG descriptors using SAE. Autoencoders are unsupervised neural network which uses backpropagation algorithm and iteratively minimizes the error between the actual and the predicted output. By doing so, the network learns the weights corresponding to the input data. It is an alternative to well-known Principal Component Analysis (PCA) and can represent the high dimensional input vectors using low dimensional vectors.
A basic autoencoder consists of three types of layers input layer, encoder layers and decoder layer. It takes an input which is an element of d - dimensional space (o ∈ R
d
) and generates an output z which is an element of k - dimensional space (z ∈ R
k
) where k is less than d. The output z is given in (4).

Architecture of SAE used in the proposed method.
Once the frames are represented efficiently using feature descriptors, the next step in the summarization process is to group similar frames among the set of frames. We applied the modified version of the simplest unsupervised K-means clustering as in [11] to group similar frames into clusters. Instead of allocating frames randomly to each cluster, the frames are grouped sequentially and thus,the similar consecutive frames in videos allow fast convergence of K-means algorithm. The main drawback of K-means algorithm is that the value of the number of clusters ’k’ is predetermined. The predetermined value of ’k’ is not acceptable for domain-independent approaches in video summarization since summary length depends on the content of the video. So, ’k’ is determined by adaptively thresholding the Euclidean distance between the feature vectors of consecutive frames in the set of frames as given in Section. 3.1.3. The frame closest to the centroid is chosen to be the key-frame.
Experiments and discussion
In this section, the effectiveness and efficiency of the proposed high-level and low-level features based video summarization method is validated through vaious experiments and comparisons. We conducted performance tests of the proposed method on two standard datasets, VSUMM and OVP with videos of different categories. VSUMM dataset consists of videos belonging to cartoon, sports and news. The videos span 1 to 4 minutes duration. The method is also tested on OVP dataset of 50 documentary videos. Datasets also include ground truth created with user summaries of 5 different users for each video. All the videos and groundtruth are available at [22].
Performance metrics
The lack of uniform evaluation standards or criteria in video sumarization since there is no objective ground truth. The existing approaches evaluated abstracts by visually comparing the frames in the output with the ground truth. Even humans cannot compare different summaries of the same video because the portions of the video which are relevant to one user may not be interesting to the other. The accuracy of the proposed approach is evaluated using Precision, Recall and F-score which is calculated by measuring the similarity between the output frames with those in the human created summaries. We explored a simple similarity measure based on Mutual Information between the output frames and frames in the ground truth. Two frames are considered similar if MI value is greater than 0.95. Mutual Information between two images measures variation between the sum of individual entropies and the joint entropy of two images. Given two frames f1 and f2 with a and b representing intensity values of f1 and f2, respectively and c represent intensity value of F. The joint entropy H(f1,f2), individual entropy H(f1) and H(f2) of f1 and f2 and Mutual information MI is calculated as in (9) - (11).
Let N
total
represents the number of frames in the input video, N represents the number of frames in the summary generated by the proposed approach, N
nm
represents the number of frames in the generated summary that doesnot match with the frames in user summaries given in the dataset, N
m
represents the number of frames in the generated summary compared that are similar to the frames in the user summaries given in the dataset, N
GT
represents the number of frames in the user summaries.Then,
where n represents the number of user summaries corresponding to each video.
This section gives parameter settings of SAE used in the proposed method. The values of the parameters that affect the performance of SAE is chosen as in [23] where β=3, ρ=0.05, λ=0.003, γ=1e-9, δ=400. Here, the sigmoid function is selected based on preliminary experiments conducted as it perform bette than ReLu function.
Determining the number of nodes in the hidden layer of sparse autoencoders
The dimensionality of feature vector extracted from each hidden layer is same as the number of nodes in the hidden layer. So, the number of nodes in the hidden layer must be chosen carefully. Here, we extracted the reduced feature vector from the last layer of two-layer SAE and dimensionality of the feature vector is equal to the number of nodes in the last hidden layer. The choice of the number of nodes is made in such a way that the error calculated using (5) has the least value. Fig. 4 shows the reconstruction error obtained with different values for number of hidden nodes using VSUMM dataset. Based on the plot, the proposed method makes use of 170 nodes in the first layer and 110 nodes in the second layer. We utilized the same number of nodes for OVP dataset and achieved better results compared to existing approaches.

The choice of number of hidden nodes in each layer versus error.
Figures 5 and 6 illustrates how the Euclidean distance between HSV histogram of consecutive frames can be utilized to determine the number of clusters in K-means clustering.

Pairwise distances of frames (’v88.avi’).

Pairwise distances of frames (’v21.mpg’).
The graph shows the distribution of distances along time, and there are peaks at the certain point in time which shows a significant change in the content of the video. The frames between two peaks are similar and belong to one cluster. Thus, normalized pairwise distances are thresholded, and the number of clusters k is same as the number of frames with distance score greater than the threshold value. The threshold value is chosen to be 0.3 empirically.
In this section, we present results of our method and also the comparative analysis of results with state-of-the-art summarization methods. Table 1 demonstrates the key-frame extraction results of the proposed method on two benchmark datasets and a comprehensive evaluation to illustrate the effect of high-level features in contrast to low-level features.
Comparison of results using different features on VSUMM and OVP dataset
Comparison of results using different features on VSUMM and OVP dataset

Comparison of time taken by SAE and SAE+HOG.
The comparison is done by clustering HOG descriptors of Gabor images without computing its high-level representation using SAE and also with the high-level features extracted using SAE directly from input frames. The results show that our method is more efficient in detecting key-frames than the low-level features. Though high-level feature vectors extracted directly from images is competent in recognizing key-frames, the computation burden is high as evident in the Fig. 7. It shows time taken by proposed approach is very low when compared to time taken by SAE where images are given directly. We have represented the time analysis of only first 10 videos from two datasets graphically.
To evaluate the efficiency of our approach, we compare the results with summarization methods in VRHDPS [14], VISCOM [16], VSUMM [24] on VSUMM and OVP dataset.

Comparison of results of proposed method with other methods on VSUMM dataset.

Comparison of results of proposed method with other methods on OVP dataset.
Figures 8 and 9 illustrates the performances of the comparison results on VSUMM and OVP dataset, respectively. As shown in the figure our method has high recall and F-score compared to the other methods, but the precision is slightly less compared to VRHDPS. As far as summarization method is considered recall is an important measure since it calculates the similarity of the automatically extracted summary with ground truth. A high precision and low recall indicate that there are equal number of frames in the generated summary and the ground truth, but the number of matching frames is less which shows automatically generated summary is semantically weak which conflicts the objective of summarization. Our method, when compared to other methods, achieved high recall maintaining precision close to the state-of-the-art methods.
Comparison with other clustering based approaches on VSUMM dataset
Comparison with other clustering based approaches on OVP dataset
We compared results with clustering based approaches to further prove the efficiency of our method. Tables 2 and 3 shows the comparison of the results. Table 2 illustrates the comparison of the results of the proposed approach on VSUMM dataset with the results of existing clustering based approaches in literature that consider VSUMM dataset for evaluation. Table 3 illustrates the comparison of the proposed method on OVP dataset with the results of existing clustering based approaches in literature that consider OVP dataset for evaluation. It shows our method which clusters frames based on high-level features outperform other methods mentioned in [14].
The rapid increase in the video data over the internet has raised the challenge of developing algorithms which can summarize videos belonging to different categories. The problem has attracted researchers into this field. This paper presents a domain independent video summarization method utilizing high and low-level features. Summarization step is preceded by redundancy elimination step based on motion vectors preserving key-frames. The approach attains good results compared to the other existing standard techniques. The work can be extended by including fine-tuning of the parameters of hidden layers of SAE which leads to more representative features and can ultimately improve the quality of generated summaries.
Footnotes
Acknowledgment
We would like to thank University of Kerala, Thiruvananthapuram, India for providing financial support for this work under University Junior Research Fellowship Scheme.
