An optimal video summarization of surveillance systems using LFOB-COA with deep features and RBLSTM

Abstract

The visual data attained from surveillance single-camera or multi-view camera networks is exponentially increasing every day. Identifying the important shots in the presented video which faithfully signify the original video is the major task in video summarization. For executing efficient video summarization of the surveillance systems, optimization algorithm like LFOB-COA is proposed in this paper. Data collection, pre-processing, deep feature extraction (FE), shot segmentation JSFCM, classification using Rectified Linear Unit activated BLSTM, and LFOB-COA are the proposed method’s five steps. Finally a post-processing step is utilized. For recognizing the proposed method’s effectiveness, the results are then contrasted with the existent methods.

Keywords

Video summarization Levy Flight (LF) and opposition-based learning Coyote Optimization Algorithm (LFOB-COA)Bi-directional Long Short-term Memory (BLSTM)Jaccard Similarity-centered Fuzzy C-Means (JSFCM)

1 Introduction

The video data’s quantity is explosively augmenting, which introduced new challenges to numerous video processing tasks [1]. Greater attention has been given to video summarization [2]. Video Summarization (VS) aims to generate a shortened video skim expressing the gist of its original version [3]. It mostly contains 2 categories storyboard and video skim [4]. A few image sequences are selected from original videos and anticipated to be an approximate illustration of the entire video’s visual contents [5]. Since it highlights the exact key points within the video, the generating process is easy [6]. Therefore, the time axis is eliminated by the keyframe aimed at VS [7]. A group of video shots offer interesting video contents is compiled by video skims [8]. The summary contains both audio and motion elements that enrich emotions, quantity of information transmitted by the summary [9]. However, the difficulty lies in the categorization of important video segments [10]. The video browsing’s efficiency can be enhanced by a video summary [11]. Furthermore, summarizing a long video to a shorter form is focused by Single Video Summarization (SVS) [12]. Summarizing multi-view videos into informative video summaries is called Multi-view Video Summarization (MVS) [13]. As a consequence SVS is less difficult contrasted to MVS [14]. Hence, the time along with cost needed for analyzing video information is reduced by VS [15]. But, a difficult issue is creating a summarization from a sequence of topic-associated videos [16]. With several existing techniques’ computational complexity is extremely high because of redundant frames [17]. Moreover, the high-level semantics are often neglected by most approaches [18]. Thus, utilizing LFOB-COA and RBLSTM algorithms, efficient deep learning (DL)-centered video summarization is proposed here.

This paper is systematized as: Section (2) proffers the literature works. Section (3) exhibits the proposed technique. Section (4) offers the result along with a discussion. Lastly, section (5) concluded the paper.

2 Literature review

Shaohui Mei et al. [19] anticipated a true sparse constrained Minimum Sparse Reconstruction (MSR) aimed at VS. A Percentage of Reconstruction criterion and a selection matrix and keyframe selection were utilized. This MSR-centered VS approach surpassed the top-notch as shown by experimental results. This framework was effective only for videos with low dimensionality and had the possibility of being stuck at local minima, which was this method’s disadvantage.

Shuwen Xiao et al. [20] introduced a Query-biased Self-Attentive Network for query-concentrated VS. It projected a hierarchical self-attentive network, same video’s various segments, and textual information of video description, along with its associated visual contents for modeling the relative relationship at three levels. More efficacy and efficiency were attained by the results; however, more time was needed by the network’s training. Thus, the overall summarization process was delayed.

Xingrun Wang et al. [21] expected a VS system centered upon the modality correlation. The correlation betwixt the video frames and text were gauged utilizing the recurrent neural network (RNN) model and the correlations between them were calculated. A promising level of performance was attained by this framework. In contrast, it was very complicated to process the longer sequence of videos owing to the RNN.

Archana and Malmurugan [22] incorporated a Multi-Edge optimized Long Short-Term Memory (LSTM) RNN for VS. Multi-edge Optimized LSTM was utilized for edge pixel information extraction. This method showed higher performance. The image’s phase information was affected whilst converting the image from spatial domain to frequency domain, which was the drawback.

Zhong Ji et al. [23] attained an Attentive encoder-decoder networks utilizing Bidirectional LSTM and two attention-centered LSTM networks were explored. The competing methods were outperformed by this model with greater scores. However, a vanishing gradient problem was caused by the activation function (AF) which was utilized in BLSTM, thereby makes the computation slow.

3 Proposed methodology

An optimization algorithm for effective VS using LFOB-COA with deep features is propounded by this paper. It comprises of preprocessing, FE, Segmentation, Classification and key frame selection. Figure 1 presents the proposed method’s illustration,

Fig. 1

Illustration of the proposed methodology.

3.1 Video splitting

The input video v_s is initially collected and converted into m-number frames. $v_{s} = f_{i} = {f_{1}, f_{2}, \dots \dots, f_{m}}$ (1)

3.2 Pre-processing

3.2.1 Image rotation

Image rotation is utilized for enhancing the image’s orientation by mapping input image position (split frames f_i) (a₁, b₁) with center coordinates (a_0,b₀) onto a position (a₂, b₂) through an angle φ, $a_{2} = cos (φ) * (a_{1} - a_{0}) - sin (φ) * (b_{1} - b_{0}) + a_{0}$ (2) $b_{2} = sin (φ) * (a_{1} - a_{0}) + cos (φ) * (b_{1} - b_{0}) + b_{0}$ (3)

3.2.2 Image translation

For improving the image’s visualization, translation through (ψ_a, ψ_b) is utilized. $a_{2} = a_{1} + ψ_{a}$ (4) $b_{2} = b_{1} + ψ_{b}$ (5)

3.2.3 Image scaling

For altering the quantity of information saved in a scene, image scaling is employed. Therefore, the preprocessed images ρ_s can well be indicated as, $ρ_{s} = {f_{1}, f_{2}, \dots \dots, f_{m}}$ (6)

3.3 Deep feature extraction

Utilizing a DL model called GoogLeNet, the redundant VF are extracted after pre-processing. Now, using the convolutional layer (CL) along with polishing layers of GoogLeNet, the input’s visual descriptors are extracted. Figure 2 exhibits the GoogLeNet architecture.

Fig. 2

Architecture of GoogLeNet.

The CL’s equation x_i at the i^th CL’s output is specified as,

$\begin{matrix} x_{i} & = \sum_{i = 1}^{n} (ρ_{s} * w_{i} + b_{i}) where ρ_{s} \\ = {f_{1}, f_{2}, \dots \dots, f_{m}} \end{matrix}$ (7)

Where, n, w_i and b_i implies the number of filters, filter weight and bias value. Hence, the extracted j = 1, 2, ⋯⋯ , N number of deep feature vector is the GoogLeNet (y_j)’s output is,

$y_{j} = {y_{1}, y_{2}, \dots \dots, y_{N}}$ (8)

3.4 Shot segmentation using JSFCM

For shot segmentation, the extracted features (y_j) are classified into a disparate similar group by the JSFCM algorithm. The cluster centers are created centered on the Euclidean distance (ED) in conventional FCM, while in ED, the cluster centers are complicated to implement. Therefore, JS is employed that measures the number of similar and distinct items in two sets. This JS-centered clustering of similar frames in FCM is named JSFCM.

Step 1: Initially, dividing N feature vectors into k clusters. The objective function (ψ) is, $ψ = \sum_{p = 1}^{k} {\sum_{j = 1}^{N} ∥ y_{j} - c_{p} ∥}^{2} . {(η_{pj})}^{z}$ (9)

Where, c_p indicates the k^th cluster center, η_Pj symbolizes the membership function (MF) along with z implies the membership function’s fuzziness.

Step 2: Estimate the cluster center c_p utilizing equation (10) $c_{p} = \frac{\sum_{j = 1}^{N} {(η_{pj})}^{z} . y_{j}}{\sum_{j = 1}^{N} {(η_{pj})}^{z}}, 1 ⩽ p ⩽ k$ (10)

Step 3: Estimate the JS betwixt two disparate frames, $J_{j} (l) = \frac{\sum_{j = 1}^{N} \sum_{l = 1}^{M} min (y_{j} (l), y_{j} (l - 1))}{\sum_{j = 1}^{N} \sum_{l = 1}^{M} max (y_{j} (l), y_{j} (l - 1))}$ (11)

Wherein, l = 1, 2, ⋯ , M represents number of frames from y_j and J_j (l) signifies the JS, y_j (l) , y_j (l - 1) defines the l^th and (l - 1^th) frame.

Step 4: The MF η_Pj is updated via Equation (12) $η_{pj} = \frac{1}{\sum_{j = 1}^{N} \sum_{l = 1}^{M} {(\frac{1}{J_{j} (l)})}^{\frac{2}{z - 1}}}$ (12)

Hence, a collection of possible clusters is attained by contrasting the JS between the color histogram of successive frames. The n number of clustered frames (C_f) can be indicated as, $C_{f} = {C_{1}, C_{2}, \dots \dots, C_{n}}$ (13)

3.5 Classification using RBLSTM

Utilizing RBLSTM, the clustered frames are categorized as informative or non-informative sequences. A sigmoid AF is employed by the conventional BLSTM, which vanish the gradient issue. Therefore, rather than the sigmoid AF, ReLU is utilized which overcomes the vanishing gradient issue. This ReLU activated BLSTM is called RBLSTM.

Step 1: The input gate I (t), FG f (t), output gate (O (t)), candidate memory cell (M (t)) and state value within the memory cell $\vec{M (t)}$ is initialized, $I (t) = χ (W_{I} C_{f} + u_{I} h_{t - 1} + b_{I})$ (14) $F (t) = χ (W_{F} C_{f} + u_{F} h_{t - 1} + b_{F})$ (15) $O (t) = χ (W_{O} C_{f} + u_{O} h_{t - 1} + b_{O})$ (16) $M (t) = tanh (W_{M} C_{f} + u_{M} h_{t - 1} + b_{M})$ (17) $\vec{M (t)} = I (t) \otimes M (t) + F (t) \otimes M (t - 1)$ (18)

Wherein, t represents every time step, h_t-1 signifies the hidden state, W_I, b_I, u_I implies the RBLSTM’s input gate parameters, W_F, u_F, b_F, W_O, u_O, b_O, W_M, u_M, b_M specifies input vector weight, weights at the previous time step, along with bias via the FG’s, output gate’s, memory cell’s network of RBLSTM, the value of I (t) and F (t) ranged betwixt 0 along with 1, M (t - 1) symbolizes that the information is preserved in the preceding memories and χ indicates the ReLU AF. $χ = \frac{1}{1 + e^{- C_{f}}}$ (19)

Step 2: The RBLSTM’s output is denoted in Equation (20), $h_{t} = \vec{h_{t}} \oplus \overset{\leftarrow}{h_{t}}$ (20) $h_{t} = O (t) \otimes tanh (\vec{M (t)})$ (21)

Where, $\vec{h_{t}}$ and $\overset{\leftarrow}{h_{t}}$ indicates the forward and backward hidden states at time t + 1. Thus, both informative and non-informative sequence of frames is contained by the RBLSTM (h_t)’s output that can well be further chosen using LFOB-COA.

3.5.1 Key frame selection using LFOB-COA

Employing LFOB-COA, the keyframes from the informative frames are chosen after the frame’s classification. The COA is centered upon the coyote’s social behavior and its adaptation to the environment. The COA is liable to decay in local optimum and needs examination with the proper blending of exploitation. Therefore, Levy flight (LF) and Opposition-centered learning are incorporated in the COA (original) for overcoming such limitations. This LF and Opposition-based COA is entitled LFOB-COA.

Step 1: Initially, the population (h_t) is separated into p groups with q coyotes within the search space, $ζ_{i} = L_{i} + R_{i} (U_{i} - L_{i})$ (22)

Wherein, ζ indicates the initial social condition, U_i, L_i symbolizes the upper and lower bound of i^th decision variable and R_i implies the real random number within the range [0, 1] using LF distribution. $R_{i} = {\begin{matrix} \frac{μ . φ}{{| ν |}^{R_{i}}} & if R_{i} > 1 \\ 1 & else \end{matrix}$ (23)

Where, μ, ν symbolizes the normal distribution function. φ implies a fixed parameter with standard gamma function γ, $φ = {[\frac{γ (1 + R_{i}) \times sin (\frac{π R_{i}}{2})}{γ ((\frac{1 + R_{i}}{2}) \times R_{i} \times 2^{\frac{R_{i} - 1}{2}})}]}^{\frac{1}{R_{i}}}$ (24)

Step 2: Next, by the objective of possessing the highest memorability m_i and entropy value e_i of each coyote, the fitness function f (ζ) is estimated in (25). $f (ζ_{i}) = \sum_{i = 1}^{D} max (m_{i}, e_{i})$ (25)

Step 3: Sometimes coyotes move to another group which is related to a probability (p_l), $p_{l} = 0.005 . q^{2}$ (26)

Step 4: The alpha coyote α of p^th group in the t^th time is specified as, $α = ζ_{i} for min (f (ζ_{i}))$ (27)

Step 5: The culture tendency (C_t) of every group is estimated utilizing Equation (30) $C_{t} = {\begin{matrix} R_{\frac{q + 1}{2}, i}^{p, t} q = odd \\ \frac{R_{\frac{q}{2}, i}^{p, t} + R_{(\frac{q}{2} + 1), i}^{p, t}}{2} otherwise \end{matrix}$ (28)

Where, R^p,t indicates the coyote’s ranked social conditions.

Step 6: The coyote’s life cycle like birth and death (N_c) is written in Equation (29), $N_{c} = {\begin{matrix} ζ_{p_{1}, i} n_{i} < p (s) or i = i_{1} \\ ζ_{p_{2}, i} n_{i} ⩾ p (s) + p (a) or i = i_{2} \\ T_{i} otherwise \end{matrix}$ (29)

Where, p₁, p₂ implies random coyotes within p, i₁, i₂ signifies two random decision variables, p (s) and p (a) indicates the scatter and association probabilities, T_i denotes a random number within the D dimensional decision variable limit and n_i represents the random number within the range [0, 1] created utilizing Equation (23). Therefore, p (s) and p (a) is signified as, $p (s) = \frac{1}{D}$ (30) $p (a) = \frac{1 - p (s)}{2}$ (31)

Step 7: The cultural interaction κ₁ and κ₂ and the updation of the coyote’s social behavior ζ_i,new is articulated below, $κ_{1} = α - ζ_{i, p_{1}}$ (32) $κ_{2} = C_{t} - ζ_{i, p_{2}}$ (33) $ζ_{i, new} = ζ_{i} + n_{1} κ_{1} + n_{2} κ_{2}$ (34)

Step 8: The coyote’s new fitness value (f (ζ_i,new)) is assessed using the below equation, $f (ζ_{i, new}) = \sum_{i = 1}^{D} max (m_{i, new}, e_{i, new})$ (35) $ζ_{i} (t + 1) = {\begin{matrix} ζ_{i, new} & if f (ζ_{i, new}) < f (ζ_{i}) \\ ζ_{i} & else \end{matrix}$ (36)

Step 9: If ζ_i is a solution in existing search space, the transformed space utilizing the opposition-centered learning method ${\hat{ζ}}_{i}$ is denoted by, ${\hat{ζ}}_{i} = L_{i} + R_{i} - ζ_{i}$ (37)

Thus, the candidate fitness $f ({\hat{ζ}}_{i})$ is assessed and the condition for updating search space (S) is, $S = {\begin{matrix} ζ_{i} \leftarrow {\hat{ζ}}_{i} & if f ({\hat{ζ}}_{i}) ⩽ f (ζ_{i}) \\ ζ_{i} & else \end{matrix}$ (38)

The global solution to the optimization issue is the coyote’s social condition that best adjusted itself to the environment. Lastly, for obtaining the input video’s video summary, the selected keyframes (K_S) are merged together. Finally, the post-processing is performed which utilizes color histogram difference for discarding frames of the same shots in the summarized video. The pseudocode of LFOB-COA is exposed in Fig. 3,

Fig. 3

Pseudocode of LFOB-COA.

4 Results and discussion

The proposed technique’s performance is examined with the conventional systems. Grounded upon the performance metrics namely precision, recall, accuracy, f-measure, frame count, along withexecution time, the performance examination of the proposed JSFCM, RBLSTM, and LFOB-COA are analogized with existing techniques. Five surveillance videos are selected from the CAVIAR, CViSOR datasets to perform the analysis [26]. Figure 4 exhibits the input VF and also the chosen keyframes from the sample video of the CAVIAR, CViSOR dataset,

Fig. 4

(a) Input frames, (b) Selected key frames.

4.1 Performance analysis of JSFCM

Herein, centered on recall, precision, f-measure, and also accuracy, the proposed JSFCM’s performance for shot segmentation is weighted against the prevailing methods like FCM, K-means, along with K-medoids.

It is exemplified from Table 1 that high precision and also recall values when analogized with the prevailing methods are possessed by the proposed JSFCM’s performance. 96.4325% precision along with 95.8796% accuracy is expressed by the proposed work for video-1. Likewise, higher precision along with recall when contrasted with the conventional techniques is exhibited by the proposed work for the remaining videos. Thus, superior performance when analogized to the existent methodologies is demonstrated by the proposed one.

Table 1
Performance comparison (a) precision and (b) Recall

Input Proposed FCM K-means K-medoids

videos JSFCM

(a)

Video-1 96.4325 93.2653 92.1254 88.3467

Video-2 97.5639 93.4687 91.6542 88.9873

Video-3 95.3686 94.2469 91.4321 89.7453

Video-4 96.5678 93.6218 92.7809 89.1245

Video-5 95.4191 94.4509 92.3789 88.3642

(b)

Input Proposed FCM K-means K-medoids

videos JSFCM

Video-1 95.4387 94.5967 92.1145 89.2793

Video-2 95.2846 94.1497 92.8976 89.6029

Video-3 96.1093 93.8096 92.5784 90.4956

Video-4 96.3857 93.3907 92.6598 90.9856

Video-5 95.6809 94.2138 91.9985 90.2867

Input	Proposed	FCM	K-means	K-medoids
(a)
Video-1	96.4325	93.2653	92.1254	88.3467
Video-2	97.5639	93.4687	91.6542	88.9873
Video-3	95.3686	94.2469	91.4321	89.7453
Video-4	96.5678	93.6218	92.7809	89.1245
Video-5	95.4191	94.4509	92.3789	88.3642
(b)
Input	Proposed	FCM	K-means	K-medoids
videos	JSFCM
Video-1	95.4387	94.5967	92.1145	89.2793
Video-2	95.2846	94.1497	92.8976	89.6029
Video-3	96.1093	93.8096	92.5784	90.4956
Video-4	96.3857	93.3907	92.6598	90.9856
Video-5	95.6809	94.2138	91.9985	90.2867

Table 2

Performance measurement by means of F-measure

Input	Proposed	BLSTM	KNN	SVM
videos	RBLSTM
Video-1	97.2353	91.2476	85.1123	82.2674
Video-2	97.1934	92.7846	83.6734	80.2734
Video-3	96.9734	91.9237	85.2874	82.3864
Video-4	96.2374	88.2137	84.3484	81.2363
Video-5	97.2136	89.6345	83.8734	80.8234

From Fig. 5, the F-measure of 95.2809% is displayed by the proposed work for video-2 while the prevailing systems namely FCM, K-means, and also K-medoids attain 93.4782%, 91.2976%, 89.2647% F-measure, correspondingly. Meanwhile, the proposed method’s accuracy is 95.8827% for video-3, however, the existing technique’s accuracy like FCM (94.9865%), K-means (91.3746%), and K-medoids (90.2536%). In addition, higher performance is attained by the proposed one for the remaining videos. This affirms that enhanced performance results are attained by the proposed one when analogized with the conventional methods.

Fig. 5

Performance comparison based on (a) F-measure and (b) Accuracy.

4.2 Performance analysis of RBLSTM

Centered on F-measure and also accuracy, the proposed RBLSTM classifier’s performance is analogized with the existent classifiers namely BLSTM, K-nearest neighbor (KNN), and Support Vector Machine (SVM) in this stage.

Grounded on F-measure, the proposed method’s f-measure for video-5 attains a higher performance of 97.2136% as shown by the table, in the same way for videos 1–4, high f-measure is attained by the proposed one when contrasted to the existent methods. Similarly, lower performance is attained by the conventional systems when contrasted to the proposed one for the remaining videos. Therefore, it states that superior performance is displayed by the proposed technique to the existent techniques.

The proposed RBLSTM methodology with the existent method’s performance measure centered on accuracy is exhibited in Fig. 6. For video-4, 97.2334% accuracy is possessed by the proposed work, which illustrates that when analogized with the existent ones, the proposed technique exhibits the maximum accuracy. The existing approaches’ accuracy is low whilst contrasting with the proposed work. Thus, more accuracy is attained by the proposed one when weighted against the existing classifiers.

Fig. 6

Performance measure in terms of Accuracy of the proposed RBLSTM.

4.3 Performance analysis of LFOB-COA

Centered on frame count and execution time, the proposed LFOB-COA with the existent COA, Normalized K-means (NK-means) and Center Surround Model, and also an Integer Knapsack formulation for K-means (CSMIK-Kmeans) methods’ comparative analysis is described in this phase.

Concerning frame count and execution time, the frame count and execution time offered by the proposed one for video-1 is 172 frames, and 2.6589 sec, correspondingly. But, higher frame count and execution time are expressed by the prevailing techniques. Likewise, the total frames and execution time is more for the existent methods for each video. Hence, it is deduced that the proposed technique is better when weighted against the existing approaches. Figure 7 exhibits the Comparative analysis based on Frame count and Execution time.

Fig. 7

Comparative analysis based on (a) Frame count and (b) Execution time.

5 Conclusion

For the surveillance system’s effective video summarization, RBLSTM and LFOB-COA are proposed in this paper. For examining the proposed method’s effectiveness, the proposed along with the existing method’s performance is contrasted. Higher f-measure (97.2353%), accuracy (97.2334%) for RBLSTM classifier and low frame count (172 frames), and execution time (2.6589 sec) for LFOB-COA centered keyframe selection is denoted by the proposed technique. Thus, better results are exhibited by the proposed technique grounded upon the above performance metrics. Therefore, the surveillance system’s effective video summarization is offered by the proposed RBLSTM and LFOB-COA with greater accuracy and lesser execution time. By deeming the multi-view videos of the surveillance system utilizing the video summarization’s advanced model, the work will be prolonged in the upcoming future.

References

, Mei

, Wan

, Hou

, Wang

and Feng

D.D.

, Video summarization via block sparse dictionary selection, Neurocomputing 378 (2020), 197–209.

Yuan

, Mei

, Cui

and Zhu

, Video summarization by learning deep side semantic embedding, IEEE Transactions on Circuits and Systems for Video Technology 29(1) (2017), 226–237.

Gao

, Yang

, Zhang

and Xu

, Unsupervised video summarization via relation-aware assignment learning, IEEE Transactions on Multimedia, (Early Access) (2020), DOI: https://doi.org/10.1109/TMM.2020.3021980.

Lix

, Zhao

and Lu

, A general framework for edited video and raw video summarization, IEEE Transactions on Image Processing 26(8) (2017), 3652–3664.

Huang

and Wang

, A novel key-frames selection framework for comprehensive video summarization, IEEE Transactions on Circuits and Systems for Video Technology 30(2) (2019), 577–589.

Yuan

, Li

and Wang

, Spatiotemporal modeling for video summarization using convolutional recurrent neural network, IEEE Access 7 (2019), 64676–64685.

Fei

, Jiang

and Mao

, A novel compact yet rich key frame creation method for compressed video summarization, Multimedia Tools and Applications 77(10) (2018), 11957–11977.

, Yao

, Ling

and Mei

, Detecting shot boundary with sparse coding for video summarization, Neurocomputing 266 (2017), 66–78.

Kannappan

, Liu

and Tiddeman

, DFP-ALC: automatic video summarization using distinct frame patch index and appearance based linear clustering, Pattern Recognition Letters 120 (2019), 8–16.

10.

Michael

M.T.

and Balachandran

, A classified study on semantic analysis of video summarization, 2017 International Conference on Algorithms, Methodology, Models and Applications in Emerging Technologies (ICAMMAET), IEEE 16-18 Feb. 2017, Chennai, India, 2017.

11.

Zhao

, Li

and Lu

, Property-constrained dual learning for video summarization, IEEE Transactions on Neural Networks and Learning Systems 31(10) (2019), 3989–4000.

12.

, Zhang

, Pang

and Li

, Hypergraph dominant set based multi-video summarization, Signal Processing 148 (2018), 114–123.

13.

, Guo

, Zhu

, Liu

, Song

and Zhou

Z.-H.

, Multi-view video summarization, IEEE Transactions on Multimedia 12(7) (2010), 717–729.

14.

Hussain

, Muhammad

, Ding

, Lloret

, Baik

S.W.

and Victor

, Hugo de Albuquerque, A comprehensive survey of multi-view video summarization, Pattern Recognition 109 (2021), 1–15.

15.

Zhang

, Kampffmeyer

, Zhao

and Tan

, Deep reinforcement learning for query-conditioned video summarization, Applied Sciences 9(4) (2019), 1–16.

16.

, Zhang

, Pang

, Li

and Pan

, Multi-video summarization with query-dependent weighted archetypal analysis, Neurocomputing 332 (2019), 406–416.

17.

Mohan

, Nair

M.S.

, Domain independent redundancy elimination based on flow vectors for static video summarization, Heliyon 5(10) (2019), 1–8.

18.

Sun

, Zhu

, Lei

, Hou

, Zhang

, Duan

and Qiu

, Learning deep semantic attributes for user video summarization, In 2017 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 10-14 July 2017, Hong Kong, China, 2017.

19.

Mei

, Guan

, Wang

, Wan

, He

and Feng

D.D.

, Video summarization via minimum sparse reconstruction, Pattern Recognition 48(2) (2015), 522–533.

20.

Xiao

, Zhao

, Zhang

, Guan

and Cai

, Query-biased self-attentive network for query-focused video summarization, IEEE Transactions on Image Processing 29 (2020), 5889–5899.

21.

Wang

, Nie

, Liu

, Wang

and Yin

, Modality correlation-based video summarization, Multimedia Tools and Applications 79(45) (2020), 33875–33890.

22.

Archana

and Malmurugan

, Multi-edge optimized LSTM RNN for video summarization, Journal of Ambient Intelligence and Humanized Computing 12 (2020), 5381–5395.

23.

, Xiong

, Pang

and Li

, Video summarization with attention-based encoder–decoder networks, IEEE Transactions on Circuits and Systems for Video Technology 30(6), 1709–1717.

24.

Michael Moses

and Balachandran

, A deterministic key-frame indexing and selection for surveillance video summarization, International Conference on Data Science and Communication (IconDSC), 1-2 March 2019, Bangalore, India, 2019.