Detection of shot boundaries and extraction of key frames for video retrieval

Abstract

The prerequisite steps for summarizing, retrieval of video are detection of shot transitions and extraction of key frames. We hypothesized an advanced, ultra-modern Scale Invariant Feature Transform (SIFT). This SIFT method captures statistical modifications of various shot transitions, next the key frames or representative frames are extracted from those segmented shot with the calculation of entropy for each frame in the shot. We can amplify the performance of over proposed system by removing the repeated representative frames using the technique called edge matching rate. This intensified algorithm is applied to variable classes of videos to perceive shot transitions and getting of the key frame. Thus, the proposed algorithm proves its efficacy and accuracy in exhibiting its experimental results.

Keywords

Video dataset frame separation Scale Invariant Feature Extraction similarity measures shot transitions representative frames

1. Introduction

Presently, there’s been a fabulous expansion in the quantity of video database due to the accessibility of the number of video recording equipment’s, which are available at affordable expenses of storage memory devices and utilization of high speed communication services. So, a complex and effective video data base system is necessary for retrieving, browsing and searching [1, 2]. We retrieve videos in the current video database system by considering text or description. It engages a plenty of time of the user and requires man power. To cross this hurdle, an advanced video retrieval system is required based on its content [3]. Recognition of Shot transitions and getting representative frames are the crucial steps in video retrieval [4]. Shot is defined as the continuous frames that are captured in between the on and off button of the camera. A shot border differentiates two successive shots of a video. A border of a shot is classified as cut border and gradual border. The detection of shot transition is a significant step in indexing, classifying, summing up and retrieving videos [5, 16]. Therefore, a pattern change in shot border detection studies has occurred in research. There were many algorithms created to allow the objective above. These are block-based border detection, the process based on pixels and technique based on histograms. In pixel feature extraction process [6], correlating with pixels in order of the pictures is equated. Since it is arithmetically simple it saves time. Its disadvantage is if incase of any small shake in the camera or in the object also affects the result. Ultimately, it shows dissimilarity between two similar consecutive frames. Even in case of small illumination of the object also results in dissimilarity. Block based detection of shot transition [7]; each picture is split into multiple blocks. The respective successive frames of each adjacent block are compared and evaluated. Compared to pixel-based technique, this technique provides better outcomes. But the technique of identification of shot boundaries based on blocking is slow owing to computation or execution difficulties. This technique is also unable to define the items that move quickly. Detection of shot boundaries based on histogram [8, 17]; this method utilizes the statistical illustration of potency values. It overcomes the disturbances like translation, rotation and motion in the camera [18]. Its main drawback is it does not give accurate results when any two dissimilar frames represent similar color combination [29]. To prevail over the drawbacks of existing algorithms, in this paper we have hypothesized an unconventional methodology in detecting the shot boundaries using the technique SIFT i.e., Scale Invariant Feature Transform. It is vigorous, accurate and invariant to noise; picture scaling, picture rotation and changes in brightness [9, 28]. The shot border detection is the initial stage in video abstraction and followed by key frame extraction. Representative frame extraction reflects the shot’s notable data and abstracts the remainder of the frames features. It neglects or eliminates repeated information that in turn decreases data for video retrieval. The entropy value of all the pictures in every shot is computed in this suggested algorithm. As key frames, the frame that has distinct entropy value is subjected. This eliminates the redundant key frames by using the edge matching rate method yields the ultimate representative frames [10, 19]. Remaining paper is sectionalized below: Section 2 discusses shot boundary detection, Section 3 deals with key frame extraction, and Section 4 shows the framework of over proposed method. Section 5 covers the results of over experiment and lastly Section 6 leads to the conclusion.

2. Shot boundary detection

A shot is termed with the help of a single camera as the sequence of successive pictures of continually recorded video. Basing on the similarity and dissimilarity of the video, the shot boundary divided into two consecutive shots. Cut boundary and gradual boundary are the two classifications of a shot boundary [11, 20]. A cut boundary is defined as the quick transform from a shot to succeeding shot. There is a cut between the first shot’s end frame and the second shot’s start frame. There is a continuing transition between more frames [21]. To detect the shots, feature extraction has to be done to verify the similarity and dissimilarity between two frames of a video. We have proposed a technique called SIFT for the creation of feature vector in a video [12].

2.1 Scale Invariant Feature Transform

It is an efficient and powerful feature extraction algorithm for frame matching. It is a robust algorithm for object matching from a huge database. It is an accurate and reliable algorithm and it is in variant to changes in lighting, motion, object movement and picture scaling and also invariant to mixing of any unwanted noise. It comprises of 4 phases: ‘Recognition of Scale-Space Extreme’, ‘Exact Key Point Localization’, ‘Orientation Assignment’ and ‘Descriptor Illustration’ [22].

2.1.1 Recognition of Scale-Space Extreme

In this phase, we perceive attentiveness points or isolated points in the scale space of a picture or frame. These detected, isolated points are known as key points of the frame and these key points acts as a feature of that image. By smearing the local extreme of difference of Gaussian (DoG) space in the image, the key points are acquired. From Eq. (1), we can acquire DoG scale space.

$\displaystyle D(m,n,\sigma)=(G(m,n,k\sigma)-G(m,n,\sigma)){}\times I(m,n)=L(m,% n,k\sigma)-L(m,n,\sigma)$ (1) $\displaystyle L(m,n,\sigma)=G(m,n,\sigma)\times I(m,n)$ (2)

$G(x,y,\sigma)$ is defined as:

$\displaystyle G(m,n,\sigma)=\frac{1}{2\pi{\sigma}^{2}}e^{-(m^{2}+n^{2})/2{% \sigma}^{2}}$ (3)

2.1.2 Exact key point localization

In this phase, the place and scale of the main point candidate are determined. Be contingent on the stability measurement, key points are selected. By taking out small intensity and noise-influenced key points, exact key points are obtained.

2.1.3 Orientation assignment

Here, to build the descriptor invariant to rotation, we allocate an orientation to each of the main points. To calculate the orientation of a key point, the angle histogram of local gradients from the closest smoothed picture $L(x,y,\sigma)$ is drawn. By using the pixel distinctions of the picture. We calculate each sample frame’s gradient magnitude and the angle.

$\displaystyle\quad M(m,n)$

(4) $\displaystyle=\sqrt{\begin{array}[]{c}(L(m+1,n)-L(m-1,n))^{2}+\\ (L(m,n+1)-L(m,n-1))^{2}\\ \end{array}}$ $\displaystyle\quad\theta(m,n)$ (5) $\displaystyle={\text{tan}}^{-1}\left(\frac{L(m,n+1)-L(m,n-1)}{L(m+1,n)-L(m-1,n% )}\right)$

The 360-degree orientation histogram of the main axis with thirty-six bins.
2.1.4 Descriptor illustration

Each and every key point of a local image gives a gradient data which results a descriptor. Here we consider a 16 $\times$ 16 region to compute the feature vector descriptor [13]. In that, every individual 4 $\times$ 4 region anticipated with 8 bins which represent 8 directions. The total number of regions is 16 and each region has 8 directions resulting 128 directions. This methodology generates 128 bins feature vector that is displayed in Fig. 1.

Figure 1.

Generation of key point descriptors in 16 $\times$ 8 directions.

Figure 2.

Matching of key points.

2.2 Matching of key points

By using k-Nearest Neighbor (kNN) search, the main points in the present picture are correlated to that of the subsequent picture. If the main point has a smallest difference from Euclidean, it is said as closest neighbor. K-Nearest Neighbor exploration process is one of the easiest algorithms in machine learning. K-Nearest Neighbor relies on the range from Euclidean to determine if the main points of the present image match those of the following picture [14]. Euclidean distance is a method measuring the difference between two vectors of features [24].

$\displaystyle d(a,b)=\sqrt{\sum^{p}_{i=1}(a_{i}-b_{i})^{2}}$ (6)

When matching the main points in two frames, the lines between the corresponding main points are traced. If there is less than 0.6 in the detachment between any 2 key points, the match is accepted. Figure 2 shows a main point matching instance. If the quantity of coordinated key points is larger than the threshold value between any 2 images, these two images are correlated to the same region. Otherwise the boundary of the shot will be identified.

3. Extraction of key frame or representative frame

In the process of the video retrieval Representative frame extraction places a significant role. The other names for a key frame are selective or representative frame [25]. It consists of necessary data of the shot and synopsizes the physiognomies of the remaining frames. It removes duplication data and diminutions data quantities for indexing and video retrieval.

3.1 Computation of entropy

Entropy of picture data is a method that indicates the amount of data in a picture [15, 26]. To extract representative frames, each frame in the shot is evaluated with information entropy. The computations of information entropy for a picture are outlined in Eq. (7).

$\displaystyle\textit{Entropy}=-\sum^{L}_{i=1}p(x_{i})\times\log(p(x_{i}))$ (7)

Here, $p(x_{i})$ is the probability of intensity value $x_{i}$ ,

$\displaystyle p(x_{i})=\frac{\textit{total}(x_{i})}{m\times n},0\leqslant p(x_% {i})\leqslant 1\text{ and }$ $\displaystyle\sum^{n}_{i=1}p(x_{i})=1$

$\textit{Total}(x_{i})=$ Total pixel count with intensity value $x_{i}$ , $L=$ quantity of intensity values in the picture, $m=$ Number of picture rows, $n=$ Number of picture columns.

The frames with distinct entropy values are regarded as important frames [27]. The representative frames are not comparable in a particular shot in this phase. If the other shots are taken into account, comparable key frames can result. Entropy is dependent on the distribution of the gray level image. Both comparable frames may have distinct gray levels, which can lead to redundant representative frames.

Table 1

The precision and recall values for various kinds of videos

Video type	Histogram-based algorithm		Block-matched algorithm		Proposed system
	Precision%	Recall%	Precision%	Recall%	Precision%	Recall%
Cartoon	82.3	82.2	86.9	83.6	95.3	95.1
News	88.4	83.2	85.6	84.7	92.4	97.8
Movies	85.2	84.1	84.9	83.6	95.4	94.6
Sports	83.6	81.6	87.3	83.6	94.6	91.8
Flowers	81.4	83.6	82.5	82.6	95.8	93.5

3.2 Use edge matching rate to extract ultimate main frames

The objective of this method is to remove the redundant, irrelevant, unnecessary key frames to avoid duplications. Method for removing the repeated selected frames is used in Edge Matching Rate. The final representative frames are eventually obtained. The quantity of false key frames depends on the nature of that particular video. The redundancy of false key frames is more in news videos when compared with movies, cartoons and animated videos. Here Prewitt operation is done to compare the edges of the key frames and eliminate the unnecessary key frames. The next step is to compare the top of the key frame candidate to the other key frame candidate. A threshold to control resemblance or dissimilarity shall be provided to the Edge matching rate. The representative frames are comparable if the correlation is higher than the matching rate. From the total key frames the redundant key frames are eliminated and the remaining are ultimate representative frames. The below points explains the mechanism.

1.
Put on Prewitt operator on the candidate representative frames to detect edges.
2.
To compute the Edge Matching rate between 2 key frames with the use of Euclidean difference.
3.
A cut-off value is assigned for similarity corresponding.
4.
We compare edge matching rate between 2 candidate pictures, if matching is less than cut-off value then both are the ultimate key frames otherwise one of the key frames is treated as a redundant key frame.
5.
The previous step is repeated for all ultimate key frames.
6.
If we remove all the redundant key frames then the rest of the frames are the ultimate key frames.

4. Proposed method framework

Here we represent the step by step methodology to apply over hypothesized method as shown in Fig. 3.

Proposed algorithm

1.
From video dataset, take a video as an input.
2.
From that video all the sequence of frames are being extracted.
3.
For each frame, feature vectors are obtained using a method called SIFT.
4.
To detect the shot transitions, the key points from the current frames are equated by the successive frame.
5.
To extract representative frames from each shot, we use the technique known as the Image Information Entropy method.
6.
Depending on the threshold values of each frame are obtained by using Edge Matching Rate, which helps in eliminating the false key frames.

Figure 3.
Framework for the proposed method.

Table 2
The extracted shot transitions, representative frames and ultimate key frames

S. no Video type Total frames Number of shot transitions Representative frames Ultimate key frames

1. News 1255 12 54 22

2. Cartoon 1585 26 82 55

3. Movies 1785 22 66 52

4. Sports 1155 9 38 29

5. Flowers 855 17 42 35

Figure 4.
Measurement of the performance of the suggested technique using current techniques.

Figure 5.
The key frames of the candidate obtained from the flower video.

Figure 6.
Extracted the ultimate representative frames without false frames.

5. Experimental results

S. no	Video type	Total frames	Number of shot transitions	Representative frames	Ultimate key frames
1.	News	1255	12	54	22
2.	Cartoon	1585	26	82	55
3.	Movies	1785	22	66	52
4.	Sports	1155	9	38	29
5.	Flowers	855	17	42	35

The proposed algorithm uses two essential techniques. Those are SIFT and Image information entropy for shot border detection and extraction of representative frames respectively. To examine the performance of this algorithm, 5 classes of videos from the video data set viz cartoon, news, movies, sports and floral videos are taken. By using SIFT algorithm the shot boundary detection is obtained and the performance is estimated by considering the value of precision and recall.

$\displaystyle\quad\textit{Precision}$

(8) $\displaystyle=\frac{\textit{correct detection}}{\textit{correct detection}+% \textit{false detection}}\times 100$ $\displaystyle\quad\textit{Recall}$ (9) $\displaystyle=\frac{\textit{correct detection}}{\textit{correct detection}+% \textit{missed detection}}\times 100$

The experimental outcomes prove that the proposed technique gives well efficacy when equated with the existing method in Table 1.

The hypothesized shot transition detection method showing extraordinary recall and precision values. So, this shows that the hypothesized system yields the most accurate results when compared to the existing methods as shown in Fig. 4.

The below table shows the sort of videos that are regarded, the complete number of frames in each video, the amount of shot transitions identified, the amount of candidates, representative frames obtained and the amount of ultimate key frames shown in the Table 2 after eliminating the redundant videos.

The extracted candidate key frames from the flower category video are represented in Fig. 5. The quantity of ultimate key frames is represented in Fig. 6 which are obtained by excluding the repeated key frames by using edge matchng rate technique. This algorithm results in high efficacy and less redundancy from the proposed system.

In the proposed algorithm, SIFT technique is used for detection of shot transitions. It is invariant to picture scaling, variation in brightness, rotation and adding of noise. The main drawback in the existing algorithms is extraction of redundant representative frames. Here the repeated representative frames are removed by using the technique edge matching rate in over proposed algorithm. So, the proposed algorithm overcomes the above drawback by using these techniques and this algorithm gives a high value of precision and a high value of recall rate. Hence it is effective and strong.
6. Conclusion

In this paper, the major steps are shot boundary detection which is done by the SIFT technique followed by key frame extraction and it is done by using the technique called Image Information Entropy values. Next, repeated key frames are removed by using edge matching rate method and after eliminating the redundant videos we attain the ultimate representative frames. They strengthens the outcomes and improves the performance and shows robustness and accuracy of the hypothesized framework. The two steps detection of shot transition and keyframe extraction are obtained efficiently and allow the algorithm to function/perform properly in the process of retrieving the video.

References

Montazer

G.A.

and Giveki

, Content based image retrieval system using clustered scale invariant feature transforms, Optik 126(18) (2015), 1695–1699.

Chen

and Wang

, Automatic key frame extraction in continuous videos from construction monitoring by using colour, texture, and gradient features, Automation in Construction, Elsevier, 2017, 355–368.

Cotsaces

Nikolaidis

and Pitas

, Video Shot Detection and Condensed Representation? IEEE Signal Processing Magazine, March, 2006, 28–37.

Lowe

D.G.

, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2004), 91–110.

Gharbi

Massaoudi

Bahroun

and Zagrouba

, Key frames extraction based on local features for efficient video summarization, in: International Conference on Advanced Concepts for Intelligent Vision Systems, Springer, Cham, 2016, pp. 275–285.

Naveen Kumar

G.S.

and Reddy

V.S.K.

, Key Frame Extraction Using Rough Set Theory for Video Retrieval, in: Soft Computing and Signal Processing, Springer, Singapore, 2019, pp. 751–757.

Guru

D.S.

Suhil

and Lolika

, A novel approach for shot boundary detection in videos, in: Multimedia Processing, Communication and Computing Applications, Springer, New Delhi, 2013, pp. 209–220.

Hannane

Elboushaki

Afdel

Naghabhushan

and Javed

, An efficient method for video shot boundary detection and keyframe extraction using SIFT-point distribution histogram, International Journal of Multimedia Information Retrieval 5(2) (2016), 89–104.

Yuan

J.H.

Wang

H.Y.

and Zhang

, A formal study of shot boundary detection, Journal of Transactions on Circuits and Systems for Video Technology 17(2) (February 2007), 168–186.

10.

Kabbai

Azaza

Abdellaoui

and Douik

, Image matching based on lbp and sift descriptor, in: 2015 IEEE 12th International Multi-Conference on Systems, Signals & Devices (SSD15), IEEE, 2015, pp. 1–6.

11.

Kouayep

S.C.

Doyoddorj

and Rhee

K.-H.

, A Copy-Move Forgery Detection Method based on SIFT-LBP Features.

12.

et al., A divide-and-rule scheme for shot boundary detection based on SIFT, JDCTA 4(3) (2010), 202–214.

13.

Sun

and Zhou

, A key frame extraction method based on mutual information and image entropy, in: 2011 International Conference on Multimedia Technology, Hangzhou, 2011, pp. 35–38.

14.

Liu

and Hao

, Key frame extraction based on improved hierarchical clustering algorithm, in: 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), IEEE, 2014, pp. 793–797.

15.

Liu

et al., Shot boundary detection and keyframe extraction based on scale invariant feature transform, Computer and Information Science, 2009, ICIS 2009, in: Eighth IEEE/ACIS International Conference on, IEEE, 2009.

16.

Liu

and Dai

, A method of video shot-boundary detection based on grey modeling for histogram sequence, International Journal of Signal Processing, Image Processing and Pattern Recognition 9(4) (2016), 265–280.

17.

Z.-M.

and Shi

, Fast video shot boundary detection based on SVD and pattern matching, IEEE Transactions on Image Processing 22(12) (2013), 5136–5145.

18.

Mohamadzadeh

and Farsi

, Content based video retrieval based on HDWT and sparse representation, Image Analysis & Stereology 35(2) (2016), 67–80.

19.

Nasreen

and Shobha

, Reducing redundancy in videos using reference frame and clustering technique of key frame extraction, in: International Conference on Circuits, Communication, Control and Computing, IEEE, 2014, pp. 348–440.

20.

Lin

Gao

and Wang

, An improved keyframe extraction method based on HSV colour space, JSW 8(7) (2013), 1751–1758.

21.

Ren

et al., Key frame extraction based on information entropy and edge matching rate, in: Future Computer and Communication (ICFCC), 2010 2nd International Conference on, IEEE, Vol. 3, 2010.

22.

Thakre

K.S.

Rajurkar

A.M.

and Manthalkar

R.R.

, Video partitioning and secured keyframe extraction of MPEG video, Procedia Computer Science 78 (2016), 790–798. Elsevier.

23.

Upesh

Shah

and Panchal

, Shot detection using pixel wise difference with adaptive threshold and colour histogram method in compressedand uncompressed video, International Journal of Computer Applications (0975–8887) 64(4) (February 2013).

24.

and Xu

, Shot boundary detection in video retrieval, Electronics Information and Emergency Communication (ICEIEC), in: 2013 IEEE 4th International Conference on, IEEE, 2013.

25.

Song

and Xie

, Shot boundary detection using convolutional neural networks, in: 2016 Visual Communications and Image Processing (VCIP), IEEE, 2016, pp. 1–4.

26.

Yuan

et al., A formal study of shot boundary detection, IEEE Transactions on Circuits and Systems for Video Technology 17(2) (2007), 168–186.

27.

Zhao

, A Novel Approach for Shot Boundary Detection and Key Frames Extraction, in: 2008 International Conference on Multimedia and Information Technology, IEEE.

28.

Zhang

Zeng

Zhang

and Wu

, Sift matching with cnn evidences for particular object retrieval, Neurocomputing 238 (2017), 399–409.

29.

Zhao

Zhai

Dubois

and Wang

, Image matching algorithm based on SIFT using color and exposure information, Journal of Systems Engineering and Electronics 27(3) (2016), 691–699.