Abstract
Content-based indexing methods need to analyze all frames of a video. Because such a procedure is extremely time consuming the indexing may be limited to the key-frames, i.e. to only one or to a few frames for every shot or for every video scene. The detection of the key-frame of a shot or of a scene requires very effective temporal segmentation methods. Content-based indexing of videos is based on the automatic detection of a video structure. A video shot is the main structural video unit. The temporal aggregation results in grouping of shots into scenes of a given category. Moreover, the determination of the most likely category on the basis of time relations also reduces the analysis time enabling us to apply the adequate method of content-based indexing. The main problem is to select report shots and non-report shots because usually different indexing strategies should be applied. The paper examines the usefulness of the temporal aggregation method and pre-categorization of shots in news videos to reduce processing time taken by a very time-consuming content-based video indexing process.
Keywords
Introduction
The great variety of approaches and methods of content-based video indexing are applied to an automatic processing of television broadcast. In television broadcast archives there is a huge number of news and TV sports news videos from last years as well as from the past. Many old analogue videos are digitized, new news videos are being stored every day. Very efficient methods of automatic content-based analyses of digital videos are strongly desirable. The main goal of content-based video indexing of broadcast news videos is to ensure an effective retrieval of special events or special people, of official statements or political polemics and commentaries, etc. Whereas, in the case of sports news the main purpose is to detect players and games, or to select reports of a given sports category. To achieve this purpose, the automatic categorization of sports events, i.e. the automatic detection of the sports disciplines of reported events should be provided. In consequence, the effective retrieval of sports news and of sports highlights of a given sports category such as the best or actual games, tournaments, matches, contests, races, cups, etc., special player behaviors or actions like penalties, jumps, or race finishes, etc. becomes possible.
In most approaches the content-based indexing method needs to analyze all frames of a video. Because such a procedure is extremely time consuming the indexing should be limited to the key-frames, to only one or to a few frames for every shot or for every video scene. The detection of a key-frame of a shot or of a scene requires very effective temporal segmentation methods.
The automatic detection of transitions between shots in a digital video with the purpose of temporal segmentation is relatively well managed and applied in practice. Unfortunately, the detection of scenes is not effectively carried out. A scene is usually defined as a group of consecutive shots sharing similar visual properties and having a semantic correlation – following the classical rule of unity of time, place, and action. The temporal aggregation method [1] detects player scenes taking into account only shot lengths. A player scene is a scene presenting the sports game, i.e. a given scene was recorded on the sports fields such as playgrounds, tennis courts, sports hall, swimming polls, ski jumps, etc. All other non-player shots and scenes usually recorded in a TV studio such as commentaries, interviews, charts, tables, announcements of future games, discussions of decisions of sports associations, etc. are called studio shots or studio scenes. Studio shots are not useful for video categorization and therefore should be rejected. It was observed that the studio scenes may be even two thirds of sports news. This rejection of non-player scenes before starting content analyses creates a great opportunity to significantly reduce computing time and conduct these analyses more efficiently.
Generally different video genres have different editing style. The specific nature of videos has an important influence on the efficiency of temporal segmentation methods. In the experiments performed in [2] the efficiency of segmentation methods was analyzed for five different categories of movies: TV talk-show, documentary movie, animal video, action & adventure, and pop music video. It has been shown that the segmentation parameters should be suitable to the specificity of the videos.
TV news video editing is similar to that of TV sports news but shots are longer in average. Then the statements and commentaries can be more significant in news than in sports news because these statements are not spoken by anchorman but also for example by politicians or famous people. The detection of politicians is important and may be realized using for example face detection methods.
In this paper the usefulness of the temporal aggregation in detection of report and non-report shots in a news video is verified and then it is shown that the pre-categorization of video news shots can significantly optimize the choice of content analysis strategy. This paper is a revised and extended version of the conference papers presented at the ICCCI’2015 Conference: “Automatic Categorization of Shots in News Videos Based on the Temporal Relations” [3] and at the ACIIDS’2016 Conference: “Probabilistic Approach to Content-Based Indexing and Categorization of Temporally Aggregated Shots in News Videos” [4]. The new last section discusses the advantages of applying temporal aggregation and pre-categorization of shots in news videos to reduce the number of frames analyzed during the content-based video indexing.
The paper is organized as follows. The next section describes related work on automatic news and sports news shot categorization. The main idea of the temporal aggregation method is presented and the detection of pseudo-scenes using temporal aggregation is outlined in the third section. The fourth section presents the experimental results of the detection of the main structural units of TV news obtained in the AVI Indexer. The fifth section presents the results of the tests demonstrating the usefulness of temporal analyses of shots in news videos to estimate the probability of shot category and to optimize the order of applied strategies. In the sixth section the influence is discussed of the advantages of applying temporal aggregation and pre-categorization of shots in news videos to reduce the number of frames analyzed during the content-based video indexing. The final conclusions and the future research work areas are discussed in the last seventh section.
Related work
Much research has been carried out in the area of automatic recognition of video content for example of recognition of human behaviors recorded by surveillance video systems [5] as well as in the area of video news indexing and retrieval [6–8]. Video files are also observed as a significant part of a huge big data transferred over the Internet [9]. Traditional textual techniques frequently applied for videos are not sufficient for nowadays video archive browsers. The effective methods of the automatic categorization of a immense amount of broadcast news videos – mainly sports news videos – would be highly desirable. Most of proposed methods require the detection of the structure of videos being indexed and the categorization of shots and scenes detected invideos [10].
Different criteria can be used for the shot categorization in indexing process, the most interesting criterion is the content of a shot. In news videos such shot categories can be defined as [11]: anchor shots, animation (intro), communication (static image of reporter), interview, reporter, maps, studio (discussion with a guest), synthetic (tables, charts, diagrams), whether, and of course report shots.
Other authors developed methods for sports shots classification to such classes as: court views, players and coach, player close-up views, audience views, and setting long views [12], but also as: long views, middle views, close-up views, out of fields views [13], as well as like: intro, headlines, player shots, studio shots [14]. In [15] all sports shots from all types of field video are classified basing on the perceived distance between the camera and the object presented in the shot. Fourteen different shot classes were defined: close up shot head with simple background, close up shot head complex background, close up shot head mixture background, close up shot waist up simple background, close up shot waist up complex background, close up shot waist up mixture background, short distance shot presenting player(s) simple background, short distance shot presenting player(s) complex background, short distance shot presenting player(s) mixture background, short distance shot presenting spectators, long distance shot presenting centre of the field, long distance shot presenting right side of the field, long distance shot presenting left side of the field, and long distance shot presenting spectators.
Anchor/non-anchor shots are frequently used as a starting point for the automatic recognition of a news or sports news video structure. Anchorperson shot detection is still a challenging and important stage of news video analysis and indexing. Recent years, many algorithms have been proposed to detect anchorperson shots. Because we observe the very high similarity between anchor shots (very static sequences of frames, small changes, the same repeated background) one of the approaches of an anchor shot detection is based on template matching. Whereas the other methods are based on different specific properties of anchor shots or on temporal analyses of shots. In the first group of methods a set of predefined models of an anchor should be defined and then, they are matched against all frames in a news video, in order to detect potential anchor shots. The second group of an anchor shot detection methods is mainly based on clustering. Unfortunately, the proposed methods are very time-consuming because they require complex analyses of a great number of video frames. The third approach based on temporal aggregation is very fast because only shot durations are analyzed.
The high values of recall and precision for anchorperson detection have been obtained in the experiments on 10 news videos [16]. The news videos were firstly as usually segmented into shots by a four-threshold method. Then the key-frames were extracted from each shot. The anchorperson detection was conducted from these key-frames by using a clustering-based method based on a statistical distance of Pearson’s correlation coefficient.
The method presented in [17] can be also used for dynamic studio background and multiple anchorpersons. It is based on spatio-temporal slice analysis. This method proposes to extract two different diagonal spatio-temporal slices and divide them into three portions. Then all slices from two sliding windows obtained from each shot are classified to get the candidate anchor shots. And finally, the real anchor shots are detected using structure tensor. The experiments carried out on news of seven different styles confirmed the effectiveness of this method.
The algorithm described in [18] analyzes audio, frame, and face information to identify the content. These three elements are independently processed during the cluster analysis and then jointly in a compositional mining phase. The temporal features of the anchorpersons for finding the speaking person that appears most often in the same scene are used to differentiate the role played by the detected people in the video. Significant values of precision and recall have been obtained in the experiments carried out for broadcast news coming from eight different TV channels.
A novel anchor shot detection method proposed in [19] detects an anchorperson cost-effectively by reducing the search space. It is achieved by using skin color and face detectors, as well as support vector data descriptions with non-negative matrix factorization.
It is observed that the most frequent speaker is the anchorman [20]. An anchor speaks many times during the program, so the anchorperson shots are distributed all along the program timeline. This observation leads to the selection of the speaker who most likely is the anchorman. It is assumed that a speaker clustering process labels all the speakers present in the video and associates them to temporal segments of the content. However, there are some obvious drawbacks, because a shot with a reporter (interview shots) or with a politician (statement shots) frequently found in news can be erroneously recognized as an anchor shot.
Another observation in a large database [21] draws much attention to interview scenes. In many interview scenes an interviewer and an interviewee recursively appear. A technique called interview clustering method based on face similarity can be applied to merge these interview units.
In [22] a fast method of automatic detection of anchorperson shots has been presented. The method is useful for detection long duration shots such as anchor, reporter, interview, or any other statement shots.
The automatic detection and classification of shots in news videos proposed in [23] used a probabilistic framework based on the Hidden Markov Models and the Bayesian Networks paradigms. The system has been tested on news videos of two different Italian TV channels.
Video analyses discussed in the related papers as well as in this research are the methods using visual features only. There are also audio-visual approaches analyzing not only visual information but also audio (see for example [24] or [25]).
Temporal segmentation and temporal aggregation in the AVI Indexer
The Automatic Video Indexer AVI [26] is a research system designed to develop new tools and techniques of automatic video content-based indexing for retrieval systems, mainly based on the video structure analyses [27] and using the temporal aggregation method [1]. The standard process of automatic content-based analysis and video indexing is composed of several stages. Usually, it starts with a temporal segmentation resulting in the segmentation of a movie into small units called video shots. Shots can be grouped to make scenes, and then key-frame or key-frames for every scene can be selected for further analyses. In the case of TV sports news every scene is categorized using such strategies as: detection of playing fields, of superimposed text like player or team names, identification of player faces, detection of lines typical for a given playing field and for a given sports discipline, recognition of player and audience emotions, and also detection of sports objects specific for a given sports category. Whereas in the case of TV news scenes can be categorized basing on the people or place detection using face detection or object detection.
The detection of video scenes facilitates the optimization of indexing process. The automatic categorization of news videos will be less time consuming if the analyzed video material is limited only to scenes the most adequate for content-based analyses like player scenes in TV sports news or official statements in TV news. The temporal aggregation method implemented in the AVI Indexer is applied for a video structure detection. The method detects and aggregates long anchorman shots. The shots are grouped into scenes basing on the length of the shots as a sufficient sole criterion.
The temporal aggregation method has two main advantages. First of all it detects player scenes, therefore the most informative parts of sports news videos. Then, it significantly reduces video material analyzed in content-based indexing of TV sports news because it permits to limit indexing process only to player scenes. Globally, the length of all player scenes is significantly lower than the length of all studio shots.
The temporal aggregation is specified by three values: minimum shot length as well as lower and upper limits representing the length range for the most informative shots. The values of these parameters should be determined taking into account specific editing style of a video and its high-level structure.
Formally, the temporal aggregation process is defined as follows [28]: single frame detected as a shot is aggregated to the next shot,
if (L(shot i ) == 1 [frame]) then L(shoti +1) = L(shoti +1) + 1;
LS = LS – 1;
where L(shot
i
) is the length [measured in frames] of the detected shot i and shoti +1 is a next shot on a timeline and LS is a number of shots detected;
very short shots should be aggregated till their aggregated length attains a certain value Min_Shot_Length,
while ((L(shot i ) < MIN_Shot_Length) and (L(shoti +1) < MIN_Shot_Length)) do
{ L(shot i ) = L(shot i ) + L(shot i +1);
LS = LS – 1;} all long consecutive shots should be aggregated because these shots seem to be useless in further content analyses and categorization of sports events,
while ((L(shot i ) > MAX_Shot_Length) and (L(shoti +1) > MAX_Shot_Length)) do
{ L(shot i ) = L(shot i ) + L(shoti +1);
LS = LS – 1; } after aggregation all shots of the length between two a priori defined maximum and minimum values should remain unchanged – these shots are very probably the most informative shots for further content-based analyses.
Very short shots including single frames are relatively very frequent. Generally, very short shots of one or several frames are detected in case of dissolve effects or they are simply wrong detections. The causes of false detections may be different [29]. Most frequently it is due to very dynamic movements of players during the game, very dynamic movements of objects just in front of a camera, changes (lights, content) in advertising banners near the player fields, very dynamic movements of a camera during the game, light flashes during games or interviews. These extremely short shots resulting from temporal segmentation are joined with the next shot in a video. So, the first two steps of the temporal aggregation of shots also lead to the significant reduction of false cuts detected during temporal segmentation.
Report and non-report shots in temporally aggregated news
The method of temporal aggregation has been applied in the experiments performed in the AVI Indexer. The temporal aggregation has been used with such parameters that only shots of the duration not lower than 45 frames (MIN_Shot_Length) and not greater than 305 frames (MAX_Shot_Length) have been not aggregated. These are shots of the length from 2 to 12 seconds±5 frames of tolerance.
Six editions of the TV News “Teleexpress” used in the experiments have been broadcasted in the first national Polish TV channel (TVP1). Their characteristics before and after temporal aggregation are presented in Table 1. The “Teleexpress” is broadcasted every day and is of 15 minutes. This TV program is mainly dedicated to young people. It is dynamically edited, it is very fast paced with very quickly uttered anchor comments. So, the dynamics of the “Teleexpress” News can be comparable to the dynamics of players scenes in TV sports news. However the number of topics and events reported in the news is usually much greater than in the sportsnews.
The tested videos have standard structure of news video. Every video starts with the intro animation, then several events are reported and commented by anchorman, reporter, politicians, or even casual observers. A sequence of report shots can be optionally illustrated by charts, tables, diagrams, or maps. A news video is always finished by a final graphical animation of several seconds usually with text imposed on the image. It should be also noticed that the lengths of the intro and of final animation vary, so the temporal segmentation detects different sequences of shots at the beginning as well as at the end of different news videos. It results from the fact that sometimes fade effects from black to the intro part are used and similarly the final animation often fades away to a black frame. Moreover, the first frame of the intro and similarly the last frame of the final animation are frozen for some short time, so the lengths of the intro and of the final animation of a news video are not constant. Nevertheless, in a temporally aggregated news video these parts of a video, i.e. intro and final animation, as well as headlines can be easily detected [28]. To speed up the process of shot categorization all shots before the first long anchor shot and similarly all shots after the last anchor shot have been eliminated from further analyses because their categories are already known. The main problem discussed in the rest of the paper is how to determine the most probable categories of the internal part with other shots of news. What kind of categories can be expected in this internal part? These are mainly such shots as: report shot, anchor shots, statement (interview) shots, or chart shots(Table 2).
The question is how the temporal aggregation method changes the temporal relations of shots, whether despite the fact that shots are aggregated their temporal characteristics enable us to predict a shot category.
The aggregation is incorrect if the length of an aggregated shot becomes not adequate for its category, i.e. a report aggregated shot becomes so long that it can be treated as for example an anchor shot. The aggregation is also incorrect if two or more shots of different categories are aggregated, mainly if a report shot is aggregated with no report shot. It happens very rarely. Only 2.8 incorrectly aggregated shots in average have been observed (last row in Table 1). The most frequently it was a case of the aggregation of very short report shot with a subsequent anchor shot. Such a long aggregated shot would be then processed as an anchor shot, and thus it should not distort the results of content-based indexing.
The most important result is that after the application of the temporal aggregation the anchor shots are still the longest shots in news (Table 3). Although, it should be noticed that the statement, reporter, or interview shots are also at the beginning of the ranking of longest shots in news videos. They are almost as frequent (7 shots) in long shots as report shots (8 shots). So, the shot aggregation facilitates the detection of speaking person shots. Thus, the temporal analyses of aggregated shots make it easy to select report shots and non-report shots. And this is a key problem in indexing.
Between all 88 long aggregated shots there are 70 anchor shots and 7 other speaking people shots, eight report shots, one chart shot, and one final animation. The report shots represent only about 9% of all aggregated long shots. To detect faster anchor shots it is desirable to apply temporal aggregation.
The structure of a news video is as follows: it starts with the intro animation, then several stories are presented and commented containing an anchor shot or shots followed by a sequence of report shots optionally enhanced by reporter, politician’s statement, interview, or chart, table, diagram shots. Similarly to the entrance part the news is finished with a final animation shot. Although, the intro and final animation are always the same, nevertheless, as it has been already mentioned, the lengths of the intro and of final animation are not constant because of frozen frames as well as of encountered fades from black to the intro and fades away from final animation to a black frame. So, after temporal segmentation at the beginning as well as at the end of a news video we can receive different sequences of shots. The length of these parts of a news video are of an unpredictable duration.
The analyses of the lengths of both the 50 longest shots (Table 4) as well as of all shots (Table 5) in the tested videos clearly confirm that anchor shots are the longest parts of videos before and also after aggregation: 414 frames in average for the most longest shots in a news video and 341 frames in average for all shots. And at the same time report shots are the shortest parts of videos: 173 frames in average for the most longest shots in a news video and 92 frames in average for all shots.
The most important is that the number of shots decreased almost three times. If the indexing process is based on the analysis of a single key-frame for every shot in a video the processing time of indexing will be also decreased three times. This is a great advantage of the temporal aggregation.
Shot category probabilities in temporally aggregated news
Tables 6–8 presents the statistics for the tested internal parts of videos of the lengths of aggregated shots initially classified to anchor, chart, statement (interview), or report shot categories. These tables show the numbers of all shots of a given category (All) in the internal parts of tested videos and the numbers of shots of a given category selected on the basis of shot lengths (Sel). Also the probabilities have been estimated that all shots of a given category have the lengths of a given range.
The first results (Table 6) are obtained for the following thresholds: 130, 210, and 290, that is it has been assumed that report shots are not longer than 130 frames, the lengths of statement shots are from 130 to 209, the lengths of chart shots are from 210 to 289, and anchor shots are longer than 290 frames.
The estimated probabilities that all shots of a given category have the length of a given range are as follows: anchor shots – 0.83, chart shots – 0.33, statement shots – 0.41, and report shots – 0.85. Average is equal to 0.6049.
Because chart shots are rare the second test (Table 7) skips this category and uses two thresholds: 125 and 300 (5 and 12 seconds). Whereas, the Table 8 presents the results for the test with the same range for chart as well as for statement shots. This option is based on the suggestion that may be these two categories should be analysed in the same way (with the same ranges of lengths).
The estimated probabilities that all shots of a given category have the length of a given range are as follows: anchor shots – 0.80, chart shots – 0.00, statement shots – 0.76, and report shots – 0.83. Average is equal to 0.5995. Average without taking into account chart shots is equal to 0.7993.
The estimated probabilities that all shots of a given category have the length of a given range are as follows: anchor shots – 0.80, chart shots – 0.33, statement shots – 0.76, and report shots – 0.83. Average is equal to 0.6828.
Now, let’s estimate a probability that a shot of a given length is a shot of predicted, the most probable category (Table 9). The statistical estimations of the probabilities that all shots of a given category have the length of a given range are as follows: anchor shots – 0.80, chart shots – 0.01, statement shots – 0.21, report shots – 0.97.
The tests have shown that anchor shots as well as report shots can be selected with very high probability. But the problem remains how to categorize shots of the medium length. Chart and statement shots are the most critical. Table 10 presents the probability of a given category if the length of a shot is neither adequate for anchor shots nor for report shots. The statistical estimations of the probabilities that shots of a given category have the length in the critical medium range are as follows: anchor shots – 0.02, chart shots – 0.01, statement shots – 0.21, report shots – 0.76.
The results of analyses of shots of the length from the medium range (Table 10) show that we can expect most probably report shots or less probably statement shots. Other categories are unlikely.
These estimations clearly suggest such a strategy that also for the shots with the lengths in this medium range the procedure of content-based analysis adequate for report shots should be first applied, next that for statements shots, and then that one adequate for anchor shots. The special procedure for the detection of charts shots can be applied at the end (if at all).
It is interesting to compare the categorization of shots basing on these probabilities with the case of random order of categorization strategies. Random order means that the content analysis starts assuming that the analyzed shot is of a category randomly chosen but always the same for all shots. This comparison is presented in Table 11.
These analyses lead to the very practical conclusion that the order of strategies for content-based analyses can be optimized, i.e. for the long shots (more than 12 seconds) the anchor category is the most probable, whereas, for the short shots (less than 5 seconds) the report category is the most probable. Furthermore, for the shots of the length of a medium range (from 5 to 12 seconds) two categories are more probable than others, these are report shots and statement shots.
Reduction of the number of frames analyzed during content-based indexing
In most approaches the content-based indexing method needs to analyze all frames of a video. It is mainly justified when the methods of analyses need to analyse the object or camera movement and at that time the analysis insures better results than the analysis of a single frame. But in other cases because a procedure of content analysis is extremely time consuming the indexing is limited to key-frames, to only one or to a few frames for every shot or for every video scene. However, the detection of video scenes is still not sufficiently well performed. One of the solution is grouping shots of the same scene or of shots not useful for further content analyses for example anchor shots.
So, we are interested to reduce the number of analyzed shots and in consequence to reduce the number of key-frames chosen from every shot. The methods as well as their efficiency of the selection of a key-frame form a given shot is another problem not discussed in this paper. Much research work has been performed to efficiently extract the video key-frames. Early approaches select key-frames by randomly or uniformly sampling the video frames at predefined intervals. Other techniques try to group frames with similar features and select the frame closest to each group centroid as a key-frame. The more sophisticated approaches take into account visual content, object tracking, motion analysis, and shot dynamics. For some proposed approaches and techniques of key-frame selection see for example [30–33].
If we assume that every shot is represented by a single frame – key-frame – the number of frames analyzed is reduced almost 50 times. But after applying the temporal aggregation method this number of key-frames is reduced 129 times. Then if we limit the content analysis only to internal part of a news video, that is without intro animation, final animation and first and last anchor shots, the number of frames analyzed can be reduced 134 times. Comparing to the number of key-frames extracted for all shots the number of key-frames is diminished 2.71 times when applying temporal aggregation method and when limiting the analysis to the internal part of a news video (Table 12). It is a very significant decrease of processing time taken by a very time-consuming content-based video indexing.
The pre-categorization of report and anchor shots is very efficient, whereas such a process is not useful for chart shots (Table 13). But chart shots are not frequent, so the rare cases of chart shots do not lower the usefulness of this approach.
Final conclusions and remarks
The methods of content-based video indexing are still being improved, new methods are still being proposed. Many of them adapted to the content analysis of news videos are based on video structure. The detection of news video structure and categorization of news video shots is very important. The key problem is to select report shots and non-report shots because usually different indexing strategies should be applied.
The temporal aggregation can successfully reduce the video space analyzed in content-base indexing without disturbing news shot categorization. Furthermore, the temporal aggregation can be also used to select report shots and non-report shots in news videos. Moreover, the determination of the most likely category on the basis of time relations also reduces the analysis time enabling us to apply the adequate method of content-based indexing.
The temporal aggregation method is very efficient in detecting headlines, first welcome anchor shot, but it also can be applied to optimize the decision which content-based indexing strategy should be used as the first. The results of tests performed in the AVI Indexer have confirmed that the temporal aggregation facilitates the automatic parsing of video structure of news videos.
All candidate shots for non-report shots such as anchor shots as well as statement, reporter, or interview shots can be then analyzed using usually proposed approaches mainly based on face detection and person recognition. Whereas, the report shots demand a variety of methods based on different approaches. The tests have shown that the analyses of time relations of shots can easily determine the most likely category of a shot in the news video and then to optimize the order of strategies applied for a given shot. If a shot is most likely anchor shot the detection of studio background or anchor face should be applied as the first. If the estimated probability suggests another category for an analysed shot the adequate strategy for such category should be launched.
Finally, the reduction of the number of analyzed shots significantly decreases the number of key-frames and in consequence decreases processing time taken by a very time-consuming content-based video indexing.
