Abstract
Development of health information technology has had a dramatic impact to improve the efficiency and quality of medical care. Developing interoperable health information systems for healthcare providers has the potential to improve the quality and equitability of patient-centered healthcare. In this article, we describe an automated content-based medical video analysis and management service that provides convenience and ease in accessing the relevant medical video content without sequential scanning. The system facilitates effective temporal video segmentation and content-based visual information retrieval that enable a more reliable understanding of medical video content. The system is implemented as a Web- and mobile-based service and has the potential to offer a knowledge-sharing platform for the purpose of efficient medical video content access.
Introduction
Medical video repositories play important roles for many health-related issues such as medical imaging, medical research and education, medical diagnostics, and training of medical professionals. Because of limitations in accessing medical expertise, the health maintenance system of a country may face a variety of problems that will directly affect an individual's quality of life and also the entire well-being of a society. Connecting as many hospitals as possible to a medical information system would be very beneficial in terms of an improved standard of medical practice and educational aspects for medical students and staff who cannot reach medical resources because of resource, geographical, and time constraints. 1,2
The maturation of Adobe Flash and Web 2.0 has led the launch of several video-sharing Web sites such as YouTube 3 and Vimeo, 4 allowing users to post video clips online and share them with others. 5 There are also topic-specific video sites like OrLive, 6 an online surgical and healthcare video and Web cast platform. These Web sites enable users to stream video content. Although they are efficient in distributing videos over the broadband network, they lack mechanisms for effective content management and organization. Video content in digital libraries is of only limited utility without appropriate organization and management. 7 Video streams must be divided into smaller meaningful segments, and their semantics must be described in order to construct an index for effective retrieval. Indexing the video makes access to certain entities of the content timely and efficient. The data must be partitioned in a hierarchical fashion into meaningfully clustered subgroups so that the foundational structures required for conducting point operation for extracting the related information are obtained. 8
In this article, we present a content-based management service for medical videos that provides convenience and ease in accessing the relevant medical video content without sequential scanning. The proposed service (1) automatically detects the boundaries of the shot changes and partitions a video into shorter segments, (2) provides a pictorial summarization of the video, (3) enables retrieval and access of the video content based on query image, and (4) provides a variety of ways for accessing any particular part of the video (i.e., clicking the key frame starts the video playing from that point in time). We implemented the system for both Web and mobile environments. We give a high-level description of key system components here. The details of the algorithms used in the system can be found in our previous work. 9,10
Materials and Methods
BACKGROUND
This section presents a brief overview of temporal video segmentation, including shot boundary detection and key frame extraction processes and video content retrieval problems. Temporally segmenting videos by detecting the shot boundaries aims to break up the video into meaningful segments so designated shots contain the same semantic information and then key frames are selected to represent each shot.
Most existing methods use a similarity metric between successive frames to detect shot boundaries. Based on the similarity measure, the algorithms can be divided into three categories: pixel, block-based, and histogram comparisons.
Pixel-level comparison 11 –15 is the simplest way to evaluate the intensity values of corresponding difference in pixels between successive frames. A shot boundary has been found if the difference in mean absolute change in the intensity value of the pixels is greater than a prespecified threshold T.
Block-based approaches 11,13,16 –18 are based on the comparison of corresponding regions (blocks) in two successive frames. Frames are divided into blocks that are compared with their corresponding blocks. In contrast to pixel-level comparison, which is based on global image characteristics, block-based approaches use local characteristics to increase the robustness to camera and object movement while retaining enough spatial information.
To increase the robustness to the camera and object motion, alternative approaches have been proposed based on the comparison of histograms of successive images. The histogram comparison algorithms can be divided into two categories: global and local histogram comparisons. Several comparisons of histogram-based techniques have been performed for shot boundary detection based on difference between two histograms. 13,19 –21 On the other hand, several local histogram comparison methods 22 –24 have been proposed based on that frames are divided into uniform and nonoverlapping regions. Histogram values of each region are then compared with corresponding regions of the successive frame.
Key frame extraction involves selecting one or multiple frames that will represent the content of the video. The techniques for key frame extraction can be classified into three categories: curve simplification, matrix-based, and clustering-based methods.
Curve simplification methods 25 –27 are based on approximating line segments in a curve into smaller number of vertices. A simplified curve is computed that approximates a trajectory curve representing a video sequence in a high-dimensional feature space, according to some predefined error criterion. The junctions between simplified curve segments are then chosen as key frames.
Another main approach to key frame extraction is matrix factorization. 28 –30 The frames of video sequences are represented as matrices. Then, by applying a matrix factorization technique to this feature-frame matrix key frames are selected.
Clustering-based techniques 8,9,31 are alternative methods for key frame extraction. After the extracted features are grouped into clusters, key frames are selected from these clusters.
Video content retrieval aims access of the video content based on query image by applying content-based image retrieval principles. Features representing the visual content of the video frames and query image are extracted. Based on the similarity metric determines how close query image and video frames are measured. Retrieval results are then ranked according to the similarity score. Several general-purpose content-based image retrieval systems have been previously proposed. Some examples include SIMPLIcity, 32 CIRES, 33 ALIPR, 34 FIRE, 35 AMORE (Advanced Multimedia Oriented Retrieval Engine), 36 and MARS. 37
System Overview
The system has two components: (1) temporal video segmentation and (2) content-based retrieval. The temporal video segmentation process includes partitioning a video sequence into a set of shots and extracting one or multiple key frames to represent each shot. In the retrieval process, video content can be searched, browsed, or retrieved based on a query image.
Temporal video segmentation
Temporal video segmentation is the first process for automatic video indexing, aiming to split visual data into coherent and smooth groups along the time axis. Figure 1 shows the overview of the temporal segmentation process. It includes two fundamental steps: (1) shot boundary detection and (2) key frame extraction. Shot boundary detection targets partitioning a video into shorter segments (shots). Key frame extraction provides a compact pictorial summarization and representation of a video sequence.

Overview of the temporal segmentation process.
Shot boundary detection of our system is based on hue–saturation–value (HSV) color histogram comparison. RGB color space is converted to HSV space, and the differences of HSV histograms between consecutive frames are computed using the equation
where hk is the color histogram of the kth frame of the video sequence with N bins. 9,10 Furthermore, a color quantization is performed using 256 colors (16 levels for the H channel, 4 levels for the S channel, and 4 levels for the V channel) in order to reduce computational effort. Figure 2 shows color histogram differences of a video part. Peaks are associated as shot boundaries where large discontinuities occur between histograms.

Hue–saturation–value (HSV) color histogram difference. 10
After the video sequence is divided into shots, key frames are chosen from these shots such that each represents the content of the corresponding shot. Our system uses k-means clustering and principal component analysis for key frame extraction process. Figure 3 shows a clustering plot of a video with k=3. The horizontal and vertical axes of clustering plot are the projections of HSV histogram vectors onto the first two principal components. The frames closest to the cluster centroids are selected as key frames.

Clustering plot with k=3. 9
Content-based retrieval
To search shots and images inside the video sequences, our system takes the advantage of visual features of the key frames. Query image and key frames of the videos are represented by the HSV color histogram feature vector and then are compared using a similarity metric. Based on the similarity, the relevant shots or images are retrieved from the videos. The retrieval process is depicted in Figure 4.

Overview of the retrieval process. HSV, hue–saturation–value.
To evaluate the similarity between query image and the key frames, the Euclidean distance between corresponding color histograms h
1
and h
2
is computed
38
:
where x, y, and z denote the color channels of hue, saturation, and value, respectively, for HSV space.
Data Structure
We designed a tree structure in the form of XML (Extensible Markup Language) storing the results of temporal video segmentation. The XML is used for further access, browsing, and retrieval of the video content. The information about the video sequence, in particular the name, description, ID, and frames per second of the video and its key frames, histogram bins of the key frames, the starting and end times of the shots, and where each key frame belongs, is organized using the nested hierarchy of XML elements. Figure 5 shows this information about the video described in the XML tree structure.

Extensible Markup Language tree structure describing the information about the video. FPS, frames per second.
Using XML structural organization, users will be able to efficiently access specific parts of the video they are interested in, browse the video summary (key frames), and query the stored information to retrieve the video content.
Application
Our service is implemented for both Web and mobile environments. Figure 6 shows the overall view of the Web interface. The shot boundary locations of a video chosen from the video list on the right panel are automatically marked by red lines on the timer of the control bar. The key frames of the selected video are arranged on the key frame panel below.

Overall view of the Web interface.
The user can browse the key frames of a video and select any particular key frame that interests him or her most. By clicking on the specific key frame, all key frames of the shot are shown in a separate window (Fig. 7). When the user clicks on one of them, the video can be viewed starting from the segment represented by the selected key frame. Therefore, without having to look through the entire video, users can watch the interesting part they want to see.

Key frame selection and video browsing.
In the content-based retrieval module (Fig. 8), the user can search and retrieve a particular content from the video list based on the query image. Once a query image is specified, the HSV color histogram feature is extracted from the image and compared with the histogram values of key frames that are stored in the XML file. According to the similarity score, relevant shots with their key frames are presented. Once the user identifies an interesting result, clicking on its specific key frame starts video playback from that point.

Content-based retrieval module.
We also implemented our system for mobile devices. Figure 9 shows the mobile interface of the service deployed on a Windows Phone 7 emulator.

Mobile interface on the Windows Phone 7 emulator.
A demo of our Web-based service is available online at
Discussion
Medical video libraries are dedicated to many health-related applications such as medical imaging, medical research and education, medical diagnostics, and training of medical professionals. In recent years, rapid expansion in the use of digital videos has led to a significant increase in the availability and the amount of video data. 39 In this article, we have presented an automated content-based management service for medical videos. The proposed service brings ease and effectiveness to the access of visual medical content. Our system is designed for Web and mobile platforms and has the potential to provide a robust framework for effective management and organization of medical video content.
Footnotes
Disclosure Statement
No competing financial interests exist.
