Abstract
We present a new approach for generating thumbnail images from H.264/AVC coded bit streams. We have verified analytically that mismatch errors between the encoder and decoder prevent the direct generation of thumbnail images from H.264/AVC transform coefficients. Based on this analysis, we have devised a method that exploits both the spatial and transform domains. What distinguishes our algorithm from previous works is that it determines the thumbnail image pixels by summing the residual and estimate block averages. The residual block averages are directly acquired in the transform domain and the estimate block averages are calculated in the spatial domain. The proposed method produces thumbnail images that are indistinguishable to the ones produced by the method that decodes the H.264/AVC-I slice bit streams and then scales them down, while, for most of images, it executes almost 3 times faster than the down-scaling method at frequently used bandwidths.
Introduction
The use of spatially reduced images, also known as thumbnail images, has become more prevalent as video browsing and retrieval have been more important. In order to generate thumbnail images, the encoded video stream must be decoded and then the decoded video sequence must be down-scaled. This could lead to significant amounts of computing and storage requirements as the demand for video browsing and retrieval expands to real time broadcasting and mobile terminals [1]. Therefore, efficient methods for the rapid generation of thumbnail images from coded video streams must be developed.
The efficient methods had been developed for generating thumbnail image directly from the MPEG-2 or 4 coded video streams. The current thumbnail image generating method used for MPEG-2 and 4 coded bit streams is to collect the DC values of image spectra in I-frames, since the DC values are the average values of image blocks [2, 3]. However, the conventional method is not readily applicable to the H.264/AVC coded bit stream for the following reasons. The H.264/AVC standard adopts the integer DCT (IntDCT). The kernel of the IntDCT is extracted from the kernel of the modified DCT (MDCT), which approximates the DCT kernel found in the MPEG-2 and 4 standards [4, 5]. It also performs intra prediction in the spatial domain and conducts transforms on the resultant spatial residual blocks [6]. So, the H.264/AVC transform coefficients are not directly associated with the image block spectra. In addition, the quantization parameters and the integer DCT coefficients are closely coupled. Therefore, it has become necessary to develop a method that efficiently generates thumbnail images from a H.264/AVC encoded video stream.
In order to produce image spectrum from H.264/AVC bit stream, Chen et al. developed matrix operations translating the intra prediction in spatial domain into the MDCT domain [7]. Using this work, Kim et al. generated thumbnail images by collecting DC values of MDCT coefficients [8]. Since the method in [7] considered neither the quantization process nor the procedure of separating the IntDCT kernels from the MDCT kernels, it induces mismatch errors between the encoder and decoder when applied to actual H.264/AVC coding system [9, 10]. For compensating the errors, Kim et al. adopted lookup tables (LUT). However, the mismatch errors can not be exactly analyzed in transform domain and so it is difficult for their lookup tables to compensate the errors. The method also requires real number operations due to the real values of MDCT.
In this work, we propose new algorithms exploiting spatial and transform domains. We analyze the intra prediction in the MDCT domain with consideration of the mismatch errors. Based on the analysis, the proposed method obtains thumbnail pixels by summing the averages of the residual and estimate blocks. The averages of the residual blocks are acquired from the DC values in the integer transform domain and those of the estimate blocks are computed in the spatial domain. The proposed method not only radically eliminates the source of the mismatch errors, but also performs integer operations only. The experiments verify that the proposed method produces the thumbnail images subjectively indistinguishable to the down-scaled image and reduces the complexity by more than 60% over the down-scaling method.
The outline of the paper is organized as follows. Section 2 introduces the transformation and intra prediction used in H.264/AVC. In Section 3, we prove the infeasibility of generating the thumbnail image directly from transform coefficients of H.264/AVC. Section 4 elaborates the dual domain method for generating thumbnail images from H.264/AVC bit-streams. The performances of the developed method are discussed in Section 5. Finally, Section 6 summarizes the conclusions.
Transform and Intra prediction in H.264/AVC
This section briefly addresses the transformation and the intra prediction of H.264/AVC.
Transformation of H.264
In H.264/AVC, a spectrum of an image block is obtained by the modified DCT(MDCT) that approximates the real DCT. However, for integer operations, H.264/AVC performs the integer DCT(IntDCT) of which kernels are derived from the MDCT kernel [4].
Let
Intra prediction of H.264/AVC encodes the residual between current blocks to be coded and blocks estimating the current blocks. Let

Reference pixels in accordance with block sizes. (a) The 13 reference pixels for the 4×4 intra predictions. (b) The 33 reference pixels associated with the 16×16 intra predictions.
It is possible to describe the procedures of reconstructing estimated block in matrix operations [7, 8]. We denote
The n
th
residual block and its reconstructed residual block are denoted as
In this section, we analyze the intra prediction in the transform domain and then verify that it is infeasible to generate thumbnail images directly from H.264/AVC transform coefficients.
We use the subscripts “MDCT” and “IntDCT” to denote the modified DCT blocks and the integer DCT blocks. From (3), the intra prediction in MDCT domain can be described as
From (4), if Q
s
is quantization scale, the spectrum of a reconstructed residual block is
In H.264/AVC, the inverse quantization procedure is implemented by multiplying
The rescaling matrix
For integer operations, the inverse quantization of H.264/AVC produces the integer block
As shown from (4) and (9), the errors of the elements containing rounding error spread over entire elements of the reconstructed spectrum block
Figure 2 compares the thumbnail images produced by taking the average values of the 4×4 decoded image blocks and collecting the DC values of the reconstructed 4×4 MDCT blocks. As shown in Fig. 2(b), the integer approximation error embedded in the MDCT coefficients propagates and accumulates and so the quality of thumbnail image degrades rapidly as the intra prediction proceeds.

Comparison of thumbnail images. (a) Thumbnail image produced by the average values of 4x4 blocks. (b) Thumbnail image produced by collecting DC values of the 4x4 reconstructed MDCT blocks.
This section develops a method for fast generating thumbnail images from H.264/AVC coded bit streams, without any mismatch errors. A thumbnail pixel must be the average of the reconstructed block and the reconstructed block is sum of the residual block and estimated block. Therefore, a thumbnail pixel Avg_rec is obtained as
The proposed method obtains the residual block averages in the integer transform coefficients and calculates the estimate block averages in the spatial domain. So, the proposed method is carried out in dual domains.
At {(i, j)} = {(0, 0) , (2, 0) , (0, 2) , (2, 2)}, the elements of
From (9) and (11), the reconstructed MDCT coefficients at { (i, j)} = { (0, 0) , (2, 0) , (0, 2) , (2, 2) } are
The DC coefficient of image spectrum,
From (13), we can obtain the residual block average directly from integer DCT coefficients without any mismatch errors.
Similarly, for the 8×8 transform, the elements of
As shown in Section 3, the estimate block averages calculated from transform coefficients must contain the accumulated mismatch errors. Therefore, we need to calculate the estimate block averages in the spatial domain.
The estimate blocks are constructed through referring to the pixels at the boundaries of previously reconstructed blocks [6]. So, Avg_EstBL can be calculated from only the boundary pixels. Table 1 lists the formulas calculating Avg_EstBL for each of intra-prediction modes. In the table, calculations of the estimate block averages for vertical, horizontal and DC modes in 16×16 size are equivalent to those of same modes in 4×4 size.
It can be observed that the reference pixels are produced from the combinations of only the 7 boundary pixels located at bottom and right boundaries of previously reconstructed blocks. Figure 3 represents how combinations of the 7 boundary pixels construct a 4×4 estimate block. In Fig. 3(a), the block

Reference pixels for constructing a 4×4 estimate block. (a) Constitution of 13 reference pixels used for estimating the block
As seen in (3), the reconstruction of the 7 boundary pixels requires the reconstruction of the corresponding pixels of the residual block and the estimate block. The 7 boundary pixels of a residual block are reconstructed by calculating the integer inverse DCT [13]. The boundary pixels of an estimate block are generated through combinations of the previously reconstructed boundary pixels, where the combinations are made up in accordance with each intra prediction mode [6].
Using that the 7 boundary pixels of a residual block are reconstructed, we develop a method that greatly reduces the complexity of integer inverse DCT(IntIDCT). From (1), the integer inverse DCT(IntIDCT) is computed such as
Let
Similarly, Avg_EstBL for the 8×8 block size can be calculated by 25 reference pixels which are shown in Fig. 4. Table 2 lists the formulas calculating Avg_EstBL for each 8×8 intra prediction mode. As in the 4×4 block, only the 15 boundary pixels of the 8×8 block are reconstructed. The 15 integer IDCT (15-IntIDCT) is derived in the following manner.

Reference pixels for constructing a 8×8 estimate block.
where the horizontal components are
Tables 3 and 4 compare the complexities of the proposed integer IDCTs and the conventional integer IDCTs. Computational structure of the 7,15-IntIDCT contains neither the diagonal component computations nor any internal dependencies. Accordingly, the 7,15-IntIDCT not only greatly decrease the number of operations, but also avoid any recursive operations. By avoiding the recursive structure, the proposed integer IDCTs dramatically reduces memory access compared to the fast integer IDCT. In addition, they require fewer arithmetic operations over the direct IDCT since fewer components are computed.
With the method constructing the 7 or 15 boundary pixels only, we can greatly reduce the complexity of calculating the average of an estimate block, compared to the method reconstructing blocks.
Complexities of the 7-IntIDCT and the 4×4 integer IDCTs
Complexities of the 15-IntIDCT and the 8×8 integer IDCTs
Figure 5 illustrates the overall scheme of the proposed method. The quantized residual integer DCT block

The overall scheme for the proposed dual domain method.
Distinctions of the proposed method can be listed as follows. First, the proposed method uses multiple domains exploiting both the transform and spatial domain. Second, it performs the 7-IntIDCT or the 15-IntIDCT instead of the 4×4 or 8×8 integer IDCT. Finally, it recovers only the 7 or 15 estimate boundary pixels rather than the entire estimate blocks.
In chroma components, even if the block size for the intra prediction is 8×8, the size of integer DCT is 4×4. Thus, 7-IntIDCT can be directly applied except that the reference pixels are at boundaries of the 8×8 block.
We evaluate the performance of the proposed method in terms of thumbnail image quality, theoretical complexity and actual computation time. The proposed method was implemented on the H.264/AVC-JM17.2 at main and high profiles [14]. Test sequences are ‘table setting’, ‘playing cards’, and ‘rolling tomatoes’ of which the resolution is HD(1920×1080). The transform block size for the main profile is 4×4 and the resolution of the thumbnail images is 480×270. At the high profile, 4×4 and 8×8 block transforms are used and so the resolution of the thumbnail images is 240×136.
Evaluation of thumbnail image quality
The original thumbnail images are naturally regarded as the images obtained through fully decoding the intra-coded video stream and then down-scaling the resultant images. Average, bilinear and bicubic filters are used in down-scaling.
We used the Structural SIMilarity(SSIM) index to measure the similarities between the original thumbnail images and ones generated by the proposed method, and also between the original thumbnail images and the ones obtained through the DC extraction method [15]. For two images
The SSIM score of 1.0 implies that the compared images are identical. Table 5 shows the SSIM results. The SSIM values comparing the original images and the images made by the proposed method are very close to 1, which indicates that the differences between the compared images are indistinguishable. Conversely, the low SSIM values in regards to the DC extraction method indicate that the original images and the images produced by the DC extraction method are apparently distinguishable.
The structural similarity(SSIM) index scores. The original images are measured against the images generated by the proposed method and the DC extraction method
Figure 6 shows the thumbnail images generated using each method and the differences between these images and the original images. We used an average filter to obtain the original image. The image resolution was reduced to 1/4 in each direction. As seen in the figure, the thumbnail image produced by the proposed method is identical to the original image, whereas the DC extraction method displays a degradation in quality caused by the mismatch errors.

Thumbnail images generated by the proposed method and the DC extraction method and differences against the original images. (a) Thumbnail images by the proposed method. (b) Thumbnail images by the DC extraction method.
According to both the SSIM and the subjective quality results, the proposed algorithm produces thumbnail images that are indistinguishable from the original ones.
In order to evaluate the complexity of the proposed method, we compared the complexities of the proposed method to the down-scaling method. The down-scaling method generally uses an average filter since the average filter has the lowest complexity and the thumbnail images generated by the average filter are almost identical to the ones generated by using the other down-scaling filters.
For calculating the complexities theoretically, we multiplied the frequencies of each mode and the theoretical complexities found in the corresponding modes. The frequencies of each mode were actually counted. The theoretical complexity of the proposed method includes the computation of the 7-IntIDCT and 15-IntIDCT, the construction of the 7 or 15 boundary pixels of estimate blocks, and the calculation of the formulas found in Tables 1 and 2 used to obtain the estimate block average. The complexity of the down-scaling method includes the computation of the fast 4×4 and 8×8 integer IDCTs, the construction of the 4×4 or 8×8 estimate blocks, and the calculation of the reconstructed block average [4, 6, 11].
Table 6 compares the theoretical complexities. Compared to down-scaling method, at the main profile, the proposed method reduces addition by 26.9%, multiplication by 45.5%, and memory access by 79.9%, and at the high profile, the proposed method reduces addition by 33.2%, multiplication by 38%, and memory access by 82.7%.
Complexities for generating a pixel of thumbnail images from bit-stream coded with main and high profiles. ‘Scale’ and ‘Prop’ indicate ‘down-scaling method’ and ‘proposed method’, respectively
Complexities for generating a pixel of thumbnail images from bit-stream coded with main and high profiles. ‘Scale’ and ‘Prop’ indicate ‘down-scaling method’ and ‘proposed method’, respectively
In order to evaluate the execution times at actual systems, we executed the proposed method and the down-scaling method on a 3 GHz 32-bit Pentium processor. Table 7 compares the average execution times for generating a thumbnail image. As the QP becomes larger, the more integer DCT coefficients become zeros. The zero coefficients do not need inverse quantization and inverse transformation so that the execution time should be more reduced for the lager QPs. At the QP values of around 25 that produces frequently used bandwidth for HD size, the proposed method executes almost 3 times faster than the averaging method.
Comparison of the average execution time for generating a thumbnail image (msec)
We have presented a new method in generating thumbnail images from H.264/AVC coded bit streams. The proposed method produces pixel values of thumbnail images by summing the averages of the residual and estimate blocks. The averages of the residual blocks are obtained from the spectrum of residual blocks in the transform domain and those of the estimate blocks are calculated from combinations of reference pixels in the spatial domain. The proposed method not only eliminates the source of the mismatch errors, but also performs integer operations only.
When the proposed method was tested on actual systems, it executed more 3 times faster than the method that decodes the video stream and then down-scales the decoded images. The proposed method produces thumbnail images indistinguishable to the human eye from the ones by the down-scaling method.
Appendix
We derive how the error accumulates as the intra prediction proceeds. From (2), the MDCT domain version of the n
th
estimate block
From (9),
The solution of recursive equation (19) is determined as
Thus,
Consequently, the spectrum of the reconstructed block must contains the integer approximation errors which are accumulated as the intra prediction proceeds.
