Abstract
With the increasing variety of display devices, image retargeting has become an indispensable technology for adjusting the aspect ratio of images to adapt to different display terminals. Since the retargeting operation would cause geometric distortion and content loss of the image, the image retargeting quality assessment (IRQA) is necessary to guide the retargeting algorithm’s optimization, selection, and design. Our paper mainly works for systematically reviewing the state-of-the-art technologies in IRQA. And then, this paper further discusses image registration algorithms for matching the original image and the retargeted image. Next, we investigate the feature measurement methods for image retargeting quality evaluation. To facilitate the quantitative assessment of the IRQA methods, this paper gives a list of publicly open datasets and the performance of the mainstream methods. Finally, some promising research directions towards IRQA are pointed out. From this survey, engineers from the industry may find skills to improve their image retargeting systems, and researchers from academia may find ideas to conduct some innovative work.
Introduction
In recent years, due to the diversity of display devices and the versatility of image media sources, images and videos need to adapt to display terminals with different resolutions and aspect ratios [1, 2]. Image retargeting technology adjusts the size of input images while retaining important areas [3, 4] of display devices with different resolutions and aspect ratios. Since the composition and characteristics of each image are different, there is no retargeting method that can achieve satisfactory retargeting results in various types of photos [5, 6]. In addition, although objective image quality assessment methods have made significant progress, they are limited to two images of the same sizes. Due to some information loss or geometric changes, the size and resolution of the retargeted image are different from those of the original image, so it is necessary to evaluate their difference through registration or similarity measurement. Therefore, it is crucial to design an image Retargeting Quality assessment (IRQA) index consistent with a subjective assessment to quantify the effects of different retargeted operators [7] and optimize the image retargeted technology to obtain high-quality retargeted images.
According to the development of IRQA, this paper summarizes the key technologies, as shown in Fig. 1. Early IRQA indicators usually use Edge Histogram (EH) [8] and color layout (CL) [9] to measure the difference between the retargeted image and the original one. Taking the image as a whole, the spatial edge similarity and color distribution similarity between the original image and the retargeted image are calculated. Bidirectional Similarity (BDS) [10] is used to calculate the bidirectional mapping between the patches of the original image and the patches of the retargeted image as a measure. Bidirectional Warping (BDW) [11] is similar to BDS, and the mapping in BDW adopts asymmetric dynamic time warping, which minimizes the warping cost and keeps the patches in order. Still, they think that each patch is equally essential, which quickly leads to the loss of important information of retargeted images [12]. Earth-Mover’s Distance (EMD) [13] can capture the structural attributes of images, which is more consistent with human subjective perception than the above indicators. These methods mainly detect the quality of the retargeted image by calculating the similarity between the original image and the retargeted image. Since the retention of important contents is not considered [14], the consistency between these indicators and the subjective perception is low. To align the original image with the retargeted image and compare the quality of images with different sizes, Scale-invariant feature transform-flow (SIFT flow) [15] uses SIFT descriptor [16] to match the original image with the retargeted image and improves the consistency with subjective assessment. By using SIFT Flow method, most IRQA methods first apply the registration algorithm to establish the pixel correspondence between the original image and the retargeted image, which solves the problem that cannot be directly evaluated because of the different sizes and resolutions. Then features from the original and retargeted images are extracted to calculate their similarity (mainly considering geometric distortion and content loss). Finally, these features are fused to get the predicted IRQA score. With the success of deep learning in image processing, high-level semantic features have become another vital measure feature in IRQA. Preserving the semantic information of the image is the primary task of image retargeting, and IRQA without precise semantic component evaluation is incomplete[17, 18].

Timeline of representative IRQA methods.
In the last decades, there are a few reviews on IRQA. Most of them focus on the classical image quality assessment measures. In 2012, Ma et al. summarized the simple mathematical distance metrics on IRQA and evaluated these metrics on their proposed dataset [19]. In 2020, Karimi et al. described subjective and objective image quality assessment metrics, and then, based on low-level and high-level features, IRQA methods are classified and their performance indicators are listed [20]. However, the involved IRQA algorithms in literature [20] are limited. In 2022, Asheghi [21] listed the existing IRQA methods from subjective and objective perspectives while summarizing image retargeting methods. The objective IRQA methods are only listed by year and not classified. Also, since there are different aspect ratios between the original and the retargeted image, image registration is generally used as the preprocess operation for the task of IRQA. However, the existing IRQA surveys do not analyze and discuss the registration algorithms. Besides, they are described without emerging classical topics. Considering the limitations of the above reviews, the image registration algorithms are introduced in detail in our paper, which can promote the understanding of the later researchers on the development of IRQA and we summarize the general order of existing IRQA algorithms, as shown in Fig. 2, which is roughly divided into four stages: (1) registering the original image with the retargeted image; (2) extracting features and measuring similarity; (3) fusing the features in stage (2); and (4) predicting the quality of the retargeted image. Based on the implementation algorithms of (1) and (2), the existing IRQA algorithms are comprehensively summarized.

General method of IRQA.
The main contributions of this work are as follows: The existing IRQA algorithms are summarized into three parts: registering the pixels of the original image with the retargeted image; extracting features for feature measurement; Fusing multiple features. In this paper, the registration algorithms are summed up for the IRQA for the first time. The feature measurement methods are analyzed from several aspects: aesthetic perception, high-level semantic feature based on deep learning, etc. We thoroughly compare the performance of existing IRQA methods, shedding light on potential directions for future research from registration, aesthetics, and symmetry based on deep learning.
The organization of this paper is as follows: Section 2 summarizes the IRQA methods involved in this paper briefly. Section 3 introduces the registration algorithm used by IRQA. Section 4 describes the method of feature measurement involved in IRQA and the common-used feature fusion methods; Section 5 lists the datasets and evaluation indexes of image retargeting and compares the performance of state-of-the-art IRQA methods. The conclusion and the future development are discussed inSection 6.
According to publication years, we summary the state-of-the-art IRQA algorithms in Table 1. In addition, we further classify and describe the literatures in Table 1 according to feature measurement inTable 3. As shown in Table 1, we analyze the advantages and disadvantages of the referenced papers, and provide a general overview of these IRQAalgorithms.
A summary of representative literatures in the field of IRQA
A summary of representative literatures in the field of IRQA
IRQA methods based on common-used feature measures
Early quality assessment methods mainly use the distance between two images for measurement without considering the preservation of important contents or structural distortion [19]. Therefore, the objective evaluation results are not well consistent with subjective scores. Since the excellent performance of SIFT in registering, the IRQA index uses SIFT [16] to establish the pixel correspondence between the original image and the retargeted image. Then the extracted features are used to measure the similarity directly. Fast robust features (SURF) [22], Deformable spatial pyramid (DSP) [23], backward registration [24], and bidirectional registration are also common-used registering methods, and their principles, advantages, and disadvantages are summarized in Table 2, which describes how IRQA uses each algorithm to establish pixel correspondence.
Comparative analysis of different registration algorithms. “P”, “A”, “D” indicates “Principles”, “Advantages”, “Disadvantages”, respectively
Comparative analysis of different registration algorithms. “P”, “A”, “D” indicates “Principles”, “Advantages”, “Disadvantages”, respectively
SIFT is a local operator describing local gradient information, which is a discrete feature representation, including feature point extraction, feature point description, and feature point matching. The SIFT descriptor extracts the local structure of the image and encodes the context information. Then a discrete and discontinuous stream estimation algorithm matches the SIFT feature at the pixel level. In 2014, Fang et al. used SIFT Flow algorithm to establish the pixel correspondence between the original image and the retargeted image [25]. Then they measured the retention of structural information in the retargeted image by the structural similarity index measure (SSIM).
To measure the quality of two images with different sizes, Hsu et al. [26] also used SIFT flow algorithm establishing dense correspondence estimation to measure the perceived geometric distortion and information loss. A new measurement method was proposed to quantify the information loss of retargeted images. As shown in Fig. 3, based on SIFT Flow mapping, the saliency map of the retargeted image is obtained by warping the saliency map of the original image [27]. The saliency loss ratio (SLR) is measured by the ratio of the sum of the saliency values before and after the retargeting operation, as shown in Equation (1).

Information loss estimation [26].
Speeded Up Robust Features (SURF) are an interest point detector and descriptor with invariant scale and rotation [22] which generates feature points and extracts features by constructing a Hession matrix (Gaussian convolution process corresponding to SIFT). In feature point descriptor generation, Haar wavelet response is used to allocate the directions of SURF feature points. Four directions are formed by the sum of the horizontal direction, the sum of the vertical direction, the sum of absolute values of the horizontal direction, and the sum of absolute values of the vertical direction. In comparison, SIFT counts eight gradient directions in each region. The generation time of the feature point descriptor of SURF is reduced by half, which dramatically improves the feature extraction and description speed and generates stable edge points. Zhang et al. [28] proposed an intermediate domain optimization registration method based on SURF, which can establish a robust match. Then they used the SURF features extracted by the SURF algorithm to calculate the Average Local Similarity (ALS) of all image blocks [29]. According to the unmatched SURF features, the retargeted image’s Content Loss Degree (CLD) is calculated, too, see Equations (2) and (3).
The state-of-the-art matching methods (SIFT Flow and PatchMatch [30]) are generally performed at the pixel level. In [23], using the DSP algorithm, a pyramid model is proposed, which can regularize the matching consistency in multiple spatial ranges (from the whole image to the coarse grid unit and then to each pixel) simultaneously. The accuracy and running time have been greatly improved. To measure the perceived geometric distortion and information loss, Shigwan et al. [31] used the DSP algorithm to replace SIFT flow algorithm [26] and establish a fast and dense correspondence. The predicted quality score is more consistent with subjective evaluation.
Backward registration algorithm
Zhang et al. [24] showed that the geometric change estimation efficiently clarifies the relationship between the original and retargeted images. They proposed a backward registration algorithm that maps the retargeted image to the original image to reveal the geometric change. The registration accuracy is further improved than before. This paper interprets image retargeting as resampling grid generation and forward sampling. As shown in Fig. 4(a), the red nodes are the regular retargeted image grid, while the gray nodes are the regular original image grid and the blue nodes are the resampling grid in different shapes. The backward registration is formulated as a problem that reveals the resampling grid, which is a reverse problem compared to the forward resampling, as shown in Fig. 4(b). The red node p is the observed pixel in the retargeted image. The blue node l is the unknown resampling location for pixel p, and the subscripts {L, T, B, R} denote the four-connected neighborhood. It mainly estimates the resampling location for each pixel from the retargeted image. The backward registration problem is expressed as the marking problem of retargeted image pixels with minimum energy [24], and the computation process is summarized as Equation (4).

Forward resampling and backward registration [24].
To get accurate backward registration results, this paper develops a hybrid feature descriptor in data item by using the CIE-Lab intensity component, dense SIFT descriptor, and the relative position component, which is different from using the SIFT descriptor for matching before.
A bidirectional registration algorithm was proposed in [10] to improve the accuracy of backward registration. This algorithm generally uses SIFT flow registration from two directions: forward matching from the original image to the retargeted image and backward matching from the retargeted image to the original image, which can be regarded as a supplement to backward registration. As shown in Fig. 5, the size of the visual image of forward matching is the same as the retargeted image, and the size of the visual image of backward matching is the same as the original image. Pixel matching is established in both forward and backward directions.

Bidirectional registration.
No reference image quality evaluation method does not involve the registration algorithm. In 2016, at the first time, Ma et al. [32] proposed a no-reference quality assessment method for IRQA based on pairwise rank learning, which uses GIST [33] (a kind of scene feature descriptor, which can capture the most sensitive information for the evaluation of the perceived quality of the retargeted image) extracted from each image to do pairwise rank learning, and a ranking model is obtained. Based on the learning model’s ranking results and subjective quality scores, the image quality scores are obtained by the exponential curve fitting ranking function.
When using the registration algorithm, the researchers found that incorrect matching easily occurred when pixel correspondence was established between the overlapped texture and the smooth region of the image [34]. To solve this problem, Jiang et al. [34] tried not to establish the dense correspondence between images but measure the similarity between the original image and the retargeted image from a global perspective. Two overcomplete dictionaries are learned from an image to represent its corresponding distortion-sensitive features. The dictionaries of the retargeted image and the original image are connected to obtain a joint overcomplete dictionary and perform sparse representation. Then, similarity measurement is carried out. However, this method is less efficient than SIFT Flow and EH methods because learning features from over-complete dictionaries is very time-consuming, accounting for about 70% of the total time.
Feature measurement method for image retargeting quality evaluation
In the first stage of image registration, the pixel correspondence between the original and the retargeted image is established, which is the crucial step of IRQA and solves the problem that the original and the retargeted image are different in size. Then the features of the two images are measured. Also, the feature measurement method consistent with subjective perception can improve the consistency between IRQA and subjective evaluation [53]. Different feature measurement methods are classified and summarized in Table 3.
Structural distortion and information loss
Many experiments show that shape distortion and content information loss can improve metric performance [19]. Therefore, most IRQA methods use structural distortion and information loss as measure indicators. The following IRQA methods mainly consider these two measurement indicators. Zhang et al. [34] proposed a novel objective index that accounts for three major determining factors for humans visual perception on retargeted images: Global structural distortion (G), local, regional distortion (L), and significant information loss (S). Figure 6 shows the visualization results of three dominant distortion types for image retargeting. They are measured respectively by graph structural similarity, image block similarity, and information integrity. Finally, the machine learning model fuses these features to predict the quality score. The quality index is called GLS.

Three dominant distortion types [35].
To limit visual artifacts in the retargeted image, preserve the critical content of the original image, and preserve internal structures of the original image, Hybrid Distortion Pooled Model (HDPM) [36] is proposed to take into account some common types of mixed distortion, such as image local distortion, content information loss, and image structure distortion. Firstly, SIFT descriptors are extracted from the original image and the retargeted image, then matched, and the local distortion degree of the retargeted image is measured by calculating SSIM between two matching blocks; then, the loss of content information is evaluated by using unmatched SIFT descriptors, and the degree of distortion of image structure is quantified by using Gray-level co-occurrence matrix (GCLM) [54] of two images; finally, these three distortion types are linearly fused to predict the final quality. Because some distortions are difficult to evaluate by pixels, a framework based on region evaluation is proposed to measure the shape distortion and information loss [38]. Firstly, this paper extracts the original image’s saliency map and a group of region segments used to determine the region information. Then, the SIFT Flow method establishes the corresponding relationship between the original and the retargeted image area. And the measures of local distortion, global distortion, and information loss of the retargeted image are calculated. Finally, the quality index is the weighted sum of these three indexes.
Since the perception of a human visual system depends on the edge of the image, the geometric distortion caused by image retargeting often distorts the edge, which seriously affects the retargeted image quality. Peng et al. [38] proposed Local and Global Geometric Distortions (LGGD) metric, which measures both local and global geometric distortions using a sketch token-based local edge descriptor (ST-LED) to represent geometric-aware features. This index named LGGD+has good compatibility and can be fused with existing IRQA indexes (including geometric and non-geometric distortion measures). It can be found from Tables 5 and 6 that the correlation between LGGD+and human subjective perception is not very high. However, LGGD is an excellent supplement to the existing IRQA. Designing a quality calculation model that covers all quality-related aspects isn’t easy. The proposed LGGD, which is also a breakthrough, measures geometric distortion more accurately and improves IRQA performance to a higher level.
In IRQA, full reference images are not always available, but reference images can be described with partial information. Wei et al. [39] proposed a reduced reference IRQA method to calculate the local structural similarity and global salience similarity between the original image and the retargeted image by extracting SIFT features and visual saliency features. Since pixels at different positions have different weights, weighted EMD is proposed to calculate the similarity between features. Finally, local EMD and global EMD are fused to obtain the final quality score. Unlike the IRQA method based on a full reference image, the reduced reference evaluation method only uses part of the reference image information. Usually, its correlation index is relatively low. This paper aims to achieve a breakthrough in reduced reference quality assessment of retargeted images.
The first step of IRQA is to establish pixel correspondence between the original image and the retargeted image. However, the overlapping part of the texture and smooth area in the image is prone to wrong matching when establishing the corresponding pixel, which leads to inaccurate feature similarity measurement. Given this problem, researchers have done a lot of improvement work.
Hsu et al.[26] simultaneously considered perceptual geometric distortion(PGD) and information loss. Firstly, they measured the local Geometric Distortion Map (GDM) of the retargeted image using the local variance of the SIFT vector field of the image and warped the retargeted image back to the original image, and evaluated the prediction residual between the distorted image and the original image based on SIFT flow. It can be used as a Local Confidence Map (LCM) to suppress errors caused by a mismatch of SIFT streams. Then the visual saliency map (VSM) is used to determine the weight value of each block in GDM. Finally, PGD is calculated by combining GDM, VSM, and LCM. In addition, the information loss is measured by the saliency loss rate (SLR), which is the ratio of the saliency value of the retargeted image to the original image. To alleviate the measurement error in the spatial domain, a learning-based spatial-frequency domain metric (LSFM) [40] is proposed, which measures shape distortion and visual content change (local content change and global content change) in the spatial domain and frequency domain, respectively. It reduces the matching error in the spatial domain by measuring shape distortion and visual content change in the frequency domain without matching. Finally, machine learning is used to fuse different quality estimators. Similar to the confidence measurement used in the literature [26] to suppress error distortion detection caused by mismatch, Niu et al. proposed a metric named RN-IRQA [6]. The retargeted image is reconstructed by backward registration. The similarity map between the retargeted image and the reconstructed image is calculated to measure the accuracy of backward registration, which is recorded as registration confidence measure (RCM). For RCM, the blacker pixels, the lower the accuracy of backward registration. Then, it is fused with similarity measure to obtain reliable image fidelity measurement. This paper also puts forward the strategy of noticeability-based pooling (NBQ), which reflects the different sensitivities of different image regions to distortion. Compared with the previous Importance Weighted Pooling (IWP) strategy, it makes excellent progress.
Aspect ratio similarity
The proposed backward registration [24] reveals the geometric changes in image retargeting. As shown in Fig. 7, the geometric changes establish the block correspondence between the original and retargeted images. An Aspect Ratio Similarity (ARS) metric is designed to evaluate the local block changes and then evaluate the quality of the retargeted image. After that, most of the literature used ARS as a part of the measurement features. Zhang et al. [41] further proposed a measurement method based on Multiple-Level Feature (MLF), which used the backward registration algorithm to estimate the pixel correspondence between the original image and the retargeted image, then generated multi-level features according to the corresponding relationship, including low-level ARS [24], Edge Group Similarity [55] and high-level Face Block Similarity [56]. MLF simulates different attribute changes in image retargeting and solves the deformation problem of common high-level semantics, such as the inability to capture a face and structural deformation effectively. Guo et al. [42] mainly considered global distortion and local distortion. For measuring the deformation of the whole structure, elastic registration based on the B-line function and foreground retention rate are proposed. Improved ARS (IARS) and concentrated deletion ratio are proposed to measure local content loss. Because the retargeting algorithms are different, the retargeted blocks are irregular polygons. It is unreasonable to regard the longest edge as width and height in [24]. In this paper, for better measuring the change in local aspect ratio, the length and width are redefined: the maximum length and width of a block is divided by four and multiplied by the number of blocks divided by the retargeted image N as the length and width of the retargeted block; Finally, the weighted linear fusion method is used to fuse the four features and evaluate the quality of the retargeted image.

ARS frame diagram [24].
Wu et al. [44] proposed a method based on multi-scale distortion-aware (MSDA) feature, which further improved the calculation of ARS [24]. Since different distortions appear on different image scales, this paper also obtains ARS features of the image from blocks with multiple scales (8*8 and 16*16). But for images with complex foreground objects, symmetrical structures, and textures, the consistency with subjective perception is not better than MLF. The proposed ARS has made different improvements to ARS proposed in [42]. This paper points out that for each image block before and after retargeting, the original ARS [24]. Calculation Equation (6) has limitations. The visual distortion problem will be ignored when a block in the original image does not match the retargeted image.
Therefore, λ is introduced as the penalty factor of visual distortion, and the improved equation is Equation (7).
To extract the accurate saliency map as the weight for fusing the improved ARS block scores, Zhang et al. [57] proposed a Visual Attention Fusion Framework (VAF) to calculate the saliency map. This framework integrates the salient map generated by the DCTS [27] algorithm and BSCA [58] algorithm, and the features of face and lines are enhanced. The low-level, medium-level, and high-level image features are considered, improving accuracy compared with the previous saliency detection methods. In 2019, Li et al. [43] used backward registration to establish pixel correspondence, and then different measurement methods were used depending on whether there are foreground objects. For images with significant foreground, foreground measure (including high-level semantic similarity feature and low-level size ratio feature) and global measure (including ARS feature and edge group similarity feature) are used; otherwise, the only global measure is performed. Niu et al. [6] also improved the quality of the ARS block [24], the improved equation is as follows [5].
According to the improved ARS, area similarity and block-level shape similarity are used to measure the features to evaluate information loss and the geometric changes of each retargeted block. RCM and similarity measurement are fused to obtain block-level image fidelity measurement.
Huma beings are sensitive to the shape distortion of important regions of retargeted images; however, the existing IRQA algorithms seldom consider such situations, leading to the mismatch between subjective perception and objective evaluation results. In 2021, Yao et al. [59] proposed the ARS metric to measure the saliency region between the original and retargeted images. A new algorithm is used to extract the salient region of the retargeted image, and the calculation process is shown in Fig. 8. Backward registration mapping is established by using the SIFT Flow algorithm. L represents the position information of important areas of the original image, and R represents the position information of backward registration. Then, the intersection C of L and R is used to form the set of an important region of the retargeted image. In this paper, the width and height of the important region of the original image and retargeted image are marked as SO
W
, SO
H
, SR
W
, SR
H
, respectively; the change rates of width and height between the original image and the retargeted image are expressed as SO
RW
= SR
W
/ - SO
W
, S
RH
= SR
H
/SO
H
, respectively; the average value of the change rate is denoted as S
RM
= (S
RH
+ S
RW
)/ - 2. Therefore, the calculation equation of ARS of the important region is as Equation (9).

Generation of salient regions of retargeted images [59].
In conclusion, ARS plays a vital role in IRQA.
Users always favor beautiful photos since they provide users with a more pleasant feeling, so aesthetic image analysis has attracted more and more attention in computer vision [60–62]. As part of feature measurement, aesthetics can improve the consistency between IRQA and subjective evaluation, so it has attracted the attention of many researchers.
Unlike the previous image retargeting method, which mainly considers structural distortion and content loss, the flowing methods think the feature measurements from several aspects. Liang et al. [45] put forward five key factors: 1) the retention of significant regions; 2) analysis of the influence of artifacts; 3) retaining the global structure; 4) observing aesthetic rules; 5) maintaining symmetry. The aesthetic rules of this paper mainly refer to the rule of thirds and visual balance, and each of them is assigned 50% weight during the calculation. Finally, the five measurement results are linearly fused to predict the aesthetic score. Yan et al. [46] proposed an open framework for IRQA, which uses the saliency weighting, CW-SSIM model with SIFT guidance [63], and aesthetic evaluation model, and combines them with eight existing retargeting evaluation methods (seven early distance methods and SSIM indicators) to assess the quality of image retargeting, and the aesthetic evaluation model is shown in Fig. 9. And 1320 images from the Aesthetic Visual Analysis (AVA) [64] dataset are selected as the training dataset, and 11 aesthetic features are extracted from each image as input to train neural networks to predict the aesthetic scores of subjective images. In [45], the features provided by these new and old models are used to train radial basis function networks [65, 66] to learn and predict the retargeted image quality. In 2019, Liu et al. [12] used General regression neural network (GRNN) [67] to simulate the combination of nine known objective quality evaluation indexes. They used nine-dimensional vectors to express the quality of the retargeted image, as shown in Equation (10).

Aesthetic evaluation model [46].
These indicators can be roughly divided into four categories: (1) retention of global structure; (2) reservation of significant areas; (3) the influence of visual distortion and introduced artifacts; (4) aesthetic measurement. The calculation method of aesthetic measurement is the same as [44]. The calculated scores retain the scores of the same original images and provide a reference for comparing the retargeting results of different original images.
Bi-directional measurement mainly refers to assessing quality in image retargeting through some features in a bi-directional way. In the early stage, the loss of relevant content was especially taken into account by bi-directional measurement. Zhang et al. [37] used bidirectional measurement to evaluate regional information loss for foreground region and background region information loss. Liu et al. [40] used the measurement method of BDS to measure the change of local visual content. Chen et al. [47] first incorporated the measurement of image Natural Scene Statistics (NSS) into IRQA. Then the significant global structure distortion and bidirectional significant information loss are measured, and a bi-directional natural salient scene distortion model (BNSSD) is proposed. Finally, the above three measures are input into the support vector regression model to predict the scores. Among them, the forward saliency information loss measure calculates how much saliency information of the original image is retained in the retargeted image. Backward salient information loss measure calculates the salient information of the original image correctly recovered from the retargeted image. In 2018, Oliveira et al. [48] proposed bi-directional Importance Map Similarity (BIMS). The critical step of the method is to measure features, such as bi-directional quality score, retargeting ratio feature, and keypoint matching feature, in a bi-directional way. The bi-directional feature is helpful to estimate the relative position and analyze how some information is retained after retargeting. When the related content is lost, or visual distortion occurs, accurate feature similarity measurement can be made by analyzing these positions. Bidirectional measures of geometric distortion also began to develop. Shao et al. [49] proposed a transform-Aware Similarity (TRASIM) measurement method, which includes:(1) bidirectional geometric distortion (BDGD) measurement, which used salient values as weights to calculate the forward and backward geometric distortions; (2) bidirectional information loss (BDIL) which calculate the changes of salient regions by forward and backward retargeting; (3) global salient structure distortion (GSSD). The TRASIM framework is shown in Fig. 10, which is different from the previous IRQA method. A similarity transformation is established by bi-directional re-warping to simulate different types of retargeting operators, and geometric distortion and content loss are measured based on similarity transformation to determine the quality of the retargeted image.

TRASIM frame diagram [49].
Some evaluation methods use simple regression algorithms to fuse features, which cannot simulate the perception process of the retargeted images in the human visual system. So Jiang et al. [50] trained a segmented stacked AutoEnCoder based on geometric shape and content matching [68, 69] to fuse features and output two scores. Finally, the final image retargeting quality score is obtained by combining the two scores through a weight mechanism. In recent years, IRQA based on deep learning mainly used neural networks to extract semantic features as high-level measuring features.
Fu et al. [51] extracted manual features by establishing the pixel correspondence between the original and retargeted images to measure structural distortion and information loss. Then the pre-trained VGG16 network [70] was used to construct encoders to extract deep-learned features for measuring texture similarity and semantic similarity. As shown in Fig. 11, texture similarity is calculated by convolution1-1 of VGG16, and semantic features are measured by the output of the second fully connected layer. Finally, manual features and deep-learned features are combined to evaluate the quality of retargeted images. In [43], foreground measurement (including high-level semantic similarity feature and low-level size ratio feature) and global measure (including improved ARS feature and edge group similarity feature) are used as extracted features for images with foreground objects. Fu’s method directly reshapes the image to the input size required by the network when extracting semantic features, which will completely lose the original aspect ratio information. An adaptive input method is designed in [43] to meet the requirement that the input size of the input image is the same as that of the network without changing the original image’s aspect ratio, which further improves the accuracy of semantic measurement. Compared with the MLF evaluation method, although ARS and edge group features are considered in MLF, it only finds a face in semantic composition. In this paper, the semantic analysis of foreground objects is carried out, so its effect is relatively good.

Deep feature map extracted by vgg16 [51].
The literature mentioned above all regard semantic features as additional features. However, in [52], semantic features are the primary measurement features, and the instance is used as the basic semantic unit. When comparing the original image with the retargeted image, the original image can be segmented using the Mask R-CNN model [71]. Still, this model cannot be used for instance segmentation of retargeted images. The backward registration strategy establishes an instance of pixel correspondence to reconstruct the retargeted image. Then, the semantic features at the instance level (shape distortion, size similarity, information loss, and position movement) are extracted. An adaptive pooling method based on semantics is proposed to integrate these semantic features. Finally, the global feature ARS is integrated into the instance features, which further improves the consistency with subjective evaluation.
Effective image quality prediction depends not only on selecting features but also on the mechanism of fusing all features into a single quality score. However, the contribution of each feature to the final quality score may be different. Some common-used basic pooling methods include simple summation, multiplication, and linear combination. Still, these methods have some shortcomings in that they assume each feature’s relative importance, which lacks a convincing basis [35]. In recent years, support vector regression (SVR) algorithms including linear kernel SVR (l-SVR), polynomial kernel SVR (poly-SVR), radial basis function kernel (RBF-SVR) [72] are often used to train prediction functions to fuse different features and have achieved good results; In addition, the fusion operation in Ambiguous D-means fusion clustering [73] can fuse the best features to generate a satisfactory result image, which may be used to fuse the above features and improve the quality of images.
IRQA dataset, evaluation index, and performance comparison
Dataset
The commonly used image retargeting quality evaluation dataset is RetargetMe [7], CUHK [19], and NRID [26]. Table 4 briefly describes these three datasets from the aspects of the number of original images, retargeting operators, the number of retargeted images, and image attributes.
(1) RetargetMe dataset.
It is the first dataset for IRQA, which includes 37 original images. Each image has one or more of six attributes, line/edge, face/person, foreground object, texture, geometric structure, and symmetry. The length and width of each original image are reduced by 25% or 50%, and 8 different retargeting operators are used respectively, including: 1) Seam Carving (SEAM) [74]; 2) Nonhomogeneous Warping (WARP) [75]; 3) Scale and Stretch (SCST) [76]; 4) Multi Operator (MOP) [11]; 5) Uniform scaling (SCAL); 6) ShiftMap, SHIF) [77]; 7) Manual Cropping (CROP); and 8) Streaming Video (STVI) [78]. This dataset contains a total of 269 retargeted images. Subjective research adopts the method of paired comparison [79] by showing participants two different retargeted images and allowing the participants to choose the image with the best visual effect, then vote for it. The total number of votes obtained is taken as the subjective score of the retargeted image.
(2) CUHK dataset.
CUHK dataset contains 57 original images, and each original image adopts three different retargeting operations. Like the RetargetMe dataset, the length and width are scaled by 25% or 50%. The retargeting operators used by each image may be different because they are randomly selected from 10 representative retargeting operators. The retargeting operators come from 8 operators in RetargetMe and two other operators (Optimized seam cutting and scale (SCSL) [80] and Energy-Based Deformation, ENER) [81]). The dataset consisted of 171 retargeted images using a subjective evaluation method: participants rate each image on a five-point scale of “very bad”, “bad”, “fair”, “good” and “very good”, and use the above scores to calculate the Mean Opinion Score (MOS) of each retargeted image.
Common datasets for image retargeting.
Common datasets for image retargeting.
(3) NRID dataset.
NRID dataset contains 35 original images, and each image has five retargeted images. The retargeting operators include the SAME, WARP, MULTI, SCAL, and SIFT, containing 175 retargeted images. Each image also has six attributes mentioned by RetargetMe. The subjective evaluation method is the same as that used in RetargetMe. Because there are few experimental results in this dataset, the performance comparison of IRQA in NRID is not listed in this paper.
(1) Kendall rank correlation coefficient (KRCC).
KRCC is used to measure the correlation between predicted values and MOS. n
c
represents the number of image pairs consistent with subjective ranking, n
d
is the number of image pairs inconsistent with subjective ranking, and n represents the ranking sequence length.
(2) Pearson linear correlation coefficient (PLCC).
PLCC is used to describe the linear correlation between predicted values and MOS. X, Y are the predicted value and the ground truth, respectively. cov (X, Y), is the covariance, and σ
X
and σ
Y
are the corresponding variances. PLCC is easily affected when there is a maximum or minimum value in the predicted value.
(3) Spearman rank-order correlation coefficient (SRCC).
SRCC is used to measure the monotonic relationship between two variables. r
W
and r
Y
represent the rank values of X and Y, respectively. For example, the value with the highest score is rated as level 1, the value with the lowest score is rated as level 9, and the other values are rated as an integer between 2 and 9. SRCC mainly evaluates the hierarchical correlation of two sets of data, which can alleviate the problem that PLCC is susceptible to extreme values.
(4) Root Mean Squared Error (RMSE).
RSME estimates the difference between the predicted value and the true value MOS. m is the number of values.
(5) Outward ratio (OR)
OR represents the ratio of the number of outliers to the total number of objective evaluation scores. The abnormal value belongs to values outside the range [MOS - 2σ, MOS + 2σ], σ indicates the standard deviation of the objective score.
The larger PLCC and SRCC, the better the correlation between subjective and objective scores among the above evaluation indexes. The smaller RMSE and OR, the closer the subjective score is to the objective score.
The performances of IRQA methods under the registration algorithms listed in Table 1 are compared, respectively, as shown in Tables 5 and 6.
Performance comparison of IRQA model in RetargetMe dataset
Performance comparison of IRQA model in RetargetMe dataset
Performance comparison of IRQA model in CUHK dataset
In Tables 5 and 6, red fonts have the best representation performance, followed by bold fonts, and yellow fonts and green fonts rank third and fourth, respectively. Table 5 lists the performance comparison of different IRQA methods in the RetargetMe dataset. The mean and standard deviation of KRCC of each IRQA measure and the average KRCC of an attribute are given. It can be found that IRQA methods with good performance are concentrated in the backward registration algorithm. SURF algorithm has relatively poor performance in symmetric structure but has good performance in other aspects. Zhang et al. [82] achieve the best performance in symmetric structure, probably because of inconsistency detection at the area/block level. It can avoid the shape distortion caused by over-compressing or stretching the scale distortion of foreground objects and important areas. And in [52], semantic information is the primary measurement feature, so it achieves the best performance in textures. In literature [43], it detects whether there are foreground objects, so it has the best score on the performance of foreground objects. Table 6 lists the performance comparison of IRQA on the CUHK dataset. Similar to Table 5, IRQA methods with the highest performance rank mainly focus on the backward registration algorithm. The performances of the proposed evaluation methods in [6, 82] are outstanding. As there are no experimental results on these two datasets and corresponding indicators in [31], and the source code of the paper [31] has not been published, our paper does not list its performance comparison.
Descriptions of abbreviation and notation used in this paper
To verify the generalization ability of IRQA, cross-dataset comparison (cross-dataset comparison) was put forward in 2018 and later, that is, training parameters on one dataset and testing on another dataset. In these experiments, a virtual DMOS is set for the quality of each retargeted image in RetargetMe and NRID datasets, which can correspond to CUHK. Thereby it ensures that the three datasets have the same score standard. Literatures [6, 52] have carried out such experiments, and some comparative results are shown in Tables 7 and 8. The bold font on the right side is the score of the same training dataset and testing dataset, which shows that the cross-dataset comparison result is second only to those for the same training dataset and testing dataset, which proves that the proposed evaluation index has good generalization ability.
Cross databases performance in [46]
Cross databases performance in [46]
Cross databases performance in [69]
IRQA is important in image retargeting, mainly applied in two aspects. Firstly, it can evaluate the performance of different image retargeting algorithms; secondly, it is widely used to optimize the multi-operator operation of image retargeting, determining which operator to use in each iteration to obtain the best-retargeted image. This paper mainly introduces the common-used registration algorithms. Then, according to different feature measurement methods, IRQA is classified and summarized. The datasets and evaluation indicators are outlined, and the performance of IRQA using different registration algorithms is compared. Although some achievements have been made in image retargeting quality assessment, it still faces the following limitations and challenges. Recently, most IRQA methods first establish the dense correspondence between the original image and the retargeted image. However, some areas of the image face the problem of inaccurate registration, which will lead to the inconsistency between objective and subjective evaluations. With the continuous development of deep learning, we hope to use deep learning to establish the correspondence between the original image and the retargeted image to further solve inaccurate image registration. Registration methods based on deep understanding have gradually matured in other fields, such as remote sensing images [84] and image segmentation [85]. Like [84] said, using CNN to generate robust multi-scale feature descriptors to establish matching has obtained better results than SIFT matching. In the future, we will try these methods in IRQA. Integrating semantic features in quality assessment is increasingly influential [86]. At present, low-level and intermediate-level features are considered more in image quality evaluation methods, and high-level semantic features are often considered additional features. Deep learning [87] is a powerful hierarchical feature representation tool for image classification and target recognition. In the future, we hope to accurately detect the semantic features of retargeted images and use the extracted semantic features to train the network for quality prediction. IRQA often needs to evaluate the results from multiple angles and early adding aesthetic and symmetry evaluation to quality evaluation has achieved good results. However, the definition of symmetry detection is unclear, and its accuracy is not high, limiting its application. In recent years, the symmetry of image retargeting has also been studied [88, 89]. Aesthetic evaluation [90, 91] with deep learning has also been made progress. If symmetry and aesthetic assessments can be integrated with IRQA, the consistency between objective and subjective evaluations will improve. In recent years, fuzzy set theory has been widely used because it is closer to the thinking process. Inspired by the fusion operation in the ambiguous D-means fusion clustering algorithm [73], we can use the fuzzy threshold method to convert the image to the fuzzy feature plane through the membership function, assigning different fusion weights to different measurement features. This method will further improve the quality of IRQA. With the constant maturity of the IRQA algorithm, the quality evaluation of stereoscopic image retargeting [92, 93] will be further developed.
