Abstract
This paper proposes a novel method for video quality evaluation based on machine learning technique. The current research deals with the correct interpretation of objective video quality evaluation (Quality of Service – QoS) in relation to subjective end-user perception (Quality of Experience – QoE), typically expressed by mean opinion score (MOS). Our method allows us to interconnect results obtained from video objective and subjective assessment methods in the form of a neural network (computing model inspired by biological neural networks). So far, no unified interpretation scale has been standardized for both approaches, therefore it is difficult to determine the level of end-user satisfaction obtained from the objective assessment. Thus, contribution of the proposed method lies in description of the way to create a hybrid metric that delivers fast and reliable subjective score of perceived video quality for internet television (IPTV) broadcasting companies.
Introduction
Although there were few online service providers during the 1980s, they provided just limited capabilities to access the Internet, such as e-mail and files exchange. The first real Internet service providers as we know them today started to appear in the 1990s, especially after the introduction of the World Wide Web. However, the best starting position on the market in this field belonged to the cable television companies and telecommunication providers. These companies were able to adapt their existing infrastructure to new kinds of services without huge network creation costs. Utilization of new techniques such as Integrated Services Digital Network (ISDN) or Digital subscriber line (DSL) ensured better use of the existing network bandwidth for added services. Moreover, no additional resources associated with costly cable laying procedures were required. As a result, those companies usually became the dominant Internet service providers in the national ISP markets. Ongoing digitalization and convergence of transmission systems during the last decade of the previous century allowed transmission of so far separate services – such as voice and video – via unified digital infrastructure. The television broadcasting transmitted over packet network has been becoming rapidly popular among internet service providers in the last decades.
From nowadays perspective, the Open Systems Interconnection model (OSI model) promoted by telecommunication companies stayed in the phase of conceptual model, and TCP/IP came into widespread use on multi-vendor networks for internetworking. However, the IP network has not been designed to transmit real-time services such as video or audio because of its focus on “best-effort”. The User Datagram Protocol is a connectionless datagram protocol and likewise IP, it is an “unreliable” protocol that offers unreliable connectionless communication without any guarantee of delivery.
This can potentially cause packet loss, especially in networks during the period of high level of utilization. In the past, the Quality of Service approach represented the main evaluation framework. However, QoS cannot characterize precisely the user’s perception, although metrics such as SSIM (The structural similarity index) or VQM (Video Quality Metrics) tried to adopt a human visual system (e.g. different weight settings for contrast or brightness). Hence, the framework of Quality of Experience (QoE) was created and gradually standardized by the International Telecommunication Union (ITU) [11, 12]. This concept needs real observers who can subjectively analyze the video quality.
It is relatively easy for the network providers to monitor the QoS transmission parameters like packet loss or overall delay, but it is impossible for them to perform a subjective test with continuous and immediate evaluation. In addition, the providers do not know how to interpret score obtained from objective metrics, or whether it is necessary to provide an intervention to the network due to potentially poor video quality delivery to end users. This fact consequently led to an attempt to analyze the gap between the perception of video quality by the human brain and distortions occurring in the transmission system. In comparison to voice service (where mapping function of how to translate objective results to subjective point of view is standardized by ITU), for video service (especially IPTV) this issue plays crucial role in research.
We have proposed a mapping function in the form of software application which is able to offer correct interpretation and translation of the objective outcome to a subjective point of view. Therefore, our hybrid method based on neural network provides a monitoring tool capable of predicting the QoE satisfaction score that routs out from an objective metric (SSIM), network packet loss representing network behavior, and qualitative characteristics of video sequences.
Related works
As it is widely known, the amount of data generated by end-users is enormous and multimedia plays the main role in it. In the last decade, a video has become a dominant part of all multimedia content sent through the IP networks. Every minute, hundreds of hours of video content are uploaded and millions are watched. Moreover, IPTV providers and media service providers have become very popular in the last decades. The video content is more demanding from the network throughput perspective in comparison to other multimedia content. Luckily, aside from video conferences and videocalls, video service is one-directional opposed to traditional communication services, thus the network delay is not so crucial when compared to the voice service. Therefore, packet dropping caused by an insufficient capacity dimension of the transmission system is a major impact factor in the overall video quality [8].
Two main aspects are usually presented in the manuscripts dedicated to this topic. On one hand, it is the robustness of video codecs eliminating the network packet loss, which causes video artifacts and defects in the process of stream reconstruction of the video playback stream on the receiving side. On the other hand, it is the compression efficiency and complexity reflecting in necessary computation performance [8, 9].
Regularly, the predecessors are comparatively evaluated with their new replacements or successors shortly after release of the new version like VP9 (Google) and H.265 (a partnership between MPEG and ITU-T). The utilization of QoS metrics has its limitation in capturing of the end-user visual experience assessments. The evaluation of video quality resulting from QoE instead of QoS should provide more appreciated solution because of two aspects. Firstly, improvement of QoS does not directly affect QoE increase and secondly, it can be associated with the higher costs that consequently lead to lower profit and thus to overall lower effectiveness without the desired impact. Real subjects are required for achieving of the real visual assessment, but such a testing is time-consuming and it requires observers who would – voluntarily or for a small fee or other kind of reward – fulfill such a survey. Because of that, there are tendencies to use objective metrics which can simulate human perception. These metrics are based on mathematical models and require the original reference video sequence to compute a quality score for degraded testing video sequence [13].
The key objective and motivation of this work is to find out a method capable of tying the findings of the objective and subjective method. Mean opinion score (MOS) is a measure used in the field of QoE in the IT and communication sector. This rating is usually divided to scale of five levels of the perceived experience quality, ranging from “bad-1” to “excellent-5” level [11]. However, every objective metric uses its own scale, and no unified mapping function of the objective result interpretation has been known so far, especially in the field of subjective score expression by the MOS scale. The first endeavor to utilize a neural network for this task is called PSQA (Pseudo Subjective Quality Assessment) [18], and the authors used objective qualitative parameters such as packet loss or bitrate as an input to neural network training with the objective to calculate a subjective rating. Although only old codec MPEG2 altogether with lowresolution (352
Valderrama et al. [27] have selected different attributes for the neural network training process, such as various lengths of the group of pictures (GOP), prioritization policies (BestEffort and DiffServ), altogether with bottlenecks in the test network topology. Although Pearson’s coefficient achieved a level of more than 0.9 in their paper, only low-resolution pictures (740
The main benefit of the next work [16] is the improvement of the SSIM method. The research team led by WoeiTan Loh incorporated the concept of spatial and temporal video quality into the SSIM index. They obtained improvement in accuracy; but on the other hand, the computational time of the proposed method was 50% higher when compared to the “classic” SSIM metric.
New generation of mobile networks offers high-speed internet connection (especially the 4
Newer video codec H.265, together with Pearson’s coefficient slightly over 0.92 but without UHD resolution, was used in an effort to synthesize a regression function for subjective score prediction in papers of Cheng et al. [10] and Anegekuh et al. [4]. Alreshoodi’s team designed their ambiguous-logic rules for the QoE prediction with the use of special qualitative video sequence attributes, specifically the spatial and time information [3]. Their ambiguous interface system allows estimation of QoE results in the MOS scale with a correlation of more than 0.95. Although much recent research in the field of video quality assessment has been using supervised learning methods, there is one paper related to unsupervised learning (using clustering) and video quality assessment [28]. The authors used the LAMDA learning algorithm to calculate the global membership degree of a sample with respect to a class, considering all the contributions of its attributes (physical variables or features). Therefore, the output of this proposed method is not a number (MOS value), but the video sequence is assigned to a predefined class representing a subjective quality perception. Low resolution of the selected video sequences (720
All the above-mentioned works endeavor to design a computational model capable of extrapolating the subjective perception of video quality from the set of objective parameters. This paper tries to analyze the impact of all crucial parameters such as used codec, bitrate, resolutions and even the type of motion presented in the scene. In addition, the impact of individual parameters on the prediction accuracy is also in our area of interest.
Methodology
Video content exists in different resolutions and refresh rates in association with various bitrates and codecs, but MPEG-4 H.264/AVC (MPEG Part 10) is the most used compression standard in high definition formant in present days. The main improvements against its predecessor lay in the adjustable size of motion compensation, redundant pictures or multiple reference frame motion estimation [9]. High-Efficiency Video Coding (HEVC), also known as H.265 is considered to be a successor and it brings a better data compression level. Therefore, although a similar video quality requires about a half bitrate of its predecessor, it happens at the expense of increased computational complexity. Nevertheless, it also supports screen resolutions up to 8K. The primary changes for HEVC are [21, 25]:
Motion compensation. Intra and inter-prediction is enhanced by the various block sizeup to 64 Motion vector prediction with better accuracy results in a lower residual error and can use up to 35 directions, while H.264 can only use 9. Adaptive Motion Vector Prediction – an advanced method for the inter-prediction process. Sample Adaptive Offset – an additional filter for reduction of the block edges artifacts.
There are always pros and cons, and as it has already been mentioned above, the major disadvantage of such higher efficiency and consequently higher complexity is hidden in the higher requirement for computation power, i.e. more powerful hardware is needed to encode video of the same quality as H.264 [26].
TV broadcast bitrate typically ranges from 10 to 15 Mbps, and according to the tests performed by public service broadcasters, it seems like the standard bitrate for HEVC is going to settle on 15 Mbps in near future [8, 9].
High definition (HD) is the current standard video format for terrestrial and cable TV broadcasting, Blu-ray discs and streaming videos over the internet with standardized screen resolutions such as HD (1280
As already mentioned, IP protocol uses the best-effort approach and therefore offers unreliable connectionless communication without any guarantee of delivery; furthermore, there is no feedback to the sender regarding the successfulness of transmission while using UDP.
So, when the network is congested and internal buffers of network components are depleted, the incoming packets may be dropped or significantly delayed when rerouting around the network bottleneck. Our previous works had already proved that packet loss higher than 1% caused significant deterioration of video quality, therefore we could conclude that packet loss over 1% is absolutely unacceptable for video quality [8, 9].
It is critical to monitor the network due to jitters or one-directional delays in case of voice service or another type of bidirectional communication. Broadcasting is a one-way data stream from a content distributor to a consumer. Therefore, we have primarily focused our view to the packet loss and its impact on the picture quality.
The first decision is to choose an objective and subjective method for video quality assessment. The calculated score of an ideal “objective model” for quality estimation should give the same value as the score received in subjective tests by making an average from the outputs provided by a huge number of observers. In general, the objective video quality evaluation methods can be divided as follows [11, 12]:
Full Reference (FR); require reference (original) video sequence, reference is compared with the tested video sequence. Reduced Reference (RR); some features of the original video sequence (metadata) are sent concurrently with the tested video. No Reference (NR); tested video quality is estimated with respect to the pixel domain of a video, utilizes information embedded in the bitstream of the related video format, or performs a quality assessment.
The SSIM method is widely utilized because of its high correlation with human perception [9, 16]. It is a full reference metric, i.e. the image quality prediction is based on an initial distortion-free image as a reference. The analysis results from the measurements of contrast and luminance altogether with the structural similarities of the referenced video sequence are shown in Fig. 1. Throughout the video quality investigation process, the scores range from 0 – no similarities to 1 – an identical sequence as the reference. The formula for the final combination represents the similarity measure of a test signal
where:
There are two subjective methods for video quality assessment standardized by ITU-T. The Degradation Category Rating (DCR) method is the first one and it presents sequences in pairs. Each pair consists of a reference sequence followed by a test sequence [11]. However, we have chosen the second method in view of the real situation, where the comparison of received video quality with an original sequence is inaccessible for an end-user in real circumstances. The absolute category rating (ACR) is a category judgment method, and the video sequences are presented individually, i.e. one at a time.
The block diagram of the SSIM index metric [9].
Each presentation continues with the video quality evaluation done by the invited observers, and the standard MOS scale (as described above) is used for the assessment. As depicted in Fig. 2, voting limit is set to ten seconds.
Stimulus presentation of the ACR method [11].
It is necessary to compose testing video sequences in an appropriate quality before creating a neural network model that can extrapolate the subjective quality rating from the objective score and qualitative parameters of the video. No approach given in section Related works can simultaneously predict H.264 and H.256 in one model.
The scene character, from sport broadcastings through action movies to TV news, can be described by the Time and Spatial information. The recommendation ITU-T P.910 [11] defines several categories of video content based on these two parameters. Various scenes in term of the SI and TI parameters were released by the Shanghai Jiao Tong University’s research team. These video sequences last only 10 seconds and contain 300 frames, i.e. the frame rate is 30. The descriptive characteristics for the scenes mentioned above are displayed in Fig. 3 [24].
SI and TI values of tested video sequences [24].
Based on different SI and TI parameters or positions in the depicted graph, we have selected 4 video sequences (Fig. 4):
Four testing video sequences [24].
The UHD resolution with 4:2:0 chroma subsampling and 8 bits per channel (16.7 million colors – true color) were used in all video sequences (typical TV broadcasting profile). We downloaded the selected uncompressed video sequences from the webserver [24] in the YUV format at the beginning of the database creation process. Next, the FFmpeg tool (x264 and x265 encoders) was used to encode all sequences into both standards (H.264/AVC and H.265/HEVC). Three types of bitrates were used (5, 10 and 15 Mbps), and the size of the GOP (Group of Picture) format was set to half of the framerate (M
Video stream was captured and saved via local computer interface (using VLC Player). The Real-time Transport Protocol (RTP) was used to stream video, and each RTP packet was encapsulated in the UDP segment. MPEG-TS digital container format was used to simulate the IPTV broadcast system. The number of created video sequences stopped at 432 at the final stage.
The quality of the predictions and results largely depends on the database size, especially in the field of artificial intelligence training. An insufficient database can lead to under-fitting; and thus, large training database is required for more accurate predictions. However, the amount of video data in the created database can be considered as relevant for purposes of this model. A subjective evaluation of these video sequences was required as a reference output for the designed model. The testing room (lighting conditions, viewing distance) with a TV screen (24” Dell P2415Q UHD) was prepared to meet the recommendations stated in [12]. Sixty observers participated at the age ranging from 18 to 35 years and with a male domination of 38:22. Observers had a little break every half hour, and total session duration took two hours [12]. The ACR method described above was used for subjective assessment. The whole list of input and output parameters is summarized in Table 1, and the whole process is depicted by Fig. 5.
List of input and output parameters
The assessment conducted by the subjects can be highly volatile. Due to this, we had to verify how accurately the calculated mean represents the subjective perception. Therefore, the variation coefficient had been calculated for each tested video sequence as the ratio of the standard deviation to the mean. If the value was higher than 50%, which meant significant dispersion of collected data, the arithmetic mean could not have been used for the representative purpose [13].
However, only 50 out of total 432 subjective results of video sequence testing showed the variation coefficient higher than 35% during the analysis, and none of them exceeded 40%. Following deductions can be made based on these results:
When the packet loss occurs, it affects the results of newer video codecs H.265 (HEVC) more than its predecessor H.264, thus the successor seems to be more vulnerable to the network distortion. More data demanding transmission associated with a higher bitrate shows a lower resistance to packet loss. The huge variability of the SSIM values to the MOS scale was monitored, as described in Table 2. The good quality MOS evaluation (at least 4) was connected to the lowest SSIM value of 0.945, while the average quality mark (MOS Relative static scenes such as “Construction Field” and “Campfire Party” have achieved better assessment in comparison to more dynamic sequences such as “Marathon runners” and “Wood”. It is easier to compute missing data in a video sequence where the displayed objects do not drastically change position or color in the exposed scene. Therefore, the algorithm reaches higher effectiveness with less dynamic content [1, 14].
Measured SSIM intervals related to the MOS scale (N/A – not available)
The whole procedure of creation and evaluation of testing videosequences.
Regardless of the assumption of screen resolution impact on perceived quality, there has not been found any evidence to support this expectation neither in a positive, or a negative way. Most packets belong to I-frames and thus the most part of dropped packets influences the I-frame integrity in the case of video streaming. In case of stronger compression of the I-frame, the quality loss can be more noticeable at high resolution when compared to low resolution assuming comparable bitrates. On the other hand, one UDP packet includes 7 PES (Packetized Elementary Stream) packets with a maximum size of 188 Bytes in MPEGTS transmission. Therefore, one PES packet holds less fundamental information in the higher resolution and its loss should not be so harmful for macroblock creation with a positive impact on the error propagation in GOP. The final perceived video quality is mostly influenced by factors like bitrate, codec type, time and spatial information of the reproduced scene [3, 5].
The packet loss impact on the achieved results in a high-motion scene compared to a static scene is shown in Figs 6 and 7. The high-motion scenes include dynamic camera motion or object motion, or both; and due to paper range limitation, the “Wood” scene was selected to represent this kind of video scene. On the other hand, the “Construction field” was chosen to represent the opposite.
SSIM results for high dynamic scene Wood (left) and relatively static scene Construction (right) in UHD resolution.
ACR (MOS) results for high dynamic scene Wood (left) and relatively static scene Construction (right) in UHD resolution.
All the above-mentioned data, such as objective and subjective assessment, as well as finding reference, testing, validating of neural network topology and its activation function are included in the neural network (NN) database. The NN inputs also include various transmission system properties for different model scenarios.
The artificial neural network is a computing system inspired by the biological neural network. Neural networks currently provide the best solutions to many problems like recognition issues (images, emotions) or for the creation of prediction models (e.g. voice/video quality approximation).
Multi-Layer Perceptron (MLP) is a feedforward neural network consisting of one input layer, at least one hidden layer and one output layer. Perceptron works well as a so-called binary classifier. It can recognize two states (zero or one). For more sophisticated problem solving, we need to choose an activation function that keeps the binary character of perceptron but offers better adaptability for classification into the classes at the same time.
Feedforward means that data moves from the input to the output layer. This type of network is trained by the backpropagation learning algorithm. MLPs are widely used for prediction, pattern classification, recognition, and estimation. MLP can resolve problems which are not linearly separable. A principal benefit of neural networks is an iterative learning process during which the dataset is loaded to the network one at a time, and the weights associated with the input values are changed every time. After all cases are presented, the process often starts all over again. During this learning stage, the network learns by calibrating the weights, which allows us to predict the proper outcome of input samples.
The error value indicates how much the network’s output (actual) is off the mark from the expected output (target). We used the Mean Squared Error (MSE) function to calculate the error. The delta rule (gradient descent learning rule for updating the weights of the inputs) is derived by attempting to minimize the error in the output of the neural network through gradient descent [29]. The error value of a single output neuron is a function of its actual value and the target value.
The total error of the network is a sum of all error values from all output neurons. For a network with
The advantages of neural networks include their high tolerance to noisy data, as well as their capability to categorize patterns which they have not been trained for.
The number of layers and the number of neurons per layer is a crucial choice. These settings to a feedforward, backpropagation topology show the “art” of the network designer. Elimination of difference between target outputs and actual ones provided by our model plays a significant role in network modeling. To avoid neural network too overemphasized training dataset, we performed dropout and k-fold cross-validation methods that can reduce overfitting. We also counted computing time as one of decision parameters because simplified topology related with faster computational time offers better generalization ability.
The activation function is an important part of the artificial network modeling process. It decides whether a neuron will be activated or not. In fact, it is usually the non-linear mapping function between inputs and response variables [5]. Because of that, finding the best fitting activation function was the second step of NN modeling. During the NN training, we set several well-known activation functions (sigmoid, hyperbolic tangents, rectified linear unit – ReLu) and chose the best one based on the measured values (Pearson correlation coefficient – PCC, Root Mean Square Error – RSME and computation time [13]). The performance of the activation function related to the testing set is shown in Table 3. For the purposes of NN modeling we have used the Python programming language (version 3.5.2), and we implemented the Keras and Tensor flow machine learning libraries [15].
First three steps deal with loading of the training dataset (this code will make two arrays X and Y. X stands for input vector and Y stands for output variables) and calling selected activation functions for analysis. Steps 4 and 5 performed cycle where each of the activation functions was gradually implemented into the hidden layers and based on how the actual model outputs were close to target outputs compute accuracy and MSE (results obtained from all topologies were automatically saved). RMSE is the square root of the MSE.
Activation functions accuracy performance
In general, any neural network becomes overfit when it is closely fit to a training set which has difficulties to generalize and provide the correct classification for new incoming (and unknown) data inputs. Due to this, the good practice is to divide dataset into three parts – training set, validation (increasing MSE in training set and decreasing MSE in validation phase indicates overfitting) and test set (not a part of training or validation process). Since our dataset did not contain millions of entries, it was divided into a relatively smaller portion dedicated to training set and a higher portion for validation and test set, concretely 70:15:15 (huge datasets are usually divided in 80:10:10 or 90:5:5 ratio).
Another popular technique that is supported by Keras library is called Dropout [15]. This concept is actually very simple – every neuron (except those belonging to the output layer) is given the probability
In our case, we tested more than 140 topologies (up to 8 layers and maximum 230 neurons per layer). We created automatic script (steps 8.–10. in algorithm 3 provided cycle for adding hidden layers and neurons) in Python language.
Learning coefficient was experimentally set to 0.0013 (we started with value 0.001), maximum number of epochs was limited to 300 (in order to prevent overfitting). Table 4 shows the most accuracy topologies related to correlation coefficient.
List of best rated topologies
List of best rated topologies
Creation of the dataset resulted in 25 920 (432
The proposed model works with two videocodecs, namely H.264 and H.265. The testing dataset which did not participate in training and validation phase was divided into two equal smaller groups, unique for codec H.264 and H.265.
This step helped us to recognize if the prediction accuracy is similar for both codecs. All statistical methods were performed with a significance level of 0.05. Table 5 shows comparison (prediction of subjective perception in MOS) of median, modus, confidence intervals and standard derivation for reference and testing set for both videocodecs. As shown in the table, the reference and model datasets are very comparable. Since the Shapiro-Wilk test of normality rejects the hypothesis of the dataset’s normal distribution, nonparametric test called Mann-Whitney-Wilcoxon U test was chosen as the next statistical method [13]. This test can be used to determine whether two independent samples come from the same population. It inspects if two sample means are equal or not. Calculated values belonging to this test are shown in Table 6.
Basic statistical investigation
Basic statistical investigation
Results of the Mann-Whitney-Wilcoxon test
The relative error distribution-validation set.
The
The last phase of verification was a comparison with the up-to-date published models. As depicted by Table 8, our proposed method based on backpropagation neural network (BPNN) reached better results in all cases except the first one (for nowadays outdated codec MPEG-2) that had provided the basis for our research and motivation. Our model can estimate subjective perception of video quality for two codecs and three resolutions, which is something that none of the previously mentioned models offer.
Prediction accuracy for the test set
Comparison of input parameters and statistical assessment with published models for subjective prediction of video quality
NOTE: ND – not defined, X – presented, BR – bitrate, FR – framerate, PLR – packet loss rate, RNN – recurrent neural network, DBN – deep belief network, ABFR – adaptive basis function regression, FIS – fuzzy interface system.
This paper presents a new method to predict the subjective score of video quality. We have identified several key parameters influencing the end-user video quality perception. We needed to measure the impact of different features (packet loss, bitrate, codec, resolution, scene dynamics). Four scenes, 3 bitrates and resolutions affected by a packet loss ratio up to 1% offered many scenarios reflecting the real situation of IPTV service. The proposed model is unique because it offers subjective quality estimation for the most used IPTV videocodecs H.264 and H.265 in real-time. Video on demand and IPTV take dominant part of all data transferred via packet based networks. Netflix or HBO Go are popular services nowadays and e.g. Netflix developed its own objective assessment metric that will be analysed in next study. Since that kind of service is not operated in real time (IPTV is sensitive to packet loss due to its real-time character), service providers need to know what kind of service quality they deliver to customers. Objective video metrics do not contain this information as much as subjective tests do. On the other hand, subjective tests cannot be performed real-time. Thus, mapping function or hybrid metric is a solution for all service content providers. From the gained data, we also extracted information about the SSIM objective metric intervals related to at least point 3 dedicated to the MOS scale. If the picture quality is worse than 3 on MOS scale, the offered service is poor with annoying visual impairment – thus the results showed in Table 2 can serve for very fast video quality estimation.
This model works as a Python application (works for Linux and Windows operating systems as well), and it aims to be a useful provider tool to monitor the impact of network behavior on customer subjective perception.
Footnotes
Acknowledgments
This work was supported by the Institutional research of Faculty of Operation and Economics of Transport and Communications – University of Zilina, no. 11/PEDAS/2019. The research was also supported by the Czech Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPUII) Project “IT4Innovations excellence in science” reg. no. LQ1602.
