Abstract
Neural style transfer is used as an optimization technique that combines two different images – a content image and a style reference image – to produce an output image that retains the appearance of the content image but has been modified to match the actual style of the style reference image. This is achieved by fine-tuning the output image to match the style reference images and the statistics for both content and style in the content image. These statistics are extracted from the images using a convolutional network. Primitive models such as WCT were improved upon by models such as PhotoWCT, whose spatial and temporal limitations were improved upon by Deep Photo Style Transfer. Eventually, wavelet transforms were introduced to perform photorealistic style transfer. A wavelet-corrected transfer based on whitening and colouring transforms, i.e., WCT2, was proposed that allowed the preservation of core content and eliminated the need for any post-processing steps and constraints. A model called Domain-Aware Universal Style Transfer also came into the picture. It supported both artistic and photorealistic style transfer. This study provides an overview of the neural style transfer technique. The recent advancements and improvements in the field, including the development of multi-scale and adaptive methods and the integration of semantic segmentation, are discussed and elaborated upon. Experiments have been conducted to determine the roles of encoder-decoder architecture and Haar wavelet functions. The optimum levels at which these can be leveraged for effective style transfer are ascertained. The study also highlights the contrast between VGG-16 and VGG-19 structures and analyzes various performance parameters to establish which works more efficiently for particular use cases. On comparing quantitative metrics across Gatys, AdaIN, and WCT, a gradual upgrade was seen across the models, as AdaIN was performing 99.92 percent better than the primitive Gatys model in terms of processing time. Over 1000 iterations, we found that VGG-16 and VGG-19 have comparable style loss metrics, but there is a difference of 73.1 percent in content loss. VGG-19, however, is displaying a better overall performance since it can keep both content and style losses at bay.
Introduction
Neural Style transfer took shape after Gatys et al. [1] first developed the revolutionary idea of using Convolutional Neural Networks to give the output of superimposition of content and a style image. It can broadly be classified into two main categories of Artistic and Photorealistic neural style transfer depending upon the aesthetic of the final image produced. After the initial developments in the artistic style transfer arena, succesive improvements over the first model generated various works such as the Instance Norm Model, Conditional Instance Norm Model, and Adaptive Instance Norm Model. Eventually this culminated into the finest version: Whitening and Coloring Transform Model. It was faster and allowed more diverse style transfer. Gradually, there was also a need to produce more life-like and authentic results which won’t just qualify as art pieces. This need paved the way for photorealism. Rudimentary methods such as Deep Photo Style Transfer (DPST) could only produce low-quality images when extended to photorealistic style transfer. Eventually, In 2018, PhotoWCT was introduced. Here, the nearest neighbor upsampling was discarded in favor of unpooling to maintain the structural integrity of the chosen content image. Then, this was replaced with wavelet pooling and unpooling modules (WCT) to remove post-processing steps and enable stylization with minimal losses or temporal and spatial constraints. Recently, a lot of work has been done in the domain of neural style transfer. An application of the same is video style transfer, which aims to extend the style of the image to frames of the video. Photorealism has pervaded every sphere ranging from AR/VR to the formation of games, neuroaesthetics, filmmaking, and social/communication creative tools. Over the years, a number of models have been developed and improved upon. It thus becomes even more important to compare and contrast these state-of-the-art models and identify their strengths and weaknesses.
This work aims to run experiments to perform neural style transfer and trace different facets of it. Each aspect of the technique is varied by conducting a series of studies to determine the best operating setting for performing style transfer. Different architectures are employed and their quantitative as well as qualitative differences are identified to decode the benefits of using one over the other.
Our contributions are as follows:
Neural style transfer using VGG-16 and VGG-19 was performed WCT Core and Core Segment were executed successfully. Exhaustive analysis was done to determine and accentuate the role of convolutional layers for conducive encoder-Decoder architectural integration. Haar Wavelet Filter experiments were conducted to determine the optimum settings for most efficient style transfer.
Style Transfer using neural networks is an emerging field of research. The work of Yoo et al. [2] has led the way for further research in photorealistic style transfer. They presented a theoretically substantial modification to the network design that greatly enhanced photorealism and produced images free of spatial distortions and artificial embellishments. Wavelet transforms, which are a perfect fit for deep networks, were the main component of the approach. A wavelet-corrected transfer based on the coloring and whitening transforms (WCT2) was put forth, which enabled features to retain the original structural data and statistical characteristics of the VGG feature space during stylization. It was the initial complete and thorough model capable of stylizing an image with a resolution of 1024 by 1024 in 4.7 seconds while maintaining an appealing and photorealistic quality.
Luan et al. [3] presented a deep-learning technique for photographic style transfer that faithfully reproduced the reference style while handling a diverse range of visual data. They worked on the problem of painterly transfer: which distinguished between an image’s style and content by considering the various layers involved in a neural network. They deemed it inappropriate for photorealistic style transfer, as a result, showed distortions resembling those of a painting even though the input and reference pictures are both photos. They demonstrated that this method effectively eliminates distortion and produces pleasing photorealistic results.
Reinhard et al. [4] employed straightforward statistical analysis to apply one image’s color characteristic to another. It was proposed that color correction can be achieved by selecting a suitable source image and transposing its corresponding characteristics to another image.
Williams et al. [5] stated that the convolutional neural network had a significant role in the advancement of object classification. Network regularization primarily focuses on convolution operations with little focus on pooling operations. They introduced wavelet pooling as a replacement for neighborhood pooling. Their approach led to reducing features in a more structurally compact manner compared to pooling using neighborhood regions. Also, it addressed the overfitting issue that max-pooling encounters.
An et al. [6] proposed solutions to the extensive dependency of the results of photorealistic style transfer on pre/post-processing. Their approach included a construction step (C-step) to create a photorealistic stylization network and an acceleration step (P-step). Their search produced a network architecture called PhotoNAS that significantly accelerated over PhotoNet while retaining almost all stylization features.
Hong et al. [7] mentioned the issue of the extent of stylization of existing works being bounded in a specific domain due to their structural limitations. This prevents result generation across other domains. They proposed Domain-aware Style Transport Networks (DSTN) that take a particular reference image’s domainness – an attribute of the domain – and transfer it along with style.
Gatys et al. [1] proposed a method using a Neural Algorithm of Artistic Style, which could deconstruct and reconstitute the style and content of actual photographs. It enabled the creation of new images with excellent perceptual quality, and the results demonstrated the potential of Convolutional Neural Networks to be used for high-level image synthesis.
Huang et al. [8] suggested that neural style transfer can be accelerated by using feed-forward neural networks for faster style transfer approximations by overcoming the existing limitation of the network being bound to a predetermined set of styles because of which it was unable to accommodate arbitrary new styles. A unique method was provided that permitted real-time style transfer of any kind. The basis of this methodology was a novel addition in the form of an adaptive instance normalization (AdaIN) layer that aligns the mean and variance of the content characteristics with those of the style attributes. The system used a single feed-forward neural network to provide variable user options like a content-style trade-off, style interpolation, color, and spatial controls.
Deng et al. [9] developed a brand-new database dubbed “ImageNet” to manage and organize image data. Three straightforward applications in object recognition, image classification, and autonomous object clustering were provided as examples of how beneficial ImageNet is in terms of its size, precision, diversity, and hierarchical structure.
Nijaguna et al. [10] incorporated the usage of VGG-16 and Resnet-50 models for feature extraction while using the QFFA (Quantum Fruit Fly Algorithm) technique to improve the effectiveness of classification in medical diagnosis. The QFFA model has provided better results in terms of sensitivity and accuracy over existing models.
Fujieda et al. [11] noticed that while convolutional neural networks (CNNs) have made tremendous progress in the categorization of images, the task of classifying textures remains challenging. A new CNN design called wavelet CNNs was proposed incorporating a spectral analysis. The discovery was that a constrained version of spectral analysis could be seen in the pooling and convolution layers. Spectral information was leveraged with wavelet CNNs, which was lost in traditional CNNs but was helpful in texture classification. Plenty of roadblocks were also encountered along the way. Most models cannot simulate a particular style, which is an inflexible approach. This creates a minimal selection of styles that can be transferred successfully. Problems have also arisen while trying to identify and accurately replicate an image’s texture using AI, which could be overcome to a certain extent using Domain Aware Style Transfer. In specific traditional models, additional post-processing steps often created distortions and increased the time to produce results. The optimisation has thus been a significant issue to be tackled. Furthermore, the limitations of the SSIM Index as the ultimate metric for quantitative analysis in neural style transfer techniques also need to be addressed, and a replacement that shows better compatibility with diverse forms of data must be found. Also, many drawbacks were faced while extending the neural style transfer techniques to 3D data representations. Work needs to be done to plug the gaps in this space too.
Comparison of different quantitative metrics
The principal metric for the comparison of results concerning the content image is called the Structural Similarity Index (SSIM). In the case of these photorealistic style transfer models, it measures the degradation of image quality after all the processing has taken place when WCT has been applied at every consecutive convolutional layer. It takes into consideration factors such as loss of information and data compression.
When an image is fed to Convolutional Neural Networks, different sub-matrices called feature maps are generated. They contain information about the image’s physical features like lines, edges, etc. These feature maps can’t be used in their unadulterated form to capture the style of the image . A correlation between these feature maps has to be calculated to get the style information. This can be generated by converting the feature maps into a vector and then multiplying them with their transpose. The style loss is computed using the resulting matrix, which is known as the gram matrix.
Table 1 displays the different ’loss functions’ used in different publications in the domain of neural style transfer. Table 2 displays the formulae of different loss functions that can be used.
Loss functions
Loss functions
Formulae of loss functions
Plenty of roadblocks were also encountered along the way. Most models aren’t capable of simulating a particular style which is an inflexible approach. This creates a very limited selection of styles that can be transferred successfully. Problems have also arisen while trying to identify and accurately replicate the texture of an image using AI which could be overcome to a certain extent using Domain Aware Style Transfer. In certain traditional models, additional post-processing steps often created distortions and increased the time being taken to produce results too. Optimisation has thus been a major issue to be tackled. Furthermore, the limitations of SSIM Index as the ultimate metric for quantitative analysis in neural style transfer techniques also needs to be addressed and a replacement needs to be found which shows better compatibility with diverse forms of data. Also, a lot of drawbacks were faced while trying to extend the neural style transfer techniques to 3D data representations. Work needs to be done to plug the gaps in this space too.
Tools and methodology
Our dataset consists of two subsets: The first aspect of it is the content images. This is the main frame of reference and the result has to be in complete sync with it. The second aspect is the style images. This is the secondary frame of reference and the result, even though based primarily on the content image attributes, is modified artistically depending on what the style image looks like. Therefore, for one result image there need to be two inputs in the form of a content and a style image. To achieve photorealism, a model should recover the structural information of a given content image while it stylizes the image faithfully at the same time. Input dataset: https://github.com/luanfujun/deep-photo-styletransfer/tree/master/examples/input. Style dataset: https://github.com/luanfujun/deep-photo-styletransfer/tree/master/examples/style Resize both content and style images to the same size as it is essential for processing them. Display the images after resizing them. Load pre-trained VGG-16 and VGG-19 models.Feed the images to pre-trained VGG-16 and VGG-19 models to perform neural style transfer. Vary the iterations from 100, 500 to 1000 for both VGG-16 and VGG-19 models. Calculate style and content loss for 100, 500, and 1000 iterations for both VGG-16 and VGG-19. Compare the loss values for all. The total loss should be minimum to indicate that a successful neural style transfer has been performed.
Figure 1 represents the flowchart for experiments conducted using VGG-19 and VGG-16.

Flowchart for VGG-16 and VGG-19.
The second implementation was of the encoder-decoder method. The VGG 19 model utilizes its feature extraction layers as an encoder and its corresponding mirror as a decoder. Specifically, the convolutional layers up to the fourth one are used for the encoder, while the corresponding mirror architecture is used for the decoder. This particular architecture was trained for image reconstruction, meaning that when given an image as input, the encoder should encode the image and the decoder should output the original image. To evaluate the effectiveness of the encoder-decoder mechanism, certain layers can be skipped and the resulting output can be analyzed. Figure 2 displays the visual representation for the encoder-decoder method.

Visual representation for encoder-decoder method.
This section describes the strategies employed for implementation. These include experiments using the VGG-16, VGG-19, Encoder-Decoder model, and Haar Wavelet Filter.
VGG-19
One of the initial models to be used for photorealistic style transfer was VGG-19. The Keras library was used to apply this. There are three fully connected layers, 5 MaxPool layers, 1 SoftMax layer, and 16 convolution layers. One of the ways in which VGG-19 has been used is content and style superposition using the network. The VGG-19 network is trained using images from the ImageNet database. It was trained on 224

VGG architecture.
VGG16 is a deep convolutional neural network (CNN) architecture. It is a variant of the VGG network family and is widely used for image recognition and classification tasks. There are 16 layers total in the VGG16 architecture, including 13 convolutional layers and three fully connected layers. The fully connected layers identify the image using the convolutional layers’ features extracted from the input image. Convolutional layers with 64 filters of size 3
Encoder decoder structure
The encoder-decoder mechanism for image reconstruction has been used in newer models to improve the original VGG architecture by considerably limiting distortions and data losses. The first four layers of the convolutional network were kept as the encoder, which takes in the input images and enables them to be processed further by making necessary modifications. The decoder, made by replicating the architecture of the encoder, should then give the final output with minimal post-processing losses. After passing through this four-layer tentative encoder-decoder VGG architecture, a few sample images had outputs that resembled the original ones. There were very few post-processing flaws, such as minor blurriness. This indicates the relative effectiveness of the technique mentioned above. One of the four convolutional layers was skipped in an extension of the same, and the results were documented. The resultant images obtained were faintly similar to the original but lacked almost all necessary details.
Haar wavelet filter
The functionality of the Haar wavelet concept was tested by generating four kernels, i.e., LL, LH, HL, and HH filters. These wavelet filters were applied to a few sample images from the dataset. This led to the following observations: The LL filter brings minute changes to an image that are mostly not discernible since it is a low-pass filter that does not capture higher frequencies. The HH filter, however, captures higher frequencies, and its absence creates a major difference in the image quality as it is responsible for the intricacies of the picture. The same has been showcased in Fig. 4.

Image and result.
WCT Core Segment involves using a semantic segmentation mask on the output before passing it to the next layer of the architecture. A segmentation mask is created using a particular part of a complete image secluded from the rest. An image with its segmentation mask generated has been shown in Fig. 5. This is done to accentuate the areas which need to be labeled to be studied for computer vision models. Semantic segmentation is done for the input images to avoid content mismatch and information losses. This helps in improving the quality of photorealism being achieved in the results. WCT Core is an older technique that did not use segmentation masks. This was used in the models prior to the WCT2.

Image and its segmentation mask.
Visual representation of segmentation masks for content and style images shown in Fig. 6.

Content image and style and content segments.
In Fig. 7 a comparison of images obtained with and without using segmentation masks is shown.

Image with segmentation and without segmentation.
Generative Adversarial Networks (GANs) are a class of deep learning models that can be employed for performing image transformation. Their architecture consists of a generator and a discriminator. The generator network learns to generate plausible data. It takes random noise as input and generates samples that resemble the data from the training set. The generated instances serve as negative training samples for the discriminator. The discriminator network then acts as a binary classifier that aims to distinguish between the real data from the training set and the fake samples generated by the generator. The architecture of GAN has been shown in Fig. 8.

GAN architecture.
CycleGANs are used to perform image transformation. They are much bigger than regular GANs as they use four networks, two of the networks are responsible for generation and the other two are responsible for discrimination. CycleGAN does not employ paired images for translation from one image to another. It allows for training the model with multiple pictures of different classes. Post the training, the model successfully learns how to transform one class into another without requiring any additional style image. In cases when there isn’t much variation in the images present in the style dataset, CycleGANs can be used. The representation of the same is shown below in the Fig. 9.

CycleGAN sample.
Adversarial loss incentivizes the generator to generate realistic images. Cycle consistency loss ensures consistency between input and reconstructed images. These two methods are employed by CycleGAN to compute losses during neural style transfer. If the desired style is comparable to the input image content, it may also use identity loss to maintain that material. To accomplish the desired behavior, weights are changed and integrated into a total loss function. Without the need for paired data samples, CycleGAN achieves efficient style transfer between two domains by simultaneously optimizing these losses throughout training.
Tables 3 and 4 mention the Content and Style Loss obtained for different iterations for both VGG-16 and VGG-19. The results are mentioned as follows.
Comparison of quantitative metrics for VGG-19
Comparison of quantitative metrics for VGG-19
Comparison of quantitative metrics for VGG-16
The graphs in Figs 10 and 11 show the comparison between style loss and content loss for both VGG-16 and VGG-19.

Loss using VGG-19.
Both VGG16 and VGG19 are widely used models for neural style transfer and have shown to produce good results. VGG19 is a deeper version of VGG16, with 19 layers instead of 16, and has a higher number of filters in the convolutional layers, which may allow it to capture more complex features [17]. However, the choice of model ultimately depends on the requirements of a particular project and resources available. VGG19 may provide better results for more complex images or larger datasets, but it also requires more computational resources and may be slower to train. On the other hand, VGG16 is a good balance between accuracy and speed and maybe a better choice for smaller datasets or less complex images. It’s worth noting that there are many other pre-trained models available for neural style transfer, such as ResNet, Inception, and MobileNet, among others. These models have different architectures and characteristics, which may make them more suitable for specific tasks or datasets. Hence, experimenting with different models and comparing their performance to find the most appropriate fit for a specific task provides the best results.
Table 5 mentions the Content and Style Loss obtained for different style transfer models. The gradual upgrade can be seen as newer developments come into picture and models evolve. The results are mentioned as follows.
Comparison of quantitative metrics for style transfer models

Loss using VGG-16.
Gatys’ work on neural style transfer is groundbreaking and has the ability to work with any style. Although it takes more time since back-propagation is used to reduce content and style loss, the method is nonetheless cited as the inspiration for the ones that came after it. While Gatys’ approach can produce reasonable results, the iterative solver often fails to find the optimal solution within a limited number of iterations, resulting in less stylized outputs. The output of more modern approaches, on the other hand, is more styled and better preserves the structures of the content images since they intentionally limit style and content loss while searching for a closed-form solution.
The feature transformation method AdaIN [8] assumes that CNN feature channels are independent and match each channel with two one-dimensional Gaussian distributions. However, the channels are in sync, making AdaIN unsuitable for achieving visually stylized results. AdaIN fails to achieve visually pleasing transfers [18], this is due to its failure to account for channel correlation in CNN features. Even newer approaches consider channel correlation and achieve more stylized results.
The WCT [19] method proposes using feature whitening and coloring for style transfer, specifically using ZCA whitening. When other whitening methods were tested on the ReLU3 1 feature it was found that only ZCA whitening [20] produces tangible outcomes because it considerably reduces the difference between whitened and content features. However, ZCA whitening lacks constraints on the final transformed feature [21]. Even the WCT method has a drawback which has become the starting point for more recent papers. It is aimed to lessen the gap between the original and the altered feature, incorporating content loss from Gatys’ method. The most optimum solution will be a closed-form approach [22] that minimizes both style and content loss and better preserves the input image’s characteristics, as opposed to WCT which only focuses on style loss.
Table 6 mentions the difference in time taken for image generation across 3 different style transfer models.
Comparison of processing time for different style transfer models
This research makes use of a state-of-the-art approach, amalgamating refined methodologies and diverse tools to offer a comprehensive exploration of neural style transfer, contributing to the existing body of knowledge in the field. Tables 3 and 4 enumerate the style and content loss obtained by running experiments using VGG-16 and VGG-19 contrasting it to loss results of already existing models displayed in Tables 5–6 analytically compares the processing time for different style transfer models.
Dealing with the information loss and noise in the resultant images is essential. As neural style transfer has applications in medical imaging, accuracy is paramount. One approach involves utilizing advanced image processing algorithms, such as Gaussian smoothing, median filtering, or wavelet denoising, to remove unwanted noise while preserving essential image details. Gaussian smoothing is a technique that blurs an image by convolving it with a Gaussian filter kernel, leading to reduced noise and sharp transitions. Median filtering replaces each pixel’s value with the median value of its neighboring pixels, which aids in noise removal. Wavelet denoising is a method that decomposes an image into wavelet coefficients and selectively attenuates high-frequency noise components, preserving important image features. Additionally, convolutional neural networks (CNNs) possess sophisticated noise reduction capabilities by learning complex patterns and features from the data. By employing these noise reduction strategies in the image reconstruction pipeline,it becomes possible to achieve higher-quality images with reduced noise levels.
In this paper, we presented the variations between different style transfer techniques. We conducted experiments on the encoder-decoder mechanism to determine the optimal CNN setup, and analyzed two widely used architectures for neural style transfer: VGG-19 and VGG-16. Our findings suggest that the choice of model depends on the project requirements and available resources. VGG-19 performs well on complex datasets, while VGG-16 is better suited for smaller ones.
In conclusion, neural style transfer using deep neural networks is a promising technique for generating artistic images through computer algorithms. It allows the integration of content from one image with the style of another, resulting in aesthetically captivating and artistically intriguing new images. While neural style transfer is still an emerging research topic, it has already shown great promise in various industries such as film, advertising, image editing, virtual reality, and medicine. In the film industry, it is utilized for visual enhancements, while in medicine, it brings consistency to results obtained from different machines and locations, making it easier to apply further machine-learning techniques. Although there are challenges that need to be addressed, such as the development of more efficient algorithms and improved methods for evaluating the realism of generated images, the future of neural style transfer looks promising. As technology continues to advance, the potential for more profound applications of neural style transfer in our daily lives will increase. Our work can be viewed here: Neural Style Transfer.
Future work
Future work in this field could involve further exploring the Haar wavelet function, the various filters, and their role in style transfer. In addition to that experiments to develop a better understanding of content segmentation and the various pooling techniques are to be conducted. The 3D limitations of neural style transfer ought to be looked into. Also, a detailed study regarding trying to find other, more compatible alternatives of the SSIM Index can be carried out.
