Neural Style Transfer with Human Comparison

1 downloads 0 Views 15MB Size Report
For this project, we quantified the style of Leonardo da Vinci as earth tones, sfumato, and chiaroscuro. We used machine learning to extract and apply that style ...
Neural Style Transfer with Human Comparison Filis Coba & Ryan Abrahams Nebulearn [email protected]

Abstract We compare a computer generated style transfer to a physically painted style transfer. We contrast the computer algorithm with the self-recorded thought process of the artist. The neural style transfer was performed with the VGG16 network with the algorithm in L. A. Gatys and Bethge (2016). The computer successfully detected the sfumato, chiaroscuro, and color scheme that Leonardo’s painting depicts. The human succeeded more with capturing Leonardo’s style which comes from a balance of the sfumato and chiaroscuro techniques. This is because the human has the ability to venture off and “invent” a whole new set of features which do not exist in the input reference images. We believe this allows the human to make a more convincing piece of art, much more closely resembling what Leonardo would have made.

1. Introduction Early 1300s Italian painters had a hard time spatially transposing 3D natural renderings on 2D canvas surfaces. Their depictions of figures and architecture appeared flat, without depth or form. Soon after the 1400s, architect Filippo Brunelleschi used mathematics to generate geometric perspective for his architectural designs. This revolutionized the way artists interpreted the structure of their surroundings, and greatly improved the scientific accuracy of their observations. Leonardo da Vinci especially benefited from this, and even combined it with anatomy, botany, medicine, architecture, mechanics, and optics. He mainly used earth tones and oil painting techniques called “sfumato” and “chiaroscuro” in his paintings. Sfumato is the hazy appearance of a painting achieved from blending colors such that no lines are present, only shadows. Chiaroscuro is the illumination of figures and landscapes by a single light source with high contrast of shading and tinting. (Isaacson, 2017). By combining the knowledge of science with his own studies of nature, Leonardo painted (sometimes using his fingers) earthy-toned, 3D forms of art with exquisite detail. For this project, we quantified the style of Leonardo da Vinci as earth tones, sfumato, and chiaroscuro. We used machine learning to extract and apply that style to an image of our choice. Simultaneously, we repeated this process using traditional oil painting, then compared our result against the computer. We intentionally focused on the famous polymath to honor the union of art and science, and tested how the program handled a more subtle style.

2. Convolutional Neural Networks (CNNs) A neural network is a complex function built up from less complicated “neurons”. A neuron takes in any number of inputs, evaluates a simple function, and outputs a number. Neurons are combined together to approximate a complicated function. A common example is to 1

Earth tones

Sfumato

Chiaroscuro

(Color)

(Texture)

(Form)

Figure 1: Leonardo da Vinci’s artistic style consists of earth tones and paint blending techniques called sfumato (gradual transitioning of hues producing a hazy effect) and chiaroscuro (dramatic lighting to achieve high contrast and 3D form). These represent Leonardo’s style and we will refer to them as color, texture, and form.

classify objects within an image. CNNs are similar; the function is a weighted sum – a convolution operation. A cartoon of two layers in a CNN is given in Figure 2. CNNs differ from standard neural networks because each node is connected to only specific nodes in the previous layer. Standard CNNs allow for connections to any or all of the nodes in the previous layer. We used the optimization method proposed in L. A. Gatys and Bethge (2016) to transfer the style from one image onto another. This optimization method requires the quantitative definition of the ‘content’ and the ‘style’ of an image. The algorithm minimizes the difference between a combination of the content and styles between two provided images. One of the images was the content reference and the other was the style reference image. In this section, we describe this process in detail. The goal of the CNN style transfer is to generate a “pastiche” as if Leonardo painted the image himself. The pastiche is the final image incorporating the content of the content reference image with the style of the style reference image, both of which are defined below. To discuss the specific definitions of content and style, we first need to talk about (a) the input data and (b) the set of features within our CNNs. The overall picture of the optimization process is shown in Figure 2.5. In sum, evaluate the style and content of both images. Then, add up the difference in content and the difference in style. Modify the pastiche to reduce this sum. 2.1 Input Data The input data were digital images. We used jpeg images which contained a red, green, and blue filter to reproduce color. This was supposed to mimic the structure of human eyes, which contained three color-sensitive cells. Therefore, the input data were a stack of three 2

Figure 2: Diagram of part of a CNN. From top to bottom, the input image passes through three convolution operations (the number is chosen arbitrarily) resulting in three images F 1 # . Then, each of these images pass through three convolution operations resulting in a total of nine output images. Each of these resulting images F are called features.

2D arrays. Every number in the array was a separate data point. A 512 x 512 pixel image contained 786,432 data points. To extract the content and style of the images, we used a CNN called VGG16 (Simonyan and Zissermann, 2014). This network had already been trained on a large dataset of images to determine whether the image contains one of 1,000 objects (such as a person, airplane, cat, etc.). The pre-trained VGG16 network is available online1 , and contains all of the model parameters. We used the version available in Python’s Keras module2 . 2.2 Network Features The features are the qualities which describe an object; features act to discriminate one object from another. For example, consider the ways a person is different from a cat. Descriptions for a person would not include whiskers and do not include the same pattern of hair. The eyes are different, and proportions and placement of things (arms, head) are different. These are a few possible features which may describe and differentiate these two object types. A CNN learns aspects of images which distinguish different objects. The features of the CNN’s are images. The images are generated from the input data through convolution operations. The model parameters, which are grouped into 3D arrays of numbers, are the filters used to perform the convolution operations. This filter encodes how nearby pixels in the image are weighted in the weighted average. If we want the convolution of an image at pixel x, take the weighted average of all of the surrounding pixels. This weighted average 1. http://www.robots.ox.ac.uk/~vgg/research/very_deep/ 2. https://keras.io/

3

Figure 3: The image above is a generalization of the CNN process. The computer initially fed a style image, content image, and a pastiche of random noise through the CNNs. The noise pastiche eventually started to develop features from the content and style image with the end goal being the minimal loss.

4

becomes the value of the pixel in the convolved image. Some specific convolution filters can detect vertical or horizontal edges, or can blur or sharpen an image. The first layer of the network performs a number of convolutions on the input data. Each convolution uses a different filter, and each filter is learned when the network is being trained. Figure 2 shows this; each arrow represents a different filter. Because each filter is determined during training, the exact features represented may not be able to be determined before training. Subsequent layers are defined the same way. A number of filters are convolved with the output images of the previous layer to generate a larger number of convolved images. N output images in the previous layer and M filters defined in the current layer results in N × M convolved images of this layer. See the bottom layer in Figure 2 for an example. This means that there are N × M features in the layer; each feature is a convolved image. These higher level features, which are generated towards the middle and end of the CNN are more difficult to interpret alone, as they are convolutions of convolutions. 2.3 Content Definition When an image is fed through the network, the set of features is computed through the convolution operations. The content, based on a layer `, is a combination of these features C` =

X

F `i,

(1)

i

where F ` i is the ith filter in layer `. CNNs appears hierarchical. This means that more complex objects are built up out of less complex objects. For example, an eye can be thought to be made up of a number of shapes: e.g., some circles and other shapes. Each of these, in turn, are made up of simple combinations of lines. This suggests that higher level features are more complex, as they incorporate lower level features. We assume that the image can be recreated as a combination of the network features. The abstract content of the image should be represented as a combination of the image features. 2.4 Gram Matrix and the Definition of Style Unlike the content, the definition of style is more than just summing the features. L. A. Gatys and Bethge (2016) defines the style as a combination of the Gram matrices of features. In this section, we dissect the Gram matrix. The Gram matrix is the set of all dot products of a vector with itself. The Gram matrix is defined as G` i,j =

X

F ` i,k F ` j,k ,

(2)

k

where F ` i,j represents the ith filter in layer ` at position j. Figure 2 shows this notation in the context of a CNN. This represents the vectorized feature; j may be broken up into pixels jx , jy , and jz (the z dimension is the color channel: red, green, or blue). The inner product given above is defined as the dot product between between different features in the 5

same layer of the CNN. We have two images, F ` i (x, y, z) and F ` j (x, y, z). To compute the inner product, multiply the values of each pixel in both images and sum over all pixels. The Gram matrix, then, is the array of these dot products between all combinations of filters (e.g., 1 & 2, 1 & 3, 2 & 3, etc.). Note, the Gram matrix, in this case, is symmetric: G` i,j = G` j,i . What is the concept behind the Gram matrix? The dot product, in general, represents the projection of one vector onto another. It quantifies the similarity of the directions between two vectors within a space. The space in which our vectors live is Rn×m×c , where n and m are the image height/width and c is the number of color channels. From the beginning, we start list of features that are generated by the CNN. These features form the internal representation of the image. Each layer of the CNN consists of a number of images, F ` i . We can flatten these images to form vectors F ` i,j . The Gram matrix G` i,j says that feature i and feature j occur together within this image. Let’s look at the equations to clarify this. A feature F ` i defines a vector within some abstract space. Let’s call the (normalized) direction fˆ` i , and the length of the vector |F ` i,j |. The Gram matrix may be written as G` i,j =

X

|F ` i,k ||F ` j,k |fˆ` i · fˆ` j .

(3)

k

The Gram matrix is written here as a product of a projection term (fˆi · fˆj ) and an activation term (|Fi,k ||Fj,k |). The projection term quantifies the similarity between two features. We can think of this as a correlation between the two features; how similar are the features? The activation terms represents whether this feature exists in the input image. Because the activation term is a product, the term represents whether the two features occur together within the image. The projection term represents how similar the images are within the abstract space. 2.5 The Optimization Process We will cover the optimization only briefly; other sources (Nar; Des) cover this portion well. In the following sections, we will go over what is being optimized: the loss function which we want to minimize. We have three images: the content reference image, style reference images, and the generated pastiche. The pastiche combines the content of the content reference with the style from the style reference image. The process is iterative: generate the combination, evaluate the loss function, determine the next best combination. This process is shown in Figure . The loss function is the sum of three terms: the content loss, the style loss, and a regularization term. We describe each term below. 2.6 Content Loss Lcontent

2 X L L =α F i,j (content) − F i,j (pastiche) i,j

6

(4)

The content portion of the loss function is the square of the difference between the content reference image and the pastiche image. This is summed over all features within one layer, then multiplied by a user defined constant. This final constant is called the content weight. We want to match the content between the content reference image and the generated pastiche as closely as possible. To make this loss zero means there is no difference in the content of the content reference image and the pastiche. 2.7 Style Loss Lstyle

2 XX 1  ` ` =β G i,j (style) − G i,j (pastiche) , 4x` `

(5)

i,j

where x` represents the number of color channels (red, green, blue) times the number of pixels in the images of layer `. The style loss considers the difference between the Gram matrix of the style reference and the Gram matrix of the pastiche. Essentially, it describes the difference in style between the style reference image and the pastiche. The result is normalized according to the number of pixels in the style reference image. We sum this quantity for each layer defined. This is then multiplied by a constant, called the style weight. 2.8 Additional Term Finally, we add another term to regularize the loss function. This follows Mahendran and Vedaldi (2014), and encourages a smoother, less noisy output. Lreg = γ

X 

∆x (pastiche)

2

 2 1.25 + ∆y (pastiche) .

(6)

x,y,z

Given an image I(x, y), ∆x (I) = I(x, y) − I(x − 1, y) and ∆y (I) = I(x, y) − I(x, y − 1). These represent spatial gradients within the image. Then, we sum over all pixels and color channels. Very noisy images tend to have large gradients, which would result in a large Lreg term. Including this term serves to suppress noisy pastiche images. All three of these terms are summed to form the total loss function. We then set the number of iterations to run, and let the optimizer find the optimal pastiche. Li et al. (2017) showed that this optimization process is equivalent to a special case of the maximum mean discrepancy (MMD) technique. In other words, this implementation of style transfer tries to “match the feature distribution between the style images and the generated images.”

3. Results 3.1 Machine results The first run involved recreating an image of a figure in the style of Leonardo’s painting titled Virgin of the Rocks (Isaacson, 2017). Specifically, we compared it to a study of one of the characters in the painting, the angel Gabriel, as shown in Figure 4. We decided to choose an content reference image similar to the Leonardo’s painting (our style reference image) in order to ease the learning process. Figure 5 shows the results, along with a 7

Figure 4: Leonardo da Vinci’s Virgin of the Rocks shown on the left hand side, and the study for this project shown on the right hand side. The image on the right will represent the style reference image.

thresholded version to show the lighting. It compares the generated pastiche (middle) with both the content and style images (left and right, respectively) at iterations 20, 40, and 100. Initially, we ran it up to 10 iterations, then increased it to 40. On the 10th iteration we noticed that the computer detected the direction of the light source from the painting. The computer generated pastiche mirrored the intensity of the light source seen on the face of Leonardos figure, as well as the sfumato effect. The computer also noticed that the shadow on the face of the figure included a reddish earthy undertone, especially near the figure’s mouth and the side of the face. The skin of the figure also developed the same texture found in Leonardos painting. The color scheme was also starting to resemble that found in the painting. The hair was becoming darker at 40 iterations, as was the background. The algorithm did not apply the style randomly, it learned to recognize the face, hair, and even the background changed colors and texture to simulate the background in the content figure. Although after 40 iterations the pastiche did not look identical to the style image used, the computer picked up on the hazy effect (sfumato), the earthy tones, and some of the high contrast (chiaroscuro). This implies the computer was able to detect some of Leonardo’s style. How well was the style applied to the content image? The resulting pastiche did not have a strong enough saturation of earth tones, chiaroscuro, and sfumato. It seemed the computer applied the style rather weakly. 3.2 Did the CNN Capture da Vinci’s Style? The images shown in section 3.1 show that many of the style elements can be reproduced. The lighting and color, for example, is clearly shown. The pastiche has dramatically different lighting and color from the content reference image; it is primarily earth brown and there 8

Figure 5: This table shows the content and style with iterated pastiche sandwiched in between. Notice how the computer started to transform the pastiche into a more glowing, hazy appearance as seen in image 8.

9

Figure 6: Pastiche minus the content image, the red channel only. The image on the left are positive differences, where the pastiche contains more red, and the image on the right contains the negative differences.

is larger contrast between light and shadow (like a spotlight). This is seen in Figure 6. There, we take the red channel of the difference between the generated pastiche and the content reference image. Figure 6 also shows one sense of ‘focus’, where different features are modified in different ways. The faces in the pastiche are more red than the content reference image, whereas the central figure’s shoulder is less red. This is supported by looking at the image histogram of the red channels (Figure 7) for the images pastiche (top) and the content (bottom-left) and style (bottom-right) reference images. The content image a lot of white pixels (R = 255) and, in general, a broad distribution of brightness. The histogram for the pastiche is much more similar to that of the style reference. Thus, one action of the algorithm is to match the color distribution. There are red undertones, specifically in the transition from light to dark, which are missing in the computer generated pastiche. This is partly because the algorithm cannot 100% match the color distribution of the style reference image, and partly because the optimization is a global operation. These transition regions are a small fraction of the entire image. We can see in Figure 10 that the computer generated pastiche maintains the outline of the central figure more faithfully than the painted pastiche. For the computer generated pastiche, the silhouette of the central figure more closely matches the content reference image. The silhouette in the painted pastiche incorporates more features of the style reference image, however. The computer generated pastiche is missing da Vinci’s characteristic shadows around the eyes. This is a prominent feature within the central figure of the style reference, but the algorithm is constrained to be faithful to the content reference image. 3.3 Painting Results As comparison, we used the same two style and content reference images given to the computer to generate a new image in the form of a physical painting pastiche. Unlike a computer, painting is messy, tedious, and requires a some potentially toxic materials, and 10

Figure 7: Image histogram of the red channel for the three images. is limited by human-related necessities such as bathroom breaks, food, and sleep. Initially, we used reference images oriented differently from the ones used with the computer. This proved to be extremely difficult. Our brain constantly wanted to focus entirely on one of the reference images while transferring the renderings to the final product. The sequence of progression for the painting is shown in Figure 8. The following was the thought process which led us to complete the painting: 1. Examine the overall painting by Leonardo da Vinci. What colors did Leonardo use? Where is the light source coming from? 2. Examine the content image we want to recreate in the style of Leonardo’s painting. What features should we preserve? 3. What colors would Leonardo have used? This required reading books on Leonardo, and understanding his personality, choice of colors, and technique. Painting the pastiche ourselves helped us better understand the steps necessary to perform the style transfer successfully. It was not sufficient to simply apply and match Leonardo’s shadows and color scheme from the locations of where they were found in his painting. For us to obtain a convincing final painted pastiche, we needed to apply his style (i.e. earth tones, sfumato, chiaroscuro) on where they would be found on the content reference image, given the features that exist there. This is not an easy task for either the computer nor the human for it requires mental calculations and spatial reasoning. However, the human has the ability to venture off into expression mode and in some way can “invent” 11

Figure 8: The progression of the oil painting pastiche from inception. Initially, we used different orientations for the reference style and content image (see middle image from the top row), but that proved to be highly difficult.

a whole new set of features which do not exist in either the style image nor the content image. We believe this allows the human to make a more convincing piece of art, much more closely resembling what Leonardo would have made. We compare the painting pastiche to the content and style reference images in Figure 9, just as we compared the computer generated pastiche at different stages in Figure 5. The figure includes all three images at a 75% threshold to highlight the lighting. The lighting is closer to the style reference’s lighting; it is less diffuse than the content reference. To reiterate, the computer and the human both detected the sfumato, chiaroscuro, and color scheme that Leonardo’s painting includes. Visually, it appears that the human succeeded more with capturing Leonardo’s caricature style which comes from a balance of the sfumato and chiaroscuro techniques. The painted pastiche seems to be more illustrative, exaggerating the shape of the nose and curvature of the face. The painting pastiche doesn’t resemble the content image as much as it does the style image which implies the human (us) had a hard time trying to understand how to preserve the original shadows of the content image when applying the style image.

4. Conclusion This project not only required a good knowledge of machine learning and programming, but also art and science. Leonardo studied science because he understood that to recreate a convincing art of his surroundings, he needed to understand the mechanics of it all. This 12

Figure 9: Comparison of painting to content and style image with black-and-white versions. In the black-and-white version, the pastiche has more circular under-cheek shadows compared to the style image. The painting failed to capture the contours, however, and contains more sharp edges than the da Vinci painting.

Figure 10: Comparison of handmade painting (far-left) to computer generated pastiche (middle-left) with respect to the content reference image (middle-right) and the style reference image (far-right).

13

Figure 11: Computer pastiche (left) and painting pastiche (right). While the computer generated pastiche preserved the content, the painting pastiche better captured da Vinci’s blending and color scheme, and has a more consistent lighting.

interconnectedness between art and science was crucial in helping him generate his famous art. Similarly, by quantifying the definition of artistic style, analyzing how a computer makes art and comparing the results to how a human makes art, we tried to show that in order to understand science, you need to understand art. The notion of style is incomplete; that the style is not exactly described as the correlation of features. It indicates that the problem posed is not well defined. What is the style of an image? What exactly do we mean by transferring the style of one image onto another – if the content and style cannot be 100% disentangled, how much modification to the content is acceptable? 4.1 Weaknesses in the computer model As can be seen in Figures 10 and 11, the iterations produce spotty results. Figure 10 show both the computer pastiche and the painted pastiche next to both the content and style reference images and Figure 11 highlights the comparison between the computer generated pastiche and the painted pastiche. The computer generated pastiche appears patchy, perhaps because there is no similar object within the style reference image. It showcases the global nature of the optimization process; the loss function reduces the image to a scalar. The painted pastiche shows a more uniform application of style, at the expense of removing the background. A person can focus on specific elements. Maybe this is a result of using only up to a few hundred iterations. But it highlights the geometry of the loss function. There are a lot of features; it is not obvious that the problem 14

has a unique global minimum. Another drawback lies in th memory and time required to run each model.

5. Implications We can see from the results of the style transfer that the computer is capable of detecting Leonardo da Vinci’s subtle style of sfumato and chiaroscuro. When we tried painting our pastiche we made a more convincing pastiche than the computer because it had more of the sfumato features. We found it difficult to simultaneously alternate back and forth between the style and content images to assess and determine which features of the content to keep, and which features of the style reference to keep in order to create a painting that Leonardo would have painted himself. The mainstream CNNs can make anyone an artist these days through the widely available web and mobile app filters. What is overlooked, however, is the understanding of what is meant by artistic style, and why do some CNNs choose to alter certain features of a certain image and not others. Knowing these answers can help critique and quantify the overall definition of art and the way we teach it. This can assist in teaching classical art styles by focusing on style, as opposed to content. A student can express more freedom in learning a particular art style. Instead of being limited to faithful “studies” of existing works of art, they can try to apply the style to anything. A quantitative assessment of style can guide a student’s efforts by providing concrete feedback. This is especially true if some notion of attention is added to the algorithm. Instead of optimizing across the entire image, maybe optimizing for certain features (e.g., figures or landscapes). We don’t have to stop at art, CNNs can also be used to classify astronomical data in the form of images – a project soon to follow this one.

15

References

W. Isaacson. Leonardo da Vinci: The Biography. Simon + Schuster UK, 2017. ISBN 9781471166761. A. S. Ecker L. A. Gatys and M. Bethge. Image style transfer using convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2414–2423, 2016. Y. Li, N. Wang, J. Liu, and X. Hou. Demystifying neural style transfer. abs/1701.01036, 2017. URL http://arxiv.org/abs/1701.01036.

CoRR,

A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. CoRR, abs/1412.0035, 2014. URL https://arxiv.org/abs/1412.0035. K. Simonyan and A. Zissermann. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URL https://arxiv.org/abs/1409.1556.

16