This is an article accompanying the paper with the same name which I’ll present at the 2nd Workshop on Machine Learning for Creativity and Design at the 32nd Conference on Neural Information Processing Systems (NeurIPS) 2018, where I discuss some of the underlying practical and technical methods (not so much the conceptual or artistic motivations) used in the creation of my “Deep Meditations” series of works. This article includes most of the information from the paper, expanded with some additional information.
We introduce a method which allows users to creatively explore and navigate the vast latent spaces of deep generative models such as Generative Adversarial Networks. Specifically, we focus on enabling users to discover and design interesting trajectories in these high dimensional spaces, to construct stories, and produce time-based media such as videos, where the most crucial aspect is providing meaningful control over narrative. Our goal is to encourage and aid the use of deep generative models as a medium for creative expression and story telling with meaningful human control. Our method is analogous to traditional video production pipelines in that we use a conventional non-linear video editor with proxy clips, and conform with arrays of latent space vectors.
One of the recent major advances we’ve seen in the so-called field of ‘Artificial Intelligence’ (AI) is in ‘generative models’. A nice summary (if slightly out of date due to the insane speed at which the field is moving) can be found on the openai blog.
First I’d like to take a slight diversion to underline the difference between two different definitions of ‘generative’: i) In the arts or other creative fields — I’ll refer to this as generative content. vs ii) In the context of Machine Learning (ML), or more broadly speaking, in statistics — I’ll refer to this as a (statistical) generative model.
In the arts, creative fields, computer graphics, computational design etc. ‘generative content’ refers to content made working with rule-based or (semi or fully) autonomous systems (i.e. which can also be thought of as governed by rules). From a computational perspective we might quickly think of early generative art pioneers such as Manfred Mohr or Vera Molnar; or the types of works typically made with creative programming toolkits such as openframeworks, processing, vvvv, houdini, touchdesigner, and many many more. However, ‘generative’ can also be ‘analog’ (yet still algorithmic), e.g. Lillian Schwartz, Sol LeWitt, Steve Reich, Terry Riley are classic oft-quoted examples. (Personally I prefer the term ‘procedural’ or ‘algorithmic’ over ‘generative’ in this context).
In ML/statistics however, a ‘generative model’ refers to a system which captures the distribution of some observations (e.g. ‘training data’), in the hope that ideally it is able to model the underlying dynamic processes that gave rise to those observations. So when we sample from this distribution, we are effectively executing the function which models that dynamic process; thus, this generates new data that should hopefully be structurally similar to the observations. This also means that when we manipulate parameters of the generative model, we are manipulating parameters of the dynamic process, and thus generating varying outputs that are consistent with the world from which the training data came from. (This is of course, an ideal case scenario. It’s important to remember that “All models are wrong, but some are useful”, and “…the practical question is, how wrong do they have to be to not be useful.”- George E. Box.)
These are two very different, quite unrelated definitions of ‘generative’. However, we can loosely relate them by noting that sampling from ‘statistical generative models’ can almost always be considered ‘generative content’ from the ‘generative art and design’ point of view. But ‘generative content’ is not by default ‘generative’ from a statistical point of view. Even using ML does not guarantee statistical generative-ness. E.g. Deepdream does not use a statistical generative model or process (there is no distribution of data, it’s just gradient ascent), but it is generative content.
My use of the term ‘generative’ in this article is the ML/statistics definition (and most likely, any paper or article you read on generative models in ML, particularly if written by someone with a background in statistics or ML, will also be using this definition. Whereas any paper or article you read written by someone with a background in art, computational design, computer graphics etc will most likely be using the ‘generative content’ aka ‘procedural’ definition).
‘Deep’ Learning (in this context)
Some of the major advances that have been happening recently in generative models is in deep generative models, i.e. using Deep Artificial Neural Networks, i.e. using Deep Learning (DL) — the current dominant paradigm in AI research.
Sometimes people (quite rightly) point out weaknesses of DL. A common one is that these systems often require a lot (e.g. thousands, hundreds of thousands, even millions) of examples of training data. While this is absolutely true, I find it difficult to see this as a weakness of DL per se. Rather, I see this as a weakness of the state of the field of AI research in general. Because according to my understanding, this is exactly the whole point of DL: to extract meaningful information from a vast sea of humanly-unmanagable big data. (Of course it’s no coincidence — in our current times of extreme surveillance capitalism — that the particular sub-field of AI which is seeing the most progress, and financial investment, is the particular field which deals with extracting information from big data. mini, short, long versions).
Referring to DL’s dependency on big data as a weakness, seems analogous to referring to the big size and heaviness of a sledgehammer as its weakness.
Whereas perhaps that is exactly what defines the sledgehammer and what it’s for. But luckily, we also have hammers of all sizes, suitable for different jobs.
In fact we have all kinds of tools that aren’t even hammers.
It’s undoubtedly important to note that right now DL is a big heavy weapon best suited for dealing with vast amounts of data, for smashing open massive boulders into millions of pieces, to see what’s hidden inside. And for optimum results, it’s best combined with other tools specializing in different functions for different purposes.
Control, Relinquishing control, Meaningful control
(This section was too epic, it will become its own post eventually, but ultimately a section on it belongs here).
So the progress we have been seeing in the past few years that I’m referring to, has been in deep generative models, i.e. generative neural networks capable of training on vast datasets, to learn hierarchies of representations. And in this particular paper, I’ll be focusing on image based models: deep generative models that are able to learn to produce high resolution, realistic (or not) and novel images (though the techniques mentioned here can be adapted to other domains, in fact the audio in all of these videos — including “When I’m stuck with a day that’s grey” - was also created by a similar generative model, that will be the subject of an upcoming article).
The goal of these generative models is to learn the distribution of the training data such that we can sample from the distribution to generate new images. We refer to this distribution as the latent space of the model. It is usually high dimensional, though much lower than the dimensions of the raw medium. E.g. When dealing with full colour (e.g. RGB) images with 1024x1024 pixel resolution, we are dealing with roughly 3 million dimensions (i.e. features — 1024x1024x3), whereas we might use a latent space of only 512 dimensions (a compression ratio of 6144:1!).
We denote vectors (i.e. points, coordinates) in this latent space with z, and any point in the latent space can be decoded into pixel space via a special decoder function. These days it’s quite common for this decoder to be a deep Convolutional Neural Network (CNN), that e.g. takes a 512 dimensional z vector and outputs a ~3 million dimensional vector (i.e. a 1024x1024 RGB image). So in short, any point in this space can be decoded to a unique image. (A much longer introductory article on high dimensional latent spaces can be found here).
As mentioned in the openai blog linked above, there are a number of architectures and methods which are very good at this task. Since that post in 2016, more recent developments include Progressive Growing of GANs, Glow and Large Scale GAN (aka BigGAN). The research and techniques I discuss below are agnostic of the architecture and training method, I focus purely on how to explore and construct meaningful journeys in latent space — no matter how the space was constructed.
Depending on our goals, there are many different ways in which we could explore such a latent space. At one end of the spectrum, we could use a fully automated search whereby points of interest are found algorithmically based on various heuristics, such as novelty. At the other end of the spectrum, we could provide a target image and retrieve a corresponding z vector. Some architectures — e.g. Variational Auto-Encoders (VAE) or Glow — have an encoder which does exactly this, while other architectures — e.g. vanilla Generative Adversarial Networks (GAN) — do not. However it is still possible either by manually adding an encoder, such as in the case of a VAE/GAN, or using gradient based optimization techniques to search for a corresponding latent vector. Other techniques for exploring latent spaces include systematic visualizations of various interpolations between key points in space. A very recent example uses the metaphor of breeding, by mixing different images together (in latent space) to find new images.
Our paper is not an attempt to replace any of these methods for discovering interesting points in latent space. Rather, we can incorporate them into our process to enable users to discover and design interesting trajectories in latent space, to construct stories, and produce sequences with meaningful control over narrative. Our main contribution is a workflow based on familiar tools such as traditional non linear video editing software, operating on proxy video clips, and conforming the edit with arrays of latent space vectors.
We face a number of challenges to overcome in crafting trajectories in such latent spaces:
- The space is very large and high dimensional (e.g. 128, 512, 2048 etc.).
- It is not distributed ‘evenly’, or as one might expect or desire. I.e. if we were to sample uniformly across the space, we might end up with many more images of one type over another (e.g. in the case of our test model, flowers seem to occupy a large portion of the space, probably due to a higher representation in the training data).
- As a result of the uneven distribution, interpolating from latent vector z_A to z_B at a constant speed might produce images changing with a perceptually variable speed.
- It is very difficult to anticipate trajectories in high dimensions. E.g. interpolating from z_A to z_B might pass through points z_X and z_Y, which may be undesirable.
- The mass of the distribution is concentrated in the shell of a hypersphere.
- The latent space changes with subsequent training iterations.
We discuss these all in more detail below.
It is common practice in video editing and post-production pipelines to perform an offline edit on a set of proxy clips (e.g. small, easily manageable, low quality video) and later conform the edit by (mostly automatically) applying the edit to online material (e.g. high quality video) — (NB this has nothing to do with being online or offline with regards to the internet :), but is based on much older terminology related to automated machines operating on reels of footage. Read more).
This conforming process is usually performed by transferring information regarding the edit to an online system (i.e. with access to the full quality rushes) using a file such as an Edit Decision List (EDL) which contains reel and timecode information as to where each video clip can be found in order to perform the final cut.
Our method is analogous to this workflow. Our offline clips are the video outputs from a generative model for given z-sequences. Our online clips are numpy arrays of corresponding z-sequences.
In other words, we define (z-sequence, video) pairs. A z-sequence is a numpy array (saved to disk) containing a sequence of z-vectors, i.e. a trajectory in latent space. A video is a normal QuickTime file where each frame is the image output from the model decoding the corresponding z vector from the corresponding z-sequence.
A high level summary of the process is as follows:
- We can edit the videos in a conventional Non Linear Video Editor (NLE) such as Adobe Premiere, Apple Final Cut, Avid Media Composer or Kdenlive (my weapon of choice, discussed more in the appendix).
- Run a custom script to conform the edit with the corresponding numpy arrays containing the z-sequences (i.e. apply the video edit from the NLE, onto the corresponding numpy arrays).
- Feed the resulting conformed z-sequence into the model for final output.
It’s best to perform the edit on keyframes, i.e. temporally sparse destination points in latent space, so that after the conform, we can interpolate between them for smooth, continuous output.
Below is an example process in more detail. This is merely a suggestion, as many different workflows could work, however this is the process I used to create the videos on this page.
In the following context the phrase ‘render a z-sequence’ refers to i) saving the z-sequence to disk as a numpy array, and ii) decoding the z-sequence with many snapshots of the model (from 28 different training iterations, spaced 1000 iterations apart) and saving out a video where the output of each snapshot is tiled into a grid (e.g. 7x4) and labelled with the corresponding training iteration. Rendering multiple snapshots in a grid on a single frame in this way gives us an overview of how the latent space has evolved across training iterations, and allows us to easily see and select the most aesthetically desirable snapshot(s) (we go deeper into the motivations behind this in the Snapshots across time Appendix).
- Take many (e.g. hundreds or thousands of) unbiased (i.e. totally ‘random’) samples in latent space and render. This produces a video (and corresponding z-sequence) where each frame is an entirely different ‘random’ image. This gives us an idea of what the model has learnt, and how it is distributed. It also gives us an idea of how the distribution changes across subsequent training iterations, and which snapshots provide more aesthetically desirable images.
- Edit the video in a NLE to remove undesirable (i.e. ‘bad’) images or to bias (or de-bias) the distribution (e.g. remove some frames containing flowers if there are too many, or duplicate frames containing bacteria if there’s not enough of them etc.)
- Run the script to conform the edit with the original z-sequence and re-render. This produces a new video (and corresponding z-sequence) where each frame is still an entirely different ‘random’ image, but which has hopefully a desired distribution (i.e. no ‘bad’ images, and a desirable balance between different images).
- Repeat steps 2–3 until we happy with the distribution (one or two rounds is usually enough). Optionally apply varying amounts of noise in z to explore neighborhoods of selected frames (e.g. to look for and include more images of bacteria, with more variation).
- Load the final edited z-sequence (with desired distribution) and render many (e.g. tens or hundreds of) short journeys interpolating between two or three random (or hand picked) z (selected from the z-sequence). This produces tens or hundreds of short videos (and corresponding z-sequences) that contain smooth, slow interpolations between two or three keyframes where the keyframes are chosen from our preferred distribution. This gives us an idea on how the model transitions between selected images. E.g. The shortest path from a mountain to a face might have to go through buildings, which might not be desirable, but inserting a flower in between might avoid the buildings and look nicer — both aesthetically and conceptually.
- Repeat step 5, honing in on journeys which seem promising, optionally applying varying amounts of noise in z to explore neighborhoods of selected frames and journeys.
The above steps produce an arsenal of short snippets of video clips.
These can then be further edited and joined in a NLE, and then conformed with the corresponding z-sequences to produce a final video. We have produced many hours worth of carefully constructed stories using this method.
Model architecture and data
We applied the approach mentioned in this paper on a number of different models and architectures, however the primary test case we refer to (and from which we also show the results) is a GAN (specifically, a Progressively Grown GAN) trained on over 100,000 images scraped from the photo sharing website flickr. The dataset is very diverse and includes images tagged with: art, cosmos, everything, faith, flower, god, landscape, life, love, micro, macro, bacteria, mountains, nature, nebula, galaxy, ritual, sky, underwater, marinelife, waves, ocean, worship and more. We include three thousand images from each category and train the network with no classification labels. Given such a diverse dataset without any labels, the network is forced to try and organize its distribution based purely on aesthetics, without any semantic information. Thus in this high dimensional latent space we find directions allowing us to seamlessly morph from swarms of bacteria to clouds of nebula, oceanic waves to mountains, flowers to sunsets, blood cells to technical illustrations etc. Most interestingly, we can perform these transformations across categories while maintaining overall composition and form.
Video editing and conforming the edit
I use the opensource Non Linear Video Editor Kdenlive on Ubuntu. Unfortunately this editor lacks support for exporting the industry standard EDL. However, Kdenlive’s native project file format is XML based. This allows us to write a python based parser to load the project file, inspect the edits, retrieve the corresponding numpy z-sequences and conform by performing the same edits on them and exporting a new z-sequence. At this point the conform is very simple and only supports basic operations such as trimming, cutting and joining, and does not include cross-fades or other more advanced features or transitions. However, to implement such additional features is relatively trivial and left as future work (e.g. a cross-fade between two images in the NLE can be thought of as an interpolation between the two corresponding points in latent space).
We use generative models with high dimensional (512D) multivariate Gaussian distributed latent spaces. Because these distributions are concentrated around the surface of a hypersphere, when we wish to interpolate between points in this space, we have to make sure that our trajectory stays within the distribution. A common solution is to use spherical instead of linear interpolation. However this produces visibly noticeable discontinuities in the movement of the output images due to sudden changes in speed and direction. The images below are two different z trajectories, i.e. journeys in latent space, created by interpolating between a number of arbitrary keyframes. In both images, a single pixel wide vertical slice represents a single z vector, and time flows left to right.
Figure C1 visualizes the results of spherical interpolation. We can see notch-like vertical artifacts that happen when the interpolation reaches its destination and we set a new target, creating a sudden change in speed and direction. To remedy this we introduce a simple physics based dynamical system, the results of which can be seen in Figure C2. In the high dimensional latent space we create a particle connected to both the surface of the hypersphere and the next destination point with damped springs. This ensures that the particle stays close to the distribution, but also moves without discontinuities at keyframes.
A method I’ve been using more recently is borrowing from differential geometry and Riemannian manifolds. This involves projecting an offset (e.g. velocity) vector onto the tangent space of the manifold (i.e. hypersphere in the case of a normally distributed space), and from there transforming onto the manifold itself. This will be a post (and/or paper) in itself in the near future.
Snapshots across time
As the network trains, the latent space changes with each training iteration, to hopefully represent the data more efficiently and accurately. However a noticeable change across these iterations also includes transformations and shifts in space. E.g. what may be an area in latent space dedicated to `mountains’ at iteration 70K, might become `flowers’ at iteration 80K, while `mountains’ slide over to what used to be `clouds’ (this is a bit of an exaggerated oversimplification). To investigate the effects of these transformations, we render the same z-sequence decoding from a number of different snapshots across subsequent training iterations (e.g. the last 28 snapshots spaced 1000 iterations apart), and we tile the outputs in a grid (e.g. 7x4) when saving a video. An example video can be seen below.
Here, every tile within a frame is the same z-vector decoded from a different snapshot in time (i.e. training iteration). We can see in many cases the images are relatively similar with slight variations. In other cases there are more radical shifts, where earlier snapshots are hinting at generating one type of image while later snapshots are producing another for the same z-vector. Fascinatingly, even while semantically the images might be radically different, sometimes the overall form and composition is similar. E.g. in the images below, the semantic content of the tiles change as training continues, however overall shape is maintained. I hope to write a more in depth analysis of this in the near future.
When editing our videos in the NLE, we edit these videos containing the outputs from multiple tiled snapshots. This gives us an overview of the aesthetic qualities from the different training iterations, and allows us to choose the most aesthetically desirable snapshot(s) to use for our final output.
Many aspects of this process can be improved; from theoretical, computational, and user experience points of view. We present this research as a first step in many, towards enabling users to meaningfully explore and control journeys in high dimensional latent spaces to construct stories, using and building upon industry standard tools and methods with which they may already be comfortable. Our ultimate goal is to enable users to creatively express themselves, and meaningfully control the way in which they produce time-based media using deep generative models as a medium.
While this article focuses on only image based generative models, there’s nothing inherently constrained to images about this workflow. In fact the sounds in all of these videos were produced in the same way, using GRANNMA (Granular Neural Music and Audio). I’ll be posting a paper and article about that soon.
This work has been supported by UK’s EPSRC Centre for Doctoral Training in Intelligent Games and Game Intelligence (IGGI; grant EP/L015846/1).