Deep meditations: a meaningful exploration of inner self, a controlled navigation of latent space.

It’s all about the Journey.

“Deep Medtitations: When I’m stuck with a day that’s grey…”. A journey in a 512D latent space, carefully constructed to tell a particular story.

Abstract

We introduce a method which allows users to creatively explore and navigate the vast latent spaces of deep generative models such as Generative Adversarial Networks. Specifically, we focus on enabling users to discover and design interesting trajectories in these high dimensional spaces, to construct stories, and produce time-based media such as videos, where the most crucial aspect is providing meaningful control over narrative. Our goal is to encourage and aid the use of deep generative models as a medium for creative expression and story telling with meaningful human control. Our method is analogous to traditional video production pipelines in that we use a conventional non-linear video editor with proxy clips, and conform with arrays of latent space vectors.

Background

‘Generative’ Models

One of the recent major advances we’ve seen in the so-called field of ‘Artificial Intelligence’ (AI) is in ‘generative models’. A nice summary (if slightly out of date due to the insane speed at which the field is moving) can be found on the openai blog.

‘Deep’ Learning (in this context)

Some of the major advances that have been happening recently in generative models is in deep generative models, i.e. using Deep Artificial Neural Networks, i.e. using Deep Learning (DL) — the current dominant paradigm in AI research.

A sledgehammer. Big and heavy.
Screenshot of Andy Dufresne (played by Tim Robbins) using his rock hammer in The Shawshank Redemption (source: Castle Rock Entertainment. https://small-change.uq.edu.au/blog/2016/03/hammer-shaped-university)
https://www.machines4u.com.au/mag/100-years-ago-today-antique-woodworking-tools/

Control, Relinquishing control, Meaningful control

(This section was too epic, it will become its own post eventually, but ultimately a section on it belongs here).

Introduction

So the progress we have been seeing in the past few years that I’m referring to, has been in deep generative models, i.e. generative neural networks capable of training on vast datasets, to learn hierarchies of representations. And in this particular paper, I’ll be focusing on image based models: deep generative models that are able to learn to produce high resolution, realistic (or not) and novel images (though the techniques mentioned here can be adapted to other domains, in fact the audio in all of these videos — including “When I’m stuck with a day that’s grey” - was also created by a similar generative model, that will be the subject of an upcoming article).

Challenges

We face a number of challenges to overcome in crafting trajectories in such latent spaces:

  1. It is not distributed ‘evenly’, or as one might expect or desire. I.e. if we were to sample uniformly across the space, we might end up with many more images of one type over another (e.g. in the case of our test model, flowers seem to occupy a large portion of the space, probably due to a higher representation in the training data).
  2. As a result of the uneven distribution, interpolating from latent vector z_A to z_B at a constant speed might produce images changing with a perceptually variable speed.
  3. It is very difficult to anticipate trajectories in high dimensions. E.g. interpolating from z_A to z_B might pass through points z_X and z_Y, which may be undesirable.
  4. The mass of the distribution is concentrated in the shell of a hypersphere.
  5. The latent space changes with subsequent training iterations.

Method

Background

It is common practice in video editing and post-production pipelines to perform an offline edit on a set of proxy clips (e.g. small, easily manageable, low quality video) and later conform the edit by (mostly automatically) applying the edit to online material (e.g. high quality video) — (NB this has nothing to do with being online or offline with regards to the internet :), but is based on much older terminology related to automated machines operating on reels of footage. Read more).

Summary

Our method is analogous to this workflow. Our offline clips are the video outputs from a generative model for given z-sequences. Our online clips are numpy arrays of corresponding z-sequences.

  1. Run a custom script to conform the edit with the corresponding numpy arrays containing the z-sequences (i.e. apply the video edit from the NLE, onto the corresponding numpy arrays).
  2. Feed the resulting conformed z-sequence into the model for final output.
“A brief history of almost everything in 5 minutes”. A journey in a 512D latent space, carefully constructed to tell a particular story.

Process

Below is an example process in more detail. This is merely a suggestion, as many different workflows could work, however this is the process I used to create the videos on this page.

  1. Edit the video in a NLE to remove undesirable (i.e. ‘bad’) images or to bias (or de-bias) the distribution (e.g. remove some frames containing flowers if there are too many, or duplicate frames containing bacteria if there’s not enough of them etc.)
  2. Run the script to conform the edit with the original z-sequence and re-render. This produces a new video (and corresponding z-sequence) where each frame is still an entirely different ‘random’ image, but which has hopefully a desired distribution (i.e. no ‘bad’ images, and a desirable balance between different images).
  3. Repeat steps 2–3 until we happy with the distribution (one or two rounds is usually enough). Optionally apply varying amounts of noise in z to explore neighborhoods of selected frames (e.g. to look for and include more images of bacteria, with more variation).
  4. Load the final edited z-sequence (with desired distribution) and render many (e.g. tens or hundreds of) short journeys interpolating between two or three random (or hand picked) z (selected from the z-sequence). This produces tens or hundreds of short videos (and corresponding z-sequences) that contain smooth, slow interpolations between two or three keyframes where the keyframes are chosen from our preferred distribution. This gives us an idea on how the model transitions between selected images. E.g. The shortest path from a mountain to a face might have to go through buildings, which might not be desirable, but inserting a flower in between might avoid the buildings and look nicer — both aesthetically and conceptually.
  5. Repeat step 5, honing in on journeys which seem promising, optionally applying varying amounts of noise in z to explore neighborhoods of selected frames and journeys.
A one hour seamless loop. A journey in a 512D latent space, carefully constructed to tell a particular story.

Appendix

Model architecture and data

We applied the approach mentioned in this paper on a number of different models and architectures, however the primary test case we refer to (and from which we also show the results) is a GAN (specifically, a Progressively Grown GAN) trained on over 100,000 images scraped from the photo sharing website flickr. The dataset is very diverse and includes images tagged with: art, cosmos, everything, faith, flower, god, landscape, life, love, micro, macro, bacteria, mountains, nature, nebula, galaxy, ritual, sky, underwater, marinelife, waves, ocean, worship and more. We include three thousand images from each category and train the network with no classification labels. Given such a diverse dataset without any labels, the network is forced to try and organize its distribution based purely on aesthetics, without any semantic information. Thus in this high dimensional latent space we find directions allowing us to seamlessly morph from swarms of bacteria to clouds of nebula, oceanic waves to mountains, flowers to sunsets, blood cells to technical illustrations etc. Most interestingly, we can perform these transformations across categories while maintaining overall composition and form.

Video editing and conforming the edit

Example of an editing z-sequenes in Kdenlive

Interpolation

We use generative models with high dimensional (512D) multivariate Gaussian distributed latent spaces. Because these distributions are concentrated around the surface of a hypersphere, when we wish to interpolate between points in this space, we have to make sure that our trajectory stays within the distribution. A common solution is to use spherical instead of linear interpolation. However this produces visibly noticeable discontinuities in the movement of the output images due to sudden changes in speed and direction. The images below are two different z trajectories, i.e. journeys in latent space, created by interpolating between a number of arbitrary keyframes. In both images, a single pixel wide vertical slice represents a single z vector, and time flows left to right.

Fig C1. z sequence using spherical interpolation
Fig C2. z sequence using physical interpolation

Snapshots across time

As the network trains, the latent space changes with each training iteration, to hopefully represent the data more efficiently and accurately. However a noticeable change across these iterations also includes transformations and shifts in space. E.g. what may be an area in latent space dedicated to `mountains’ at iteration 70K, might become `flowers’ at iteration 80K, while `mountains’ slide over to what used to be `clouds’ (this is a bit of an exaggerated oversimplification). To investigate the effects of these transformations, we render the same z-sequence decoding from a number of different snapshots across subsequent training iterations (e.g. the last 28 snapshots spaced 1000 iterations apart), and we tile the outputs in a grid (e.g. 7x4) when saving a video. An example video can be seen below.

An example z-sequence fed through different snapshots from training. Each frame shows the same z-vector decoded from 28 snapshots spaced 1000 training iterations apart.

Conclusion

Many aspects of this process can be improved; from theoretical, computational, and user experience points of view. We present this research as a first step in many, towards enabling users to meaningfully explore and control journeys in high dimensional latent spaces to construct stories, using and building upon industry standard tools and methods with which they may already be comfortable. Our ultimate goal is to enable users to creatively express themselves, and meaningfully control the way in which they produce time-based media using deep generative models as a medium.

Acknowledgments

This work has been supported by UK’s EPSRC Centre for Doctoral Training in Intelligent Games and Game Intelligence (IGGI; grant EP/L015846/1).

computational ar̹͒ti͙̕s̼͒t engineer curious philomath; nature ∩ science ∩ tech ∩ ritual; spirituality ∩ arithmetic; PhD AI×expressive human-machine interaction;