A bit of background information for my article “#Deepdream is blowing my mind”
It’s a technical subject, but I think I can keep it relatively non-technical. If you’re familiar with the territory you can skip this. (I've put a tiny bit more info in the comments to the side).
Artificial Neural Networks
An Artificial Neural Network (ANN) can be thought of as analogous to a brain (immensely, immensely simplified. Nothing like a brain really).
It’s not really like a brain, it’s just metaphorically similar. And initially was inspired by (what we think we thought we knew about) how the brain works. But there are some similarities on a very high level.
An ANN consists of ‘neurons’ and ‘connections’ between neurons. The neurons are usually organized in layers. See this image from Wikipedia:
Data flows in one side of this neuron network (via input nodes), gets processed along the network, and something is output on the other side via output nodes. (NB. In this context ‘Data’ and ‘Information’ mean the same thing, numbers. Long sequences of numbers)
Each connection (i.e. the arrows in the image above) between two nodes has a weight associated with it. This number is the strength of that connection. (NB. When data is fed to input nodes, they get passed down all of the arrows, multiplied by the ‘weight’ of each connection. The receiving nodes add up all of the numbers they receive from all their connections, put them through a little function called an ‘activation function’, and sends the results down their own arrows to the next nodes. This is repeated throughout, all the way to the output nodes -> RESULT)
In short an ANN processes and maps an arbitrary number of inputs, to an arbitrary number of outputs. In this way the ANN acts like (and is often said to ‘model’) a mapping function.
And most importantly: Information (the function it models) is stored in the network as ‘weights’ (strengths) of connections between neurons.
If we feed a network some inputs, and the output depends on these connection weights, how do we know what weights we should use?
That’s where training comes in.
In what’s known as supervised learning, you provide the network with a bunch of training examples, in simple terms: input-output pairs. (NB. There are other types of learning too. Like unsupervised, where you don’t provide training examples, you just give the network a bunch of data, and it tries to extract patterns and relationships. Or semi-supervised learning which is a mixture of both.)
You would effectively say “For this input A, I want this output X; For this input B, I want this output Y; for this input C, I want this output Z”. This is analogous to pointing to pictures of animals with a toddler going “CAT”, “DOG” etc. You’re associating inputs (pictures of cats and dogs) with outputs (the words “CAT” and “DOG”).
Then you say LEARN! And a long iterative process tries to solve the network. The problem it’s trying to solve is: what are the weights I need on each connection, such that when I feed in the inputs of the training examples, I get the corresponding outputs of the training examples. (NB. In reality you will never be able to train the network such that you get the exact same outputs for the training inputs. So it’s more about trying to minimise the error.)
After you've trained the network, the network has (hopefully) optimum weights on each connection. Such that if you were to feed it the inputs from the training examples, you (hopefully) get the same (or near) results as the corresponding outputs.
Where it gets interesting and potentially useful (and potentially wrong and scary), is if you feed it new input data that it hasn't seen before, and it tries to interpolate / extrapolate / calculate / predict relevant new output based on the patterns it’s found from the training data. Current models predict on a spectrum ranging from spectacularly accurate, to spectacularly wrong.
So how the hell do you use this for complex tasks like image or voice recognition?
Deep / Convolutional / Reccurent Neural Networks
The network image from Wikipedia above is a very simple network. Suited to simple (i.e. low dimensional) problems. But to solve complex problems with it, you — as the trainer of the network — would need to be very specific in the data you feed it. You couldn't just feed it raw image data (i.e. millions of pixels), the network wouldn't be able to cope with the immense amount of information found in a raw image. You would need to do a hell of a lot of manual, handcrafted feature extraction first to reduce the dimensions (i.e. amount of information). Instead of feeding it raw images, you might need to run filters on the image first, find edges, break it down into simpler shapes, etc. And then feed those reduced, simplified features into the network for training and processing. This process of manually identifying and extracting features — called feature engineering — is a major bottleneck. It’s difficult, time-consuming and requires domain specific knowledge and skill.
Bring on the bad-ass complex networks.
You can feed deep networks complex, raw inputs. And they will do the feature extraction for you, so you don’t have to. Which features do they extract? Whatever they need, they learn that too! They figure it out based on the data. The deep learning model is a essentially a stack of parameterised, non-linear feature transformations that can be used to learn hierarchical representations. During training, each layer learns which transformation to apply — i.e. it learns which feature to extract — and how. As a result, the deep learning model stores a hierarchy of features with an increasing level of abstraction. It’s pretty insane. (NB. This is why Deep Neural Networks are often referred to as ‘black boxes’. You just feed them inputs and they give you outputs. It’s quite difficult to get an intuition of what’s happening inside. That’s why people are trying to invert them and visualise layers, more on that below).
It’s worth pointing out a caveat to this: we don’t need to manually handcraft the features, but instead the network architecture needs to be manually handcrafted depending on the type input data. So the network is very domain specific. E.g. usually for image classification very specific convolutional networks are used. Yann Lecun’s LeNet was the original. AlexNet, GoogLeNet, VCGNet are some modern ones. For speech recognition completely different Recurrent Neural Networks are used. And there’s loads of different options. This is quite a bottleneck, and we’re still a long way from having a single, universal, general purpose learning algorithm. (Though Jeff Hawkins’s team is working on it).
Data vs Network
An important point to make clear is: An artificial neural network does not store any of the training data. The training data was only used to learn the weights for the connections, and once trained, the data is not necessary.
Imagine a complex neural network is trained on millions of images (e.g. http://image-net.org)
This network just stores the weights of the connections required to recognise images. In doing so it is storing abstract representations of various image features that it has learnt is required to identify the different categories.
This is worth re-iterating: E.g. for #deepdream, the original images that the network was trained on, is over 1,200,000 MB (1.2 TB) whereas the trained network itself is only ~50 MB — and that’s all you need to make predictions.
Inverting the network
Normally you feed data into the input layers of a neural network, that data is fed through the network being processed, and results come out of the output layer.
There is a recently discovered process called ‘inverting the network’.
In this case you are effectively asking the network “What type of input do you need, to give this particular output”. Of course the network doesn't explicitly know that. But there are ways of manipulating an input image, such that we get the desired results. This is effectively running the network backwards.
A very crude way of putting this is you give the network a completely random picture that looks nothing like a cat and you ask it “does this look like a cat?”, the network says “no”. You make a few random changes and ask “what about this?”, and the network says “no”. And you keep repeating. If the network says “yea, maybe that looks a bit more like a cat” you say “aha! ok so I’ll make more changes in that direction, how about this?”. It’s actually not exactly like that, but you get the idea. And you can see why it takes so long.
The key thing is, the network doesn't actually know what a cat is, it only recognizes one when it sees it.
The above technique can be applied to not only end results. I.e. “does this input look like a cat”. But also intermediary (hidden) layers, by looking for inputs which maximize activity on specific neurons. So you can pick a neuron in the network, and evolve the input such that you get maximum activity on that particular neuron. Then effectively we are finding what that neuron responds to, i.e. what it represents. (NB. This is similar to how neuroscientists figure out which neurons correspond to which types of inputs. e.g. Feeding specific images to the eye and measuring what happens in the brain, they can see how some neurons respond to diagonal lines, or vertical lines, or to light then dark, or dark then light, horizontal movement, vertical movement etc.)
One interesting feature, is that the lower layers (close to the inputs) respond (i.e. represent) low-level, abstract features. Like corners, edges, oriented lines etc. Moving up the layers represent higher level, more defined features. This is remarkably analogous to how information is thought to be processed in our own brain (and the mammalian cerebral cortex). A hierarchy of features, starting at low-level abstract features close to the input, and layers building on top, establishing higher and higher level representation with each layer. (NB. Of course it’s way more complex in the brain, and no one really knows absolutely for sure, but most accepted theories point this way).