Why AlphaGo is fascinating, from a non-technical technical perspective.
I don’t work for Deepmind, the below is based on Silver et al’s Jan 2016 paper in Nature. I'm also not a Go player, this article is mainly from an AI perspective. See Miles Brundage’s brilliant post for comparison against prior work.
Thanks to @bedder and @alexjc for proof reading this and feedback.
At the time of writing, AlphaGo has beaten Lee Se-dol two games in a row. Even if Lee wins the remaining three games, this is a big deal. In the media maybe you've read that “Go is the holy grail of AI, it’s a very complex game, has more moves than atoms in the universe, is ‘the drosophila of AI’ ” etc. That’s all mainly true (1). But I want to focus on a few specifics that I haven’t seen mentioned much in the media (yet), at least detailed enough to convey aspects that I find most interesting, but also not too technical, and hopefully in under 7 minutes (excluding footnotes).
First, and most importantly, I want to mention that AlphaGo algorithm(s) have little-to-no domain knowledge (2). It’s not an expert system where human experts program strategies and rules on what moves are good, what moves are bad and how to play (as they did when Deep Blue beat Kasparov in 1997). AlphaGo learns how to play. It extracts those rules from watching (and playing) other games. And this is what is so fascinating, because it is learning things that no human knows or can even conceive. Furthermore, these general purpose learning algorithms are being tested with Go, but can potentially be transferred to other domains (3). More on all this below.
Search complexity
One of the factors that makes Go difficult for AI is indeed the complexity of the game. This is mentioned a lot in the media, and yes solving a game of this complexity is very challenging, but not the only interesting aspect of Go AI.
In a full (19x19) Go game, with every move there are on average roughly 250 choices that a player can make. With a game lasting on average 150 moves, that’s around 250¹⁵⁰ -> in order of 10³⁶⁰ possible games. For comparison, chess has in order of 10¹²³ possible games. (These numbers are approximate of course, but give an idea of the orders of magnitude. More here).
To put that into perspective: The number of atoms in the known universe is estimated to be in order of 10⁸⁰. If every single atom in the known universe were actually a universe in itself with the same number of atoms, and every single one of those atoms were also a universe in itself with the same number of atoms, and every single one of those atoms played every single possible game of chess, that’s roughly how many games of Go can be played (4).
So yes, Go, even though it has simple rules, is a very complex game.
Clearly a brute force method of just searching through the possible states and actions is not feasible (not even for chess). No matter how much computing power we have, we won’t even make a noticeable dent in the possible actions we can make. It’s absolutely key to use an efficient way of searching the decision tree and only explore paths which have potential. Currently a very good method is Monte Carlo Tree Search (MCTS), a probabilistic approach to exploring a decision tree, at it’s core by treating actions like slot machines in a multi-armed bandit. It’s very simple and very effective. It’s also not that new (~10 years old, with roots going back much further), so to keep this post brief I'm not going to go deeper into it — pardon the pun. (Some great resources here and I did write a basic openframeworks implementation).
AlphaGo’s ‘Policy Network’ is impressive and arguably ‘state of the art’, but to keep it short I'm going to skip that too and go straight to the ‘Value Network’, which is similar but perhaps simpler to explain (I summarize a bit at the end).
State value evaluation
Even armed with MCTS, there is another huge — and very interesting — problem with Go: evaluating the value of the state of the board. I.e. looking at the board at any given moment, and estimating who is in the lead and likely to win (without running simulations).
We can use smart methods like MCTS (especially with smart policies, policy networks etc.) to radically limit the width of our decision tree search, but at potentially hundreds+ moves deep, it’s not feasible to simulate a game to the end, especially thousands of times for every move. So we need to stop a simulation after N moves deep, and evaluate an estimated value for that simulated state.
In chess for example, this is relatively straightforward (and was first proposed by Claude Shannon in 1949). There are many methods of evaluation now for chess, ranging from naively counting the number of white vs black pieces, to counting pieces with weights (e.g. queen worth more than pawn), or looking for specific patterns in piece layout. So we — as humans — have managed to sit down and hand craft these rules regarding the value evaluation of a chess board state. Then we've programmed these relatively simple evaluation functions and we are now able to write chess software that humans cannot beat. These evaluation functions don’t need 100% accuracy, they just need to be good enough to be able to try out and evaluate multiple scenarios. Nowadays open-source mobile phone apps can beat expert humans at chess.
In Go however, the situation is different. To date there has been no successful evaluation function for the state of a board in Go. Not only because Go has a significantly higher game complexity than chess, but because of the nature of the game.
With thousands of years of history and a rich folklore that includes accounts of famous games such as ‘games of blood and tears, nine dragons playing with a pearl, ear reddening game, blood vomiting game’ etc, Go players rely more on intuition, conceptual and metaphorical strategies as opposed to purely mathematical approaches.
Interestingly, an experienced Go player can look at a Go board and have a rough idea of who is leading, and at an advantage. E.g. “Player White is clearly at an advantage”, or “it’s very close”, or “White seems a bit weaker but it could turn” or “White is clearly going to lose”.
But even though experienced Go players are able to do this, distilling and formalizing that evaluation into a function or set of rules, that takes a Go board state and outputs an estimated score of who is likely to win, has not been possible or sufficiently successful.
And one of the things that Silver et al have managed to do, is exactly that: train AlphaGo to learn how to evaluate the value of a Go board (5).
The Value Network
Silver et al have fed a neural network (called the Value Network) a huge dataset of past Go games. The network is trained looking at each of the board states in each of the games, and correlating that with information regarding who won that game. Furthermore, the software plays millions of times against itself to build that dataset even larger, and improve the network using (Deep) Reinforcement Learning. From this: the Value Network learns to look at a board state, and approximate a value (i.e. predict who is likely to win).
There’s a few things that seem glaringly wrong (or impossibly difficult) with this. Firstly, just because a player won a particular game, that doesn't mean that all of the board states in that game were beneficial to that player and detrimental to other player. Maybe the loser was initially ‘leading’ and made a mistake at the end? So somehow the network has to figure out which states actually contributed to winning and which ones didn't. (This is called the attribution problem / learning from delayed rewards in reinforcement learning, it is tricky but various solutions go back decades).
Secondly, the number of board states in the game Go is in order of 10¹⁷⁰. Not only is that more than the number of atoms in the known universe, but if every single atom in the known universe was itself a universe containing the same number of atoms itself, the game Go would still have billions more possible board states than those atoms (10¹⁷⁰ > 10⁸⁰ * 10⁸⁰ = 10¹⁶⁰). So clearly there’s no way that this network is going to see and memorize even a considerable fraction of possible board states.
So the software has to learn to generalize, and generalize really well.
During a game (especially during MCTS simulations), when AlphaGo encounters board states, it needs to be able to evaluate that state to see if it is beneficial or not. Due to the vast state space complexity (10¹⁷⁰), the chances of those board states being the same as what it saw during its training is almost zero. So it has to learn to recognize similar board states with similar values. But in fact what does that even mean? Just looking similar is not necessarily good enough, as an apparently small change might have a massive impact on the value (i.e. win prediction). In fact the programmers of AlphaGo themselves probably don’t even know the full range of what ‘similar board states with similar values’ means. The definition of ‘similar board states’ are not programmed into AlphaGo. The programmers code and tweak the learning algorithm, and AlphaGo figures out the rest (3).
Final words
AlphaGo learns what similarity in a Go board means. It learns what kinds of patterns are meaningful vs meaningless. What kinds of patterns lead to success vs don’t.
Ultimately, the network learns something that our human experts know intuitively based on their experience, but are unable to distill and formulate. Furthermore, the network learns, improves, distills and formulates it as a computable function, so it is able to act on this formulation in the future.
This is what makes machine learning so powerful. Obviously machines have always been able to do things that we can’t do. But we've always had to tell them exactly what to do. With machine learning, they are able to learn things that we don’t know. They look at data and extract patterns and regularities that we can’t see. No human actually fully understands the reasoning that goes on inside the network when it makes a decision. No human is able to predict what the network is going to decide to do or why. Seeing AlphaGo make ‘surprising’ moves surprises its programmers as much as everyone else. If anything, humans will have to watch and hope to learn from the network, as we are already seeing in many online discussions, or when Myungwan Kim refers to AlphaGo as a ‘goddess’, perhaps hoping to witness a ‘Kami No Itte’ (Divine Move).
Personally, as I'm not a Go player, this is interesting to me for one main reason. None of the code or algorithms that do this is specific to the game of Go and can potentially be applied to any domain (3). Health, medical diagnosis and treatment being one of the fields that could dramatically benefit from this (interesting to note recent launch of Deepmind Health).
Needless to say, this is also what makes machine learning so terrifying. Trusting machines is already causing havoc in our lives. As we move towards more black-box algorithms which we can’t understand, the potential dangers are going to be amplified. Media hype is terrified of Artificial Intelligence spawning SkyNet, but we should first be terrified of Artificial Stupidity, especially combined with Human Stupidity.
To paraphrase Max Tegmark: in the past, we first invented the car, then we learnt the hard way that we needed seat-belts and air-bags. As we develop more powerful technologies, we can’t afford to let our wisdom lag behind (edge.org).
Or to quote John Mather, “ I am both thrilled and frightened”.
Notes
(0) Worth noting, Lee was very confident that he would beat AlphaGo very comfortably. Days before the first match he was questioning not who would win the best of five, but whether AlphaGo could even win one game at all! I'm only guessing that he saw AlphaGo play Fan Hui in October 2015, and didn't think that highly of it. Perhaps he saw some mistakes and judged his own abilities to be far superior. Unless Deepmind used an inferior model in October to play mind games with Lee, in the 6 months since, AlphaGo has been able to improve its playing to render Lee “speechless” and “in shock” (which may be due to improvements to software or hardware, or both).
(1) AlphaGo is most definitely not — and no one claims that it is — the be-all and end-all of AI (and definitely not any kind of ‘Strong’ AI). It doesn't model opponent behaviour, and doesn’t even use long term planning or strategy (across moves) — both things which an ‘intelligent’ human player would do. And even though Go is a very tricky game for AI, it isn't necessarily representative of real world problems. Go is a perfect information game (i.e. the system has all of the state information that it needs to make a decision), it is deterministic (i.e. there is no chance involved), it has Markov properties (i.e. how one got to a particular state is not relevant to a decision, only the information at the current state is — though a human player would of course use history as a means to model opponent behaviour, but we could roll history into current state to try and make it Markov but that’s another story). But actually no one is claiming any of this anyway (except maybe some crazy papers).
(2)(3) Above I mentioned that AlphaGo algorithms have ‘little-to-no domain knowledge’. One particular (fast) MCTS rollout policy was trained using handcrafted patterns. This policy was used as an alternative method to the Value Network, to evaluate the value of a state by simulating to the end of the game using this policy. However it turned out that the Value Network — programmed with no domain knowledge — was consistently more accurate.
I also mentioned that these algorithms “can potentially be transferred to other domains’. This does not mean that we can pipe any data into AlphaGo and it will automatically perform super-human in that domain. Even though the algorithms and code written for AlphaGo are not limited to the game of Go, the network architectures, hyper-parameters, and the whole pipeline is designed and fine-tuned thoroughly to maximize performance for this particular problem. Nevertheless, transferring this system (or knowledge gained from it) to work with another domain is likely to be a lot more feasible than hand crafted expert systems (such as Deep Blue).
(4) You got me, 10⁸⁰ * 10⁸⁰ * 10⁸⁰ * 10¹²³ = 10³⁶³, which is 1000x bigger than 10³⁶⁰. So I exaggerated by 1e-355%.
(5) It’s difficult to distinguish exactly how much of this progress comes purely through algorithmic developments vs using a better dataset to train on vs using lots more compute power than before vs combinations of all of the above. For purposes of this article, I'm not trying to isolate algorithmic progress but focus on the overall increase of humanities abilities (even if that includes throwing money at a problem to build the most powerful Go computer ever). So all of the factors are equally relevant. See Miles Brundage’s fantastic post for comparison against prior work.
Glossary
Value Network: Given a board state, outputs a scalar value. Trained to give AlphaGo an approximate idea of the value of a board state. I.e. with one look at a board, without simulating any moves, the software can make an estimate on who is leading / winning. It’s not very accurate, but good enough. Kind of like a hunch.
Policy Network: Given a board state, outputs a probability distribution for the available actions. Trained to give AlphaGo an approximate idea on what move to make given a board state. This might not be the optimal move, but it’s quite likely to be a decent one. Again, it’s a hunch.
Monte Carlo Tree Search: Used for searching the decision tree using the Policy Network to take actions during simulations up to a specified depth, after which the Value Network is used to evaluate the final simulated state (actually another fast rollout policy is used to play to the end as well, to optionally allow weighting between using the Value Network vs not).
The networks are deep convolutional neural networks, not too dissimilar to the networks used for image classification: a bunch of convolution filters, followed by RELU etc. No pooling is used from what I can see I guess because they need precise position information, unlike classification (similar to the convnets they used on Atari games). The networks are trained first with supervised learning using 30,000,000 games from the KGS Go Server, then reinforcement learning by getting the software to play itself millions of times.
All sounds relatively straightforward, especially considering prior work. The magic seems to be in managing to get the networks to be so good at generalizing/compressing-uncompressing from the possible 10¹⁷⁰ board states, and not just memorizing (overfitting) to the comparatively minute dataset it was trained on.