Mastering The Game of Go with Deep Neural Network and Tree Search

Summary

In this article, the research team of AlphaGo explains their methodology used to create a professional level Go playing program. They also highlight the difficulties encountered. This article provides lots of insights and provides the main skeleton for this project.

Overall Structure

The main structure of AlphaGo is the Monte Carlo Tree Search (MCTS). It is accompanied by two deep learning neural networks, namely the policy network and the value network. One is developed for reducing branching factor, while the other for reducing depth of search when necessary. While the policy network is accurate, a fast rollout policy is developed for side-by-side comparison. Supervised learning and reinforcement learning are used as standard machine learning approaches.

Monte Carlo Tree Search

MCTS is a general game tree search without branching at all. Instead, each probe is conducted all the way to the end by moving randomly at each state. The rewards and visit count are then back propagated to better enhance the next probe. This continues until time runs out, which makes it ideal for playing under time constraint. However, as we can easily tell, moving randomly does not match a rational player (opponent) and thus policy network is employed to better reflect reality.

Policy Network

For AlphaGo, 13-layer policy network was trained from 30 million positions from the KGS Go Server [6]. The network is provided game states and the next move by human expert and would gradually learn to predict this move. The result was an accuracy of up to 57%, compared with a maximum of 44.4% for other research groups at that time. However, it takes 3ms for computation, which hence limits the traversed depth of MCTS.

Rollout Policy

Rollout policy is a fast algorithm that analyzes known patterns on the board for the most probable moves. This is also trained with supervised learning of 8 million positions from human games on the Tygem server. The rollout policy achieved a 24.2% accuracy using only 2 μs (0.067 % of Policy Network). In the end, both rollout policy and policy network are used in a parallel manner, and both results are taken into account.

Value Network

Value network is used to truncate tree search when deemed necessary. When looking at a game, professionals could understand if the dark or white side was at an advantage. Value network is trying to do exactly the same. At first, full game histories were provided to train the value network in a similar manner to policy network. This led to overfitting and the machine actually “memorized” the training set. The network was then trained with distinct states from different games played by itself and earlier versions. The problem was then solved.