What is the Tic-Tac-Toe game? Two players take turns playing on a three-by-three board. One player plays Xs and the other Os until one player wins by placing three marks in a row, horizontally, vertically, or diagonally, as the X player has in the game shown to the blew. If the board fills up with neither player getting three in a row, the game is a draw.

There have three steps. Train, compete and play.

The first, let’s to see the train period. Follow is the train() method.

Train() method create two Player objects first, and then let them to play the tic-tac-toe through a Judger object. It’s a very simple process.

Next. let to get into the Player object.

Follow is the code of the Player object. For understand easily, I omitted the implementation details of each method.

As a Player, the important thing during the train process is to learn a policy. The policy is a selection when the player faces a state. So there are two method savePolicy() and loadPolicy(). When the train process end, the player save its learned policy and load the same policy when the player compete with someone else later.

And, let’s to jump into the initialization method, the below is its source code:

Every player hold a dictionary. For each item in the dictionary, the key is the state, and the value is the estimation of the probability to win from this state. We use the TD(0) method to solve the problem. That is, we need to update .the state-value function step by step. The update rule is below:

$$V(s) = V(s) + \alpha [V(s’) - V(s)] ​$$

The $\alpha$ is the step size, and the $s’$ is the next state, $s$ is the current state. $V(\star)$ is the estimation of the probability to win from $*$ state.

So, what is the explore rate. We need to know how to choose the next action at current state if we want to understand the explore rate. For every state, first we find the every state it can transfer to. Then we look up the dictionary to find the state that has the highest estimation value. This state is our action will take. The method called greedy policy. But, the value of each state is our estimation, so we can’t say it’s the true probability. So the greedy policy has some error. There is a method to solve this problem. At every state, we not only select the next state that has the highest probability but also choose the next state randomly by explore rate probability. Formerly, if we use the symbol $\epsilon$ represents the explore rate, then the method is called $\epsilon$-greedy method.

Next, let’s see what is the allStates variable.

we can see

So what is the getAllStates() look like?

Until now you may ask what is the STATE? Below is the definition:

One state is one arrangement of pieces on the board. So one state has some extra attributions. Such as who is the winner at current state, if the state is the terminal state or not and so on. Specially, each state has a hash value for representation convenient. The board is represented by a n by n array, that is, one state is a n by n array.

Below is how to calculate the hash value of a state:

Below is how to judge if a state is end or not:

There are two scenarios for the end of the game: Someone wins or ties. Because player A’s chessman is represents by 1 and play B is -1. So if A wins, then one row ‘s sum is n or one column’s sum is n or one diagnose’s sum is n. Otherwise is -n. And the state’s winner attribute is 1 or -1, that is, player A or player B. if the sum of the absolute value of the all chessman in the board is n by n, then the game is tie, winner is 0 (that is no one wins).

When someone put a chessman in the board, then the state is change and transfer to another state. How to get the state?

And the last, we are play a game so we need a GUI. This is what the show() to do.

Let’s come back to the getAllStates() method.

Now we know what is a state and the next we need to generate all possible state. The first, we need build a data structure to store the all states. So we define a dictionary allStates. It’s key is the hash value of the state, and it’s value is a tuple. The first item of the tuple is the state (a n by n array) and the second item is a flag that represent the state whether is a terminal state or not. For generate the all states, we jump into the getAllStatesImpl() method.

The getAllStatesImpl() method start with a empty board (currentState), and generate the states step by step (because it recursive calls itself). Because the game is very simple, so we could generate all possible states. But for the larger game, this is impossible.

Tada~Let’s come back to the Player object. We put the code here again for convenience.

We has explained the initialization method. It’s worth to notice that the Player object has a attribute states. We’ll explain it later.

Below is the reset() method:

and below is the setSymbol() method:

We know that every player’s chessman in the board has a symbol (1 or -1). This method is set a symbol to the player. Furthermore, this method initialize the estimate state-value dictionary (we mentioned it earlier).

And the feedState() method:

The same as the states variable, we’ll explain it later.

Go on, below is the feedForward() method.This method not only the core of the Player object, but also it’s the core of the method that solve this game. That is, it’s the core of the TD(0) method.

We mentioned the update rule earlier. Below is it’s implementation:

Notice that we can see there are two row in the code:

So the update rule is a chain-like update rule. Specially, the states variable is set to empty (In the same way, we’ll explain it later).

The next method (implement the $\epsilon$-greedy policy) also is very important, because it tells the player how to take the next action:

We’ll see that the return action is a list that the first item is a list contains the next position and the second item is the symbol that represents the player.

Ok, the travel about the Player object is over. Then, we’ll look into the Judger object. Before that, let’s recall the train() process.

We can see that the Judger object accept two parameters, that is, two player object. The definition of Judger is below:

Notice that the rewards only receive at the end of the game. The first, let’s see the initialization method.

p1 and p2 is the two player that play the game. The feedback represents if the reward propagation back or not. On the train process the feedback is true and on the compete process and play process the feedback is false. currentPlayer represents who should move next. and next the judger set symbol for each player. The currentState is the start state (the board is empty).

Go on. Below is the giveReward() method:

Just like we say earlier, the rewards only receive at the end of the game. So if player A wins, then we give him a reward 1 and otherwise we give him a reward 0. If ties, then all reward is 0. We explain the feedCurrentState() later. Now we explain reset() method first.

It’s simple right? Let’s skip it and go to the core method:

We can see the two player alternate to play chess. Each reached state on the game will feed to the players’ states attribute.

So below is the method like:

Let’s explain the states now. Each player only update the states that the game reached in one game. Each reached state on the game will feed to the players’ states attribute. Note that, the player just update part of the states of the all states. Only after a lot of games, the all states could be updated. So all TD methods need a lot of epochs.

Ouch! Finally three core objects are explained. Now we’ll clear about the three process: train, compete and play.

It’s worth noting that there is a HumanPlayer object.

We’ll see that this object do nothing. It just put a chess to on the board.

OK, you’re done! Finally, we put the complete code here.