Abracadabra

Tic-Tac-Toe Game

What is the Tic-Tac-Toe game? Two players take turns playing on a three-by-three board. One player plays Xs and the other Os until one player wins by placing three marks in a row, horizontally, vertically, or diagonally, as the X player has in the game shown to the blew. If the board fills up with neither player getting three in a row, the game is a draw.

There have three steps. Train, compete and play.

The first, let’s to see the train period. Follow is the train() method.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def train(epochs=20000):
player1 = Player()
player2 = Player()
judger = Judger(player1, player2)
player1Win = 0.0
player2Win = 0.0
for i in range(0, epochs):
print("Epoch", i)
winner = judger.play()
if winner == 1:
player1Win += 1
if winner == -1:
player2Win += 1
judger.reset()
print(player1Win / epochs)
print(player2Win / epochs)
player1.savePolicy()
player2.savePolicy()

Train() method create two Player objects first, and then let them to play the tic-tac-toe through a Judger object. It’s a very simple process.

Next. let to get into the Player object.

Follow is the code of the Player object. For understand easily, I omitted the implementation details of each method.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Player:
# @stepSize: step size to update estimations
# @exploreRate: possibility to explore
def __init__(self, stepSize=0.1, exploreRate=0.1):
def reset(self):
def setSymbol(self, symbol):
# accept a state
def feedState(self, state):
# update estimation according to reward
def feedReward(self, reward):
# determine next action
def takeAction(self):
def savePolicy(self):
def loadPolicy(self):

As a Player, the important thing during the train process is to learn a policy. The policy is a selection when the player faces a state. So there are two method savePolicy() and loadPolicy(). When the train process end, the player save its learned policy and load the same policy when the player compete with someone else later.

Follow is the implementation details:

1
2
3
4
5
6
7
8
9
def savePolicy(self):
fw = open('optimal_policy_' + str(self.symbol), 'wb')
pickle.dump(self.estimations, fw)
fw.close()
def loadPolicy(self):
fr = open('optimal_policy_' + str(self.symbol), 'rb')
self.estimations = pickle.load(fr)
fr.close()

And, let’s to jump into the initialization method, the below is its source code:

1
2
3
4
5
6
def __init__(self, stepSize=0.1, exploreRate=0.1):
self.allStates = allStates
self.estimations = dict()
self.stepSize = stepSize
self.exploreRate = exploreRate
self.states = []

Every player hold a dictionary. For each item in the dictionary, the key is the state, and the value is the estimation of the probability to win from this state. We use the TD(0) method to solve the problem. That is, we need to update .the state-value function step by step. The update rule is below:

$$ V(s) = V(s) + \alpha [V(s’) - V(s)] ​$$

The $\alpha$ is the step size, and the $s’$ is the next state, $s$ is the current state. $V(\star)$ is the estimation of the probability to win from $*$ state.

So, what is the explore rate. We need to know how to choose the next action at current state if we want to understand the explore rate. For every state, first we find the every state it can transfer to. Then we look up the dictionary to find the state that has the highest estimation value. This state is our action will take. The method called greedy policy. But, the value of each state is our estimation, so we can’t say it’s the true probability. So the greedy policy has some error. There is a method to solve this problem. At every state, we not only select the next state that has the highest probability but also choose the next state randomly by explore rate probability. Formerly, if we use the symbol $\epsilon$ represents the explore rate, then the method is called $\epsilon$-greedy method.

Next, let’s see what is the allStates variable.

1
self.allStates = allStates

we can see

1
2
# all possible board configurations
allStates = getAllStates()

So what is the getAllStates() look like?

1
2
3
4
5
6
7
def getAllStates():
currentSymbol = 1
currentState = State()
allStates = dict()
allStates[currentState.getHash()] = (currentState, currentState.isEnd())
getAllStatesImpl(currentState, currentSymbol, allStates)
return allStates

Until now you may ask what is the STATE? Below is the definition:

1
2
3
4
5
6
class State:
def __init__(self):
def getHash(self):
def isEnd(self):
def nextState(self, i, j, symbol):
def show(self):

One state is one arrangement of pieces on the board. So one state has some extra attributions. Such as who is the winner at current state, if the state is the terminal state or not and so on. Specially, each state has a hash value for representation convenient. The board is represented by a n by n array, that is, one state is a n by n array.

1
2
3
4
5
6
7
8
9
def __init__(self):
# the board is represented by a n * n array,
# 1 represents chessman of the player who moves first,
# -1 represents chessman of another player
# 0 represents empty position
self.data = np.zeros((BOARD_ROWS, BOARD_COLS))
self.winner = None
self.hashVal = None
self.end = None

Below is how to calculate the hash value of a state:

1
2
3
4
5
6
7
8
9
# calculate the hash value for one state, it's unique
def getHash(self):
if self.hashVal is None:
self.hashVal = 0
for i in self.data.reshape(BOARD_ROWS * BOARD_COLS):
if i == -1:
i = 2
self.hashVal = self.hashVal * 3 + i
return int(self.hashVal)

Below is how to judge if a state is end or not:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# determine whether a player has won the game, or it's a tie
def isEnd(self):
if self.end is not None:
return self.end
results = []
# check row
for i in range(0, BOARD_ROWS):
results.append(np.sum(self.data[i, :]))
# check columns
for i in range(0, BOARD_COLS):
results.append(np.sum(self.data[:, i]))
# check diagonals
results.append(0)
for i in range(0, BOARD_ROWS):
results[-1] += self.data[i, i]
results.append(0)
for i in range(0, BOARD_ROWS):
results[-1] += self.data[i, BOARD_ROWS - 1 - i]
for result in results:
if result == 3:
self.winner = 1
self.end = True
return self.end
if result == -3:
self.winner = -1
self.end = True
return self.end
# whether it's a tie
sum = np.sum(np.abs(self.data))
if sum == BOARD_ROWS * BOARD_COLS:
self.winner = 0
self.end = True
return self.end
# game is still going on
self.end = False
return self.end

There are two scenarios for the end of the game: Someone wins or ties. Because player A’s chessman is represents by 1 and play B is -1. So if A wins, then one row ‘s sum is n or one column’s sum is n or one diagnose’s sum is n. Otherwise is -n. And the state’s winner attribute is 1 or -1, that is, player A or player B. if the sum of the absolute value of the all chessman in the board is n by n, then the game is tie, winner is 0 (that is no one wins).

When someone put a chessman in the board, then the state is change and transfer to another state. How to get the state?

1
2
3
4
5
def nextState(self, i, j, symbol):
newState = State()
newState.data = np.copy(self.data)
newState.data[i, j] = symbol
return newState

And the last, we are play a game so we need a GUI. This is what the show() to do.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# print the board
def show(self):
for i in range(0, BOARD_ROWS):
print('-------------')
out = '| '
for j in range(0, BOARD_COLS):
if self.data[i, j] == 1:
token = '*'
if self.data[i, j] == 0:
token = '0'
if self.data[i, j] == -1:
token = 'x'
out += token + ' | '
print(out)
print('-------------')

Let’s come back to the getAllStates() method.

1
2
3
4
5
6
7
def getAllStates():
currentSymbol = 1
currentState = State()
allStates = dict()
allStates[currentState.getHash()] = (currentState, currentState.isEnd())
getAllStatesImpl(currentState, currentSymbol, allStates)
return allStates

Now we know what is a state and the next we need to generate all possible state. The first, we need build a data structure to store the all states. So we define a dictionary allStates. It’s key is the hash value of the state, and it’s value is a tuple. The first item of the tuple is the state (a n by n array) and the second item is a flag that represent the state whether is a terminal state or not. For generate the all states, we jump into the getAllStatesImpl() method.

1
2
3
4
5
6
7
8
9
10
11
def getAllStatesImpl(currentState, currentSymbol, allStates):
for i in range(0, BOARD_ROWS):
for j in range(0, BOARD_COLS):
if currentState.data[i][j] == 0:
newState = currentState.nextState(i, j, currentSymbol)
newHash = newState.getHash()
if newHash not in allStates.keys():
isEnd = newState.isEnd()
allStates[newHash] = (newState, isEnd)
if not isEnd:
getAllStatesImpl(newState, -currentSymbol, allStates)

The getAllStatesImpl() method start with a empty board (currentState), and generate the states step by step (because it recursive calls itself). Because the game is very simple, so we could generate all possible states. But for the larger game, this is impossible.

Tada~Let’s come back to the Player object. We put the code here again for convenience.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Player:
# @stepSize: step size to update estimations
# @exploreRate: possibility to explore
def __init__(self, stepSize=0.1, exploreRate=0.1):
def reset(self):
def setSymbol(self, symbol):
# accept a state
def feedState(self, state):
# update estimation according to reward
def feedReward(self, reward):
# determine next action
def takeAction(self):
def savePolicy(self):
def loadPolicy(self):

We has explained the initialization method. It’s worth to notice that the Player object has a attribute states. We’ll explain it later.

Below is the reset() method:

1
2
def reset(self):
self.states = []

and below is the setSymbol() method:

1
2
3
4
5
6
7
8
9
10
11
def setSymbol(self, symbol):
self.symbol = symbol
for hash in self.allStates.keys():
(state, isEnd) = self.allStates[hash]
if isEnd:
if state.winner == self.symbol:
self.estimations[hash] = 1.0
else:
self.estimations[hash] = 0
else:
self.estimations[hash] = 0.5

We know that every player’s chessman in the board has a symbol (1 or -1). This method is set a symbol to the player. Furthermore, this method initialize the estimate state-value dictionary (we mentioned it earlier).

And the feedState() method:

1
2
def feedState(self, state):
self.states.append(state)

The same as the states variable, we’ll explain it later.

Go on, below is the feedForward() method.This method not only the core of the Player object, but also it’s the core of the method that solve this game. That is, it’s the core of the TD(0) method.

1
2
3
4
5
6
7
8
9
10
11
def feedReward(self, reward):
if len(self.states) == 0:
return
self.states = [state.getHash() for state in self.states]
target = reward
for latestState in reversed(self.states):
value = self.estimations[
latestState] + self.stepSize * (target - self.estimations[latestState])
self.estimations[latestState] = value
target = value
self.states = []

We mentioned the update rule earlier. Below is it’s implementation:

1
2
value = self.estimations[
latestState] + self.stepSize * (target - self.estimations[latestState])

Notice that we can see there are two row in the code:

1
2
self.estimations[latestState] = value
target = value

So the update rule is a chain-like update rule. Specially, the states variable is set to empty (In the same way, we’ll explain it later).

The next method (implement the $\epsilon$-greedy policy) also is very important, because it tells the player how to take the next action:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def takeAction(self):
state = self.states[-1]
nextStates = []
nextPositions = []
for i in range(BOARD_ROWS):
for j in range(BOARD_COLS):
if state.data[i, j] == 0:
nextPositions.append([i, j])
nextStates.append(state.nextState(
i, j, self.symbol).getHash())
if np.random.binomial(1, self.exploreRate):
np.random.shuffle(nextPositions)
# Not sure if truncating is the best way to deal with exploratory step
# Maybe it's better to only skip this step rather than forget all
# the history
self.states = []
action = nextPositions[0]
action.append(self.symbol)
return action
values = []
for hash, pos in zip(nextStates, nextPositions):
values.append((self.estimations[hash], pos))
np.random.shuffle(values)
values.sort(key=lambda x: x[0], reverse=True)
action = values[0][1]
action.append(self.symbol)
return action

We’ll see that the return action is a list that the first item is a list contains the next position and the second item is the symbol that represents the player.

Ok, the travel about the Player object is over. Then, we’ll look into the Judger object. Before that, let’s recall the train() process.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def train(epochs=20000):
player1 = Player()
player2 = Player()
judger = Judger(player1, player2)
player1Win = 0.0
player2Win = 0.0
for i in range(0, epochs):
print("Epoch", i)
winner = judger.play()
if winner == 1:
player1Win += 1
if winner == -1:
player2Win += 1
judger.reset()
print(player1Win / epochs)
print(player2Win / epochs)
player1.savePolicy()
player2.savePolicy()

We can see that the Judger object accept two parameters, that is, two player object. The definition of Judger is below:

1
2
3
4
5
6
7
8
9
10
11
12
class Judger:
# @player1: player who will move first, its chessman will be 1
# @player2: another player with chessman -1
# @feedback: if True, both players will receive rewards when game is end
def __init__(self, player1, player2, feedback=True):
# give reward to two players
def giveReward(self):
def feedCurrentState(self):
def reset(self):
# @show: if True, print each board during the game
def play(self, show=False):

Notice that the rewards only receive at the end of the game. The first, let’s see the initialization method.

1
2
3
4
5
6
7
8
9
10
11
def __init__(self, player1, player2, feedback=True):
self.p1 = player1
self.p2 = player2
self.feedback = feedback
self.currentPlayer = None
self.p1Symbol = 1
self.p2Symbol = -1
self.p1.setSymbol(self.p1Symbol)
self.p2.setSymbol(self.p2Symbol)
self.currentState = State()
self.allStates = allStates

p1 and p2 is the two player that play the game. The feedback represents if the reward propagation back or not. On the train process the feedback is true and on the compete process and play process the feedback is false. currentPlayer represents who should move next. and next the judger set symbol for each player. The currentState is the start state (the board is empty).

Go on. Below is the giveReward() method:

1
2
3
4
5
6
7
8
9
10
def giveReward(self):
if self.currentState.winner == self.p1Symbol:
self.p1.feedReward(1)
self.p2.feedReward(0)
elif self.currentState.winner == self.p2Symbol:
self.p1.feedReward(0)
self.p2.feedReward(1)
else:
self.p1.feedReward(0)
self.p2.feedReward(0)

Just like we say earlier, the rewards only receive at the end of the game. So if player A wins, then we give him a reward 1 and otherwise we give him a reward 0. If ties, then all reward is 0. We explain the feedCurrentState() later. Now we explain reset() method first.

1
2
3
4
5
def reset(self):
self.p1.reset()
self.p2.reset()
self.currentState = State()
self.currentPlayer = None

It’s simple right? Let’s skip it and go to the core method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def play(self, show=False):
self.reset()
self.feedCurrentState()
while True:
# set current player
if self.currentPlayer == self.p1:
self.currentPlayer = self.p2
else:
self.currentPlayer = self.p1
if show:
self.currentState.show()
[i, j, symbol] = self.currentPlayer.takeAction()
self.currentState = self.currentState.nextState(i, j, symbol)
hashValue = self.currentState.getHash()
self.currentState, isEnd = self.allStates[hashValue]
self.feedCurrentState()
if isEnd:
if self.feedback:
self.giveReward()
return self.currentState.winner

We can see the two player alternate to play chess. Each reached state on the game will feed to the players’ states attribute.

1
self.feedCurrentState()

So below is the method like:

1
2
3
def feedCurrentState(self):
self.p1.feedState(self.currentState)
self.p2.feedState(self.currentState)
1
2
def feedState(self, state):
self.states.append(state)

Let’s explain the states now. Each player only update the states that the game reached in one game. Each reached state on the game will feed to the players’ states attribute. Note that, the player just update part of the states of the all states. Only after a lot of games, the all states could be updated. So all TD methods need a lot of epochs.

Ouch! Finally three core objects are explained. Now we’ll clear about the three process: train, compete and play.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def train(epochs=20000):
player1 = Player()
player2 = Player()
judger = Judger(player1, player2)
player1Win = 0.0
player2Win = 0.0
for i in range(0, epochs):
print("Epoch", i)
winner = judger.play()
if winner == 1:
player1Win += 1
if winner == -1:
player2Win += 1
judger.reset()
print(player1Win / epochs)
print(player2Win / epochs)
player1.savePolicy()
player2.savePolicy()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def compete(turns=500):
player1 = Player(exploreRate=0)
player2 = Player(exploreRate=0)
judger = Judger(player1, player2, False)
player1.loadPolicy()
player2.loadPolicy()
player1Win = 0.0
player2Win = 0.0
for i in range(0, turns):
print("Epoch", i)
winner = judger.play()
if winner == 1:
player1Win += 1
if winner == -1:
player2Win += 1
judger.reset()
print(player1Win / turns)
print(player2Win / turns)
1
2
3
4
5
6
7
8
9
10
11
12
13
def play():
while True:
player1 = Player(exploreRate=0)
player2 = HumanPlayer()
judger = Judger(player1, player2, False)
player1.loadPolicy()
winner = judger.play(True)
if winner == player2.symbol:
print("Win!")
elif winner == player1.symbol:
print("Lose!")
else:
print("Tie!")

It’s worth noting that there is a HumanPlayer object.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class HumanPlayer:
def __init__(self, stepSize=0.1, exploreRate=0.1):
self.symbol = None
self.currentState = None
return
def reset(self):
return
def setSymbol(self, symbol):
self.symbol = symbol
return
def feedState(self, state):
self.currentState = state
return
def feedReward(self, reward):
return
def takeAction(self):
data = int(input("Input your position:"))
data -= 1
i = data // int(BOARD_COLS)
j = data % BOARD_COLS
if self.currentState.data[i, j] != 0:
return self.takeAction()
return (i, j, self.symbol)

We’ll see that this object do nothing. It just put a chess to on the board.

OK, you’re done! Finally, we put the complete code here.