Reinforcement Learning on Tetris

9 min readJun 4, 2021

I have decided to re-write this journal and deliver a succinct article about this side project. The code can be found here tetris_ai. Here is the outline of this journal:
1. Videos of current results
2. Algorithm of Reinforcement Learning
3. Application Details
4. Neural Networks
5. Results and Discussion
6. Ideas and Variants I have tried

part 2 of the article: https://rex-l.medium.com/reinforcement-learning-on-tetris-2-f12f74f70788

Videos of current results

AI playes Tetris that focuses on T-Spin

AI plays customized Tetris

Algorithm of Reinforcement Learning

Most content in this section does not come from authentic resources (except where the source is provided). Thus, if you notice something wrong or you disagree with, you are very welcome to point it out and discuss about it.

The major issue for Q-learning to apply on Tetris is that many steps will be trivial. The action {a} for the game includes {left, right, down, turn left, turn right, hold, soft drop and hard drop}, but it is not our priority to teach the AI learn how to move a piece. When a human is playing Tetris, it is trivial to consider how to move a piece to a certain location; instead, we consider to where we should move it. Hence, it is better to provide all possible locations the current piece can drops at, and let the AI choose the best one.

All possible states and rewards: [(r, s’)], where r’ is simply {score’-score}. If you are not familiar with T-spin, please check this short video https://youtu.be/FI39WJqTLvA

With this adjustment, the action {a} is no longer involved, and the reward {r} is bound with future state {s’}. The Q value is replaced by V value which is basically the maximum Q given possible state pool {[s’]}.

Φ is the the trainable variables of the Neural Network for V(s)

This is huge advantage for the AI: first, the AI does not need to learn how to move a piece. Otherwise, it is expected to see the AI is stuck with moving a piece back and forth, turning left and then right; second, the dataset size is reduced to about 1/7, where 7 is the estimated average steps to move a piece; finally, the AI does not need to learn how to score, for example, do a T-spin— the AI only needs to learn how to set up a board for a T-spin. The V value can be interpreted as the potential to score, which is expected to be lower than the actual score because of γ. So, when the AI is provided with [(r, s’)], it processes it into a list of numbers [r+V(s’)]; the high score move will almost certainly have the largest value, and thus the policy would choose it.

Application Details

Outer loop — samples collection {(s, s’, r)}:
For each step, the environment (game) has current state s and provides a list of all possible states and their corresponding reward [(s’, r)]. The Neural Network evaluates V value, making it a list of numbers [r+V(s’)], and finds the highest r+V(s’) value and names it the best future state. The AI appends the current state, the best future state and its reward (s, s’, r) into the samples. Then, it generates a random number between 0 and 1, if it’s larger than ε (say, 0.05), then the AI is going to exploit, choosing the best future state; otherwise, it will explore, choosing a random future state. Finally, the currently state s of the environment is replaced by the chosen s’.
Next step… repeat till gameover.
Inner loop — updating target y:
This is simply: y_i=γ[r_i+V(s’_i)].
I put r_i inside the bracket because I think it is bound with the future state. Hence, it is a future reward.
There is one thing to notice, when s’_i is game over, y_i=γ[r_i], and it must not include V(s’_i) because this value V(s’_i) will never be evaluated; its value is unbounded.
Another small detail: s contains the current piece, hold piece and 4 next pieces. s’ contains the future current piece, hold piece, 3 next pieces and an unknown 4th next piece, thus the input given [0,0,0,0,0,0,0].

Fitting — training the neural network:
Well, let the library functions of the machine learning package handle it.
I use tensorflow.keras and the only thing I would like to point out here is model(inputs, training=False) is much faster than model.predict(inputs, batch_size=batch_size). Although, I need to manually write the code for a mini-batch for model(inputs).

Neural Networks

Inputs: The height of each column is to give the AI a sense of danger. The hole position is to give the AI a sense of untidiness of the board. The inputs[0] is a 2D array which will be fed to a Convolutional Neural Network.

Model:

Some parameters:
Buffer size: 50000
Inner loop: 5
Epoch: 5
Batch size: 512
opimizer=keras.opimizers.Adam(0.001)
loss=‘huber_loss’
metrics=‘mean_squared_error’

Results and Discussion

The AI is surprisingly fast at learning to survive. In this video AI playing customized Tetris, it only took 20 outer loops to reach this level of play. Each loop took about 400 sec for my desktop to train the AI, that is ~2.5 hours in total to train for a brand new AI to survive a customized Tetris game almost indefinitely, provided that the difficulty of the customized game is reasonable. I find the AI plays this customized game really well.

However, the AI is bad at scoring efficiently, which is score per step. It does one, two or three lines completion too often. I haven’t found the way to train it to achieve higher scores per 2000 pieces — the AI simply was stuck at somewhere or went worse. I believe this is due to the fact that a traditional Tetris has recurring states, but different policies will lead to different rewards.

As an example, given γ=0.95, completing 1 line rewards 10 points while completing 2 lines rewards 30 points,

Policy 1:
current state s— drop 1 piece — drop 1 piece — drop 1 piece and complete 1 line — drop 1 piece — drop 1 piece and complete 1 line — same state s
V[s] = 10*γ³+γ⁵*(10+V[s])
V[s] = 72.11

Policy 2:
current state s— drop 4 pieces — drop 1 piece and complete 2 lines — same state s
V[s] = γ⁵(30+V[s])
V[s] = 102.61

With a good γ value, theoretically, the AI pursues a policy for higher score per piece. But it is really difficult for the AI to find out a better policy once it has been trained for the current one — It requires the AI to hold a number of steps (>3) from completing lines and scoring right now! And even if it chooses to explore, the chance for it to choose a not terrible move is even lower, and this lucky coincidence have to happen several times to encounter a better policy for once, which this sample will probably be drown in the pool of samples from bad policies. (As I write this, I realize perhaps giving a distributed probability for epsilon-greedy method instead of a uniformed random would improve this.)
In other words, the current AI is really unlikely to switch to another policy. https://youtu.be/BbruZtUte00 See this video, AI developed different strategies even though their setups were completely same. Some ended up always trying to do T-spins, while others never did once.
Another problem is also revealed by the example policies, the V values may be somewhat independent of states as long as they occur after a few drops of pieces. This means all states, if the next pieces are not given, may have the same V values! Crazy!
The results are highly sensitive to the chosen γ. I imagine γ, reward and penalty to form the boundary conditions. Every value in the field of states is dependent on the boundary conditions. They are crucial. For example, if γ is too low, AI would prefer grabbing points right now instead of saving it for future with higher points. If γ is too high, the V values may be absurdly high, because of the states re-occuring.

Ideas and Variants I have tried

[Tried] N-return: Bad results. I don’t know why.
[Tried] Sampling all possible states, instead of just the best one — {(s,[s’,r])}. Then, add another loop between the outer loop and inner loop that choose the best [s’, r] with the current model. The benefit of this is that all samples will be permanently usable, because it is policy independent. However, there is a problem during the model fitting, that when another [s’, r] is chosen, V(s’) might be unbounded, because the game likely have not experienced that route yet.
[Not tried yet] Based on the analysis above, should I not add the sample (s, s’, r) when it is exploration instead of exploitation? Notice that no matter what, the best s’ will be recorded, not the actually chosen one.
[Tried] Replay buffer: Doesn’t seem to improve convergence.
[Not tried yet] Ensemble models: This has some great potentials, because, as mentioned before, the early training is really fast (~2.5 hours), but then they’re stuck with their current policy. Some may have developed more efficient policy (T-spin) and it is random. Perhaps, it’d be good to consult dozen of models that have trained for 3 hours, and then with the samples collected by them, train some other new models.
[Tried] Custom Layer: Tried. There is too much to talk about it here. I will have to skip it at the moment. Currently, it sees a small improvement, but still has the same limit.
[Not tried yet] For exploration, using a distributed probability based on [V(s’)+r] to choose the move.
[Tried, somewhat] True transitional function: the last piece of inputs[1] of s’ can be any of the possibility, [1,0,0,0,0,0,0], [0,1,0,0,0,0,0]… [0,0,0,0,0,0,1]. V(s’) and be the average of all of them. This causes more calculation time, and I didn’t find significant improvement. However, this is absolutely important for testing the algorithm with a tabular method. (I tested the algorithm on a 3x3 game board and it is proven to be consistent with the tabular method :>)
[Tried] Manipulating Reward Function: I would like see the AI do more 4-line completion and double T-spins instead of 1, 2, or 3-line completion, so I manipulate the reward function and let it be different from added score. This works good, but when I over-compress the score from 1, 2, or 3-line completion, the AI is confused in an interesting and reasonable way. The AI would stack the board till it’s 3/4 high, and then complete 1 or 2 lines just for the sake of survival. It will never figure out keeping a clean board would be better once it can survive at 3/4 high for long enough, because after a large number of steps, say, 200, γ²⁰⁰ is essentially 0 and the penalty for game over will not be applied on such states.
[Tried] Multiprocessing: very good for collecting samples. The algorithm is fully fitted V-learning. Thus, asynchronous training is not necessary, is it?
[Tried but failed] GPU: I think the best way for this AI is to collect samples with CPU, and train the model with GPU. But I failed at switching between them. During sample collection, the inputs batch size is tiny, about 40 possible future states of a current state. Therefore, using model(inputs) with one core is faster than model.predict(inputs) due to overhead, I assume. And I can let each core store their own model so that the scale of multiprocessing is near perfect. However, GPU is better at training it during V fitting, about 3x faster on my desktop. I tried to switch between ‘:/cpu’ and ‘:/gpu’ but there were always fatal errors. Anyway, efficiency of training is not the major issue of this AI, and I am already satisfied with the current speed.

Questions

Do you have any idea how to improve this AI?
Do you want to see the AI plays a different customized Tetris game? You may tell me the pieces you’d like to see, and I can train it to play. In my code, when a piece is too naughty, such as the “H” and the “Donut”, I’d lower their chance to appear. Otherwise, the game would be unplayable.