https://github.com/ChintanTrivedi/DeepGamingAI_FIFARL
A code tutorial in Tensorflow that uses Reinforcement Learning to take free kicks.
Free-kicks taken by the AI bot, trained through 1000 epochs of the Reinforcement Learning process.
In my previous article, I presented an AI bot trained to play the game of FIFA using Supervised Learning technique. With this approach, the bot quickly learnt the basics of the game like passing and shooting. However, the training data required to improve it further quickly became cumbersome to gather and provided little-to-no improvements, making this approach very time consuming. For this sake, I decided to switch to Reinforcement Learning, as suggested by almost everyone who commented on that article!
Previous article: Building a Deep Neural Network to play FIFA 18
In this article, I'll provide a short description of what Reinforcement Learning is and how I applied it to this game. A big challenge in implementing this is that we do not have access to the game's code, so we can only make use of what we see on the game screen. Due to this reason, I was unable to train the AI on the full game, but could find a work-around to implement it for skill games in practice mode. For this tutorial, I will be trying to teach the bot to take 30-yard free kicks, but you can modify it to play other skill games as well. Let's start with understanding the Reinforcement Learning technique and how we can formulate our free kick problem to fit this technique.
What is Reinforcement Learning (and Deep Q-Learning)?
Contrary to Supervised Learning, we do not need to manually label the training data in Reinforcement Learning. Instead, we interact with our environment and observe the outcome of our interaction. We repeat this process multiple times gaining examples of positive and negative experiences, which acts as our training data. Thus, we learn by experimentation and not imitation.
Let's say our environment is in a particular state s
, and upon taking an action a
, it changes to state s'
. For this particular action, the immediate reward you observe in the environment is r
. Any set of actions that follow this action will have their own immediate rewards, until you stop interacting due to a positive or a negative experience. These are called future rewards. Thus, for the current state s
, we will try to estimate out of all actions possible which action will fetch us the maximum immediate + future reward, denoted by Q(s,a)
called the Q-function. This gives us Q(s,a) = r + γ * Q(s',a')
which denotes the expected final reward by taking action a
in state s
. Here, γ
is a discount factor to account for uncertainty in predicting the future, thus we want to trust the present a bit more than the future.
Deep Q-learning is a special type of Reinforcement Learning technique where the Q-function is learnt by a deep neural network. Given the environment's state as an image input to this network, it tries to predict the expected final reward for all possible actions like a regression problem. The action with the maximum predicted Q-value is chosen as our action to be taken in the environment. Hence the name Deep Q-Learning.
Formulating free-kicks in FIFA as a Q-Learning problem
· States: Screenshot images of the game processed through a MobileNet CNN giving 128-dimensional flattened feature map
.
· Actions: Four possible actions to take shoot_low, shoot_high, move_left, move_right
.
· Reward: If upon pressing shoot, in-game score increases by more than 200, we scored a goal so r=+1
. If we missed the goal, score remains the same so r=-1
. Finally, r=0
for actions related to moving left or right.
· Policy: Two-layered Dense Network that takes feature map as input and predicts total final reward for all 4 actions.
Reinforcement Learning process for the bot interacting with the game environment. The Q-Learning Model is the heart of this process and is responsible for predicting the estimated future reward for all possible actions that the bot can take. This model is trained and updated continuously throughout this process.
Note: If we had a performance meter in kick-off mode of FIFA like there is in the practice mode, we might have been able to formulate this problem for playing the entire game and not restrict ourselves to just taking free-kicks. That, or we need access to game's internal code which we don't have. Anyways, let's make the most of what we do have.
Results
While the bot has not mastered all different kinds of free kicks, it has learnt some situations very well. It almost always hits the target in absence of wall of players but struggles in its presence. Also, when it hasn't encountered a situation frequently in training like not facing the goal, it behaves bonkers. However, with every training epoch, this behavior was noticed to decrease on an average.
The figure shows, for epoch 1 through 1000, average number of free-kicks that got converted per attempt, calculated over a moving average window of 200 attempts. So, for example, a value of 0.45 at epoch 700 means 45% of attempts got converted to a goal (on an average) around this epoch.
As shown in the figure above, the average goal scoring rate grows from 30% to 50% on an average after training for 1000 epochs. This means the current bot scores about half of the free kicks it attempts (for reference, a human would average around 75–80%). Do consider that FIFA tends to behave non-deterministically which makes learning very difficult.
More results in video format can be found on my YouTube channel, with the video embedded below. Please subscribe to my channel if you wish to keep track of all my projects.
Code Implementation
We shall implement this in python using tools like Tensorflow (Keras) for Deep Learning and pytesseract for OCR. The git link is provided below with the requirements setup instructions in the repository description.
I would recommend below gists of code only for the purpose of understanding this tutorial since some lines have been removed for brevity. Please use the full code from git while running it. Let's go over the 4 main parts of the code.
1. Interacting with the game environment
We do not have any readymade API available that gives us access to the code. So, let's make our own API instead! We'll use game's screenshots to observe the state, simulated key-presses to take action in the game environment and Optical Character Recognition to read our reward in the game. We have three main methods in our FIFA class: observe(), act(), _get_reward()
and an additional method is_over()
to check if the free kick has been taken or not.
2. Collecting training data
Throughout the training process, we want to store all our experiences and observed rewards. We will use this as the training data for our Q-Learning model. So, for every action we take, we store the experience <s, a, r, s'>
along with a game_over
flag. The target label that our model will try to learn is the final reward for each action which is a real number for our regression problem.
3. Training process
Now that we can interact with the game and store our interactions in memory, let's start training our Q-Learning model. For this, we will attain a balance between exploration (taking a random action in the game) and exploitation (taking action predicted by our model). This way we can perform trial-and-error to obtain different experiences in the game. The parameter epsilon
is used for this purpose, which is an exponentially decreasing factor that balances exploration and exploitation. In the beginning, when we know nothing, we want to do more exploration but as number of epochs increases and we learn more, we want to do more exploitation and less exploration. Hence. the decaying value of the epsilon
parameter.
For this tutorial I have only trained the model for 1000 epochs
due to time and performance constraints, but in the future I would like to push it to at least 5000 epochs.
4. Model definition and starting training process
At the heart of the Q-Learning process is a 2-layered Dense/Fully Connected Network with ReLU activation. It takes the 128-dimensional feature map as input state and outputs 4 Q-values for each possible action. The action with the maximum predicted Q-value is the desired action to be taken as per the network's policy for the given state.
This is the starting point of execution of this code, but you'll have to make sure the game FIFA 18 is running in windowed mode on a second display and you load up the free kick practice mode under skill games: shooting menu. Make sure the game controls are in sync with the keys you have hard-coded in the FIFA.py script.
Conclusion
Overall, I think the results are quite satisfactory even though it fails to reach human level of performance. Switching from Supervised to Reinforcement technique for learning helps ease the pain of collecting training data. Given enough time to explore, it performs very well in problems like learning how to play simple games. However, Reinforcement setting seems to fail when it encounters unfamiliar situations, which makes me believe formulating it as a regression problem cannot extrapolate information as well as formulating it as a classification problem in supervised setting. Perhaps a combination of the two could address the weaknesses of both these approaches. Maybe that's where we'll see the best results in building AI for games. Something for me to try in the future!
Acknowledgements
I would like to acknowledge this tutorial of Deep Q-Learning and this git repository of gaming with python for providing majority of the code. With the exception of the FIFA "custom-API", most of the code's backbone has come from these sources. Thanks to these guys!
0 件のコメント:
コメントを投稿