AIM-PIbd-31-Alekseev-I-S/Lab_6/Lab6.ipynb
Иван Алексеев 7de16ac006 ещё хочу
2024-12-06 17:30:55 +04:00

82 KiB
Raw Blame History

Начало крайней лабораторной в этом семестре, эх...

Что необходимо сделать:

Развернуть и запустить проект по реализации обучения с подкреплением для игры "Крестики-нолики". Перевести проект на библиотеку gymnasium и современную версию Python. Реализовать агента для игры "Крестики-нолики" в виде отдельного класса (по примеру из лекции). Переписать основной цикл обучения для работы с отдельным классом агента (по примеру из лекции). Выполнить тестирование новой версии программы.

Наши крестики-нолики: https://github.com/nczempin/gym-tic-tac-toe

Перевод проекта на библиотеку gymnasium

Gymnasium — это открытая библиотека Python, которая предоставляет стандартизированные среды для разработки и тестирования алгоритмов обучения с подкреплением (Reinforcement Learning, RL). Ранее была известна как OpenAI Gym (до 2022 года), но теперь развивается под новым именем.

Библиотека позволяет разработчикам RL-агентов взаимодействовать с различными симуляциями — от простых игровых задач до сложных физических моделей. Gymnasium упрощает процесс тестирования и сравнения алгоритмов благодаря унифицированному интерфейсу.

Основные возможности Gymnasium:

  1. Унифицированный API для RL-сред:
  • Gymnasium предлагает стандартный интерфейс для взаимодействия с RL-средами, включающий методы reset(), step(action), и другие.
  • Это позволяет легко переключаться между различными средами без изменения кода агента.
  1. Разнообразие встроенных сред:
  • Библиотека включает множество готовых симуляций, от простых (например, CartPole, MountainCar) до сложных (например, робототехника, Atari-игры).
  • Среды подразделяются на категории: контроль, игры, физика, робототехника и др.
  1. Поддержка классических задач RL:
  • Среды для изучения классических задач, таких как балансировка маятника, решение головоломок, управление роботами и т.д.
  1. Гибкость создания пользовательских сред:
  • Gymnasium позволяет разработчикам создавать собственные симуляции, соответствующие API.
  1. Совместимость с различными RL-библиотеками:
  • Gymnasium интегрируется с популярными RL-фреймворками, такими как Stable-Baselines3, Ray RLlib, TensorFlow Agents, PyTorch RL и другими.
  1. Визуализация:
  • Среды могут визуализироваться, что упрощает отладку и демонстрацию работы алгоритмов.

Основные функции Gymnasium:

  • Инициализация среды (gymnasium.make()): позволяет создавать экземпляр среды по её имени.
  • Сброс среды (reset()): возвращает начальное состояние среды и другую информацию.
  • Выполнение действия (step(action)): передаёт действие агенту и возвращает результат.
  • Закрытие среды (close()): очищает ресурсы, связанные со средой.
  • Режим рендеринга (render()): позволяет визуализировать работу среды.
In [2]:
import gymnasium as gym
from gymnasium import spaces

class TicTacToeEnv(gym.Env):
    metadata = {'render.modes': ['human']}
    
    symbols = ['O', ' ', 'X']

    def __init__(self):
        super().__init__()
        self.action_space = spaces.Discrete(9) # Дискретное пространство действий (08), соответствующее номерам клеток на игровом поле (от 0 до 8).
        self.observation_space = spaces.Discrete(9 * 3 * 2) # Дискретное пространство состояний. Расчёт: 9 клеток × 3 состояния (пустая, крестик, нолик) × 2 игрока (на чей ход).
        self.reset()

    def step(self, action):
        done = False
        reward = 0

        p, square = action  # p - игрок (1 или -1), square - номер клетки

        board = self.state['board']
        proposed = board[square] 
        om = self.state['on_move'] 
        if proposed != 0:  # Клетка уже занята
            print(f"Незаконный ход: Квадрат {square} уже занят.")
            done = True
            reward = -1 * om 
        if p != om:  # Не тот игрок на ходу
            print(f"Незаконный ход: игрок {p} не находится в движении")
            done = True
            reward = -1 * om
        else:
            board[square] = p
            self.state['on_move'] = -p

        for i in range(3):
            # Горизонтали и вертикали
            if (board[i * 3] == p and board[i * 3 + 1] == p and board[i * 3 + 2] == p) or \
               (board[i] == p and board[i + 3] == p and board[i + 6] == p):
                reward = p
                done = True
                break

        # Диагонали
        if (board[0] == p and board[4] == p and board[8] == p) or \
           (board[2] == p and board[4] == p and board[6] == p):
            reward = p
            done = True
                
        return self.state, reward, done, {}

    def reset(self):
        self.state = {}
        self.state['board'] = [0, 0, 0, 0, 0, 0, 0, 0, 0] 
        self.state['on_move'] = 1 
        return self.state, {}

    def render(self, close=False):
        if close:
            return
        print("on move: " , self.symbols[self.state['on_move']+1])
        for i in range (9):
            print (self.symbols[self.state['board'][i]+1], end=" ")
            if ((i % 3) == 2):
                print()

    def move_generator(self):
        moves = []
        for i in range(9):
            if self.state['board'][i] == 0:
                p = self.state['on_move']
                m = [p, i]
                moves.append(m)
        return moves

Реализация агента

В контексте обучения с подкреплением (Reinforcement Learning, RL) агент — это программный компонент, который взаимодействует с окружающей средой (environment) с целью научиться выбирать оптимальные действия для достижения своей цели.

Основные аспекты агента:

  1. Что делает агент?
  • Агент принимает решения: в каждом состоянии среды он выбирает действие (например, на какую клетку походить в крестиках-ноликах или как двигаться в игре).
  • Он учится выбирать действия, которые максимизируют его "награду" (reward), получаемую от среды.
  1. Как агент учится?
  • Агент улучшает свою стратегию на основе опыта взаимодействия со средой.
  • Этот процесс осуществляется с использованием алгоритмов обучения с подкреплением (например, Q-learning, глубокое Q-обучение или методы политики).

Ключевые элементы агента:

  1. Наблюдение (Observation): Агент воспринимает текущее состояние среды. Это может быть, например, игровое поле, показания датчиков или изображение. Наблюдение агент получает в виде данных от среды.
  2. Действие (Action): В каждом шаге агент выбирает действие из допустимого набора действий. Этот набор зависит от правил среды (например, куда можно походить в игре).
  3. Политика (Policy): Политика — это стратегия, которой следует агент, чтобы выбирать действия. Она может быть:
  • Детерминированной: одно и то же состояние всегда приводит к одному и тому же действию.
  • Стохастической: для одного состояния агент выбирает действие с определённой вероятностью.
  1. Награда (Reward): После выполнения действия агент получает награду — числовое значение, отражающее, насколько его действие было "хорошим". Награды помогают агенту оценивать свои действия и формировать полезное поведение.
  2. Функция ценности (Value Function): Это внутренняя оценка агента, которая отражает, насколько "хорошим" или "перспективным" является определённое состояние. Она позволяет агенту предсказывать долгосрочные последствия своих действий.
  3. Алгоритм обучения: Агент использует алгоритмы для обновления своей политики и функции ценности, основываясь на полученных наградах. Примеры: метод Q-learning, методы на основе градиента политики (Policy Gradient), алгоритмы Actor-Critic и т. д.
In [3]:
import random

class Agent:
    def __init__(self, symbol):
        self.symbol = symbol  # Символ игрока (1 - X, -1 - O)
    
    def get_action(self, moves):
        return random.choice(moves)  # Выбираем случайный ход из доступных

Основной цикл обучения

In [4]:
# Основной цикл обучения (работа с отдельным классом агента)

# Создание среды для игры в крестики-нолики
environment = TicTacToeEnv()

# Создание агента (играющего крестиками)
agent = Agent(symbol=1)

num_episodes = 100  # Количество эпизодов (игр) для обучения
collected_rewards = [] # Список для хранения наград/побед в каждом эпизоде 

# Переменная для отслеживания символа и текущего игрока
oom = 1

for i in range(num_episodes):
    # Сброс среды и начало нового эпизода
    state, _ = environment.reset() 

    # Общая награда за эпизод
    total_reward = 0

    # Флаг завершения игры
    done = False
    om = oom 

    # Максимум 9 ходов, поскольку поле 3x3 
    for j in range(9): 
        moves = environment.move_generator() 

        # Ходов нет, заканчиваем игру
        if not moves:
            break

        
        if len(moves) == 1:
            move = moves[0]    # Если остался один ход на основе стратегии
        else:
            move = agent.get_action(moves)   # Агент выбирает ход на основе стратегии

        # Выполнение хода и обновление состояния игры
        next_state, reward, done, info = environment.step(move)
        total_reward += reward
        state = next_state

        # Отображаем текущее состояние игры
        environment.render()

        if done:
            break

        om = -om    # Смена игрока

    collected_rewards.append(total_reward)

    print(f"Episode {i+1}, Total Reward: {total_reward}")
    average_reward = sum(collected_rewards) / len(collected_rewards)
    print(f"Average Reward: {average_reward}")
on move:  O
    X 
      
      
on move:  X
  O X 
      
      
on move:  O
  O X 
    X 
      
on move:  X
O O X 
    X 
      
on move:  O
O O X 
  X X 
      
on move:  X
O O X 
  X X 
    O 
on move:  O
O O X 
  X X 
X   O 
Episode 1, Total Reward: 1
Average Reward: 1.0
on move:  O
X     
      
      
on move:  X
X     
      
O     
on move:  O
X X   
      
O     
on move:  X
X X   
      
O O   
on move:  O
X X   
  X   
O O   
on move:  X
X X O 
  X   
O O   
on move:  O
X X O 
  X X 
O O   
on move:  X
X X O 
O X X 
O O   
on move:  O
X X O 
O X X 
O O X 
Episode 2, Total Reward: 1
Average Reward: 1.0
on move:  O
  X   
      
      
on move:  X
  X   
      
    O 
on move:  O
  X   
      
  X O 
on move:  X
  X   
      
O X O 
on move:  O
X X   
      
O X O 
on move:  X
X X   
  O   
O X O 
on move:  O
X X   
  O X 
O X O 
on move:  X
X X   
O O X 
O X O 
on move:  O
X X X 
O O X 
O X O 
Episode 3, Total Reward: 1
Average Reward: 1.0
on move:  O
      
  X   
      
on move:  X
O     
  X   
      
on move:  O
O     
  X   
    X 
on move:  X
O     
  X O 
    X 
on move:  O
O X   
  X O 
    X 
on move:  X
O X   
  X O 
  O X 
on move:  O
O X X 
  X O 
  O X 
on move:  X
O X X 
  X O 
O O X 
on move:  O
O X X 
X X O 
O O X 
Episode 4, Total Reward: 0
Average Reward: 0.75
on move:  O
      
      
X     
on move:  X
    O 
      
X     
on move:  O
    O 
X     
X     
on move:  X
    O 
X     
X O   
on move:  O
    O 
X     
X O X 
on move:  X
  O O 
X     
X O X 
on move:  O
X O O 
X     
X O X 
Episode 5, Total Reward: 1
Average Reward: 0.8
on move:  O
      
X     
      
on move:  X
      
X   O 
      
on move:  O
      
X X O 
      
on move:  X
      
X X O 
O     
on move:  O
      
X X O 
O   X 
on move:  X
    O 
X X O 
O   X 
on move:  O
  X O 
X X O 
O   X 
on move:  X
  X O 
X X O 
O O X 
on move:  O
X X O 
X X O 
O O X 
Episode 6, Total Reward: 1
Average Reward: 0.8333333333333334
on move:  O
      
      
    X 
on move:  X
      
  O   
    X 
on move:  O
      
X O   
    X 
on move:  X
  O   
X O   
    X 
on move:  O
X O   
X O   
    X 
on move:  X
X O   
X O   
  O X 
Episode 7, Total Reward: -1
Average Reward: 0.5714285714285714
on move:  O
      
  X   
      
on move:  X
      
  X   
    O 
on move:  O
      
X X   
    O 
on move:  X
      
X X O 
    O 
on move:  O
    X 
X X O 
    O 
on move:  X
  O X 
X X O 
    O 
on move:  O
  O X 
X X O 
  X O 
on move:  X
  O X 
X X O 
O X O 
on move:  O
X O X 
X X O 
O X O 
Episode 8, Total Reward: 0
Average Reward: 0.5
on move:  O
      
      
  X   
on move:  X
      
      
  X O 
on move:  O
X     
      
  X O 
on move:  X
X     
    O 
  X O 
on move:  O
X   X 
    O 
  X O 
on move:  X
X   X 
    O 
O X O 
on move:  O
X   X 
X   O 
O X O 
on move:  X
X O X 
X   O 
O X O 
on move:  O
X O X 
X X O 
O X O 
Episode 9, Total Reward: 0
Average Reward: 0.4444444444444444
on move:  O
      
      
  X   
on move:  X
      
      
  X O 
on move:  O
      
      
X X O 
on move:  X
      
  O   
X X O 
on move:  O
      
  O X 
X X O 
on move:  X
O     
  O X 
X X O 
Episode 10, Total Reward: -1
Average Reward: 0.3
on move:  O
      
  X   
      
on move:  X
    O 
  X   
      
on move:  O
    O 
X X   
      
on move:  X
    O 
X X   
O     
on move:  O
    O 
X X X 
O     
Episode 11, Total Reward: 1
Average Reward: 0.36363636363636365
on move:  O
      
X     
      
on move:  X
      
X O   
      
on move:  O
      
X O   
    X 
on move:  X
    O 
X O   
    X 
on move:  O
    O 
X O   
X   X 
on move:  X
    O 
X O   
X O X 
on move:  O
  X O 
X O   
X O X 
on move:  X
O X O 
X O   
X O X 
on move:  O
O X O 
X O X 
X O X 
Episode 12, Total Reward: 0
Average Reward: 0.3333333333333333
on move:  O
      
    X 
      
on move:  X
      
  O X 
      
on move:  O
    X 
  O X 
      
on move:  X
    X 
  O X 
    O 
on move:  O
    X 
  O X 
X   O 
on move:  X
    X 
  O X 
X O O 
on move:  O
  X X 
  O X 
X O O 
on move:  X
  X X 
O O X 
X O O 
on move:  O
X X X 
O O X 
X O O 
Episode 13, Total Reward: 1
Average Reward: 0.38461538461538464
on move:  O
      
X     
      
on move:  X
      
X   O 
      
on move:  O
      
X   O 
    X 
on move:  X
      
X   O 
  O X 
on move:  O
    X 
X   O 
  O X 
on move:  X
O   X 
X   O 
  O X 
on move:  O
O   X 
X X O 
  O X 
on move:  X
O   X 
X X O 
O O X 
on move:  O
O X X 
X X O 
O O X 
Episode 14, Total Reward: 0
Average Reward: 0.35714285714285715
on move:  O
      
      
    X 
on move:  X
      
    O 
    X 
on move:  O
    X 
    O 
    X 
on move:  X
    X 
O   O 
    X 
on move:  O
    X 
O   O 
  X X 
on move:  X
  O X 
O   O 
  X X 
on move:  O
  O X 
O X O 
  X X 
on move:  X
  O X 
O X O 
O X X 
on move:  O
X O X 
O X O 
O X X 
Episode 15, Total Reward: 1
Average Reward: 0.4
on move:  O
      
      
X     
on move:  X
      
      
X   O 
on move:  O
  X   
      
X   O 
on move:  X
  X O 
      
X   O 
on move:  O
  X O 
      
X X O 
on move:  X
O X O 
      
X X O 
on move:  O
O X O 
X     
X X O 
on move:  X
O X O 
X O   
X X O 
Episode 16, Total Reward: -1
Average Reward: 0.3125
on move:  O
      
    X 
      
on move:  X
      
    X 
  O   
on move:  O
  X   
    X 
  O   
on move:  X
  X   
  O X 
  O   
on move:  O
X X   
  O X 
  O   
on move:  X
X X   
O O X 
  O   
on move:  O
X X X 
O O X 
  O   
Episode 17, Total Reward: 1
Average Reward: 0.35294117647058826
on move:  O
      
  X   
      
on move:  X
      
O X   
      
on move:  O
      
O X   
    X 
on move:  X
  O   
O X   
    X 
on move:  O
X O   
O X   
    X 
Episode 18, Total Reward: 1
Average Reward: 0.3888888888888889
on move:  O
      
      
    X 
on move:  X
      
      
  O X 
on move:  O
X     
      
  O X 
on move:  X
X     
  O   
  O X 
on move:  O
X   X 
  O   
  O X 
on move:  X
X O X 
  O   
  O X 
Episode 19, Total Reward: -1
Average Reward: 0.3157894736842105
on move:  O
      
      
  X   
on move:  X
      
O     
  X   
on move:  O
      
O     
X X   
on move:  X
    O 
O     
X X   
on move:  O
    O 
O X   
X X   
on move:  X
    O 
O X   
X X O 
on move:  O
  X O 
O X   
X X O 
Episode 20, Total Reward: 1
Average Reward: 0.35
on move:  O
X     
      
      
on move:  X
X     
      
O     
on move:  O
X   X 
      
O     
on move:  X
X   X 
      
O O   
on move:  O
X   X 
    X 
O O   
on move:  X
X O X 
    X 
O O   
on move:  O
X O X 
X   X 
O O   
on move:  X
X O X 
X O X 
O O   
Episode 21, Total Reward: -1
Average Reward: 0.2857142857142857
on move:  O
      
      
X     
on move:  X
      
  O   
X     
on move:  O
X     
  O   
X     
on move:  X
X     
  O O 
X     
on move:  O
X X   
  O O 
X     
on move:  X
X X   
  O O 
X O   
on move:  O
X X   
X O O 
X O   
Episode 22, Total Reward: 1
Average Reward: 0.3181818181818182
on move:  O
  X   
      
      
on move:  X
  X   
O     
      
on move:  O
  X   
O     
X     
on move:  X
  X O 
O     
X     
on move:  O
X X O 
O     
X     
on move:  X
X X O 
O     
X   O 
on move:  O
X X O 
O   X 
X   O 
on move:  X
X X O 
O O X 
X   O 
on move:  O
X X O 
O O X 
X X O 
Episode 23, Total Reward: 0
Average Reward: 0.30434782608695654
on move:  O
      
    X 
      
on move:  X
      
    X 
    O 
on move:  O
X     
    X 
    O 
on move:  X
X     
    X 
  O O 
on move:  O
X X   
    X 
  O O 
on move:  X
X X   
    X 
O O O 
Episode 24, Total Reward: -1
Average Reward: 0.25
on move:  O
      
      
X     
on move:  X
      
      
X O   
on move:  O
      
    X 
X O   
on move:  X
    O 
    X 
X O   
on move:  O
    O 
X   X 
X O   
on move:  X
O   O 
X   X 
X O   
on move:  O
O   O 
X X X 
X O   
Episode 25, Total Reward: 1
Average Reward: 0.28
on move:  O
      
      
X     
on move:  X
      
      
X O   
on move:  O
      
      
X O X 
on move:  X
      
    O 
X O X 
on move:  O
    X 
    O 
X O X 
on move:  X
    X 
  O O 
X O X 
on move:  O
  X X 
  O O 
X O X 
on move:  X
O X X 
  O O 
X O X 
on move:  O
O X X 
X O O 
X O X 
Episode 26, Total Reward: 0
Average Reward: 0.2692307692307692
on move:  O
      
  X   
      
on move:  X
      
  X   
O     
on move:  O
      
X X   
O     
on move:  X
      
X X   
O O   
on move:  O
      
X X   
O O X 
on move:  X
  O   
X X   
O O X 
on move:  O
X O   
X X   
O O X 
Episode 27, Total Reward: 1
Average Reward: 0.2962962962962963
on move:  O
      
      
    X 
on move:  X
      
      
O   X 
on move:  O
  X   
      
O   X 
on move:  X
  X O 
      
O   X 
on move:  O
  X O 
  X   
O   X 
on move:  X
  X O 
  X   
O O X 
on move:  O
  X O 
X X   
O O X 
on move:  X
O X O 
X X   
O O X 
on move:  O
O X O 
X X X 
O O X 
Episode 28, Total Reward: 1
Average Reward: 0.32142857142857145
on move:  O
      
      
    X 
on move:  X
    O 
      
    X 
on move:  O
    O 
      
  X X 
on move:  X
  O O 
      
  X X 
on move:  O
X O O 
      
  X X 
on move:  X
X O O 
  O   
  X X 
on move:  O
X O O 
  O X 
  X X 
on move:  X
X O O 
  O X 
O X X 
Episode 29, Total Reward: -1
Average Reward: 0.27586206896551724
on move:  O
      
    X 
      
on move:  X
  O   
    X 
      
on move:  O
  O   
  X X 
      
on move:  X
  O   
O X X 
      
on move:  O
  O   
O X X 
    X 
on move:  X
O O   
O X X 
    X 
on move:  O
O O   
O X X 
X   X 
on move:  X
O O O 
O X X 
X   X 
Episode 30, Total Reward: -1
Average Reward: 0.23333333333333334
on move:  O
      
    X 
      
on move:  X
      
    X 
  O   
on move:  O
      
    X 
X O   
on move:  X
    O 
    X 
X O   
on move:  O
    O 
X   X 
X O   
on move:  X
    O 
X   X 
X O O 
on move:  O
X   O 
X   X 
X O O 
Episode 31, Total Reward: 1
Average Reward: 0.25806451612903225
on move:  O
X     
      
      
on move:  X
X O   
      
      
on move:  O
X O   
      
X     
on move:  X
X O O 
      
X     
on move:  O
X O O 
    X 
X     
on move:  X
X O O 
  O X 
X     
on move:  O
X O O 
  O X 
X   X 
on move:  X
X O O 
O O X 
X   X 
on move:  O
X O O 
O O X 
X X X 
Episode 32, Total Reward: 1
Average Reward: 0.28125
on move:  O
      
X     
      
on move:  X
O     
X     
      
on move:  O
O     
X     
X     
on move:  X
O     
X   O 
X     
on move:  O
O     
X   O 
X X   
on move:  X
O   O 
X   O 
X X   
on move:  O
O X O 
X   O 
X X   
on move:  X
O X O 
X   O 
X X O 
Episode 33, Total Reward: -1
Average Reward: 0.24242424242424243
on move:  O
      
      
  X   
on move:  X
  O   
      
  X   
on move:  O
  O   
  X   
  X   
on move:  X
  O   
  X O 
  X   
on move:  O
  O X 
  X O 
  X   
on move:  X
  O X 
O X O 
  X   
on move:  O
  O X 
O X O 
X X   
Episode 34, Total Reward: 1
Average Reward: 0.2647058823529412
on move:  O
    X 
      
      
on move:  X
    X 
    O 
      
on move:  O
    X 
    O 
    X 
on move:  X
    X 
O   O 
    X 
on move:  O
  X X 
O   O 
    X 
on move:  X
  X X 
O   O 
  O X 
on move:  O
X X X 
O   O 
  O X 
Episode 35, Total Reward: 1
Average Reward: 0.2857142857142857
on move:  O
  X   
      
      
on move:  X
  X   
  O   
      
on move:  O
  X X 
  O   
      
on move:  X
  X X 
  O O 
      
on move:  O
  X X 
X O O 
      
on move:  X
O X X 
X O O 
      
on move:  O
O X X 
X O O 
  X   
on move:  X
O X X 
X O O 
  X O 
Episode 36, Total Reward: -1
Average Reward: 0.25
on move:  O
      
  X   
      
on move:  X
      
  X   
    O 
on move:  O
      
  X   
X   O 
on move:  X
      
  X O 
X   O 
on move:  O
X     
  X O 
X   O 
on move:  X
X     
O X O 
X   O 
on move:  O
X   X 
O X O 
X   O 
Episode 37, Total Reward: 1
Average Reward: 0.2702702702702703
on move:  O
      
    X 
      
on move:  X
  O   
    X 
      
on move:  O
  O X 
    X 
      
on move:  X
  O X 
  O X 
      
on move:  O
  O X 
  O X 
  X   
on move:  X
  O X 
  O X 
O X   
on move:  O
  O X 
  O X 
O X X 
Episode 38, Total Reward: 1
Average Reward: 0.2894736842105263
on move:  O
      
    X 
      
on move:  X
      
O   X 
      
on move:  O
X     
O   X 
      
on move:  X
X     
O   X 
O     
on move:  O
X X   
O   X 
O     
on move:  X
X X   
O   X 
O   O 
on move:  O
X X   
O   X 
O X O 
on move:  X
X X   
O O X 
O X O 
on move:  O
X X X 
O O X 
O X O 
Episode 39, Total Reward: 1
Average Reward: 0.3076923076923077
on move:  O
    X 
      
      
on move:  X
    X 
O     
      
on move:  O
    X 
O     
    X 
on move:  X
  O X 
O     
    X 
on move:  O
  O X 
O     
  X X 
on move:  X
O O X 
O     
  X X 
on move:  O
O O X 
O   X 
  X X 
Episode 40, Total Reward: 1
Average Reward: 0.325
on move:  O
      
      
    X 
on move:  X
      
    O 
    X 
on move:  O
      
X   O 
    X 
on move:  X
      
X   O 
  O X 
on move:  O
      
X   O 
X O X 
on move:  X
O     
X   O 
X O X 
on move:  O
O X   
X   O 
X O X 
on move:  X
O X   
X O O 
X O X 
on move:  O
O X X 
X O O 
X O X 
Episode 41, Total Reward: 0
Average Reward: 0.3170731707317073
on move:  O
      
    X 
      
on move:  X
    O 
    X 
      
on move:  O
  X O 
    X 
      
on move:  X
  X O 
    X 
O     
on move:  O
  X O 
  X X 
O     
on move:  X
  X O 
O X X 
O     
on move:  O
  X O 
O X X 
O X   
Episode 42, Total Reward: 1
Average Reward: 0.3333333333333333
on move:  O
      
      
X     
on move:  X
    O 
      
X     
on move:  O
  X O 
      
X     
on move:  X
  X O 
      
X   O 
on move:  O
X X O 
      
X   O 
on move:  X
X X O 
    O 
X   O 
Episode 43, Total Reward: -1
Average Reward: 0.3023255813953488
on move:  O
      
      
    X 
on move:  X
  O   
      
    X 
on move:  O
  O   
X     
    X 
on move:  X
O O   
X     
    X 
on move:  O
O O   
X     
  X X 
on move:  X
O O   
X O   
  X X 
on move:  O
O O X 
X O   
  X X 
on move:  X
O O X 
X O   
O X X 
on move:  O
O O X 
X O X 
O X X 
Episode 44, Total Reward: 1
Average Reward: 0.3181818181818182
on move:  O
      
      
  X   
on move:  X
  O   
      
  X   
on move:  O
  O   
    X 
  X   
on move:  X
  O O 
    X 
  X   
on move:  O
  O O 
    X 
  X X 
on move:  X
  O O 
  O X 
  X X 
on move:  O
X O O 
  O X 
  X X 
on move:  X
X O O 
O O X 
  X X 
on move:  O
X O O 
O O X 
X X X 
Episode 45, Total Reward: 1
Average Reward: 0.3333333333333333
on move:  O
X     
      
      
on move:  X
X     
      
  O   
on move:  O
X X   
      
  O   
on move:  X
X X   
O     
  O   
on move:  O
X X X 
O     
  O   
Episode 46, Total Reward: 1
Average Reward: 0.34782608695652173
on move:  O
      
      
  X   
on move:  X
O     
      
  X   
on move:  O
O     
  X   
  X   
on move:  X
O   O 
  X   
  X   
on move:  O
O   O 
  X   
X X   
on move:  X
O   O 
  X O 
X X   
on move:  O
O   O 
  X O 
X X X 
Episode 47, Total Reward: 1
Average Reward: 0.3617021276595745
on move:  O
      
      
  X   
on move:  X
O     
      
  X   
on move:  O
O     
  X   
  X   
on move:  X
O     
O X   
  X   
on move:  O
O     
O X   
  X X 
on move:  X
O     
O X   
O X X 
Episode 48, Total Reward: -1
Average Reward: 0.3333333333333333
on move:  O
      
    X 
      
on move:  X
  O   
    X 
      
on move:  O
  O   
  X X 
      
on move:  X
  O   
  X X 
O     
on move:  O
  O   
  X X 
O X   
on move:  X
  O   
O X X 
O X   
on move:  O
  O   
O X X 
O X X 
on move:  X
O O   
O X X 
O X X 
Episode 49, Total Reward: -1
Average Reward: 0.30612244897959184
on move:  O
      
    X 
      
on move:  X
O     
    X 
      
on move:  O
O     
    X 
  X   
on move:  X
O O   
    X 
  X   
on move:  O
O O   
    X 
  X X 
on move:  X
O O   
    X 
O X X 
on move:  O
O O   
X   X 
O X X 
on move:  X
O O O 
X   X 
O X X 
Episode 50, Total Reward: -1
Average Reward: 0.28
on move:  O
      
      
  X   
on move:  X
  O   
      
  X   
on move:  O
  O   
    X 
  X   
on move:  X
  O   
    X 
O X   
on move:  O
  O   
  X X 
O X   
on move:  X
  O   
O X X 
O X   
on move:  O
  O X 
O X X 
O X   
on move:  X
O O X 
O X X 
O X   
Episode 51, Total Reward: -1
Average Reward: 0.2549019607843137
on move:  O
  X   
      
      
on move:  X
O X   
      
      
on move:  O
O X   
    X 
      
on move:  X
O X   
  O X 
      
on move:  O
O X   
  O X 
    X 
on move:  X
O X O 
  O X 
    X 
on move:  O
O X O 
X O X 
    X 
on move:  X
O X O 
X O X 
O   X 
Episode 52, Total Reward: -1
Average Reward: 0.23076923076923078
on move:  O
    X 
      
      
on move:  X
  O X 
      
      
on move:  O
X O X 
      
      
on move:  X
X O X 
      
    O 
on move:  O
X O X 
    X 
    O 
on move:  X
X O X 
    X 
  O O 
on move:  O
X O X 
  X X 
  O O 
on move:  X
X O X 
  X X 
O O O 
Episode 53, Total Reward: -1
Average Reward: 0.20754716981132076
on move:  O
X     
      
      
on move:  X
X     
      
O     
on move:  O
X     
      
O   X 
on move:  X
X     
O     
O   X 
on move:  O
X X   
O     
O   X 
on move:  X
X X   
O O   
O   X 
on move:  O
X X   
O O X 
O   X 
on move:  X
X X O 
O O X 
O   X 
Episode 54, Total Reward: -1
Average Reward: 0.18518518518518517
on move:  O
      
    X 
      
on move:  X
      
  O X 
      
on move:  O
  X   
  O X 
      
on move:  X
  X   
  O X 
O     
on move:  O
  X   
  O X 
O   X 
on move:  X
O X   
  O X 
O   X 
on move:  O
O X   
X O X 
O   X 
on move:  X
O X O 
X O X 
O   X 
Episode 55, Total Reward: -1
Average Reward: 0.16363636363636364
on move:  O
  X   
      
      
on move:  X
  X O 
      
      
on move:  O
  X O 
      
X     
on move:  X
  X O 
  O   
X     
on move:  O
  X O 
  O   
X X   
on move:  X
  X O 
  O   
X X O 
on move:  O
  X O 
  O X 
X X O 
on move:  X
  X O 
O O X 
X X O 
on move:  O
X X O 
O O X 
X X O 
Episode 56, Total Reward: 0
Average Reward: 0.16071428571428573
on move:  O
    X 
      
      
on move:  X
    X 
      
O     
on move:  O
    X 
X     
O     
on move:  X
    X 
X   O 
O     
on move:  O
X   X 
X   O 
O     
on move:  X
X O X 
X   O 
O     
on move:  O
X O X 
X   O 
O   X 
on move:  X
X O X 
X O O 
O   X 
on move:  O
X O X 
X O O 
O X X 
Episode 57, Total Reward: 0
Average Reward: 0.15789473684210525
on move:  O
      
      
X     
on move:  X
      
    O 
X     
on move:  O
    X 
    O 
X     
on move:  X
    X 
O   O 
X     
on move:  O
X   X 
O   O 
X     
on move:  X
X O X 
O   O 
X     
on move:  O
X O X 
O   O 
X   X 
on move:  X
X O X 
O   O 
X O X 
on move:  O
X O X 
O X O 
X O X 
Episode 58, Total Reward: 1
Average Reward: 0.1724137931034483
on move:  O
      
    X 
      
on move:  X
      
  O X 
      
on move:  O
      
  O X 
    X 
on move:  X
O     
  O X 
    X 
on move:  O
O     
  O X 
  X X 
on move:  X
O O   
  O X 
  X X 
on move:  O
O O   
  O X 
X X X 
Episode 59, Total Reward: 1
Average Reward: 0.1864406779661017
on move:  O
      
      
  X   
on move:  X
  O   
      
  X   
on move:  O
  O   
X     
  X   
on move:  X
  O   
X     
  X O 
on move:  O
  O   
X X   
  X O 
on move:  X
  O   
X X   
O X O 
on move:  O
  O   
X X X 
O X O 
Episode 60, Total Reward: 1
Average Reward: 0.2
on move:  O
X     
      
      
on move:  X
X     
      
    O 
on move:  O
X     
      
X   O 
on move:  X
X     
    O 
X   O 
on move:  O
X     
    O 
X X O 
on move:  X
X O   
    O 
X X O 
on move:  O
X O   
X   O 
X X O 
Episode 61, Total Reward: 1
Average Reward: 0.21311475409836064
on move:  O
      
      
X     
on move:  X
      
  O   
X     
on move:  O
X     
  O   
X     
on move:  X
X     
  O   
X   O 
on move:  O
X     
X O   
X   O 
Episode 62, Total Reward: 1
Average Reward: 0.22580645161290322
on move:  O
X     
      
      
on move:  X
X     
    O 
      
on move:  O
X     
    O 
  X   
on move:  X
X     
O   O 
  X   
on move:  O
X     
O   O 
X X   
on move:  X
X     
O   O 
X X O 
on move:  O
X   X 
O   O 
X X O 
on move:  X
X   X 
O O O 
X X O 
Episode 63, Total Reward: -1
Average Reward: 0.20634920634920634
on move:  O
      
X     
      
on move:  X
  O   
X     
      
on move:  O
X O   
X     
      
on move:  X
X O   
X     
    O 
on move:  O
X O   
X   X 
    O 
on move:  X
X O   
X   X 
O   O 
on move:  O
X O   
X   X 
O X O 
on move:  X
X O O 
X   X 
O X O 
on move:  O
X O O 
X X X 
O X O 
Episode 64, Total Reward: 1
Average Reward: 0.21875
on move:  O
      
      
X     
on move:  X
  O   
      
X     
on move:  O
X O   
      
X     
on move:  X
X O   
O     
X     
on move:  O
X O   
O     
X X   
on move:  X
X O O 
O     
X X   
on move:  O
X O O 
O X   
X X   
on move:  X
X O O 
O X   
X X O 
on move:  O
X O O 
O X X 
X X O 
Episode 65, Total Reward: 0
Average Reward: 0.2153846153846154
on move:  O
      
      
    X 
on move:  X
    O 
      
    X 
on move:  O
    O 
      
X   X 
on move:  X
O   O 
      
X   X 
on move:  O
O   O 
  X   
X   X 
on move:  X
O O O 
  X   
X   X 
Episode 66, Total Reward: -1
Average Reward: 0.19696969696969696
on move:  O
      
    X 
      
on move:  X
      
  O X 
      
on move:  O
    X 
  O X 
      
on move:  X
    X 
  O X 
O     
on move:  O
  X X 
  O X 
O     
on move:  X
  X X 
O O X 
O     
on move:  O
  X X 
O O X 
O X   
on move:  X
O X X 
O O X 
O X   
Episode 67, Total Reward: -1
Average Reward: 0.1791044776119403
on move:  O
  X   
      
      
on move:  X
  X   
      
  O   
on move:  O
  X   
      
  O X 
on move:  X
  X   
    O 
  O X 
on move:  O
X X   
    O 
  O X 
on move:  X
X X   
    O 
O O X 
on move:  O
X X   
  X O 
O O X 
Episode 68, Total Reward: 1
Average Reward: 0.19117647058823528
on move:  O
      
      
  X   
on move:  X
      
      
O X   
on move:  O
    X 
      
O X   
on move:  X
  O X 
      
O X   
on move:  O
  O X 
      
O X X 
on move:  X
  O X 
    O 
O X X 
on move:  O
  O X 
X   O 
O X X 
on move:  X
O O X 
X   O 
O X X 
on move:  O
O O X 
X X O 
O X X 
Episode 69, Total Reward: 0
Average Reward: 0.18840579710144928
on move:  O
    X 
      
      
on move:  X
    X 
O     
      
on move:  O
    X 
O     
  X   
on move:  X
O   X 
O     
  X   
on move:  O
O   X 
O   X 
  X   
on move:  X
O O X 
O   X 
  X   
on move:  O
O O X 
O   X 
X X   
on move:  X
O O X 
O   X 
X X O 
on move:  O
O O X 
O X X 
X X O 
Episode 70, Total Reward: 1
Average Reward: 0.2
on move:  O
  X   
      
      
on move:  X
  X   
O     
      
on move:  O
  X   
O     
  X   
on move:  X
  X   
O O   
  X   
on move:  O
  X   
O O X 
  X   
on move:  X
  X O 
O O X 
  X   
on move:  O
  X O 
O O X 
  X X 
on move:  X
  X O 
O O X 
O X X 
Episode 71, Total Reward: -1
Average Reward: 0.18309859154929578
on move:  O
      
    X 
      
on move:  X
  O   
    X 
      
on move:  O
  O   
    X 
X     
on move:  X
O O   
    X 
X     
on move:  O
O O   
    X 
X X   
on move:  X
O O O 
    X 
X X   
Episode 72, Total Reward: -1
Average Reward: 0.16666666666666666
on move:  O
X     
      
      
on move:  X
X O   
      
      
on move:  O
X O   
      
X     
on move:  X
X O   
    O 
X     
on move:  O
X O   
  X O 
X     
on move:  X
X O   
  X O 
X   O 
on move:  O
X O   
X X O 
X   O 
Episode 73, Total Reward: 1
Average Reward: 0.1780821917808219
on move:  O
      
      
X     
on move:  X
      
O     
X     
on move:  O
  X   
O     
X     
on move:  X
  X O 
O     
X     
on move:  O
X X O 
O     
X     
on move:  X
X X O 
O     
X O   
on move:  O
X X O 
O   X 
X O   
on move:  X
X X O 
O   X 
X O O 
on move:  O
X X O 
O X X 
X O O 
Episode 74, Total Reward: 0
Average Reward: 0.17567567567567569
on move:  O
X     
      
      
on move:  X
X     
      
    O 
on move:  O
X     
    X 
    O 
on move:  X
X     
  O X 
    O 
on move:  O
X X   
  O X 
    O 
on move:  X
X X   
O O X 
    O 
on move:  O
X X X 
O O X 
    O 
Episode 75, Total Reward: 1
Average Reward: 0.18666666666666668
on move:  O
      
      
X     
on move:  X
      
      
X O   
on move:  O
    X 
      
X O   
on move:  X
    X 
      
X O O 
on move:  O
X   X 
      
X O O 
on move:  X
X   X 
    O 
X O O 
on move:  O
X X X 
    O 
X O O 
Episode 76, Total Reward: 1
Average Reward: 0.19736842105263158
on move:  O
    X 
      
      
on move:  X
O   X 
      
      
on move:  O
O   X 
  X   
      
on move:  X
O   X 
  X   
  O   
on move:  O
O   X 
X X   
  O   
on move:  X
O   X 
X X   
  O O 
on move:  O
O X X 
X X   
  O O 
on move:  X
O X X 
X X   
O O O 
Episode 77, Total Reward: -1
Average Reward: 0.18181818181818182
on move:  O
      
X     
      
on move:  X
    O 
X     
      
on move:  O
    O 
X     
    X 
on move:  X
    O 
X     
  O X 
on move:  O
    O 
X   X 
  O X 
on move:  X
    O 
X   X 
O O X 
on move:  O
X   O 
X   X 
O O X 
on move:  X
X   O 
X O X 
O O X 
Episode 78, Total Reward: -1
Average Reward: 0.16666666666666666
on move:  O
  X   
      
      
on move:  X
  X   
      
    O 
on move:  O
X X   
      
    O 
on move:  X
X X O 
      
    O 
on move:  O
X X O 
  X   
    O 
on move:  X
X X O 
  X   
  O O 
on move:  O
X X O 
X X   
  O O 
on move:  X
X X O 
X X O 
  O O 
Episode 79, Total Reward: -1
Average Reward: 0.1518987341772152
on move:  O
X     
      
      
on move:  X
X     
      
  O   
on move:  O
X X   
      
  O   
on move:  X
X X   
    O 
  O   
on move:  O
X X X 
    O 
  O   
Episode 80, Total Reward: 1
Average Reward: 0.1625
on move:  O
  X   
      
      
on move:  X
  X O 
      
      
on move:  O
  X O 
X     
      
on move:  X
  X O 
X   O 
      
on move:  O
  X O 
X X O 
      
on move:  X
  X O 
X X O 
  O   
on move:  O
  X O 
X X O 
  O X 
on move:  X
O X O 
X X O 
  O X 
on move:  O
O X O 
X X O 
X O X 
Episode 81, Total Reward: 0
Average Reward: 0.16049382716049382
on move:  O
      
  X   
      
on move:  X
      
O X   
      
on move:  O
X     
O X   
      
on move:  X
X     
O X   
    O 
on move:  O
X     
O X   
  X O 
on move:  X
X     
O X   
O X O 
on move:  O
X   X 
O X   
O X O 
on move:  X
X   X 
O X O 
O X O 
on move:  O
X X X 
O X O 
O X O 
Episode 82, Total Reward: 1
Average Reward: 0.17073170731707318
on move:  O
  X   
      
      
on move:  X
  X   
      
  O   
on move:  O
  X   
      
X O   
on move:  X
  X   
      
X O O 
on move:  O
X X   
      
X O O 
on move:  X
X X   
  O   
X O O 
on move:  O
X X   
  O X 
X O O 
on move:  X
X X O 
  O X 
X O O 
on move:  O
X X O 
X O X 
X O O 
Episode 83, Total Reward: 1
Average Reward: 0.18072289156626506
on move:  O
X     
      
      
on move:  X
X   O 
      
      
on move:  O
X   O 
    X 
      
on move:  X
X O O 
    X 
      
on move:  O
X O O 
    X 
    X 
on move:  X
X O O 
    X 
  O X 
on move:  O
X O O 
X   X 
  O X 
on move:  X
X O O 
X   X 
O O X 
on move:  O
X O O 
X X X 
O O X 
Episode 84, Total Reward: 1
Average Reward: 0.19047619047619047
on move:  O
      
      
    X 
on move:  X
      
      
O   X 
on move:  O
    X 
      
O   X 
on move:  X
    X 
      
O O X 
on move:  O
    X 
X     
O O X 
on move:  X
    X 
X O   
O O X 
on move:  O
    X 
X O X 
O O X 
Episode 85, Total Reward: 1
Average Reward: 0.2
on move:  O
X     
      
      
on move:  X
X     
    O 
      
on move:  O
X X   
    O 
      
on move:  X
X X   
    O 
O     
on move:  O
X X   
  X O 
O     
on move:  X
X X   
  X O 
O O   
on move:  O
X X X 
  X O 
O O   
Episode 86, Total Reward: 1
Average Reward: 0.20930232558139536
on move:  O
X     
      
      
on move:  X
X   O 
      
      
on move:  O
X   O 
      
    X 
on move:  X
X O O 
      
    X 
on move:  O
X O O 
      
  X X 
on move:  X
X O O 
      
O X X 
on move:  O
X O O 
X     
O X X 
on move:  X
X O O 
X   O 
O X X 
on move:  O
X O O 
X X O 
O X X 
Episode 87, Total Reward: 1
Average Reward: 0.21839080459770116
on move:  O
    X 
      
      
on move:  X
    X 
      
    O 
on move:  O
    X 
      
  X O 
on move:  X
    X 
      
O X O 
on move:  O
  X X 
      
O X O 
on move:  X
  X X 
O     
O X O 
on move:  O
  X X 
O X   
O X O 
Episode 88, Total Reward: 1
Average Reward: 0.22727272727272727
on move:  O
      
      
X     
on move:  X
  O   
      
X     
on move:  O
  O   
X     
X     
on move:  X
  O   
X O   
X     
on move:  O
  O   
X O   
X   X 
on move:  X
  O O 
X O   
X   X 
on move:  O
  O O 
X O X 
X   X 
on move:  X
O O O 
X O X 
X   X 
Episode 89, Total Reward: -1
Average Reward: 0.21348314606741572
on move:  O
    X 
      
      
on move:  X
    X 
O     
      
on move:  O
  X X 
O     
      
on move:  X
  X X 
O     
    O 
on move:  O
X X X 
O     
    O 
Episode 90, Total Reward: 1
Average Reward: 0.2222222222222222
on move:  O
      
      
  X   
on move:  X
  O   
      
  X   
on move:  O
  O   
X     
  X   
on move:  X
  O   
X     
O X   
on move:  O
  O   
X   X 
O X   
on move:  X
  O   
X O X 
O X   
on move:  O
  O   
X O X 
O X X 
on move:  X
  O O 
X O X 
O X X 
Episode 91, Total Reward: -1
Average Reward: 0.2087912087912088
on move:  O
  X   
      
      
on move:  X
  X   
    O 
      
on move:  O
  X   
X   O 
      
on move:  X
  X O 
X   O 
      
on move:  O
  X O 
X   O 
  X   
on move:  X
O X O 
X   O 
  X   
on move:  O
O X O 
X   O 
X X   
on move:  X
O X O 
X   O 
X X O 
Episode 92, Total Reward: -1
Average Reward: 0.1956521739130435
on move:  O
X     
      
      
on move:  X
X     
O     
      
on move:  O
X     
O   X 
      
on move:  X
X O   
O   X 
      
on move:  O
X O   
O   X 
X     
on move:  X
X O O 
O   X 
X     
on move:  O
X O O 
O   X 
X X   
on move:  X
X O O 
O O X 
X X   
on move:  O
X O O 
O O X 
X X X 
Episode 93, Total Reward: 1
Average Reward: 0.20430107526881722
on move:  O
      
      
  X   
on move:  X
      
    O 
  X   
on move:  O
  X   
    O 
  X   
on move:  X
O X   
    O 
  X   
on move:  O
O X   
    O 
  X X 
on move:  X
O X   
    O 
O X X 
on move:  O
O X   
  X O 
O X X 
Episode 94, Total Reward: 1
Average Reward: 0.2127659574468085
on move:  O
      
X     
      
on move:  X
  O   
X     
      
on move:  O
  O   
X     
X     
on move:  X
  O   
X     
X   O 
on move:  O
  O   
X   X 
X   O 
on move:  X
  O   
X   X 
X O O 
on move:  O
X O   
X   X 
X O O 
Episode 95, Total Reward: 1
Average Reward: 0.22105263157894736
on move:  O
      
      
  X   
on move:  X
      
      
  X O 
on move:  O
  X   
      
  X O 
on move:  X
  X   
O     
  X O 
on move:  O
  X   
O     
X X O 
on move:  X
  X O 
O     
X X O 
on move:  O
  X O 
O X   
X X O 
Episode 96, Total Reward: 1
Average Reward: 0.22916666666666666
on move:  O
      
      
    X 
on move:  X
      
      
O   X 
on move:  O
X     
      
O   X 
on move:  X
X     
O     
O   X 
on move:  O
X     
O   X 
O   X 
on move:  X
X     
O O X 
O   X 
on move:  O
X X   
O O X 
O   X 
on move:  X
X X   
O O X 
O O X 
on move:  O
X X X 
O O X 
O O X 
Episode 97, Total Reward: 1
Average Reward: 0.23711340206185566
on move:  O
      
    X 
      
on move:  X
O     
    X 
      
on move:  O
O   X 
    X 
      
on move:  X
O   X 
    X 
    O 
on move:  O
O   X 
X   X 
    O 
on move:  X
O   X 
X   X 
O   O 
on move:  O
O   X 
X   X 
O X O 
on move:  X
O O X 
X   X 
O X O 
on move:  O
O O X 
X X X 
O X O 
Episode 98, Total Reward: 1
Average Reward: 0.24489795918367346
on move:  O
      
      
  X   
on move:  X
    O 
      
  X   
on move:  O
    O 
    X 
  X   
on move:  X
    O 
O   X 
  X   
on move:  O
    O 
O   X 
  X X 
on move:  X
O   O 
O   X 
  X X 
on move:  O
O   O 
O   X 
X X X 
Episode 99, Total Reward: 1
Average Reward: 0.25252525252525254
on move:  O
  X   
      
      
on move:  X
  X   
      
  O   
on move:  O
  X   
    X 
  O   
on move:  X
  X   
    X 
O O   
on move:  O
  X X 
    X 
O O   
on move:  X
  X X 
O   X 
O O   
on move:  O
  X X 
O   X 
O O X 
Episode 100, Total Reward: 1
Average Reward: 0.26