Reinforcement learning: The ideal proportion of exploitation and exploration in Treasure Hunt

What is the difference between exploration and exploitation in machine learning?

Exploitation and exploration are two different approaches for determining the basis on which an intelligent agent chooses an action. An agent explores the environment when it chooses a random action and exploits when it chooses the most likely action from its Q-table.

An intelligent agent explores the environment to find new paths that lead to greater rewards and exploits its Q-table to choose the best action for the current state.

For every episode, the agent determines between exploring the environment or exploiting its Q-table. The agent determines between the two options by using an epsilon value to control the ratio of exploration VS exploitation.

Code block 1

# Example of exploration vs exploitation
IF (epsolonRatio > randomNumber) THEN explore
ELSE exploit Q-Table

The ideal proportion of exploitation and exploration in Treasure Hunt

The intelligent agent learns the optimal path to the treasure by using two techniques referred to as exploitation and exploration.

In the Treasure Hunt game, the agent determines to either explore the maze or exploit the Q-table through the use of an epsilon value.

Code block 2

# use epsilon value to determine next action
if np.random.rand() < epsilon:
    # get next action by random
    action = random.choice(valid_actions)
    # get next action from previous state
    action = np.argmax(experience.predict(prev_envstate))

The code block above shows how the agent uses an epsilon value to determine the next action.

The epsilon value ranges from 0 to 1 and is initially 0.20 for this game. The if-statement checks if the epsilon value is greater than a randomly generated NumPy number that also ranges between 0 and 1, and If the epsilon value is greater than the randomly generated number, then the agent selects a random valid action, such as LEFT, RIGHT, UP, or DOWN.

However, if the epsilon value is not greater than the randomly generated NumPy number, the agent will rely on its Q-table to choose an action that has the highest value in the Q-table for the current state. The np.argmax selects the index from the experience instance with the highest value.

The agent continues to evaluate if it should take a random action or an action from experience for every episode.

The Q-table is a matrix that has rows representing all 64-states in the Treasure Hunt game and columns representing the four-possible actions (left, right, up, or down).

The 0.20 epsilon value encourages the agent to explore roughly 20% of the time, and exploit its Q-table roughly 80% of the time.

As the agent begins to win more games, its exploration factor is reduced because an assumption is made that the agent has explored nearly all possible paths to the treasure.

Code block 3

# decrease exploration from 20% to 5%
if win_rate > 0.9:
    epsilon = 0.05

The code block above shows that when the agent reaches a 90% win rate, the exploration factor is reduced from 20% to only 5% since it is unlikely that aggressive exploration at this point will yield any benefits.