set.seed(12)
library(reinforcelearn)

A reinforcement learning agent usually consists of three parts: a policy, a value function representation and an algorithm which updates the value function or policy parameters. In the following it will be explained how to create an agent in reinforcelearn to solve an environment.

You can create an agent with the function makeAgent. This will create an R6 class object with the corresponding policy, value function and algorithm.

env = makeEnvironment("gridworld", shape = c(3, 3), goal.states = 0L)
agent = makeAgent(policy = "softmax", val.fun = "table", algorithm = "qlearning")

Then you can run the agent in the environment by calling interact for a specified number of steps or episodes.

interact(env, agent, n.episodes = 5L)
#> Episode 1 finished after 1 steps with a return of -1
#> Episode 2 finished after 25 steps with a return of -25
#> Episode 3 finished after 3 steps with a return of -3
#> Episode 4 finished after 15 steps with a return of -15
#> Episode 5 finished after 5 steps with a return of -5
#> $returns #> [1] -1 -25 -3 -15 -5 #> #>$steps
#> [1]  1 25  3 15  5

Note that interact returns a list with the number of steps and returns per episode. Furthermore it will change the environment and agent object. So the environment’s state or the agent’s value function weights will have most likely changed after the interaction.

Although you can directly access the agent object, this is not recommended as this will be very likely to change in the next package versions. Instead use one of the accessor functions to e.g. get the weights of the action value function.

getValueFunction(agent)
#>           [,1]    [,2]    [,3]    [,4]
#>  [1,]  0.00000  0.0000  0.0000  0.0000
#>  [2,] -0.40951 -0.3081 -0.3439  0.0000
#>  [3,] -0.40951 -0.3900 -0.3710 -0.2000
#>  [4,]  0.00000 -0.1000  0.0000  0.0000
#>  [5,]  0.00000 -0.2900 -0.1900  0.0000
#>  [6,] -0.27100 -0.1000 -0.2900 -0.1900
#>  [7,]  0.00000  0.0000  0.0000  0.0000
#>  [8,]  0.00000  0.0000  0.0000  0.0000
#>  [9,]  0.00000 -0.1000 -0.2000 -0.3439

## Policies

A policy is the agent’s behavior function. We can define the policy with makePolicy.

# Uniform random policy
makePolicy("random")
#> $name #> [1] "random" #> #>$args
#> list()
#>
#> attr(,"class")
#> [1] "Policy"

# Epsilon-greedy policy
makePolicy("epsilon.greedy", epsilon = 0.2)
#> $name #> [1] "epsilon.greedy" #> #>$args
#> $args$epsilon
#> [1] 0.2
#>
#>
#> attr(,"class")
#> [1] "Policy"

# Softmax policy
makePolicy("softmax")
#> $name #> [1] "softmax" #> #>$args
#> list()
#>
#> attr(,"class")
#> [1] "Policy"

This will just capture what policy to use and the policy will then be created when we create the agent.

## Value Functions

Many reinforcement learning algorithms use a value function to learn values of state and action pairs. The value function can be represented with different types of function approximation, e.g. as a table or neural network.

makeValueFunction("table", n.states = 9L, n.actions = 4L)
#> $name #> [1] "table" #> #>$args
#> $args$n.states
#> [1] 9
#>
#> $args$n.actions
#> [1] 4
#>
#>
#> attr(,"class")
#> [1] "ValueFunction"

For a neural network you can use the keras package. Therefore you need to specify a the model’s architecture and pass these on to makeValueFunction.

library(keras)
model = keras_model_sequential() %>%
layer_dense(shape = 10L, input_shape = 4L, activation = "linear") %>%
compile(optimizer = optimizer_sgd(lr = 0.1), loss = "mae")
makeValueFunction("neural.network", model)

Note that online neural network training is currently very slow. One way to work with this is to make updates to the value function not after every interaction, but to store all interactions in a replay memory and make updates to the neural network only once in a while. Read more about this in Section Experience Replay.

Often you need to preprocess the state observation in a way the agent can work with this. Therefore you can pass on a function to the preprocess argument of makeAgent, which will then be applied to the state observation before the agent learns on this.

For neural network training the outcome of preprocess must be a one-row matrix in order to be able to learn.

## Algorithms

The algorithm defines how to learn from an interaction with the environment. We can set up an algorithm using the function makeAlgorithm.

makeAlgorithm("qlearning")
#> $name #> [1] "qlearning" #> #>$args
#> list()
#>
#> attr(,"class")
#> [1] "Algorithm"

## Agent

If we have defined policy, value function and algorithm we can create the agent by calling makeAgent.

policy = makePolicy("epsilon.greedy", epsilon = 0.2)
val.fun = makeValueFunction("table", n.states = 9L, n.actions = 4L)
algorithm = makeAlgorithm("qlearning")

agent = makeAgent(policy, val.fun, algorithm)

Note that you can also call makeAgent with character arguments which can save some typing.

agent = makeAgent("epsilon.greedy", "table", "qlearning",
policy.args = list(epsilon = 0.2))

## Interaction

You can run an interaction between an agent and environment with the interact function.

env = makeEnvironment("gridworld", shape = c(3, 2), goal.states = 0L)
agent = makeAgent("random")

interact(env, agent, n.steps = 3L, visualize = TRUE)
#>  - -
#>  o -
#>  - -
#>
#>  - -
#>  o -
#>  - -
#>
#>  - -
#>  - o
#>  - -
#>
#> $returns #> numeric(0) #> #>$steps
#> integer(0)

It allows you to run an interaction for a specified number of steps or episodes and you can also specify a maximum number of steps per episode. This makes it very flexible to step through the environment one action after the other. Note you can also run an interaction without learning.

env = makeEnvironment("gridworld", shape = c(4, 4), goal.states = 0L,
initial.state = 15L)
agent = makeAgent("random")

for (i in 1:3L) {
## comment in the next line to wait on enter press before taking the next action.
# invisible(readline(prompt = "Press [enter] to take the next action"))
interact(env, agent, n.steps = 1L, learn = FALSE, visualize = TRUE)
}
#>  - - - -
#>  - - - -
#>  - - - -
#>  - - - o
#>
#>  - - - -
#>  - - - -
#>  - - - -
#>  - - o -
#>
#>  - - - -
#>  - - - -
#>  - - o -
#>  - - - -
#> 

### Experience replay

Experience replay is a technique to learn at once from multiple past observations. Therefore all the states, actions, rewards and subsequent states will be stored in a list (the so called replay memory) and at each step a random batch from this memory will be replayed.

(memory = makeReplayMemory(size = 2L, batch.size = 1L))
#> $size #> [1] 2 #> #>$batch.size
#> [1] 1
#>
#> attr(,"class")
#> [1] "ReplayMemory"

agent = makeAgent("random", replay.memory = memory)

interact(env, agent, n.steps = 2L, visualize = TRUE)
#>  - - - -
#>  - - o -
#>  - - - -
#>  - - - -
#>
#>  - - - -
#>  - - - -
#>  - - o -
#>  - - - -
#>
#> $returns #> numeric(0) #> #>$steps
#> integer(0)

getReplayMemory(agent)
#> [[1]]
#> [[1]]$state #> [1] 10 #> #> [[1]]$action
#> [1] 2
#>
#> [[1]]$reward #> [1] -1 #> #> [[1]]$next.state
#> [1] 6
#>
#>
#> [[2]]
#> [[2]]$state #> [1] 6 #> #> [[2]]$action
#> [1] 3
#>
#> [[2]]$reward #> [1] -1 #> #> [[2]]$next.state
#> [1] 10

Here is an example training with experience replay, where the value function is updated only every 21 steps.

env = makeEnvironment("gridworld", shape = c(4, 4), goal.states = c(0, 15))

policy = makePolicy("epsilon.greedy", epsilon = 0.1)
memory = makeReplayMemory(size = 100L, batch.size = 20L)

agent = makeAgent(policy, "table", "qlearning", replay.memory = memory)

for (i in 1:100) {
interact(env, agent, n.steps = 20L, learn = FALSE)
interact(env, agent, n.steps = 1L, learn = TRUE)
}
action.vals = getValueFunction(agent)
matrix(getStateValues(action.vals), ncol = 4L)
#>            [,1]      [,2]      [,3]      [,4]
#> [1,]   0.000000 -3.012074 -4.966259 -7.397651
#> [2,]  -2.545300 -5.650029 -7.306089 -4.991932
#> [3,]  -6.271404 -6.434747 -6.101606 -3.336008
#> [4,] -12.746046 -7.714610 -3.327779  0.000000