Agents/Policies

Methods

An agent is the (only) abstraction needed to allow rlstructures to collect interactions at scale. One Agent corresponds to a set of policies (formally \(\pi_z\))

  • An Agent class represents a policy (or multiple policies through the agent_info argument) acting on a batch of environment

  • An Agent may include (or not) one or multiple pytorch modules

  • The __call__(agent_state,observation,agent_info=None,history=None) methods take as an input:

    • agent_state is the state of the agent at time t-1 (as a DictTensor)

    • observation comes from the rlstructures.VecEnv environment

    • agent_info corresponds to additional (the \(z\) in \(\pi_z\)) information provided to the agent (e.g the value of epsilon for epsilon-greedy policies)

    • history may be a TemporalDictTensor representing a set of previous transitions (e.g. used for implementing Transformer based methods, but its value is always None in the default implementation of an agent), and activated only if Agent.require_history()==True.

  • Note that agent_state.n_elems()==observation.n_elems() which is the number of environments on which the agent is computed.

As an output, the __call__ method returns a pair action,new_state where:
  • action is the action outputed by the agent as a DictTensor. Note that action.n_elems()==observation.n_elems(). This information will be transmitted to the environment through the env.step method. Note also that the action may contain any information that you would like to store in the resulting trajectory like debugging information for instance (e.g. agent step).

  • new_state is the update of the state of the agent at time t+1. This new state is the information transmitted to the agent at the next call when acquiring a trajectory.

  • RL_Agent implements an initial_state(self,agent_info,B) methods responsible of setting the initial agent state at the beginning of an episode.

Please, consider the tutorial examples to see different agent implementations.

Examples

We provide here an example of a simple uniform RL_Agent that computes the timestep as its internal state.

class UniformAgent(RL_Agent):
    def __init__(self,n_actions):
        super().__init__()
        self.n_actions=n_actions

    def initial_state(self,agent_info,B):
        return DictTensor({"timestep":torch.zeros(B).long()})

    def __call__(self,state,observation,agent_info=None,history=None):
        B=observation.n_elems()

        scores=torch.randn(B,self.n_actions)
        probabilities=torch.softmax(scores,dim=1)
        actions=torch.distributions.Categorical(probabilities).sample()
        new_state=DictTensor({"timestep":state["timestep"]+1})
        return DictTensor({"action":actions}),new_state

Agent and Batcher

An Agent and a VecEnv are used together through a RL_Batcher to collect trajectories. Building a RL_Batcher is made as illustrated below.

First one has to define agent and environment creation methods:

def create_env(max_episode_steps=100,seed=None):
    envs=[]
    for k in range(4):
        e=gym.make("CartPole-v0")
        e.seed(seed)
        e=TimeLimit(e, max_episode_steps=max_episode_steps)
        envs.append(e)
    return GymEnv(envs,seed=seed)

def create_agent(n_actions):
    return UniformAgent(n_actions)

Then the creation of the batcher is quite simple.

batcher=RL_Batcher(
            n_timesteps=100,
            create_agent=create_agent,
            create_env=create_env,
            agent_args={"n_actions":2},
            env_args={"max_episode_steps":100},
            n_processes=1,
            seeds=[42],
            agent_info=DictTensor({}),
            env_info=DictTensor({})
        )
  • n_timesteps is the number of step that the batcher will acquire at each call.

  • n_processes is the number of processes created by the batcher.

  • seeds is a list of seed values, one per process to control the seeds of the environments in the different processes.

  • agent_info and env_info are examples of information that could be sent to the Agent/Environment when acquiring trajectories. Since our current Agent and Environment don’t make use of such information, we cosider empty DictTensor in our case.

With a batcher, we can use three different methods: * batcher.reset(agent_info,env_info): It will reset both the agents and environments with the corresponding informations * batcher.execute(agent_info=None): It will launch the acquisition of trajectories (considering agent_info, or the agent_info provided at reset if not specified) * batcher.get: It will returns the acquired trajectories

Here is an example of use:

batcher.reset()
batcher.execute()
acquired_trajectories,n_still_running_envs=batcher.get()
  • the get function returns a pair of ( acquired trajectories , number of environments still running ). Indeed, at acquisition time, some environments may stop. If no more environments are running, then one has to call reset again.

  • the acquired_trajectories is a Trajectories object containing both an information acquired_trajectories.info as a DictTensor and a sequence of transitions acquired_trajectories.trajectories as a TemporalDictTensor

Trajectories returned by a batcher

Let us consider acquired_trajectories:

  • Focus on acquired_trajectories.info

    • acquired_trajectories.info.truncate_key(“agent_info/”) returns the agent_info value used for this acquisition

    • acquired_trajectories.info.truncate_key(“env_info/”) returns the env_info value used for this acquisition

    • acquired_trajectories.info.truncate_key(“agent_state/”) returns the state of the agent when starting the acquisition

  • Focus on acquired_trajectories.trajectories

    • acquired_trajectories.trajectories[“observation/”+k] is the value of field k returned by the environment at time t

    • acquired_trajectories.trajectories[“action/”+k] is the value of field k returned by the agent as action at time t

    • acquired_trajectories.trajectories[“_observation/”+k] is the value of field k returned by the environment at time t+1

Note that, the final state of one episode is only available in acquired_trajectories.trajectories[“_observation/”+k], i.e as the t+1 observation in the last acquired transitions