Agents/Policies =============== * https://github.com/facebookresearch/rlstructures/blob/main/tutorial/tutorial_rlagent.py Methods ---------------------------------- An agent is the (only) abstraction needed to allow `rlstructures` to collect interactions at scale. One Agent corresponds to a set of policies (formally :math:`\pi_z`) * An Agent class represents a policy (or *multiple policies* through the `agent_info` argument) acting on a **batch of environment** * An Agent may include (or not) one or multiple pytorch modules * The `__call__(agent_state,observation,agent_info=None,history=None)` methods take as an input: * `agent_state` is the state of the agent at time t-1 (as a `DictTensor`) * `observation` comes from the `rlstructures.VecEnv` environment * `agent_info` corresponds to additional (the :math:`z` in :math:`\pi_z`) information provided to the agent (e.g the value of epsilon for epsilon-greedy policies) * `history` may be a `TemporalDictTensor` representing a set of previous transitions (e.g. used for implementing Transformer based methods, but its value is always `None` in the default implementation of an agent), and activated only if `Agent.require_history()==True`. * Note that `agent_state.n_elems()==observation.n_elems()` which is the number of environments on which the agent is computed. As an output, the **__call__** method returns a pair `action,new_state` where: * `action` is the action outputed by the agent as a `DictTensor`. Note that `action.n_elems()==observation.n_elems()`. This information will be transmitted to the environment through the `env.step` method. Note also that the action may contain any information that you would like to store in the resulting trajectory like debugging information for instance (e.g. agent step). * `new_state` is the update of the state of the agent at time `t+1`. This new state is the information transmitted to the agent at the next call when acquiring a trajectory. * `RL_Agent` implements an `initial_state(self,agent_info,B)` methods responsible of setting the initial agent state at the beginning of an episode. Please, consider the `tutorial` examples to see different agent implementations. Examples -------- We provide here an example of a simple uniform RL_Agent that computes the timestep as its internal state. .. code-block:: python class UniformAgent(RL_Agent): def __init__(self,n_actions): super().__init__() self.n_actions=n_actions def initial_state(self,agent_info,B): return DictTensor({"timestep":torch.zeros(B).long()}) def __call__(self,state,observation,agent_info=None,history=None): B=observation.n_elems() scores=torch.randn(B,self.n_actions) probabilities=torch.softmax(scores,dim=1) actions=torch.distributions.Categorical(probabilities).sample() new_state=DictTensor({"timestep":state["timestep"]+1}) return DictTensor({"action":actions}),new_state Agent and Batcher ----------------- An `Agent` and a `VecEnv` are used together through a `RL_Batcher` to collect trajectories. Building a `RL_Batcher` is made as illustrated below. First one has to define agent and environment creation methods: .. code-block:: python def create_env(max_episode_steps=100,seed=None): envs=[] for k in range(4): e=gym.make("CartPole-v0") e.seed(seed) e=TimeLimit(e, max_episode_steps=max_episode_steps) envs.append(e) return GymEnv(envs,seed=seed) def create_agent(n_actions): return UniformAgent(n_actions) Then the creation of the batcher is quite simple. .. code-block:: python batcher=RL_Batcher( n_timesteps=100, create_agent=create_agent, create_env=create_env, agent_args={"n_actions":2}, env_args={"max_episode_steps":100}, n_processes=1, seeds=[42], agent_info=DictTensor({}), env_info=DictTensor({}) ) * `n_timesteps` is the number of step that the batcher will acquire at each call. * `n_processes` is the number of processes created by the batcher. * `seeds` is a list of seed values, one per process to control the seeds of the environments in the different processes. * `agent_info` and `env_info` are examples of information that could be sent to the Agent/Environment when acquiring trajectories. Since our current Agent and Environment don't make use of such information, we cosider empty DictTensor in our case. With a batcher, we can use three different methods: * batcher.reset(agent_info,env_info): It will reset both the agents and environments with the corresponding informations * batcher.execute(agent_info=None): It will launch the acquisition of trajectories (considering agent_info, or the agent_info provided at reset if not specified) * batcher.get: It will returns the acquired trajectories Here is an example of use: .. code-block:: python batcher.reset() batcher.execute() acquired_trajectories,n_still_running_envs=batcher.get() * the get function returns a pair of ( `acquired trajectories` , `number of environments still running` ). Indeed, at acquisition time, some environments may stop. If no more environments are running, then one has to call `reset` again. * the `acquired_trajectories` is a `Trajectories` object containing both an information `acquired_trajectories.info` as a DictTensor and a sequence of transitions `acquired_trajectories.trajectories` as a `TemporalDictTensor` Trajectories returned by a batcher ---------------------------------- Let us consider `acquired_trajectories`: * Focus on `acquired_trajectories.info` * `acquired_trajectories.info.truncate_key("agent_info/")` returns the `agent_info` value used for this acquisition * `acquired_trajectories.info.truncate_key("env_info/")` returns the `env_info` value used for this acquisition * `acquired_trajectories.info.truncate_key("agent_state/")` returns the state of the agent when starting the acquisition * Focus on `acquired_trajectories.trajectories` * `acquired_trajectories.trajectories["observation/"+k]` is the value of field `k` returned by the environment at time `t` * `acquired_trajectories.trajectories["action/"+k]` is the value of field `k` returned by the agent as action at time `t` * `acquired_trajectories.trajectories["_observation/"+k]` is the value of field `k` returned by the environment at time `t+1` Note that, the final state of one episode is only available in `acquired_trajectories.trajectories["_observation/"+k]`, i.e as the `t+1` observation in the last acquired transitions