General utilities package¶

mbrl.util.common.create_one_dim_tr_model(cfg: omegaconf.dictconfig.DictConfig, obs_shape: Tuple[int, …], act_shape: Tuple[int, …], model_dir: Optional[Union[str, pathlib.Path]] = None)¶

Creates a 1-D transition reward model from a given configuration.

This method creates a new model from the given configuration and wraps it into a mbrl.models.OneDTransitionRewardModel (see its documentation for explanation of some of the config args under cfg.algorithm). The configuration should be structured as follows:

-cfg
  -dynamics_model
    -model
      -_target_ (str): model Python class
      -in_size (int, optional): input size
      -out_size (int, optional): output size
      -model_arg_1
       ...
      -model_arg_n
  -algorithm
    -learned_rewards (bool): whether rewards should be learned or not
    -target_is_delta (bool): to be passed to the dynamics model wrapper
    -normalize (bool): to be passed to the dynamics model wrapper
  -overrides
    -no_delta_list (list[int], optional): to be passed to the dynamics model wrapper
    -obs_process_fn (str, optional): a Python function to pre-process observations
    -num_elites (int, optional): number of elite members for ensembles

If cfg.dynamics_model.model.in_size is not provided, it will be automatically set to obs_shape[0] + act_shape[0]. If cfg.dynamics_model.model.out_size is not provided, it will be automatically set to obs_shape[0] + int(cfg.algorithm.learned_rewards).

The model will be instantiated using hydra.utils.instantiate() function.

Parameters

cfg (omegaconf.DictConfig) – the configuration to read.
obs_shape (tuple of ints) – the shape of the observations (only used if the model input or output sizes are not provided in the configuration).
act_shape (tuple of ints) – the shape of the actions (only used if the model input is not provided in the configuration).
model_dir (str or pathlib.Path) – If provided, the model will attempt to load its weights and normalization information from “model_dir / model.pth” and “model_dir / env_stats.pickle”, respectively.

Returns

the model created.

Return type

(mbrl.models.OneDTransitionRewardModel)

mbrl.util.common.create_replay_buffer(cfg: omegaconf.dictconfig.DictConfig, obs_shape: Sequence[int], act_shape: Sequence[int], obs_type: Type = <class 'numpy.float32'>, action_type: Type = <class 'numpy.float32'>, reward_type: Type = <class 'numpy.float32'>, load_dir: Optional[Union[str, pathlib.Path]] = None, collect_trajectories: bool = False, rng: Optional[numpy.random._generator.Generator] = None) → mbrl.util.replay_buffer.ReplayBuffer ¶

Creates a replay buffer from a given configuration.

The configuration should be structured as follows:

-cfg
  -algorithm
    -dataset_size (int, optional): the maximum size of the train dataset/buffer
  -overrides
    -num_steps (int, optional): how many steps to take in the environment
    -trial_length (int, optional): the maximum length for trials. Only needed if
        ``collect_trajectories == True``.

The size of the replay buffer can be determined by either providing cfg.algorithm.dataset_size, or providing cfg.overrides.num_steps. Specifying dataset set size directly takes precedence over number of steps.

Parameters

cfg (omegaconf.DictConfig) – the configuration to use.
obs_shape (Sequence of ints) – the shape of observation arrays.
act_shape (Sequence of ints) – the shape of action arrays.
obs_type (type) – the data type of the observations (defaults to np.float32).
action_type (type) – the data type of the actions (defaults to np.float32).
reward_type (type) – the data type of the rewards (defaults to np.float32).
load_dir (optional str or pathlib.Path) – if provided, the function will attempt to populate the buffers from “load_dir/replay_buffer.npz”.
collect_trajectories (bool, optional) – if True sets the replay buffers to collect trajectory information. Defaults to False.
rng (np.random.Generator, optional) – a random number generator when sampling batches. If None (default value), a new default generator will be used.

Returns

the replay buffer.

Return type

(mbrl.replay_buffer.ReplayBuffer)

mbrl.util.common.get_basic_buffer_iterators(replay_buffer: mbrl.util.replay_buffer.ReplayBuffer, batch_size: int, val_ratio: float, ensemble_size: int = 1, shuffle_each_epoch: bool = True, bootstrap_permutes: bool = False) → Tuple[mbrl.util.replay_buffer.TransitionIterator, Optional[mbrl.util.replay_buffer.TransitionIterator]]¶

Returns training/validation iterators for the data in the replay buffer.

Parameters

replay_buffer (mbrl.util.ReplayBuffer) – the replay buffer from which data will be sampled.
batch_size (int) – the batch size for the iterators.
val_ratio (float) – the proportion of data to use for validation. If 0., the validation buffer will be set to None.
ensemble_size (int) – the size of the ensemble being trained.
shuffle_each_epoch (bool) – if True, the iterator will shuffle the order each time a loop starts. Otherwise the iteration order will be the same. Defaults to True.
bootstrap_permutes (bool) – if True, the bootstrap iterator will create the bootstrap data using permutations of the original data. Otherwise it will use sampling with replacement. Defaults to False.

Returns

the training and validation iterators, respectively.

Return type

(tuple of mbrl.replay_buffer.TransitionIterator)

mbrl.util.common.get_sequence_buffer_iterator(replay_buffer: mbrl.util.replay_buffer.ReplayBuffer, batch_size: int, val_ratio: float, sequence_length: int, ensemble_size: int = 1, shuffle_each_epoch: bool = True, max_batches_per_loop_train: Optional[int] = None, max_batches_per_loop_val: Optional[int] = None, use_simple_sampler: bool = False) → Tuple[Union[mbrl.util.replay_buffer.SequenceTransitionIterator, mbrl.util.replay_buffer.SequenceTransitionSampler], Optional[Union[mbrl.util.replay_buffer.SequenceTransitionIterator, mbrl.util.replay_buffer.SequenceTransitionSampler]]]¶

Returns training/validation iterators for the data in the replay buffer.

Parameters

replay_buffer (mbrl.util.ReplayBuffer) – the replay buffer from which data will be sampled.
batch_size (int) – the batch size for the iterators.
val_ratio (float) – the proportion of data to use for validation. If 0., the validation buffer will be set to None.
sequence_length (int) – the length of the sequences returned by the iterators.
ensemble_size (int) – the number of models in the ensemble.
shuffle_each_epoch (bool) – if True, the iterator will shuffle the order each time a loop starts. Otherwise the iteration order will be the same. Defaults to True.
max_batches_per_loop_train (int, optional) – if given, specifies how many batches to return (at most) over a full loop of the training iterator.
max_batches_per_loop_val (int, optional) – if given, specifies how many batches to return (at most) over a full loop of the validation iterator.
use_simple_sampler (int) – if True, returns an iterator of type mbrl.replay_buffer.SequenceTransitionSampler instead of mbrl.replay_buffer.SequenceTransitionIterator.

Returns

the training and validation iterators, respectively.

Return type

(tuple of mbrl.replay_buffer.TransitionIterator)

mbrl.util.common.load_hydra_cfg(results_dir: Union[str, pathlib.Path]) → omegaconf.dictconfig.DictConfig¶

Loads a Hydra configuration from the given directory path.

Tries to load the configuration from “results_dir/.hydra/config.yaml”.

Parameters: results_dir (str or pathlib.Path) – the path to the directory containing the config.
Returns: the loaded configuration.
Return type: (omegaconf.DictConfig)

mbrl.util.common.rollout_agent_trajectories(env: gym.core.Env, steps_or_trials_to_collect: int, agent: mbrl.planning.core.Agent, agent_kwargs: Dict, trial_length: Optional[int] = None, callback: Optional[Callable] = None, replay_buffer: Optional[mbrl.util.replay_buffer.ReplayBuffer] = None, collect_full_trajectories: bool = False, agent_uses_low_dim_obs: bool = False) → List[float]¶

Rollout agent trajectories in the given environment.

Rollouts trajectories in the environment using actions produced by the given agent. Optionally, it stores the saved data into a replay buffer.

Parameters

env (gym.Env) – the environment to step.
steps_or_trials_to_collect (int) – how many steps of the environment to collect. If collect_trajectories=True, it indicates the number of trials instead.
agent (mbrl.planning.Agent) – the agent used to generate an action.
agent_kwargs (dict) – any keyword arguments to pass to agent.act() method.
trial_length (int, optional) – a maximum length for trials (env will be reset regularly after this many number of steps). Defaults to None, in which case trials will end when the environment returns done=True.
callback (callable, optional) – a function that will be called using the generated transition data (obs, action. next_obs, reward, done).
replay_buffer (mbrl.util.ReplayBuffer, optional) – a replay buffer to store data to use for training.
collect_full_trajectories (bool) – if True, indicates that replay buffers should collect full trajectories. This only affects the split between training and validation buffers. If collect_trajectories=True, the split is done over trials (full trials in each dataset); otherwise, it’s done across steps.
agent_uses_low_dim_obs (bool) – only valid if env is of type mbrl.env.MujocoGymPixelWrapper and replay_buffer is not None. If True, instead of passing the obs produced by env.reset/step to the agent, it will pass obs = env.get_last_low_dim_obs(). This is useful for rolling out an agent trained with low dimensional obs, but collect pixel obs in the replay buffer.

Returns

Total rewards obtained at each complete trial.

Return type

(list(float))

mbrl.util.common.rollout_model_env(model_env: mbrl.models.model_env.ModelEnv, initial_obs: numpy.ndarray, plan: Optional[numpy.ndarray] = None, agent: Optional[mbrl.planning.core.Agent] = None, num_samples: int = 1) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶

Rolls out an environment model.

Executes a plan on a dynamics model.

Parameters

model_env (mbrl.models.ModelEnv) – the dynamics model environment to simulate.
initial_obs (np.ndarray) – initial observation to start the episodes.
plan (np.ndarray, optional) – sequence of actions to execute.
agent – an agent to generate a plan before execution starts (as in agent.plan(initial_obs)). If given, takes precedence over plan.

Returns

the observations, rewards, and actions observed, respectively.

Return type

(tuple of np.ndarray)

mbrl.util.common.step_env_and_add_to_buffer(env: gym.core.Env, obs: numpy.ndarray, agent: mbrl.planning.core.Agent, agent_kwargs: Dict, replay_buffer: mbrl.util.replay_buffer.ReplayBuffer, callback: Optional[Callable] = None, agent_uses_low_dim_obs: bool = False) → Tuple[numpy.ndarray, float, bool, Dict]¶

Steps the environment with an agent’s action and populates the replay buffer.

Parameters

env (gym.Env) – the environment to step.
obs (np.ndarray) – the latest observation returned by the environment (used to obtain an action from the agent).
agent (mbrl.planning.Agent) – the agent used to generate an action.
agent_kwargs (dict) – any keyword arguments to pass to agent.act() method.
replay_buffer (mbrl.util.ReplayBuffer) – the replay buffer containing stored data.
callback (callable, optional) – a function that will be called using the generated transition data (obs, action. next_obs, reward, done).
agent_uses_low_dim_obs (bool) – only valid if env is of type mbrl.env.MujocoGymPixelWrapper. If True, instead of passing the obs produced by env.reset/step to the agent, it will pass obs = env.get_last_low_dim_obs(). This is useful for rolling out an agent trained with low dimensional obs, but collect pixel obs in the replay buffer.

Returns

next observation, reward, done and meta-info, respectively, as generated by env.step(agent.act(obs)).

Return type

(tuple)

mbrl.util.common.train_model_and_save_model_and_data(model: mbrl.models.model.Model, model_trainer: mbrl.models.model_trainer.ModelTrainer, cfg: omegaconf.dictconfig.DictConfig, replay_buffer: mbrl.util.replay_buffer.ReplayBuffer, work_dir: Optional[Union[str, pathlib.Path]] = None, callback: Optional[Callable] = None)¶

Convenience function for training a model and saving results.

Runs model_trainer.train(), then saves the resulting model and the data used. If the model has an “update_normalizer” method it will be called before training, passing replay_buffer.get_all() as input.

Parameters

model (mbrl.models.Model) – the model to train.
model_trainer (mbrl.models.ModelTrainer) – the model trainer.

cfg (omegaconf.DictConfig) –

configuration to use for training. It must contain the following fields:

-model_batch_size (int)
-validation_ratio (float)
-num_epochs_train_model (int, optional)
-patience (int, optional)
-bootstrap_permutes (bool, optional)

replay_buffer (mbrl.util.ReplayBuffer) – the replay buffer to use.
work_dir (str or pathlib.Path, optional) – if given, a directory to save model and buffer to.
callback (callable, optional) – if provided, this function will be called after every training epoch. See mbrl.models.ModelTrainer for signature.

class mbrl.util.mujoco.freeze_mujoco_env(env: gym.wrappers.time_limit.TimeLimit)¶

Bases: object

Provides a context to freeze a Mujoco environment.

This context allows the user to manipulate the state of a Mujoco environment and return it to its original state upon exiting the context.

Works with mujoco gym and dm_control environments (with dmc2gym).

Example usage:

env = gym.make("HalfCheetah-v2")
env.reset()
action = env.action_space.sample()
# o1_expected, *_ = env.step(action)
with freeze_mujoco_env(env):
    step_the_env_a_bunch_of_times()
o1, *_ = env.step(action) # o1 will be equal to what o1_expected would have been

Parameters: env (gym.wrappers.TimeLimit) – the environment to freeze.

mbrl.util.mujoco.get_current_state(env: gym.wrappers.time_limit.TimeLimit) → Tuple¶

Returns the internal state of the environment.

Returns a tuple with information that can be passed to :func:set_env_state` to manually set the environment (or a copy of it) to the same state it had when this function was called.

Works with mujoco gym and dm_control environments (with dmc2gym).

Parameters: env (gym.wrappers.TimeLimit) – the environment.
Returns: For mujoco gym environments, returns the internal state (position and velocity), and the number of elapsed steps so far. For dm_control environments it returns physics.get_state().copy(), elapsed steps and step_count.
Return type: (tuple)

mbrl.util.mujoco.make_env(cfg: Union[omegaconf.listconfig.ListConfig, omegaconf.dictconfig.DictConfig]) → Tuple[gym.core.Env, Callable[[torch.Tensor, torch.Tensor], torch.Tensor], Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]]]¶

Creates an environment from a given OmegaConf configuration object.

This method expects the configuration, cfg, to have the following attributes (some are optional):

If cfg.overrides.env_cfg is present, this method instantiates the environment using hydra.utils.instantiate(env_cfg). Otherwise, it expects attribute cfg.overrides.env, which should be a string description of the environment where valid options are:

“dmcontrol___<domain>–<task>”: a Deep-Mind Control suite environment with the indicated domain and task (e.g., “dmcontrol___cheetah–run”.

“gym___<env_name>”: a Gym environment (e.g., “gym___HalfCheetah-v2”).

“cartpole_continuous”: a continuous version of gym’s Cartpole environment.

“pets_halfcheetah”: the implementation of HalfCheetah used in Chua et al., PETS paper.

“ant_truncated_obs”: the implementation of Ant environment used in Janner et al., MBPO paper.

“humanoid_truncated_obs”: the implementation of Humanoid environment used in Janner et al., MBPO paper.

cfg.overrides.term_fn: (only for dmcontrol and gym environments) a string indicating the environment’s termination function to use when simulating the environment with the model. It should correspond to the name of a function in mbrl.env.termination_fns.

cfg.overrides.reward_fn: (only for dmcontrol and gym environments) a string indicating the environment’s reward function to use when simulating the environment with the model. If not present, it will try to use cfg.overrides.term_fn. If that’s not present either, it will return a None reward function. If provided, it should correspond to the name of a function in mbrl.env.reward_fns.

cfg.overrides.learned_rewards: (optional) if present indicates that the reward function will be learned, in which case the method will return a None reward function.

cfg.overrides.trial_length: (optional) if presents indicates the maximum length of trials. Defaults to 1000.

Parameters: cfg (omegaconf.DictConf) – the configuration to use.
Returns: returns the new environment, the termination function to use, and the reward function to use (or None if cfg.learned_rewards == True).
Return type: (tuple of env, termination function, reward function)

mbrl.util.mujoco.make_env_from_str(env_name: str) → gym.core.Env¶

Creates a new environment from its string description.

Parameters

env_name (str) –

the string description of the environment. Valid options are:

”dmcontrol___<domain>–<task>”: a Deep-Mind Control suite environment with the indicated domain and task (e.g., “dmcontrol___cheetah–run”.
”gym___<env_name>”: a Gym environment (e.g., “gym___HalfCheetah-v2”).
”cartpole_continuous”: a continuous version of gym’s Cartpole environment.
”pets_halfcheetah”: the implementation of HalfCheetah used in Chua et al., PETS paper.
”ant_truncated_obs”: the implementation of Ant environment used in Janner et al., MBPO paper.
”humanoid_truncated_obs”: the implementation of Humanoid environment used in Janner et al., MBPO paper.

Returns

the created environment.

Return type

(gym.Env)

mbrl.util.mujoco.rollout_mujoco_env(env: gym.wrappers.time_limit.TimeLimit, initial_obs: numpy.ndarray, lookahead: int, agent: Optional[mbrl.planning.core.Agent] = None, plan: Optional[numpy.ndarray] = None) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶

Runs the environment for some number of steps then returns it to its original state.

Works with mujoco gym and dm_control environments (with dmc2gym).

Parameters

env (gym.wrappers.TimeLimit) – the environment.
initial_obs (np.ndarray) – the latest observation returned by the environment (only needed when agent is not None, to get the first action).
lookahead (int) – the number of steps to run. If plan is not None, it is overridden by len(plan).
agent (mbrl.planning.Agent, optional) – if given, an agent to obtain actions.
plan (sequence of np.ndarray, optional) – if given, a sequence of actions to execute. Takes precedence over agent when both are given.

Returns

the observations, rewards, and actions observed, respectively.

Return type

(tuple of np.ndarray)

mbrl.util.mujoco.set_env_state(state: Tuple, env: gym.wrappers.time_limit.TimeLimit)¶

Sets the state of the environment.

Assumes state was generated using get_current_state().

Works with mujoco gym and dm_control environments (with dmc2gym).

Parameters

state (tuple) – see get_current_state() for a description.
env (gym.wrappers.TimeLimit) – the environment.