General utilities package¶
-
mbrl.util.common.
create_one_dim_tr_model
(cfg: omegaconf.dictconfig.DictConfig, obs_shape: Tuple[int, …], act_shape: Tuple[int, …], model_dir: Optional[Union[str, pathlib.Path]] = None)¶ Creates a 1-D transition reward model from a given configuration.
This method creates a new model from the given configuration and wraps it into a
mbrl.models.OneDTransitionRewardModel
(see its documentation for explanation of some of the config args undercfg.algorithm
). The configuration should be structured as follows:-cfg -dynamics_model -model -_target_ (str): model Python class -in_size (int, optional): input size -out_size (int, optional): output size -model_arg_1 ... -model_arg_n -algorithm -learned_rewards (bool): whether rewards should be learned or not -target_is_delta (bool): to be passed to the dynamics model wrapper -normalize (bool): to be passed to the dynamics model wrapper -overrides -no_delta_list (list[int], optional): to be passed to the dynamics model wrapper -obs_process_fn (str, optional): a Python function to pre-process observations -num_elites (int, optional): number of elite members for ensembles
If
cfg.dynamics_model.model.in_size
is not provided, it will be automatically set to obs_shape[0] + act_shape[0]. Ifcfg.dynamics_model.model.out_size
is not provided, it will be automatically set to obs_shape[0] + int(cfg.algorithm.learned_rewards).The model will be instantiated using
hydra.utils.instantiate()
function.- Parameters
cfg (omegaconf.DictConfig) – the configuration to read.
obs_shape (tuple of ints) – the shape of the observations (only used if the model input or output sizes are not provided in the configuration).
act_shape (tuple of ints) – the shape of the actions (only used if the model input is not provided in the configuration).
model_dir (str or pathlib.Path) – If provided, the model will attempt to load its weights and normalization information from “model_dir / model.pth” and “model_dir / env_stats.pickle”, respectively.
- Returns
the model created.
- Return type
(
mbrl.models.OneDTransitionRewardModel
)
-
mbrl.util.common.
create_replay_buffer
(cfg: omegaconf.dictconfig.DictConfig, obs_shape: Sequence[int], act_shape: Sequence[int], obs_type: Type = <class 'numpy.float32'>, action_type: Type = <class 'numpy.float32'>, reward_type: Type = <class 'numpy.float32'>, load_dir: Optional[Union[str, pathlib.Path]] = None, collect_trajectories: bool = False, rng: Optional[numpy.random._generator.Generator] = None) → mbrl.util.replay_buffer.ReplayBuffer¶ Creates a replay buffer from a given configuration.
The configuration should be structured as follows:
-cfg -algorithm -dataset_size (int, optional): the maximum size of the train dataset/buffer -overrides -num_steps (int, optional): how many steps to take in the environment -trial_length (int, optional): the maximum length for trials. Only needed if ``collect_trajectories == True``.
The size of the replay buffer can be determined by either providing
cfg.algorithm.dataset_size
, or providingcfg.overrides.num_steps
. Specifying dataset set size directly takes precedence over number of steps.- Parameters
cfg (omegaconf.DictConfig) – the configuration to use.
obs_shape (Sequence of ints) – the shape of observation arrays.
act_shape (Sequence of ints) – the shape of action arrays.
obs_type (type) – the data type of the observations (defaults to np.float32).
action_type (type) – the data type of the actions (defaults to np.float32).
reward_type (type) – the data type of the rewards (defaults to np.float32).
load_dir (optional str or pathlib.Path) – if provided, the function will attempt to populate the buffers from “load_dir/replay_buffer.npz”.
collect_trajectories (bool, optional) – if
True
sets the replay buffers to collect trajectory information. Defaults toFalse
.rng (np.random.Generator, optional) – a random number generator when sampling batches. If None (default value), a new default generator will be used.
- Returns
the replay buffer.
- Return type
(
mbrl.replay_buffer.ReplayBuffer
)
-
mbrl.util.common.
get_basic_buffer_iterators
(replay_buffer: mbrl.util.replay_buffer.ReplayBuffer, batch_size: int, val_ratio: float, ensemble_size: int = 1, shuffle_each_epoch: bool = True, bootstrap_permutes: bool = False) → Tuple[mbrl.util.replay_buffer.TransitionIterator, Optional[mbrl.util.replay_buffer.TransitionIterator]]¶ Returns training/validation iterators for the data in the replay buffer.
- Parameters
replay_buffer (
mbrl.util.ReplayBuffer
) – the replay buffer from which data will be sampled.batch_size (int) – the batch size for the iterators.
val_ratio (float) – the proportion of data to use for validation. If 0., the validation buffer will be set to
None
.ensemble_size (int) – the size of the ensemble being trained.
shuffle_each_epoch (bool) – if
True
, the iterator will shuffle the order each time a loop starts. Otherwise the iteration order will be the same. Defaults toTrue
.bootstrap_permutes (bool) – if
True
, the bootstrap iterator will create the bootstrap data using permutations of the original data. Otherwise it will use sampling with replacement. Defaults toFalse
.
- Returns
the training and validation iterators, respectively.
- Return type
(tuple of
mbrl.replay_buffer.TransitionIterator
)
-
mbrl.util.common.
get_sequence_buffer_iterator
(replay_buffer: mbrl.util.replay_buffer.ReplayBuffer, batch_size: int, val_ratio: float, sequence_length: int, ensemble_size: int = 1, shuffle_each_epoch: bool = True, max_batches_per_loop_train: Optional[int] = None, max_batches_per_loop_val: Optional[int] = None, use_simple_sampler: bool = False) → Tuple[Union[mbrl.util.replay_buffer.SequenceTransitionIterator, mbrl.util.replay_buffer.SequenceTransitionSampler], Optional[Union[mbrl.util.replay_buffer.SequenceTransitionIterator, mbrl.util.replay_buffer.SequenceTransitionSampler]]]¶ Returns training/validation iterators for the data in the replay buffer.
- Parameters
replay_buffer (
mbrl.util.ReplayBuffer
) – the replay buffer from which data will be sampled.batch_size (int) – the batch size for the iterators.
val_ratio (float) – the proportion of data to use for validation. If 0., the validation buffer will be set to
None
.sequence_length (int) – the length of the sequences returned by the iterators.
ensemble_size (int) – the number of models in the ensemble.
shuffle_each_epoch (bool) – if
True
, the iterator will shuffle the order each time a loop starts. Otherwise the iteration order will be the same. Defaults toTrue
.max_batches_per_loop_train (int, optional) – if given, specifies how many batches to return (at most) over a full loop of the training iterator.
max_batches_per_loop_val (int, optional) – if given, specifies how many batches to return (at most) over a full loop of the validation iterator.
use_simple_sampler (int) – if
True
, returns an iterator of typembrl.replay_buffer.SequenceTransitionSampler
instead ofmbrl.replay_buffer.SequenceTransitionIterator
.
- Returns
the training and validation iterators, respectively.
- Return type
(tuple of
mbrl.replay_buffer.TransitionIterator
)
-
mbrl.util.common.
load_hydra_cfg
(results_dir: Union[str, pathlib.Path]) → omegaconf.dictconfig.DictConfig¶ Loads a Hydra configuration from the given directory path.
Tries to load the configuration from “results_dir/.hydra/config.yaml”.
- Parameters
results_dir (str or pathlib.Path) – the path to the directory containing the config.
- Returns
the loaded configuration.
- Return type
(omegaconf.DictConfig)
-
mbrl.util.common.
rollout_agent_trajectories
(env: gym.core.Env, steps_or_trials_to_collect: int, agent: mbrl.planning.core.Agent, agent_kwargs: Dict, trial_length: Optional[int] = None, callback: Optional[Callable] = None, replay_buffer: Optional[mbrl.util.replay_buffer.ReplayBuffer] = None, collect_full_trajectories: bool = False, agent_uses_low_dim_obs: bool = False) → List[float]¶ Rollout agent trajectories in the given environment.
Rollouts trajectories in the environment using actions produced by the given agent. Optionally, it stores the saved data into a replay buffer.
- Parameters
env (gym.Env) – the environment to step.
steps_or_trials_to_collect (int) – how many steps of the environment to collect. If
collect_trajectories=True
, it indicates the number of trials instead.agent (
mbrl.planning.Agent
) – the agent used to generate an action.agent_kwargs (dict) – any keyword arguments to pass to agent.act() method.
trial_length (int, optional) – a maximum length for trials (env will be reset regularly after this many number of steps). Defaults to
None
, in which case trials will end when the environment returnsdone=True
.callback (callable, optional) – a function that will be called using the generated transition data (obs, action. next_obs, reward, done).
replay_buffer (
mbrl.util.ReplayBuffer
, optional) – a replay buffer to store data to use for training.collect_full_trajectories (bool) – if
True
, indicates that replay buffers should collect full trajectories. This only affects the split between training and validation buffers. Ifcollect_trajectories=True
, the split is done over trials (full trials in each dataset); otherwise, it’s done across steps.agent_uses_low_dim_obs (bool) – only valid if env is of type
mbrl.env.MujocoGymPixelWrapper
and replay_buffer is notNone
. IfTrue
, instead of passing the obs produced by env.reset/step to the agent, it will pass obs = env.get_last_low_dim_obs(). This is useful for rolling out an agent trained with low dimensional obs, but collect pixel obs in the replay buffer.
- Returns
Total rewards obtained at each complete trial.
- Return type
(list(float))
-
mbrl.util.common.
rollout_model_env
(model_env: mbrl.models.model_env.ModelEnv, initial_obs: numpy.ndarray, plan: Optional[numpy.ndarray] = None, agent: Optional[mbrl.planning.core.Agent] = None, num_samples: int = 1) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶ Rolls out an environment model.
Executes a plan on a dynamics model.
- Parameters
model_env (
mbrl.models.ModelEnv
) – the dynamics model environment to simulate.initial_obs (np.ndarray) – initial observation to start the episodes.
plan (np.ndarray, optional) – sequence of actions to execute.
agent – an agent to generate a plan before execution starts (as in agent.plan(initial_obs)). If given, takes precedence over
plan
.
- Returns
the observations, rewards, and actions observed, respectively.
- Return type
(tuple of np.ndarray)
-
mbrl.util.common.
step_env_and_add_to_buffer
(env: gym.core.Env, obs: numpy.ndarray, agent: mbrl.planning.core.Agent, agent_kwargs: Dict, replay_buffer: mbrl.util.replay_buffer.ReplayBuffer, callback: Optional[Callable] = None, agent_uses_low_dim_obs: bool = False) → Tuple[numpy.ndarray, float, bool, Dict]¶ Steps the environment with an agent’s action and populates the replay buffer.
- Parameters
env (gym.Env) – the environment to step.
obs (np.ndarray) – the latest observation returned by the environment (used to obtain an action from the agent).
agent (
mbrl.planning.Agent
) – the agent used to generate an action.agent_kwargs (dict) – any keyword arguments to pass to agent.act() method.
replay_buffer (
mbrl.util.ReplayBuffer
) – the replay buffer containing stored data.callback (callable, optional) – a function that will be called using the generated transition data (obs, action. next_obs, reward, done).
agent_uses_low_dim_obs (bool) – only valid if env is of type
mbrl.env.MujocoGymPixelWrapper
. IfTrue
, instead of passing the obs produced by env.reset/step to the agent, it will pass obs = env.get_last_low_dim_obs(). This is useful for rolling out an agent trained with low dimensional obs, but collect pixel obs in the replay buffer.
- Returns
next observation, reward, done and meta-info, respectively, as generated by env.step(agent.act(obs)).
- Return type
(tuple)
-
mbrl.util.common.
train_model_and_save_model_and_data
(model: mbrl.models.model.Model, model_trainer: mbrl.models.model_trainer.ModelTrainer, cfg: omegaconf.dictconfig.DictConfig, replay_buffer: mbrl.util.replay_buffer.ReplayBuffer, work_dir: Optional[Union[str, pathlib.Path]] = None, callback: Optional[Callable] = None)¶ Convenience function for training a model and saving results.
Runs model_trainer.train(), then saves the resulting model and the data used. If the model has an “update_normalizer” method it will be called before training, passing replay_buffer.get_all() as input.
- Parameters
model (
mbrl.models.Model
) – the model to train.model_trainer (
mbrl.models.ModelTrainer
) – the model trainer.cfg (
omegaconf.DictConfig
) –configuration to use for training. It must contain the following fields:
-model_batch_size (int) -validation_ratio (float) -num_epochs_train_model (int, optional) -patience (int, optional) -bootstrap_permutes (bool, optional)
replay_buffer (
mbrl.util.ReplayBuffer
) – the replay buffer to use.work_dir (str or pathlib.Path, optional) – if given, a directory to save model and buffer to.
callback (callable, optional) – if provided, this function will be called after every training epoch. See
mbrl.models.ModelTrainer
for signature.
-
class
mbrl.util.mujoco.
freeze_mujoco_env
(env: gym.wrappers.time_limit.TimeLimit)¶ Bases:
object
Provides a context to freeze a Mujoco environment.
This context allows the user to manipulate the state of a Mujoco environment and return it to its original state upon exiting the context.
Works with mujoco gym and dm_control environments (with dmc2gym).
Example usage:
env = gym.make("HalfCheetah-v2") env.reset() action = env.action_space.sample() # o1_expected, *_ = env.step(action) with freeze_mujoco_env(env): step_the_env_a_bunch_of_times() o1, *_ = env.step(action) # o1 will be equal to what o1_expected would have been
- Parameters
env (
gym.wrappers.TimeLimit
) – the environment to freeze.
-
mbrl.util.mujoco.
get_current_state
(env: gym.wrappers.time_limit.TimeLimit) → Tuple¶ Returns the internal state of the environment.
Returns a tuple with information that can be passed to :func:set_env_state` to manually set the environment (or a copy of it) to the same state it had when this function was called.
Works with mujoco gym and dm_control environments (with dmc2gym).
- Parameters
env (
gym.wrappers.TimeLimit
) – the environment.- Returns
For mujoco gym environments, returns the internal state (position and velocity), and the number of elapsed steps so far. For dm_control environments it returns physics.get_state().copy(), elapsed steps and step_count.
- Return type
(tuple)
-
mbrl.util.mujoco.
make_env
(cfg: Union[omegaconf.listconfig.ListConfig, omegaconf.dictconfig.DictConfig]) → Tuple[gym.core.Env, Callable[[torch.Tensor, torch.Tensor], torch.Tensor], Optional[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]]]¶ Creates an environment from a given OmegaConf configuration object.
This method expects the configuration,
cfg
, to have the following attributes (some are optional):If
cfg.overrides.env_cfg
is present, this method instantiates the environment using hydra.utils.instantiate(env_cfg). Otherwise, it expects attributecfg.overrides.env
, which should be a string description of the environment where valid options are:“dmcontrol___<domain>–<task>”: a Deep-Mind Control suite environment with the indicated domain and task (e.g., “dmcontrol___cheetah–run”.
“gym___<env_name>”: a Gym environment (e.g., “gym___HalfCheetah-v2”).
“cartpole_continuous”: a continuous version of gym’s Cartpole environment.
“pets_halfcheetah”: the implementation of HalfCheetah used in Chua et al., PETS paper.
“ant_truncated_obs”: the implementation of Ant environment used in Janner et al., MBPO paper.
“humanoid_truncated_obs”: the implementation of Humanoid environment used in Janner et al., MBPO paper.
cfg.overrides.term_fn
: (only for dmcontrol and gym environments) a string indicating the environment’s termination function to use when simulating the environment with the model. It should correspond to the name of a function inmbrl.env.termination_fns
.cfg.overrides.reward_fn
: (only for dmcontrol and gym environments) a string indicating the environment’s reward function to use when simulating the environment with the model. If not present, it will try to usecfg.overrides.term_fn
. If that’s not present either, it will return aNone
reward function. If provided, it should correspond to the name of a function inmbrl.env.reward_fns
.cfg.overrides.learned_rewards
: (optional) if present indicates that the reward function will be learned, in which case the method will return aNone
reward function.cfg.overrides.trial_length
: (optional) if presents indicates the maximum length of trials. Defaults to 1000.
- Parameters
cfg (omegaconf.DictConf) – the configuration to use.
- Returns
returns the new environment, the termination function to use, and the reward function to use (or
None
ifcfg.learned_rewards == True
).- Return type
(tuple of env, termination function, reward function)
-
mbrl.util.mujoco.
make_env_from_str
(env_name: str) → gym.core.Env¶ Creates a new environment from its string description.
- Parameters
env_name (str) –
the string description of the environment. Valid options are:
”dmcontrol___<domain>–<task>”: a Deep-Mind Control suite environment with the indicated domain and task (e.g., “dmcontrol___cheetah–run”.
”gym___<env_name>”: a Gym environment (e.g., “gym___HalfCheetah-v2”).
”cartpole_continuous”: a continuous version of gym’s Cartpole environment.
”pets_halfcheetah”: the implementation of HalfCheetah used in Chua et al., PETS paper.
”ant_truncated_obs”: the implementation of Ant environment used in Janner et al., MBPO paper.
”humanoid_truncated_obs”: the implementation of Humanoid environment used in Janner et al., MBPO paper.
- Returns
the created environment.
- Return type
(gym.Env)
-
mbrl.util.mujoco.
rollout_mujoco_env
(env: gym.wrappers.time_limit.TimeLimit, initial_obs: numpy.ndarray, lookahead: int, agent: Optional[mbrl.planning.core.Agent] = None, plan: Optional[numpy.ndarray] = None) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]¶ Runs the environment for some number of steps then returns it to its original state.
Works with mujoco gym and dm_control environments (with dmc2gym).
- Parameters
env (
gym.wrappers.TimeLimit
) – the environment.initial_obs (np.ndarray) – the latest observation returned by the environment (only needed when
agent is not None
, to get the first action).lookahead (int) – the number of steps to run. If
plan is not None
, it is overridden by len(plan).agent (
mbrl.planning.Agent
, optional) – if given, an agent to obtain actions.plan (sequence of np.ndarray, optional) – if given, a sequence of actions to execute. Takes precedence over
agent
when both are given.
- Returns
the observations, rewards, and actions observed, respectively.
- Return type
(tuple of np.ndarray)
-
mbrl.util.mujoco.
set_env_state
(state: Tuple, env: gym.wrappers.time_limit.TimeLimit)¶ Sets the state of the environment.
Assumes
state
was generated usingget_current_state()
.Works with mujoco gym and dm_control environments (with dmc2gym).
- Parameters
state (tuple) – see
get_current_state()
for a description.env (
gym.wrappers.TimeLimit
) – the environment.