neurotorch.rl package¶
Submodules¶
neurotorch.rl.agent module¶
- class neurotorch.rl.agent.Agent(*, env: Env | None = None, observation_space: Space | None = None, action_space: Space | None = None, behavior_name: str | None = None, policy: BaseModel | None = None, policy_predict_method: str = '__call__', policy_kwargs: Dict[str, Any] | None = None, critic: BaseModel | None = None, critic_predict_method: str = '__call__', critic_kwargs: Dict[str, Any] | None = None, **kwargs)¶
Bases:
Module
- __init__(*, env: Env | None = None, observation_space: Space | None = None, action_space: Space | None = None, behavior_name: str | None = None, policy: BaseModel | None = None, policy_predict_method: str = '__call__', policy_kwargs: Dict[str, Any] | None = None, critic: BaseModel | None = None, critic_predict_method: str = '__call__', critic_kwargs: Dict[str, Any] | None = None, **kwargs)¶
Constructor for BaseAgent class.
- Parameters:
env (Optional[gym.Env]) – The environment.
observation_space (Optional[gym.spaces.Space]) – The observation space. Must be a single space not batched. Must be provided if env is not provided. If env is provided, then this will be ignored.
action_space (Optional[gym.spaces.Space]) – The action space. Must be a single space not batched. Must be provided if env is not provided. If env is provided, then this will be ignored.
behavior_name (Optional[str]) – The name of the behavior.
policy (BaseModel) – The model to use.
policy_kwargs (Optional[Dict[str, Any]]) –
The keyword arguments to pass to the policy if it is created by default. The keywords are:
default_hidden_units (List[int]): The default number of hidden units. Defaults to [256].
default_activation (str): The default activation function. Defaults to “ReLu”.
default_output_activation (str): The default output activation function. Defaults to “Identity”.
default_dropout (float): The default dropout rate. Defaults to 0.1.
all other keywords are passed to the Sequential constructor.
critic (BaseModel) – The value model to use.
critic_kwargs (Optional[Dict[str, Any]]) –
The keyword arguments to pass to the critic if it is created by default. The keywords are:
default_hidden_units (List[int]): The default number of hidden units. Defaults to [256].
default_activation (str): The default activation function. Defaults to “ReLu”.
default_output_activation (str): The default output activation function. Defaults to “Identity”.
default_n_values (int): The default number of values to output. Defaults to 1.
default_dropout (float): The default dropout rate. Defaults to 0.1.
all other keywords are passed to the Sequential constructor.
kwargs – Other keyword arguments.
- property action_spec: Dict[str, Any]¶
- property continuous_actions: List[str]¶
- copy(requires_grad: bool | None = None) Agent ¶
Copy the agent.
- Parameters:
requires_grad (Optional[bool]) – Whether to require gradients.
- Returns:
The copied agent.
- Return type:
- copy_critic(requires_grad: bool | None = None) BaseModel ¶
Copy the critic to a new instance.
- Returns:
The copied critic.
- copy_policy(requires_grad: bool | None = None) BaseModel ¶
Copy the policy to a new instance.
- Returns:
The copied policy.
- decay_continuous_action_variances()¶
- property device: device¶
The device of the agent.
- Returns:
The device of the agent.
- Return type:
torch.device
- property discrete_actions: List[str]¶
- format_batch_discrete_actions(actions: Tensor | Dict[str, Tensor], re_format: str = 'logits', **kwargs) Tensor | Dict[str, Tensor] ¶
Format the batch of actions. If actions is a dict, then it is assumed that the keys are the action names and the values are the actions. In this case, all the values where their keys are in self.discrete_actions will be formatted. If actions is a tensor, then the actions will be formatted if self.discrete_actions is not empty.
TODO: fragment this method into smaller methods.
- Parameters:
actions – The actions.
re_format – The format to reformat the actions to. Can be “logits”, “probs”, “index”, or “one_hot”.
kwargs – Keywords arguments.
- Returns:
The formatted actions.
- forward(*args, **kwargs)¶
Call the agent.
- Returns:
The output of the agent.
- get_actions(obs: ndarray | Tensor | Dict[str, ndarray | Tensor], **kwargs) Any ¶
Get the actions for the given observations.
- Parameters:
obs (Union[np.ndarray, torch.Tensor, Dict[str, Union[np.ndarray, torch.Tensor]]]) – The observations. The observations must be batched.
kwargs – Keywords arguments.
- Keyword Arguments:
re_format (str) – The format to reformat the discrete actions to. Default is “index” which will return the index of the action. For other options see :mth:`format_batch_discrete_actions`.
as_numpy (bool) – Whether to return the actions as numpy arrays. Default is True.
- Returns:
The actions.
- get_continuous_action_covariances()¶
- get_default_checkpoints_meta_path() str ¶
The path to the checkpoints meta file.
- Returns:
The path to the checkpoints meta file.
- Return type:
str
- get_random_actions(n_samples: int = 1, **kwargs) Any ¶
- get_values(obs: Tensor, **kwargs) Any ¶
Get the values for the given observations.
- Parameters:
obs – The batched observations.
kwargs – Keywords arguments.
- Returns:
The values.
- hard_update(policy)¶
- load_checkpoint(checkpoints_meta_path: str | None = None, load_checkpoint_mode: LoadCheckpointMode = LoadCheckpointMode.BEST_ITR, verbose: bool = True) dict ¶
Load the checkpoint from the checkpoints_meta_path. If the checkpoints_meta_path is None, the default checkpoints_meta_path is used.
- Parameters:
checkpoints_meta_path (Optional[str]) – The path to the checkpoints meta file.
load_checkpoint_mode (LoadCheckpointMode) – The mode to use when loading the checkpoint.
verbose (bool) – Whether to print the loaded checkpoint information.
- Returns:
The loaded checkpoint information.
- Return type:
dict
- property observation_spec: Dict[str, Any]¶
- set_continuous_action_variances_with_itr(itr: int)¶
- set_default_critic_kwargs()¶
- set_default_policy_kwargs()¶
- soft_update(policy, tau)¶
- to(*args, **kwargs)¶
Move and/or cast the parameters and buffers.
This can be called as
- to(device=None, dtype=None, non_blocking=False)
- to(dtype, non_blocking=False)
- to(tensor, non_blocking=False)
- to(memory_format=torch.channels_last)
Its signature is similar to
torch.Tensor.to()
, but only accepts floating point or complexdtype
s. In addition, this method will only cast the floating point or complex parameters and buffers todtype
(if given). The integral parameters and buffers will be moveddevice
, if that is given, but with dtypes unchanged. Whennon_blocking
is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.See below for examples.
Note
This method modifies the module in-place.
- Parameters:
device (
torch.device
) – the desired device of the parameters and buffers in this moduledtype (
torch.dtype
) – the desired floating point or complex dtype of the parameters and buffers in this moduletensor (torch.Tensor) – Tensor whose dtype and device are the desired dtype and device for all parameters and buffers in this module
memory_format (
torch.memory_format
) – the desired memory format for 4D parameters and buffers in this module (keyword only argument)
- Returns:
self
- Return type:
Module
Examples:
>>> # xdoctest: +IGNORE_WANT("non-deterministic") >>> linear = nn.Linear(2, 2) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]]) >>> linear.to(torch.double) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1913, -0.3420], [-0.5113, -0.2325]], dtype=torch.float64) >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA1) >>> gpu1 = torch.device("cuda:1") >>> linear.to(gpu1, dtype=torch.half, non_blocking=True) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16, device='cuda:1') >>> cpu = torch.device("cpu") >>> linear.to(cpu) Linear(in_features=2, out_features=2, bias=True) >>> linear.weight Parameter containing: tensor([[ 0.1914, -0.3420], [-0.5112, -0.2324]], dtype=torch.float16) >>> linear = nn.Linear(2, 2, bias=None).to(torch.cdouble) >>> linear.weight Parameter containing: tensor([[ 0.3741+0.j, 0.2382+0.j], [ 0.5593+0.j, -0.4443+0.j]], dtype=torch.complex128) >>> linear(torch.ones(3, 2, dtype=torch.cdouble)) tensor([[0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j], [0.6122+0.j, 0.1150+0.j]], dtype=torch.complex128)
neurotorch.rl.buffers module¶
- class neurotorch.rl.buffers.AgentsHistoryMaps(buffer: ReplayBuffer | None = None, **kwargs)¶
Bases:
object
Class to store the mapping between agents and their history maps
- trajectories¶
Mapping between agent ids and their trajectories
- Type:
Dict[int, Trajectory]
- cumulative_rewards¶
Mapping between agent ids and their cumulative rewards
- Type:
Dict[int, float]
- __init__(buffer: ReplayBuffer | None = None, **kwargs)¶
- clear() List[Trajectory] ¶
- property cumulative_rewards_as_array: ndarray¶
The cumulative rewards as an array
- Type:
return
- property experience_count: int¶
The number of experiences
- Type:
return
- property max_abs_rewards: float¶
The maximum absolute reward
- Type:
return
- property mean_cumulative_rewards: float¶
The mean cumulative rewards
- Type:
return
- propagate_all() List[Trajectory] ¶
Propagate all the trajectories and return the finished ones.
- Returns:
All the trajectories.
- Return type:
List[Trajectory]
- propagate_and_get_all() List[Trajectory] ¶
Propagate all the trajectories and return all the trajectories.
- Returns:
All the trajectories
- Return type:
List[Trajectory]
- property terminals_count: int¶
The number of terminal steps
- Type:
return
- update_trajectories_(*, observations, actions, next_observations, rewards, terminals, truncated=None, infos=None, others=None) List[Trajectory] ¶
Updates the trajectories of the agents and returns the trajectories of the agents that have been terminated.
- Parameters:
observations – The observations
actions – The actions
next_observations – The next observations
rewards – The rewards
terminals – The terminals
truncated – The truncated
infos – The infos
others – The others
- Returns:
The terminated trajectories.
- class neurotorch.rl.buffers.BatchExperience(batch: List[Experience], device: device = device(type='cpu'))¶
Bases:
object
- __init__(batch: List[Experience], device: device = device(type='cpu'))¶
An object that contains a batch of experiences as tensors.
- Parameters:
batch – A list of Experience objects.
device – The device to use for the tensors.
- property device¶
- class neurotorch.rl.buffers.Experience(obs: Any, action: Any, reward: float, terminal: bool, next_obs: Any, discounted_reward: float | None = None, advantage: float | None = None, rewards_horizon: List[float] | None = None, others: dict | None = None)¶
Bases:
object
An experience contains the data of one Agent transition. - Observation - Action - Reward - Terminal flag - Next Observation
- __init__(obs: Any, action: Any, reward: float, terminal: bool, next_obs: Any, discounted_reward: float | None = None, advantage: float | None = None, rewards_horizon: List[float] | None = None, others: dict | None = None)¶
- property advantage: float¶
- property discounted_reward: float¶
- property metrics¶
- property observation¶
- class neurotorch.rl.buffers.ReplayBuffer(capacity=inf, seed=None, use_priority=False, **kwargs)¶
Bases:
object
- __init__(capacity=inf, seed=None, use_priority=False, **kwargs)¶
- property capacity¶
- clear()¶
- property counter¶
- property empty¶
- extend(iterable: Iterable[Experience]) ReplayBuffer ¶
- property full¶
- get_batch_generator(batch_size: int, n_batches: int | None = None, randomize: bool = True, device='cpu') Iterator[BatchExperience] ¶
Returns a generator of batch_size elements from the buffer.
- get_batch_tensor(batch_size: int, device='cpu') BatchExperience ¶
Returns a list of batch_size elements from the buffer.
- get_random_batch(batch_size: int) List[Experience] ¶
Returns a list of batch_size elements from the buffer.
- increase_capacity(increment: int)¶
- increment_counter(increment: int = 1)¶
- static load(filename: str) ReplayBuffer ¶
- reset_counter()¶
- save(filename: str)¶
- set_seed(seed: int)¶
- start_counter()¶
- stop_counter()¶
- store(element: Experience) ReplayBuffer ¶
Stores an element. If the replay buffer is already full, deletes the oldest element to make space.
- class neurotorch.rl.buffers.Trajectory(experiences: List[Experience] | None = None, gamma: float | None = None, **kwargs)¶
Bases:
object
A trajectory is a list of experiences.
- __init__(experiences: List[Experience] | None = None, gamma: float | None = None, **kwargs)¶
- append(experience: Experience)¶
- append_and_propagate(experience: Experience)¶
- compute_horizon_rewards()¶
- property cumulative_reward¶
- is_empty()¶
- make_rewards_horizon()¶
- propagate()¶
- propagate_rewards(gamma: float | None = 0.99)¶
Propagate the rewards to the next experiences.
- propagate_values(lmbda: float | None = 0.95)¶
- property propagated¶
- property terminal¶
- property terminal_reward¶
- property terminated¶
- update_others(others_list: List[dict])¶
neurotorch.rl.curriculum module¶
- class neurotorch.rl.curriculum.CompletionCriteria(measure: str, min_lesson_length: int, threshold: float)¶
Bases:
NamedTuple
Completion criteria for a lesson.
- static default_criteria() CompletionCriteria ¶
- measure: str¶
Alias for field number 0
- min_lesson_length: int¶
Alias for field number 1
- threshold: float¶
Alias for field number 2
- class neurotorch.rl.curriculum.Curriculum(name: str = 'Curriculum', description: str = '', lessons: List[Lesson] | None = None)¶
Bases:
object
- property channels¶
- property is_completed: bool¶
Returns True if the curriculum is completed, False otherwise.
- property lessons¶
- property map_repr: Dict[str, str]¶
- on_iteration_end(metrics: Dict[str, float]) CurriculumEndIterationOutput ¶
Called when an iteration ends.
- on_iteration_start()¶
Called when an iteration starts.
- property teacher_buffer: ReplayBuffer | None¶
Returns the current teacher buffer.
- property teachers¶
- update_channels(channels: List)¶
- update_teachers(teachers: List)¶
- update_teachers_and_channels(other: Curriculum)¶
- class neurotorch.rl.curriculum.CurriculumEndIterationOutput(messages: Dict[str, str], lesson_completed: bool)¶
Bases:
NamedTuple
Output of the curriculum when the end of an iteration is reached.
- lesson_completed: bool¶
Alias for field number 1
- messages: Dict[str, str]¶
Alias for field number 0
- class neurotorch.rl.curriculum.Lesson(name, channel, params: Dict[str, float], completion_criteria: CompletionCriteria = CompletionCriteria(measure='Rewards', min_lesson_length=1, threshold=0.9), teacher=None, teacher_strength: float | None = None)¶
Bases:
object
- UNPICKLABLE_ATTRIBUTES = ['_teacher', '_channel']¶
- __init__(name, channel, params: Dict[str, float], completion_criteria: CompletionCriteria = CompletionCriteria(measure='Rewards', min_lesson_length=1, threshold=0.9), teacher=None, teacher_strength: float | None = None)¶
- property channel¶
- check_completion_criteria(metrics: Dict[str, float]) bool ¶
Checks if the lesson is completed.
- property is_completed¶
Returns True if the lesson is completed, False otherwise.
- on_iteration_end(metrics: Dict[str, float]) bool ¶
Called when an iteration ends.
- set_result(result)¶
- start()¶
Starts the lesson.
- property teacher¶
- property teacher_buffer: ReplayBuffer | None¶
Returns the replay buffer for the lesson.
neurotorch.rl.ppo module¶
- class neurotorch.rl.ppo.PPO(agent: Agent | None = None, optimizer: Optimizer | None = None, **kwargs)¶
Bases:
LearningAlgorithm
Apply the Proximal Policy Optimization algorithm to the given model. The algorithm is described in the paper Proximal Policy Optimization Algorithms <https://arxiv.org/abs/1707.06347>.
- CHECKPOINT_OPTIMIZER_STATE_DICT_KEY: str = 'optimizer_state_dict'¶
- __init__(agent: Agent | None = None, optimizer: Optimizer | None = None, **kwargs)¶
Constructor of the PPO algorithm.
- Parameters:
agent (Agent) – The agent to train.
optimizer (torch.optim.Optimizer) – The optimizer to use.
kwargs – Additional keyword arguments.
- Keyword Arguments:
clip_ratio (float) – The clipping ratio for the policy loss.
tau (float) – The smoothing factor for the policy update.
gamma (float) – The discount factor.
gae_lambda (float) – The lambda parameter for the generalized advantage estimation (GAE).
critic_weight (float) – The weight of the critic loss.
entropy_weight (float) – The weight of the entropy loss.
critic_criterion (torch.nn.Module) – The loss function to use for the critic.
advantages=returns-values (bool) – This keyword is introduced to fix a bug when using the GAE. If set to True, the advantages are computed as the returns minus the values. If set to False, the advantages are compute as in the PPO paper. The default value is False and it is recommended to try to set it to True if the agent doesn’t seem to learn.
max_grad_norm (float) – The maximum L2 norm of the gradient. Default is 0.5.
- property agent¶
- property critic¶
- get_actions_from_batch(batch: BatchExperience) Tensor ¶
Get the actions for the provided batch
- get_advantages_from_batch(batch: BatchExperience) Tensor ¶
Computes the advantages for the provided batch
- get_checkpoint_state(trainer, **kwargs) object ¶
Get the state of the callback. This is called when the checkpoint manager saves the state of the trainer. Then this state is saved in the checkpoint file with the name of the callback as the key.
- Parameters:
trainer (Trainer) – The trainer.
- Returns:
The state of the callback.
- Return type:
An pickleable object.
- get_returns_from_batch(batch: BatchExperience) Tensor ¶
Computes the returns for the provided batch
- get_values_from_batch(batch: BatchExperience) Tensor ¶
Computes the values for the provided batch
- property last_policy¶
- load_checkpoint_state(trainer, checkpoint: dict, **kwargs)¶
Loads the state of the callback from a dictionary.
- Parameters:
trainer (Trainer) – The trainer.
checkpoint (dict) – The dictionary containing all the states of the trainer.
- Returns:
None
- on_iteration_begin(trainer, **kwargs)¶
Called when an iteration starts. An iteration is defined as one full pass through the training dataset and the validation dataset.
- Parameters:
trainer (Trainer) – The trainer.
- Returns:
None
- on_optimization_begin(trainer, **kwargs)¶
Called when the optimization phase of an iteration starts. The optimization phase is defined as the moment where the model weights are updated.
- Parameters:
trainer (Trainer) – The trainer.
kwargs – Additional arguments.
- Keyword Arguments:
x – The input data.
y – The target data.
pred – The predicted data.
- Returns:
None
- on_optimization_end(trainer, **kwargs)¶
Called when the optimization phase of an iteration ends. The optimization phase is defined as the moment where the model weights are updated.
- Parameters:
trainer (Trainer) – The trainer.
- Returns:
None
- on_pbar_update(trainer, **kwargs) dict ¶
Called when the progress bar is updated.
- Parameters:
trainer (Trainer) – The trainer.
kwargs – Additional arguments.
- Returns:
None
- on_trajectory_end(trainer, trajectory, **kwargs) List[Dict[str, Any]] ¶
Called when a trajectory ends. This is used in reinforcement learning to update the trajectory loss and metrics. Must return a list of dictionaries containing the trajectory metrics. The list must have the same length as the trajectory. Each item in the list will update the attribute others of the corresponding Experience.
- Parameters:
trainer (Trainer) – The trainer.
trajectory (Trajectory) – The trajectory i.e. the sequence of Experiences.
kwargs – Additional arguments.
- Returns:
A list of dictionaries containing the trajectory metrics.
- property policy¶
- start(trainer, **kwargs)¶
Called when the training starts. This is the first callback called.
- Parameters:
trainer (Trainer) – The trainer.
- Returns:
None
- update_params(batch: BatchExperience) float ¶
Performs a single update of the policy network using the provided optimizer and buffer
neurotorch.rl.rl_academy module¶
- class neurotorch.rl.rl_academy.GenTrajectoriesOutput(buffer, cumulative_rewards, agents_history_maps, trajectories)¶
Bases:
NamedTuple
- agents_history_maps: AgentsHistoryMaps¶
Alias for field number 2
- buffer: ReplayBuffer¶
Alias for field number 0
- cumulative_rewards: ndarray¶
Alias for field number 1
- trajectories: List[Trajectory] | None¶
Alias for field number 3
- class neurotorch.rl.rl_academy.RLAcademy(agent: Agent, *, predict_method: str = '__call__', learning_algorithm: LearningAlgorithm | None = None, callbacks: List[BaseCallback] | CallbacksList | BaseCallback | None = None, verbose: bool = True, **kwargs)¶
Bases:
Trainer
- CUM_REWARDS_METRIC_KEY = 'cum_rewards'¶
- TERMINAL_REWARDS_METRIC_KEY = 'terminal_rewards'¶
- __init__(agent: Agent, *, predict_method: str = '__call__', learning_algorithm: LearningAlgorithm | None = None, callbacks: List[BaseCallback] | CallbacksList | BaseCallback | None = None, verbose: bool = True, **kwargs)¶
Constructor for Trainer.
- Parameters:
model – Model to train.
criterion – Loss function(s) to use. Deprecated, use learning_algorithm instead.
regularization –
Regularization(s) to use. In NeuroTorch, there are two ways to do regularization: 1. Regularization can be specified in the layers with the ‘update_regularization_loss’ method. This regularization will be performed by the same optimizer as the main loss. This way is useful when you want a regularization that depends on the model output or hidden state. 2. Regularization can be specified in the trainer with the ‘regularization’ parameter. This regularization will be performed by a separate optimizer named ‘regularization_optimizer’. This way is useful when you want a regularization that depends only on the model parameters and when you want to control the learning rate of the regularization independently of the main loss.
- Note: This parameter will be deprecated and remove in a future version. The regularization will be
specified in the learning algorithm and/or in the callbacks.
optimizer – Optimizer to use for the main loss. Deprecated. Use learning_algorithm instead.
learning_algorithm – Learning algorithm to use for the main loss. This learning algorithm can be given in the callbacks list as well. If specified, this learning algorithm will be added to the callbacks list. In this case, make sure that the learning algorithm is not added twice. Note that multiple learning algorithms can be used in the callbacks list.
regularization_optimizer – Optimizer to use for the regularization loss.
metrics – Metrics to compute during training.
callbacks – Callbacks to use during training. Each callback will be called at different moments, see the documentation of
BaseCallback
for more information.device – Device to use for the training. Default is the device of the model.
verbose – Whether to print information during training.
kwargs – Additional arguments of the training.
- Keyword Arguments:
n_epochs (int) – The number of epochs to train at each iteration. Default is 1.
lr (float) – Learning rate of the main optimizer. Default is 1e-3.
reg_lr (float) – Learning rate of the regularization optimizer. Default is 1e-2.
weight_decay (float) – Weight decay of the main optimizer. Default is 0.0.
exec_metrics_on_train (bool) – Whether to compute metrics on the train dataset. This is useful when you want to save time by not computing the metrics on the train dataset. Default is True.
x_transform – Transform to apply to the input data before passing it to the model.
y_transform – Transform to apply to the target data before passing it to the model. For example, this can be used to convert the target data to a one-hot encoding or to long tensor using nt.ToTensor(dtype=torch.long).
- close()¶
- copy_agent(requires_grad: bool = False) Agent ¶
Copy the agent to a new instance.
- Returns:
The copied agent.
- copy_policy(requires_grad: bool = False) BaseModel ¶
Copy the policy to a new instance.
- Returns:
The copied policy.
- property env¶
- generate_trajectories(*, n_trajectories: int | None = None, n_experiences: int | None = None, buffer: ReplayBuffer | None = None, epsilon: float = 0.0, p_bar_position: int = 0, verbose: bool | None = None, **kwargs) GenTrajectoriesOutput ¶
Generate trajectories using the current policy. If the policy of the agent is in evaluation mode, the actions will be chosen with the argmax method. If the policy is in training mode and a random number is generated that is less than epsilon, a random action will be chosen. Otherwise, the action will be chosen by a sample considering the policy output.
- Parameters:
n_trajectories (int) – Number of trajectories to generate. If not specified, the number of trajectories will be calculated based on the number of experiences.
n_experiences (int) – Number of experiences to generate. If not specified, the number of experiences will be calculated based on the number of trajectories.
buffer (ReplayBuffer) – The buffer to store the experiences.
epsilon (float) – The probability of choosing a random action.
p_bar_position (int) – The position of the progress bar.
verbose (bool) – Whether to show the progress bar.
kwargs – Additional arguments.
- Keyword Arguments:
env (gym.Env) – The environment to generate the trajectories. Will update the “env” of the current_state.
observation – The initial observation. If not specified, the observation will be get from the the objects of the current_state attribute and if not available, the environment will be reset.
info – The initial info. If not specified, the info will be get from the objects of the current_state attribute and if not available, the environment will be reset.
- Returns:
The buffer with the generated experiences, the cumulative rewards and the mean of terminal rewards.
- reset_agents_history_maps_meta()¶
- static set_default_academy_kwargs(**kwargs) Dict[str, Any] ¶
Set default values for the kwargs of the fit method. :param kwargs:
close_env: Whether to close the environment after the training. n_epochs: Number of epochs to train each iteration. init_lr: Initial learning rate. min_lr: Minimum learning rate. weight_decay: Weight decay. init_epsilon: Initial epsilon. Epsilon is the probability of choosing a random action. epsilon_decay: Epsilon decay. min_epsilon: Minimum epsilon. gamma: Discount factor. tau: Target network update rate. n_batches: Number of batches to train each iteration. batch_size: Batch size. update_freq: Number of steps between each update. curriculum_strength: Strength of the teacher learning strategy.
- Returns:
- train(env, n_iterations: int | None = None, *, n_epochs: int = 10, load_checkpoint_mode: LoadCheckpointMode | None = None, force_overwrite: bool = False, p_bar_position: int | None = None, p_bar_leave: bool | None = None, **kwargs) TrainingHistory ¶
Train the model.
- Parameters:
train_dataloader (DataLoader) – The dataloader for the training set. It contains the training data.
val_dataloader (Optional[DataLoader]) – The dataloader for the validation set. It contains the validation data.
n_iterations (Optional[int]) – The number of iterations to train the model. An iteration is a pass over the training set and the validation set. If None, the model will be trained until the training is stopped by the user.
n_epochs (int) – The number of epochs to train the model. An epoch is a pass over the training set. The nomenclature here is different from what is usually used elsewhere. Here, an epoch is a pass over the training set, while an iteration is a pass over the training set and the validation set. In other words, if n_iterations=1 and n_epochs=10, the trainer will pass 10 times over the training set and 1 time over the validation set (this will constitute 1 iteration). If n_iterations=10 and n_epochs=1, the trainer will pass 10 times over the training set and 10 times over the validation set (this will constitute 10 iterations). The nuance between those terms is really important when is comes to reinforcement learning. Default is 1.
load_checkpoint_mode (LoadCheckpointMode) – The mode to use when loading the checkpoint.
force_overwrite (bool) – Whether to force overwriting the checkpoint. Be careful when using this option, as it will destroy the previous checkpoint folder. Default is False.
p_bar_position (Optional[int]) – The position of the progress bar. See tqdm documentation for more information.
p_bar_leave (Optional[bool]) – Whether to leave the progress bar. See tqdm documentation for more information.
kwargs – Additional keyword arguments.
- Returns:
The training history.
neurotorch.rl.utils module¶
- class neurotorch.rl.utils.Linear(input_size: int | Dimension | Iterable[int | Dimension] | Size | None = None, output_size: int | Dimension | Iterable[int | Dimension] | Size | None = None, name: str | None = None, device: device | None = None, **kwargs)¶
Bases:
BaseNeuronsLayer
- __init__(input_size: int | Dimension | Iterable[int | Dimension] | Size | None = None, output_size: int | Dimension | Iterable[int | Dimension] | Size | None = None, name: str | None = None, device: device | None = None, **kwargs)¶
Initialize the layer.; See the
BaseLayer
class for more details.;- Parameters:
input_size (Optional[SizeTypes]) – The input size of the layer;
output_size (Optional[SizeTypes]) – The output size of the layer.
name (Optional[str]) – The name of the layer.
use_recurrent_connection (bool) – Whether to use a recurrent connection. Default is True.
use_rec_eye_mask (bool) – Whether to use a recurrent eye mask. Default is False. This mask will be used to mask to zero the diagonal of the recurrent connection matrix.
dt (float) – The time step of the layer. Default is 1e-3.
kwargs – Other keyword arguments.
- Keyword Arguments:
regularize (bool) – Whether to regularize the layer. If True, the method update_regularization_loss will be called after each forward pass. Defaults to False.
hh_init (str) – The initialization method for the hidden state. Defaults to “zeros”.
hh_init_mu (float) – The mean of the hidden state initialization when hh_init is random . Defaults to 0.0.
hh_init_std (float) – The standard deviation of the hidden state initialization when hh_init is random. Defaults to 1.0.
hh_init_seed (int) – The seed of the hidden state initialization when hh_init is random. Defaults to 0.
force_dale_law (bool) – Whether to force the Dale’s law in the layer’s weights. Defaults to False.
forward_sign (Union[torch.Tensor, float]) – If force_dale_law is True, this parameter will be used to initialize the forward_sign vector. If it is a float, the forward_sign vector will be initialized with this value as the ration of inhibitory neurons. If it is a tensor, it will be used as the forward_sign vector.
recurrent_sign (Union[torch.Tensor, float]) – If force_dale_law is True, this parameter will be used to initialize the recurrent_sign vector. If it is a float, the recurrent_sign vector will be initialized with this value as the ration of inhibitory neurons. If it is a tensor, it will be used as the recurrent_sign vector.
sign_activation (Callable) – The activation function used to compute the sign of the weights i.e. the forward_sign and recurrent_sign vectors. Defaults to torch.nn.Tanh.
- build() Linear ¶
Build the layer. This method must be call after the layer is initialized to make sure that the layer is ready to be used e.g. the input and output size is set, the weights are initialized, etc.
In this method the
forward_weights
,recurrent_weights
and :attr: rec_mask are created and finally the methodinitialize_weights_()
is called.- Returns:
The layer itself.
- Return type:
- create_empty_state(batch_size: int = 1, **kwargs) Tuple[Tensor, ...] ¶
Create an empty state for the layer. This method must be implemented by the child class.
- Parameters:
batch_size (int) – The batch size of the state.
- Returns:
The empty state.
- Return type:
Tuple[torch.Tensor, …]
- forward(inputs: Tensor, state: Tuple[Tensor, ...] | None = None, **kwargs)¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- initialize_weights_()¶
Initialize the weights of the layer. This method must be implemented by the child class.
- Returns:
None
- class neurotorch.rl.utils.TrainingHistoriesMap(curriculum: Curriculum | None = None)¶
Bases:
object
- REPORT_KEY = 'report'¶
- __init__(curriculum: Curriculum | None = None)¶
- append(key, value)¶
- concat(other)¶
- max(key=None)¶
- plot(save_path=None, show=False, lesson_idx: int | str | None = None, **kwargs)¶
- plot_history(history_name: str, save_path=None, show=False, **kwargs)¶
- property report_history: TrainingHistory¶
- class neurotorch.rl.utils.TrajectoryRenderer(trajectory: Trajectory, env: Env | None = None, **kwargs)¶
Bases:
object
- __init__(trajectory: Trajectory, env: Env | None = None, **kwargs)¶
- check_simulate_is_needed()¶
- render(**kwargs) Tuple[Figure, Axes, FuncAnimation] ¶
- simulate()¶
- to_file(file_path: str, fps: int = 30, **kwargs)¶
- to_gif(file_path: str, fps: int = 30, **kwargs)¶
- to_mp4(file_path: str, fps: int = 30, **kwargs)¶
- neurotorch.rl.utils.batch_dict_of_items(x: Any) Any ¶
- neurotorch.rl.utils.batch_numpy_actions(actions, env: Env | None = None)¶
- neurotorch.rl.utils.continuous_actions_distribution(actions: Dict | Tensor | ndarray, covariance: Dict | Tensor | ndarray | None = None) Dict | Distribution ¶
Creates a continuous action distribution from the actions and the covariance.
- Parameters:
actions (Union[Dict, torch.Tensor, np.ndarray]) – The actions.
covariance (Optional[Union[Dict, torch.Tensor, np.ndarray]]) – The covariance of the actions. If None, a diagonal covariance is assumed using the variance of the given actions.
- Returns:
The action distribution.
- Return type:
Union[Dict, torch.distributions.Distribution]
- neurotorch.rl.utils.discounted_cumulative_sums(x, discount, axis=-1, **kwargs)¶
- neurotorch.rl.utils.env_batch_render(env: Env, **kwargs) List[Any] ¶
Render the environment in batch mode.
- Parameters:
env (gym.Env) – The environment.
- neurotorch.rl.utils.env_batch_reset(env: Env) Tuple[ndarray, ndarray] ¶
Reset the environment in batch mode.
- Parameters:
env (gym.Env) – The environment.
- Returns:
The batch of observations.
- Return type:
np.ndarray
- neurotorch.rl.utils.env_batch_step(env: Env, actions: Any) Tuple[ndarray, ndarray, ndarray, ndarray, ndarray] ¶
Step the environment in batch mode.
- Parameters:
env (gym.Env) – The environment.
actions (Any) – The actions to take.
- Returns:
The batch of observations, rewards, dones, truncated and infos.
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]
- neurotorch.rl.utils.format_numpy_actions(actions, env: Env)¶
- neurotorch.rl.utils.get_item_from_batch(x: Any, i: int) Any ¶
- neurotorch.rl.utils.get_single_action_space(env: Env) Space ¶
Return the action space of a single environment.
- Parameters:
env (gym.Env) – The environment.
- Returns:
The action space.
- Return type:
gym.spaces.Space
- neurotorch.rl.utils.get_single_observation_space(env: Env) Space ¶
Return the observation space of a single environment.
- Parameters:
env (gym.Env) – The environment.
- Returns:
The observation space.
- Return type:
gym.spaces.Space
- neurotorch.rl.utils.obs_batch_to_sequence(obs: Tensor | Dict[str, Tensor], as_numpy: bool = False) Sequence[ndarray | Tensor | Dict[str, ndarray | Tensor]] ¶
Convert a batch of observations to a sequence of observations.
- Parameters:
obs (Union[torch.Tensor, Dict[str, torch.Tensor]]) – The batch of observations.
as_numpy (bool) – Whether to convert the observations to numpy arrays.
- Returns:
The sequence of observations.
- Return type:
Sequence[Union[np.ndarray, torch.Tensor, Dict[str, Union[np.ndarray, torch.Tensor]]]]
- neurotorch.rl.utils.obs_sequence_to_batch(obs: Sequence[ndarray | Tensor | Dict[str, ndarray | Tensor]]) Tensor | Dict[str, Tensor] ¶
Convert a sequence of observations to a batch of observations.
- Parameters:
obs (Sequence[Union[np.ndarray, torch.Tensor, Dict[str, Union[np.ndarray, torch.Tensor]]]]) – The sequence of observations.
- Returns:
The batch of observations.
- Return type:
Union[torch.Tensor, Dict[str, torch.Tensor]]
- neurotorch.rl.utils.sample_action_space(action_space: Space, re_format: str = 'raw')¶
Sample an action from the action space.
- Parameters:
action_space (gym.spaces.Space) – The action space.
re_format (str) – The format to return the action in.
- Returns:
The sampled action.
- Return type:
Any
- neurotorch.rl.utils.space_to_continuous_shape(space: Space, flatten_spaces=False) Tuple[int, ...] | Dict[str, Tuple[int, ...]] ¶
- neurotorch.rl.utils.space_to_spec(space: Space) Dict[str, Space] ¶