malib.common package

Submodules

malib.common.distributions module

Probability distributions. Reference: https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/distributions.py

class malib.common.distributions.BernoulliDistribution(action_dims: int)[source]

Bases: Distribution

Bernoulli distribution for MultiBinary action spaces.

Parameters:

action_dim – Number of binary actions

actions_from_params(action_logits: Tensor, deterministic: bool = False) Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

entropy() Tensor[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

log_prob(actions: Tensor) Tensor[source]

Returns the log likelihood

Parameters:

x – the taken action

Returns:

The log likelihood of the distribution

log_prob_from_params(action_logits: Tensor) Tuple[Tensor, Tensor][source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

mode() Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

proba_distribution(action_logits: Tensor) BernoulliDistribution[source]

Set parameters of the distribution.

Returns:

self

proba_distribution_net(latent_dim: int) Module[source]

Create the layer that represents the distribution: it will be the logits of the Bernoulli distribution.

Parameters:

latent_dim – Dimension of the last layer of the policy network (before the action layer)

Returns:

sample() Tensor[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

class malib.common.distributions.CategoricalDistribution(action_dim: int)[source]

Bases: Distribution

Categorical distribution for discrete actions.

Parameters:

action_dim (int) – Number of discrete actions.

actions_from_params(action_logits: Tensor, deterministic: bool = False) Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

entropy() Tensor[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

log_prob(actions: Tensor) Tensor[source]

Returns the log likelihood

Parameters:

x – the taken action

Returns:

The log likelihood of the distribution

log_prob_from_params(action_logits: Tensor, deterministic: bool = False) Tuple[Tensor, Tensor][source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

mode() Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

prob() Tensor[source]

Return a tensor which indicates the distribution

Returns:

A distribution tensor

Return type:

torch.Tensor

proba_distribution(action_logits: Tensor, action_mask: Optional[Tensor] = None) CategoricalDistribution[source]

Set parameters of the distribution.

Returns:

self

proba_distribution_net(latent_dim: int) Module[source]

Create the layer that represents the distribution: it will be the logits of the Categorical distribution. You can then get probabilities using a softmax.

Parameters:

latent_dim – Dimension of the last layer of the policy network (before the action layer)

Returns:

sample() Tensor[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

class malib.common.distributions.DiagGaussianDistribution(action_dim: int)[source]

Bases: Distribution

Gaussian distribution with diagonal covariance matrix, for continuous actions.

Parameters:

action_dim – Dimension of the action space.

actions_from_params(mean_actions: Tensor, log_std: Tensor, deterministic: bool = False) Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

entropy() Tensor[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

log_prob(actions: Tensor) Tensor[source]

Get the log probabilities of actions according to the distribution. Note that you must first call the proba_distribution() method.

Parameters:

actions

Returns:

log_prob_from_params(mean_actions: Tensor, log_std: Tensor) Tuple[Tensor, Tensor][source]

Compute the log probability of taking an action given the distribution parameters.

Parameters:
  • mean_actions

  • log_std

Returns:

mode() Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

prob() Tensor[source]

Return a tensor which indicates the distribution

Returns:

A distribution tensor

Return type:

torch.Tensor

proba_distribution(mean_actions: Tensor, log_std: Tensor) DiagGaussianDistribution[source]

Create the distribution given its parameters (mean, std)

Parameters:
  • mean_actions

  • log_std

Returns:

proba_distribution_net(latent_dim: int, log_std_init: float = 0.0) Tuple[Module, Parameter][source]

Create the layers and parameter that represent the distribution: one output will be the mean of the Gaussian, the other parameter will be the standard deviation (log std in fact to allow negative values)

Parameters:
  • latent_dim – Dimension of the last layer of the policy (before the action layer)

  • log_std_init – Initial value for the log standard deviation

Returns:

sample() Tensor[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

class malib.common.distributions.Distribution[source]

Bases: ABC

Abstract base class for distributions.

abstract actions_from_params(*args, **kwargs) Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

abstract entropy() Optional[Tensor][source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

get_actions(deterministic: bool = False) Tensor[source]

Return actions according to the probability distribution.

Parameters:

deterministic

Returns:

abstract log_prob(x: Tensor) Tensor[source]

Returns the log likelihood

Parameters:

x – the taken action

Returns:

The log likelihood of the distribution

abstract log_prob_from_params(*args, **kwargs) Tuple[Tensor, Tensor][source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

abstract mode() Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

abstract prob() Tensor[source]

Return a tensor which indicates the distribution

Returns:

A distribution tensor

Return type:

torch.Tensor

abstract proba_distribution(*args, **kwargs) Distribution[source]

Set parameters of the distribution.

Returns:

self

abstract proba_distribution_net(*args, **kwargs) Union[Module, Tuple[Module, Parameter]][source]

Create the layers and parameters that represent the distribution.

Subclasses must define this, but the arguments and return type vary between concrete classes.

abstract sample() Tensor[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

class malib.common.distributions.MaskedCategorical(scores, mask=None)[source]

Bases: object

property entropy
log_prob(value)[source]
property logits
static masked_softmax(logits, mask)[source]

This method will return valid probability distribution for the particular instance if its corresponding row in the mask matrix is not a zero vector. Otherwise, a uniform distribution will be returned. This is just a technical workaround that allows Categorical class usage. If probs doesn’t sum to one there will be an exception during sampling.

property normalized_entropy
property probs
rsample(temperature=None, gumbel_noise=None)[source]
sample()[source]
class malib.common.distributions.MultiCategoricalDistribution(action_dims: List[int])[source]

Bases: Distribution

MultiCategorical distribution for multi discrete actions.

Parameters:

action_dims – List of sizes of discrete action spaces

actions_from_params(action_logits: Tensor, deterministic: bool = False) Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

entropy() Tensor[source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

log_prob(actions: Tensor) Tensor[source]

Returns the log likelihood

Parameters:

x – the taken action

Returns:

The log likelihood of the distribution

log_prob_from_params(action_logits: Tensor) Tuple[Tensor, Tensor][source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

mode() Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

proba_distribution(action_logits: Tensor) MultiCategoricalDistribution[source]

Set parameters of the distribution.

Returns:

self

proba_distribution_net(latent_dim: int) Module[source]

Create the layer that represents the distribution: it will be the logits (flattened) of the MultiCategorical distribution. You can then get probabilities using a softmax on each sub-space.

Parameters:

latent_dim – Dimension of the last layer of the policy network (before the action layer)

Returns:

sample() Tensor[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

class malib.common.distributions.SquashedDiagGaussianDistribution(action_dim: int, epsilon: float = 1e-06)[source]

Bases: DiagGaussianDistribution

Gaussian distribution with diagonal covariance matrix, followed by a squashing function (tanh) to ensure bounds.

Parameters:
  • action_dim – Dimension of the action space.

  • epsilon – small value to avoid NaN due to numerical imprecision.

entropy() Optional[Tensor][source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

log_prob(actions: Tensor, gaussian_actions: Optional[Tensor] = None) Tensor[source]

Get the log probabilities of actions according to the distribution. Note that you must first call the proba_distribution() method.

Parameters:

actions

Returns:

log_prob_from_params(mean_actions: Tensor, log_std: Tensor) Tuple[Tensor, Tensor][source]

Compute the log probability of taking an action given the distribution parameters.

Parameters:
  • mean_actions

  • log_std

Returns:

mode() Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

proba_distribution(mean_actions: Tensor, log_std: Tensor) SquashedDiagGaussianDistribution[source]

Create the distribution given its parameters (mean, std)

Parameters:
  • mean_actions

  • log_std

Returns:

sample() Tensor[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

class malib.common.distributions.StateDependentNoiseDistribution(action_dim: int, full_std: bool = True, use_expln: bool = False, squash_output: bool = False, learn_features: bool = False, epsilon: float = 1e-06)[source]

Bases: Distribution

Distribution class for using generalized State Dependent Exploration (gSDE). Paper: https://arxiv.org/abs/2005.05719

It is used to create the noise exploration matrix and compute the log probability of an action with that noise.

Parameters:
  • action_dim – Dimension of the action space.

  • full_std – Whether to use (n_features x n_actions) parameters for the std instead of only (n_features,)

  • use_expln – Use expln() function instead of exp() to ensure a positive standard deviation (cf paper). It allows to keep variance above zero and prevent it from growing too fast. In practice, exp() is usually enough.

  • squash_output – Whether to squash the output using a tanh function, this ensures bounds are satisfied.

  • learn_features – Whether to learn features for gSDE or not. This will enable gradients to be backpropagated through the features latent_sde in the code.

  • epsilon – small value to avoid NaN due to numerical imprecision.

actions_from_params(mean_actions: Tensor, log_std: Tensor, latent_sde: Tensor, deterministic: bool = False) Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:

actions

entropy() Optional[Tensor][source]

Returns Shannon’s entropy of the probability

Returns:

the entropy, or None if no analytical form is known

get_noise(latent_sde: Tensor) Tensor[source]
get_std(log_std: Tensor) Tensor[source]

Get the standard deviation from the learned parameter (log of it by default). This ensures that the std is positive.

Parameters:

log_std

Returns:

log_prob(actions: Tensor) Tensor[source]

Returns the log likelihood

Parameters:

x – the taken action

Returns:

The log likelihood of the distribution

log_prob_from_params(mean_actions: Tensor, log_std: Tensor, latent_sde: Tensor) Tuple[Tensor, Tensor][source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:

actions and log prob

mode() Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:

the stochastic action

proba_distribution(mean_actions: Tensor, log_std: Tensor, latent_sde: Tensor) StateDependentNoiseDistribution[source]

Create the distribution given its parameters (mean, std)

Parameters:
  • mean_actions

  • log_std

  • latent_sde

Returns:

proba_distribution_net(latent_dim: int, log_std_init: float = -2.0, latent_sde_dim: Optional[int] = None) Tuple[Module, Parameter][source]

Create the layers and parameter that represent the distribution: one output will be the deterministic action, the other parameter will be the standard deviation of the distribution that control the weights of the noise matrix.

Parameters:
  • latent_dim – Dimension of the last layer of the policy (before the action layer)

  • log_std_init – Initial value for the log standard deviation

  • latent_sde_dim – Dimension of the last layer of the features extractor for gSDE. By default, it is shared with the policy network.

Returns:

sample() Tensor[source]

Returns a sample from the probability distribution

Returns:

the stochastic action

sample_weights(log_std: Tensor, batch_size: int = 1) None[source]

Sample weights for the noise exploration matrix, using a centered Gaussian distribution.

Parameters:
  • log_std

  • batch_size

class malib.common.distributions.TanhBijector(epsilon: float = 1e-06)[source]

Bases: object

Bijective transformation of a probability distribution using a squashing function (tanh) TODO: use Pyro instead (https://pyro.ai/)

Parameters:

epsilon – small value to avoid NaN due to numerical imprecision.

static atanh(x: Tensor) Tensor[source]

Inverse of Tanh

Taken from Pyro: https://github.com/pyro-ppl/pyro 0.5 * torch.log((1 + x ) / (1 - x))

static forward(x: Tensor) Tensor[source]
static inverse(y: Tensor) Tensor[source]

Inverse tanh.

Parameters:

y

Returns:

log_prob_correction(x: Tensor) Tensor[source]
malib.common.distributions.kl_divergence(dist_true: Distribution, dist_pred: Distribution) Tensor[source]

Wrapper for the PyTorch implementation of the full form KL Divergence

Parameters:
  • dist_true – the p distribution

  • dist_pred – the q distribution

Returns:

KL(dist_true||dist_pred)

malib.common.distributions.make_proba_distribution(action_space: Space, use_sde: bool = False, dist_kwargs: Optional[Dict[str, Any]] = None) Distribution[source]

Return an instance of Distribution for the correct type of action space.

Parameters:
  • action_space (gym.spaces.Space) – The action space.

  • use_sde (bool, optional) – Force the use of StateDependentNoiseDistribution instead of DiagGaussianDistribution. Defaults to False.

  • dist_kwargs (Optional[Dict[str, Any]], optional) – Keyword arguments to pass to the probability distribution. Defaults to None.

Raises:

NotImplementedError – Probability distribution not implemented for the specified action space.

Returns:

The appropriate Distribution object

Return type:

Distribution

malib.common.distributions.sum_independent_dims(tensor: Tensor) Tensor[source]

Continuous actions are usually considered to be independent, so we can sum components of the log_prob or the entropy.

Parameters:

tensor – shape: (n_batch, n_actions) or (n_batch,)

Returns:

shape: (n_batch,)

malib.common.manager module

class malib.common.manager.Manager(verbose: bool)[source]

Bases: ABC

cancel_pending_tasks()[source]

Cancle all running tasks.

force_stop()[source]
is_running()[source]
retrive_results()[source]
abstract terminate()[source]

Resource recall

wait() List[Any][source]

Wait workers to be terminated, and retrieve the executed results.

Returns:

A list of results.

Return type:

List[Any]

property workers: List[RemoteInterface]

malib.common.payoff_manager module

malib.common.strategy_spec module

class malib.common.strategy_spec.StrategySpec(identifier: str, policy_ids: Tuple[str], meta_data: Dict[str, Any])[source]

Bases: object

Construct a strategy spec.

Parameters:
  • identifier (str) – Runtime id as identifier.

  • policy_ids (Tuple[PolicyID]) – A tuple of policy id, could be empty.

  • meta_data (Dict[str, Any]) – Meta data, for policy construction.

gen_policy(device=None) Policy[source]

Generate a policy instance with the given meta data.

Returns:

A policy instance.

Return type:

Policy

get_meta_data() Dict[str, Any][source]

Return meta data. Keys in meta-data contains

  • policy_cls: policy class type

  • kwargs: a dict of parameters for policy construction

  • experiment_tag: a string for experiment identification

  • optim_config: optional, a dict for optimizer construction

Returns:

A dict of meta data.

Return type:

Dict[str, Any]

load_from_checkpoint(policy_id: str)[source]
property num_policy: int
register_policy_id(policy_id: str)[source]

Register new policy id, and preset prob as 0.

Parameters:

policy_id (PolicyID) – Policy id to register.

sample() str[source]

Sample a policy instance. Use uniform sample if there is no presetted prob list in meta data.

Returns:

A sampled policy id.

Return type:

PolicyID

update_prob_list(policy_probs: Dict[str, float])[source]

Update prob list with given policy probs dict. Partial assignment is allowed.

Parameters:

policy_probs (Dict[PolicyID, float]) – A dict that indicates which policy probs should be updated.

malib.common.strategy_spec.validate_meta_data(policy_ids: Tuple[str], meta_data: Dict[str, Any])[source]

Validate meta data. check whether there is a valid prob list.

Parameters:
  • policy_ids (Tuple[PolicyID]) – A tuple of registered policy ids.

  • meta_data (Dict[str, Any]) – Meta data.