malib.common package

Submodules

malib.common.distributions module

Probability distributions. Reference: https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/distributions.py

class malib.common.distributions.BernoulliDistribution(action_dims: int)[source]

Bases: Distribution

Bernoulli distribution for MultiBinary action spaces.

Parameters:: action_dim – Number of binary actions

actions_from_params(action_logits: Tensor, deterministic: bool = False) → Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:: actions

entropy() → Tensor[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known

log_prob(actions: Tensor) → Tensor[source]

Returns the log likelihood

Parameters:: x – the taken action
Returns:: The log likelihood of the distribution

log_prob_from_params(action_logits: Tensor) → Tuple[Tensor, Tensor][source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:: actions and log prob

mode() → Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action

proba_distribution(action_logits: Tensor) → BernoulliDistribution[source]

Set parameters of the distribution.

Returns:: self

proba_distribution_net(latent_dim: int) → Module[source]

Create the layer that represents the distribution: it will be the logits of the Bernoulli distribution.

Parameters:: latent_dim – Dimension of the last layer of the policy network (before the action layer)
Returns:

sample() → Tensor[source]

Returns a sample from the probability distribution

Returns:: the stochastic action

class malib.common.distributions.CategoricalDistribution(action_dim: int)[source]

Bases: Distribution

Categorical distribution for discrete actions.

Parameters:: action_dim (int) – Number of discrete actions.

actions_from_params(action_logits: Tensor, deterministic: bool = False) → Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:: actions

entropy() → Tensor[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known

log_prob(actions: Tensor) → Tensor[source]

Returns the log likelihood

Parameters:: x – the taken action
Returns:: The log likelihood of the distribution

log_prob_from_params(action_logits: Tensor, deterministic: bool = False) → Tuple[Tensor, Tensor][source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:: actions and log prob

mode() → Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action

prob() → Tensor[source]

Return a tensor which indicates the distribution

Returns:: A distribution tensor
Return type:: torch.Tensor

proba_distribution(action_logits: Tensor, action_mask: Optional[Tensor] = None) → CategoricalDistribution[source]

Set parameters of the distribution.

Returns:: self

proba_distribution_net(latent_dim: int) → Module[source]

Create the layer that represents the distribution: it will be the logits of the Categorical distribution. You can then get probabilities using a softmax.

Parameters:: latent_dim – Dimension of the last layer of the policy network (before the action layer)
Returns:

sample() → Tensor[source]

Returns a sample from the probability distribution

Returns:: the stochastic action

class malib.common.distributions.DiagGaussianDistribution(action_dim: int)[source]

Bases: Distribution

Gaussian distribution with diagonal covariance matrix, for continuous actions.

Parameters:: action_dim – Dimension of the action space.

actions_from_params(mean_actions: Tensor, log_std: Tensor, deterministic: bool = False) → Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:: actions

entropy() → Tensor[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known

log_prob(actions: Tensor) → Tensor[source]

Get the log probabilities of actions according to the distribution. Note that you must first call the proba_distribution() method.

Parameters:: actions –
Returns:

log_prob_from_params(mean_actions: Tensor, log_std: Tensor) → Tuple[Tensor, Tensor][source]

Compute the log probability of taking an action given the distribution parameters.

Parameters:

mean_actions –
log_std –

Returns:

mode() → Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action

prob() → Tensor[source]

Return a tensor which indicates the distribution

Returns:: A distribution tensor
Return type:: torch.Tensor

proba_distribution(mean_actions: Tensor, log_std: Tensor) → DiagGaussianDistribution[source]

Create the distribution given its parameters (mean, std)

Parameters:

mean_actions –
log_std –

Returns:

proba_distribution_net(latent_dim: int, log_std_init: float = 0.0) → Tuple[Module, Parameter][source]

Create the layers and parameter that represent the distribution: one output will be the mean of the Gaussian, the other parameter will be the standard deviation (log std in fact to allow negative values)

Parameters:

latent_dim – Dimension of the last layer of the policy (before the action layer)
log_std_init – Initial value for the log standard deviation

Returns:

sample() → Tensor[source]

Returns a sample from the probability distribution

Returns:: the stochastic action

class malib.common.distributions.Distribution[source]

Bases: ABC

Abstract base class for distributions.

abstract actions_from_params(*args, **kwargs) → Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:: actions

abstract entropy() → Optional[Tensor][source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known

get_actions(deterministic: bool = False) → Tensor[source]

Return actions according to the probability distribution.

Parameters:: deterministic –
Returns:

abstract log_prob(x: Tensor) → Tensor[source]

Returns the log likelihood

Parameters:: x – the taken action
Returns:: The log likelihood of the distribution

abstract log_prob_from_params(*args, **kwargs) → Tuple[Tensor, Tensor][source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:: actions and log prob

abstract mode() → Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action

abstract prob() → Tensor[source]

Return a tensor which indicates the distribution

Returns:: A distribution tensor
Return type:: torch.Tensor

abstract proba_distribution(*args, **kwargs) → Distribution[source]

Set parameters of the distribution.

Returns:: self

abstract proba_distribution_net(*args, **kwargs) → Union[Module, Tuple[Module, Parameter]][source]

Create the layers and parameters that represent the distribution.

Subclasses must define this, but the arguments and return type vary between concrete classes.

abstract sample() → Tensor[source]

Returns a sample from the probability distribution

Returns:: the stochastic action

class malib.common.distributions.MaskedCategorical(scores, mask=None)[source]

Bases: object

property entropy

log_prob(value)[source]

property logits

static masked_softmax(logits, mask)[source]: This method will return valid probability distribution for the particular instance if its corresponding row in the mask matrix is not a zero vector. Otherwise, a uniform distribution will be returned. This is just a technical workaround that allows Categorical class usage. If probs doesn’t sum to one there will be an exception during sampling.

property normalized_entropy

property probs

rsample(temperature=None, gumbel_noise=None)[source]

sample()[source]

class malib.common.distributions.MultiCategoricalDistribution(action_dims: List[int])[source]

Bases: Distribution

MultiCategorical distribution for multi discrete actions.

Parameters:: action_dims – List of sizes of discrete action spaces

actions_from_params(action_logits: Tensor, deterministic: bool = False) → Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:: actions

entropy() → Tensor[source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known

log_prob(actions: Tensor) → Tensor[source]

Returns the log likelihood

Parameters:: x – the taken action
Returns:: The log likelihood of the distribution

log_prob_from_params(action_logits: Tensor) → Tuple[Tensor, Tensor][source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:: actions and log prob

mode() → Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action

proba_distribution(action_logits: Tensor) → MultiCategoricalDistribution[source]

Set parameters of the distribution.

Returns:: self

proba_distribution_net(latent_dim: int) → Module[source]

Create the layer that represents the distribution: it will be the logits (flattened) of the MultiCategorical distribution. You can then get probabilities using a softmax on each sub-space.

Parameters:: latent_dim – Dimension of the last layer of the policy network (before the action layer)
Returns:

sample() → Tensor[source]

Returns a sample from the probability distribution

Returns:: the stochastic action

class malib.common.distributions.SquashedDiagGaussianDistribution(action_dim: int, epsilon: float = 1e-06)[source]

Bases: DiagGaussianDistribution

Gaussian distribution with diagonal covariance matrix, followed by a squashing function (tanh) to ensure bounds.

Parameters:

action_dim – Dimension of the action space.
epsilon – small value to avoid NaN due to numerical imprecision.

entropy() → Optional[Tensor][source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known

log_prob(actions: Tensor, gaussian_actions: Optional[Tensor] = None) → Tensor[source]

Get the log probabilities of actions according to the distribution. Note that you must first call the proba_distribution() method.

Parameters:: actions –
Returns:

log_prob_from_params(mean_actions: Tensor, log_std: Tensor) → Tuple[Tensor, Tensor][source]

Compute the log probability of taking an action given the distribution parameters.

Parameters:

mean_actions –
log_std –

Returns:

mode() → Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action

proba_distribution(mean_actions: Tensor, log_std: Tensor) → SquashedDiagGaussianDistribution[source]

Create the distribution given its parameters (mean, std)

Parameters:

mean_actions –
log_std –

Returns:

sample() → Tensor[source]

Returns a sample from the probability distribution

Returns:: the stochastic action

class malib.common.distributions.StateDependentNoiseDistribution(action_dim: int, full_std: bool = True, use_expln: bool = False, squash_output: bool = False, learn_features: bool = False, epsilon: float = 1e-06)[source]

Bases: Distribution

Distribution class for using generalized State Dependent Exploration (gSDE). Paper: https://arxiv.org/abs/2005.05719

It is used to create the noise exploration matrix and compute the log probability of an action with that noise.

Parameters:

action_dim – Dimension of the action space.
full_std – Whether to use (n_features x n_actions) parameters for the std instead of only (n_features,)
use_expln – Use expln() function instead of exp() to ensure a positive standard deviation (cf paper). It allows to keep variance above zero and prevent it from growing too fast. In practice, exp() is usually enough.
squash_output – Whether to squash the output using a tanh function, this ensures bounds are satisfied.
learn_features – Whether to learn features for gSDE or not. This will enable gradients to be backpropagated through the features latent_sde in the code.
epsilon – small value to avoid NaN due to numerical imprecision.

actions_from_params(mean_actions: Tensor, log_std: Tensor, latent_sde: Tensor, deterministic: bool = False) → Tensor[source]

Returns samples from the probability distribution given its parameters.

Returns:: actions

entropy() → Optional[Tensor][source]

Returns Shannon’s entropy of the probability

Returns:: the entropy, or None if no analytical form is known

get_noise(latent_sde: Tensor) → Tensor[source]

get_std(log_std: Tensor) → Tensor[source]

Get the standard deviation from the learned parameter (log of it by default). This ensures that the std is positive.

Parameters:: log_std –
Returns:

log_prob(actions: Tensor) → Tensor[source]

Returns the log likelihood

Parameters:: x – the taken action
Returns:: The log likelihood of the distribution

log_prob_from_params(mean_actions: Tensor, log_std: Tensor, latent_sde: Tensor) → Tuple[Tensor, Tensor][source]

Returns samples and the associated log probabilities from the probability distribution given its parameters.

Returns:: actions and log prob

mode() → Tensor[source]

Returns the most likely action (deterministic output) from the probability distribution

Returns:: the stochastic action

proba_distribution(mean_actions: Tensor, log_std: Tensor, latent_sde: Tensor) → StateDependentNoiseDistribution[source]

Create the distribution given its parameters (mean, std)

Parameters:

mean_actions –
log_std –
latent_sde –

Returns:

proba_distribution_net(latent_dim: int, log_std_init: float = -2.0, latent_sde_dim: Optional[int] = None) → Tuple[Module, Parameter][source]

Create the layers and parameter that represent the distribution: one output will be the deterministic action, the other parameter will be the standard deviation of the distribution that control the weights of the noise matrix.

Parameters:

latent_dim – Dimension of the last layer of the policy (before the action layer)
log_std_init – Initial value for the log standard deviation
latent_sde_dim – Dimension of the last layer of the features extractor for gSDE. By default, it is shared with the policy network.

Returns:

sample() → Tensor[source]

Returns a sample from the probability distribution

Returns:: the stochastic action

sample_weights(log_std: Tensor, batch_size: int = 1) → None[source]

Sample weights for the noise exploration matrix, using a centered Gaussian distribution.

Parameters:

log_std –
batch_size –

class malib.common.distributions.TanhBijector(epsilon: float = 1e-06)[source]

Bases: object

Bijective transformation of a probability distribution using a squashing function (tanh) TODO: use Pyro instead (https://pyro.ai/)

Parameters:: epsilon – small value to avoid NaN due to numerical imprecision.

static atanh(x: Tensor) → Tensor[source]

Inverse of Tanh

Taken from Pyro: https://github.com/pyro-ppl/pyro 0.5 * torch.log((1 + x ) / (1 - x))

static forward(x: Tensor) → Tensor[source]

static inverse(y: Tensor) → Tensor[source]

Inverse tanh.

Parameters:: y –
Returns:

log_prob_correction(x: Tensor) → Tensor[source]

malib.common.distributions.kl_divergence(dist_true: Distribution, dist_pred: Distribution) → Tensor[source]

Wrapper for the PyTorch implementation of the full form KL Divergence

Parameters:

dist_true – the p distribution
dist_pred – the q distribution

Returns:

KL(dist_true||dist_pred)

malib.common.distributions.make_proba_distribution(action_space: Space, use_sde: bool = False, dist_kwargs: Optional[Dict[str, Any]] = None) → Distribution[source]

Return an instance of Distribution for the correct type of action space.

Parameters:

action_space (gym.spaces.Space) – The action space.
use_sde (bool, optional) – Force the use of StateDependentNoiseDistribution instead of DiagGaussianDistribution. Defaults to False.
dist_kwargs (Optional[Dict[str, Any]], optional) – Keyword arguments to pass to the probability distribution. Defaults to None.

Raises:

NotImplementedError – Probability distribution not implemented for the specified action space.

Returns:

The appropriate Distribution object

Return type:

Distribution

malib.common.distributions.sum_independent_dims(tensor: Tensor) → Tensor[source]

Continuous actions are usually considered to be independent, so we can sum components of the log_prob or the entropy.

Parameters:: tensor – shape: (n_batch, n_actions) or (n_batch,)
Returns:: shape: (n_batch,)

malib.common.manager module

class malib.common.manager.Manager(verbose: bool)[source]

Bases: ABC

cancel_pending_tasks()[source]: Cancle all running tasks.

force_stop()[source]

is_running()[source]

retrive_results()[source]

abstract terminate()[source]: Resource recall

wait() → List[Any][source]

Wait workers to be terminated, and retrieve the executed results.

Returns:: A list of results.
Return type:: List[Any]

property workers: List[RemoteInterface]

malib.common.payoff_manager module

malib.common.strategy_spec module

class malib.common.strategy_spec.StrategySpec(identifier: str, policy_ids: Tuple[str], meta_data: Dict[str, Any])[source]

Bases: object

Construct a strategy spec.

Parameters:

identifier (str) – Runtime id as identifier.
policy_ids (Tuple[PolicyID]) – A tuple of policy id, could be empty.
meta_data (Dict[str, Any]) – Meta data, for policy construction.

gen_policy(device=None) → Policy[source]

Generate a policy instance with the given meta data.

Returns:: A policy instance.
Return type:: Policy

get_meta_data() → Dict[str, Any][source]

Return meta data. Keys in meta-data contains

policy_cls: policy class type
kwargs: a dict of parameters for policy construction
experiment_tag: a string for experiment identification
optim_config: optional, a dict for optimizer construction

Returns:: A dict of meta data.
Return type:: Dict[str, Any]

load_from_checkpoint(policy_id: str)[source]

property num_policy: int

register_policy_id(policy_id: str)[source]

Register new policy id, and preset prob as 0.

Parameters:: policy_id (PolicyID) – Policy id to register.

sample() → str[source]

Sample a policy instance. Use uniform sample if there is no presetted prob list in meta data.

Returns:: A sampled policy id.
Return type:: PolicyID

update_prob_list(policy_probs: Dict[str, float])[source]

Update prob list with given policy probs dict. Partial assignment is allowed.

Parameters:: policy_probs (Dict[PolicyID, float]) – A dict that indicates which policy probs should be updated.

malib.common.strategy_spec.validate_meta_data(policy_ids: Tuple[str], meta_data: Dict[str, Any])[source]

Validate meta data. check whether there is a valid prob list.

Parameters:

policy_ids (Tuple[PolicyID]) – A tuple of registered policy ids.
meta_data (Dict[str, Any]) – Meta data.