sempler.DRFNet

The sempler.DRFNet class implements a procedure to generate realistic semi-synthetic data for causal discovery. The procedure is described in detail in Appendix F of [1].

If you use this procedure for your work, please consider citing::

@article{gamella2022characterization,
  title={Characterization and Greedy Learning of Gaussian Structural Causal Models under Unknown Interventions},
  author={Gamella, Juan L. and Taeb, Armeen and Heinze-Deml, Christina and B\"uhlmann, Peter},
  journal={arXiv preprint arXiv:2211.14897},
  year={2022}
}

As input, the procedure takes a directed acyclic graph over some variables and a dataset consisting of their observations under different environments. The data can also be “observational”, that is, from a single environment. The procedure then fits a non-parametric Bayesian network, where the conditional distributions entailed by the graph are approximated via distributional random forests [2]. Once fitted, you can sample from this collection of forests to produce a new, semi-synthetic dataset that respects acyclicity, causal sufficiency, and the conditional independence relationships entailed by the given graph, while its marginal and conditional distributions closely resemble those of the original dataset [1, figure 4].

Additional R dependencies

For now, only an R implementation of distributional random forests [2] is available. Thus, to run the procedure you will additionally need

  • an R installation; you can find an installation guide here

  • the R package drf, which you can install by typing install.packages("drf") in an R terminal

  • the Python package rpy2>=3.4.1 (you can automatically install it by running pip install sempler[DRFNet])

The class is documented below.

class sempler.DRFNet(graph, data, verbose=False)

Fit a non-parametric Bayesian network with the given adjacency to the data, modelling the conditionals through distributional random forests.

Parameters:
  • graph (numpy.ndarray) – Two dimensional array representing the desired ground truth, where graph[i,j] != 0 implies the edge i -> j.

  • data (list of numpy.ndarray) – The data to which the Bayesian network will be fitted, as a list of two-dimensional arrays containing the samples from each environment, where rows correspond to observations and columns to variables.

  • verbose (bool, default=False) – If debugging traces should be printed.

Raises:
  • TypeError : – If the graph or data are of the wrong type.

  • ValueError : – If the given adjacency is not a DAG or the samples in the data are of the wrong size.

Examples

Fitting to some random data (from two environments) and a random graph.

>>> rng = np.random.default_rng(42)
>>> data = [rng.uniform(size=(100, 5)) for _ in range(2)]
>>> graph = sempler.generators.dag_avg_deg(p=5, k=2, random_state=42)
>>> network = DRFNet(graph, data)
>>> network.graph
array([[0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0]])
sample(n=None, random_state=None)

Generate a sample from the fitted Bayesian network.

Parameters:
  • n (int or list of ints or NoneType, default=None) – The desired sample size. If None the sample size from each environment matches that of the original data. Otherwise: either a single value (same number of observations per environment) or a list specifying the size of the sample from each environment.

  • random_state (int or NoneType, default=None) – To set the random seed for reproducibility.

Returns:

sample – The resulting samples, one per environment.

Return type:

list of numpy.ndarray

Raises:
  • TypeError : – If n is of the wrong type.

  • ValueError : – If n is non-positive or has the wrong length (number of environments).

Examples

If not specifying a sample size, the sample sizes in the new data match those of the original:

>>> new_data = network.sample()
>>> len(new_data)
2
>>> [len(sample) for sample in new_data]
[100, 100]

Specifying the sample sizes:

>>> new_data = network.sample(3)
>>> [len(sample) for sample in new_data]
[3, 3]
>>> new_data = network.sample([2, 3])
>>> [len(sample) for sample in new_data]
[2, 3]

References

[1] Gamella, J.L, Taeb, A., Heinze-Deml, C., & Bühlmann, P. (2022). Characterization and greedy learning of Gaussian structural causal models under unknown noise interventions. arXiv preprint arXiv:2211.14897, 2022.

[2] Ćevid, D., Michel, L., Näf, J., Meinshausen, N., & Bühlmann, P. (2020). Distributional random forests: Heterogeneity adjustment and multivariate distributional regression. arXiv preprint arXiv:2005.14458, 2020.