sempler.DRFNet
The sempler.DRFNet
class implements a procedure to generate realistic semi-synthetic data for causal discovery. The procedure is described in detail in Appendix F of [1].
If you use this procedure for your work, please consider citing::
@article{gamella2022characterization,
title={Characterization and Greedy Learning of Gaussian Structural Causal Models under Unknown Interventions},
author={Gamella, Juan L. and Taeb, Armeen and Heinze-Deml, Christina and B\"uhlmann, Peter},
journal={arXiv preprint arXiv:2211.14897},
year={2022}
}
As input, the procedure takes a directed acyclic graph over some variables and a dataset consisting of their observations under different environments. The data can also be “observational”, that is, from a single environment. The procedure then fits a non-parametric Bayesian network, where the conditional distributions entailed by the graph are approximated via distributional random forests [2]. Once fitted, you can sample from this collection of forests to produce a new, semi-synthetic dataset that respects acyclicity, causal sufficiency, and the conditional independence relationships entailed by the given graph, while its marginal and conditional distributions closely resemble those of the original dataset [1, figure 4].
Additional R dependencies
For now, only an R implementation of distributional random forests [2] is available. Thus, to run the procedure you will additionally need
an R installation; you can find an installation guide here
the R package
drf
, which you can install by typinginstall.packages("drf")
in an R terminalthe Python package rpy2>=3.4.1 (you can automatically install it by running
pip install sempler[DRFNet]
)
The class is documented below.
- class sempler.DRFNet(graph, data, verbose=False)
Fit a non-parametric Bayesian network with the given adjacency to the data, modelling the conditionals through distributional random forests.
- Parameters:
graph (numpy.ndarray) – Two dimensional array representing the desired ground truth, where graph[i,j] != 0 implies the edge i -> j.
data (list of numpy.ndarray) – The data to which the Bayesian network will be fitted, as a list of two-dimensional arrays containing the samples from each environment, where rows correspond to observations and columns to variables.
verbose (bool, default=False) – If debugging traces should be printed.
- Raises:
TypeError : – If the graph or data are of the wrong type.
ValueError : – If the given adjacency is not a DAG or the samples in the data are of the wrong size.
Examples
Fitting to some random data (from two environments) and a random graph.
>>> rng = np.random.default_rng(42) >>> data = [rng.uniform(size=(100, 5)) for _ in range(2)] >>> graph = sempler.generators.dag_avg_deg(p=5, k=2, random_state=42) >>> network = DRFNet(graph, data) >>> network.graph array([[0, 0, 1, 1, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 0], [0, 0, 1, 0, 1], [0, 0, 0, 0, 0]])
- sample(n=None, random_state=None)
Generate a sample from the fitted Bayesian network.
- Parameters:
n (int or list of ints or NoneType, default=None) – The desired sample size. If None the sample size from each environment matches that of the original data. Otherwise: either a single value (same number of observations per environment) or a list specifying the size of the sample from each environment.
random_state (int or NoneType, default=None) – To set the random seed for reproducibility.
- Returns:
sample – The resulting samples, one per environment.
- Return type:
list of numpy.ndarray
- Raises:
TypeError : – If n is of the wrong type.
ValueError : – If n is non-positive or has the wrong length (number of environments).
Examples
If not specifying a sample size, the sample sizes in the new data match those of the original:
>>> new_data = network.sample() >>> len(new_data) 2 >>> [len(sample) for sample in new_data] [100, 100]
Specifying the sample sizes:
>>> new_data = network.sample(3) >>> [len(sample) for sample in new_data] [3, 3] >>> new_data = network.sample([2, 3]) >>> [len(sample) for sample in new_data] [2, 3]
References
[1] Gamella, J.L, Taeb, A., Heinze-Deml, C., & Bühlmann, P. (2022). Characterization and greedy learning of Gaussian structural causal models under unknown noise interventions. arXiv preprint arXiv:2211.14897, 2022.
[2] Ćevid, D., Michel, L., Näf, J., Meinshausen, N., & Bühlmann, P. (2020). Distributional random forests: Heterogeneity adjustment and multivariate distributional regression. arXiv preprint arXiv:2005.14458, 2020.