Identification With Surrogates

Causal identification is usually framed as a problem where one wishes to recover some causal query \(p(Y | do(a))\) from some observed data distribution \(p(V)\) (Shpitser and Pearl, 2006; Huang and Valtorta, 2006). Recall \(p(Y(a))\) and \(p(Y|do(a))\) are equivalent but the do-notation used here is notationally cleaner.

However, recently there has been interest in the scenario where identification is performed with respect to other distributions. An analyst might have access to a set of experiments (where certain variables have been intervened to fixed values), and while the causal query might not be identified in any one experiment, jointly they might suffice. Lee et al., 2019 provide a sound and complete algorithm when the experiments are of the form \(p(V \setminus X | do(x))\) for some intervened variables \(X=x\).

Lee and Shpitser, 2020 (a different Lee!) extend the applicability of this algorithm by showing that it remains sound and complete for experiments which are ancestral subgraphs, with respect to the original graph under the intervention of the experiment, while recasting the results of Lee et al., 2019 using the one-line ID formulation provided in Richardson et al. 2017.

Following are the necessary packages that need to be imported.

[21]:

from ananke import graphs
from ananke import identification

Let’s say that we are interested in a system represented by the following graph. We can think of \(X1\) as a treatment for cardiac disease, \(X2\) as a treatment for obesity (say a particular diet), \(W\) as blood pressure (which perhaps for the purposes of this toy example is influenced only by the cardiac treatment), and \(Y\) as the final health outcome. As a first pass, we may posit the following causal graph, indicating that we only believe unmeasured confounding to exist between a person’s treatment assignment and diet. We are interested in the causal query represented by \(p(Y | do(x_1, x_2))\).

[22]:

vertices = ["X1", "X2", "W", "Y"]
di_edges = [("X1", "W"), ("W", "Y"), ("X2", "Y")]
bi_edges = [("X1", "X2")]
G = graphs.ADMG(vertices, di_edges, bi_edges)
G.draw(direction="TD")

[22]:

../_images/notebooks_identification_surrogates_3_0.svg

If we query the OneLineID algorithm in Ananke, we will see that this query is indeed identified.

[23]:

one_id = identification.OneLineID(graph=G, treatments=['X1', 'X2'], outcomes=['Y'])
one_id.id()

[23]:

True

However, since we are in a clinical healthcare setting, we should expect the presence of more confounding variables. So we update our causal graph to include other hidden confounders between the treatments and outcomes.

[24]:

vertices = ["X1", "X2", "W", "Y"]
di_edges = [("X1", "W"), ("W", "Y"), ("X2", "Y")]
bi_edges = [("X1", "X2"), ("X1", "W"), ("X2", "Y")]
G = graphs.ADMG(vertices, di_edges, bi_edges)
G.draw(direction="TD")

[24]:

../_images/notebooks_identification_surrogates_7_0.svg

We can now verify that under this model, the query is not identified from the observed data distribution \(p(V) := p(Y, W, X_2, X_1)\) using OnelineID:

[25]:

one_id = identification.OneLineID(graph=G, treatments=['X1', 'X2'], outcomes=['Y'])
one_id.id()

[25]:

False

It seems that we are stuck! However, in the next section we discuss how access to smaller subsets of experimental data on the variables involved may help us identify our causal query.

GID

What if we had access to experiments? We construct two experiments - one where \(X_1\) is fixed to value \(x_1\), and another where \(X_2\) is fixed to value \(x_2\). Perhaps, it is not possible to run an experiment where both \(X_1\) and \(X_2\) are intervened upon (financial reasons, ethical reasons, etc.) but these smaller experiments are indeed possible.

[26]:

vertices = ["X1", "X2", "W", "Y"]
di_edges = [("X1", "W"), ("W", "Y"), ("X2", "Y")]
bi_edges = [("X1", "X2"), ("X1", "W"), ("X2", "Y")]
G1 = graphs.ADMG(vertices, di_edges, bi_edges)
G1.fix(["X1"])
G1.draw(direction="TD")

[26]:

../_images/notebooks_identification_surrogates_11_0.svg

[27]:

vertices = ["X1", "X2", "W", "Y"]
di_edges = [("X1", "W"), ("W", "Y"), ("X2", "Y")]
bi_edges = [("X1", "X2"), ("X1", "W"), ("X2", "Y")]
G2 = graphs.ADMG(vertices, di_edges, bi_edges)
G2.fix(["X2"])
G2.draw(direction="TD")

[27]:

../_images/notebooks_identification_surrogates_12_0.svg

It happens that the causal query is indeed identified:

[28]:

g_id = identification.OneLineGID(graph=G, treatments=["X1", "X2"], outcomes=["Y"])

[29]:

g_id.id(experiments=[G1, G2])

[29]:

True

with corresponding identifying functional

[30]:

g_id.functional(experiments=[G1, G2])

[30]:

'ΣW ΦX2,Y p(W,X2,Y | do(X1))ΦX1,W p(W,X1,Y | do(X2))'

AID

The astute reader will notice that perhaps we didn’t need all of the experimental distributions. Rather, a margin would have sufficed - on the first experiment, we could have marginalized \(Y\) and \(X_2\), and still achieved identification. The reason is that the intrinsic set (and its parents) would have been identified anyways. Based on the results provided in Lee and Shpitser, 2020, we consider identification from the following experiment. The first graph is an ancestral subgraph with respect to \(G(V(x_1))\). The second graph remains unchanged, as does the graph defining the system we are interested in.

[31]:

vertices = ["X1","W"]
di_edges = [("X1", "W")]
bi_edges = [("X1", "W")]
G1 = graphs.ADMG(vertices, di_edges, bi_edges)
G1.fix(["X1"])
G1.draw(direction="TD")

[31]:

../_images/notebooks_identification_surrogates_19_0.svg

[32]:

a_id = identification.OneLineAID(graph=G, treatments=["X1", "X2"], outcomes=["Y"])

[33]:

a_id.id(experiments=[G1, G2])

[33]:

True

[34]:

a_id.functional(experiments=[G1, G2])

[34]:

'ΣW  p(W | do(X1))ΦX1,W p(W,X1,Y | do(X2))'

The causal query remains identified, but the identification formula has changed. Notably, no fixing operations are needed under the first experiment \(p(W | do(X1))\) since it is exactly the required kernel.