Thoughts and theory,BCN CAUSES SOMETHING
How Dual Machine Learning works for causal inference, from theoretical background to application example.
Posted in · 9 minutes of reading · June 25, 2021
9 minutes of reading
June 25, 2021
This post is the result of a joint effort withAleixa Ruiza de Villi,Jesus seeks, and the wholeCausality SOMETHING BCNclub. Note: We assume that the reader is familiar with the basic concepts of causal inference.
If you dabble in the waters of causal inference, you may have heard of the concept of dual machine learning. And if you haven't heard of it yet, I personally bet it will be possible in the near future. Or better yet, you can use it without even knowing you're using it. Like any great technology, Double Machine Learning for causal inference can become quite ubiquitous. But let us cool the heat of this writer and return to our work. In this post I try to explain, briefly but quite comprehensively, what Dual Machine Learning is and how it works. To this end, we will cover the topic from its theoretical background to a typical example of its use in causal inference.
- A general framework for estimating causal effects using machine learning techniques
- Confidence intervals for these estimates
- An estimator that is "root consistent," that is, an estimator that has good data convergence and performance properties.
And where did this whole idea come from? On the one hand, to state the obvious, it comes from the idea of causal inference using machine learning. But if we look a little deeper, two non-obvious key ideas emerge:
- From a statistical point of view, Machine Learning is a set of non-parametric or semi-parametric estimation methods,
- there is a very large body of theoretical work on non-parametric and semi-parametric estimation methods (with bounds, returns, etc.)
Double Machine Learning combines both, drawing inspiration and useful results from the latter to draw causal inferences from the former.
Let's start. We start by defining the DAG of the data generation process we will work on, as shown in the figure below:
Furthermore, we define the following partially linear model that sets the relationships between variables from the DAG:
where Y is the outcome variable, D is the binary treatment, Z is the covariate vector, and U and V are disturbances. Equation 1.1 is the main equation, and θ0 is the parameter of interest we would like to estimate, which is the derivative of ATE with respect to D. Equation 1.2 tracks confounding, that is, the relationship of treatment to covariates. This relationship is modeled using the function m0(Z) and the dependence of the outcome on the covariates using the function g0(Z). We will see later that Double Machine Learning also works for fully nonlinear models, but we will start assuming this partially linear model to make the method easier to explain and interpret. It should also be noted that we assume all the usual traceability conditions in causal inference, i.e. no hidden confounders, positivity and consistency.
Then, recalling and complementing the definition of Dual Machine Learning given in the introduction, the goal of the method is to obtain the root of the n-consistent estimator and the confidence intervals of the interesting (low dimensional) parameter θ0, in the presence of the nuisance parameter (potentially large dimension) η₀=(g0, m0). In this context, perturbation means that we do not directly care about the correctness of our estimate of η₀, as long as we have a good (root n-consistent) estimator of θ₀.
But why would we care to use machine learning to accomplish this task at all? Mainly for three reasons. First and most obvious, because of the power machine learning methods have in modeling functions and/or expectations. Second, because machine learning models predict better than traditional statistical methods (ie linear regression and OLS) for multivariate data, which is becoming the norm in our world. And last and most important, because compared to traditional statistical methods, machine learning methods do not impose such strong assumptions on the functional forms of m₀(Z) and especially g0(Z), but learn these forms from data. This is a good precaution against model misspecification, which is a problem that in our example would lead to a biased estimator even in the absence of unmeasured confounders.
So why not use machine learning to estimate θ₀ directly? For example, we could use an alternating iterative approach, estimating g0 using random forests, then estimating θ0 using OLS, repeating until convergence (note that in this case, using OLS for θ₀ does not mean that the model is not well specified because the model is linear in θ0 ). . Well, life is not that simple. The figure below shows the θ0-θ distribution for this approximation compared to the normal distribution of mean 0. As can be seen from the difference between the two distributions, the estimator is biased. Note that g₀ is set to a smooth function with few parameters that should in principle be well approximated by random forests.
The key observation for understanding this phenomenon is that g₀(Z)≠[Y|Z]. Thus, it is generally not possible to obtain a good estimate of g0(Z) by "regressing" Y on Z, and this in turn leads to the inability to obtain a good estimate of θ0. Nevertheless, it is perfectly possible to make very good predictions about Y given Z and D. This is why we usually say that machine learning is good for prediction but bad for causal inference. Bias has two sources: adjustment and overadjustment. Double Machine Learning aims to correct both regularization error by orthogonalization and overfitting error by cross-fitting. The following sections explain how these two deviation correction strategies work. A detailed explanation of the sources and patterns of bias can be found in the authors' original work orthis presentation.
Orthogonality and the Neyman orthogonality condition
To show how orthogonalization works, we first state and briefly explain the Frisch-Waugh-Lovell theorem. This theorem states that, given the linear model Y=β0+β1D+β2Z+U, the following two approaches to estimating β1 produce the same result:
- Linear regression of Y on D and Z using OLS.
- A three-step process: 1) Regress D on Z. 2) Regress Y on Z. 3) regress the residuals of 2 on the residuals of 1 to obtain β1 (all regressions using OLS).
Similarly, returning to our semi-linear example, we can proceed as follows:
- Predict D from Z using machine learning.
- Predict Y from Ω using machine learning.
- Linear regression of the residuals of 2 on the residuals of 1 to obtain an estimate of θ₀.
This procedure ensures that the model from step 3 is "orthogonal", resulting in an unbiased square root estimator consistent with n. See the figure below for the θ0-θ distribution for this approximation.
How can we formalize and generalize this process? For this, we first need to introduce the concept of scaling functions and moment conditions (for an introduction to the Generalized Method of Moments, see the articlethis page). Specifically, the scoring function used in this case is ψ=(D−m₀(Z))×(Y−g₀(Z)−(D−m₀(Z))θ), where the multiplicative terms are the error terms of the partially linear model, although other alternatives are available. Now we require this resultant function to be zero, ψ=0, which is our moment condition. This is a mathematical expression that we want our regressor and our error to be orthogonal, which is like saying we want them to be uncorrelated. Operationally, this means that once we calculate g₀ and m₀, we can get θ₀ from the momentum equation of state.
We are finally ready to express and defineNeyman rectangles Stan, How
∂η𝔼[ψ(W;θ₀,η₀)][η−η₀] = 0
where W is our chunk of data. This equation is interpreted as follows: The left-hand side is the directional derivative of the gate of our outcome function with respect to our perturbation parameter η in the neighborhood of η₀. We say we want this derivative to vanish. And since the derivative is the instantaneous rate of change, we say that our outcome function (and thus our estimate of θ₀) should be robust to "small" perturbations of the disturbance parameter n.
In summary, imposing the Neyman orthogonality condition on our outcome function (and thus on our θ0 and η0 estimators) renders the θ0 estimator free from one of two sources of systematic error—regularization error.
Splitting and Crossing Samples
Now it is time to get rid of the second source of bias, namely the overfitting error (again, for a detailed explanation of the shape of these errors, seethis presentation). A possible strategy for this would be the so-calledsample splitgetting closer. The functionality is as follows:
- We randomly divide our data into two subsets.
- We fit the machine learning models for D and Y to the first subset.
- In the second subset, we calculate θ₀ using the models obtained in step 2.
The downside of this strategy is that it reduces performance and statistical power. But it can be solved withjunction, which is the strategy used in the Double Machine Learning method. It looks like this:
- We randomly divide our data into two subsets.
- We fit the machine learning models for D and Y to the first subset.
- In the second subset, we estimate θ0,1 using the models obtained in step 2.
- We align the machine learning models with the second subset.
- In the first subset, we calculate θ0,2 using the models obtained in step 4.
- The final estimate of θ0 is taken as the average of θ0,1 and θ0,2.
Note that we can gain strength by repeating the process for K folds with K greater than 2.
We have finally reached the end of this post and it is time to gather all the previously explained ingredients and put them into an algorithm.
Let us define a fully interactive model, more general than a partially linear model. This model does not assume that D (machining) is additively separated:
Our causal parameter of interest is ATE, 𝔼[g₀(1;Z) - g₀(0;Z)] and our outcome function is
provided by Robins and Rotnizsky (1995) and is Neyman orthogonal (and doubly robust), satisfying ∂𝔼ψ(W; θ₀, η)= 0.
In such a setting, Double Machine Learning providesunbiased n-connected root estimatorATE and herconfidence intervals(with alpha confidence level) by following these steps:
And so we have come to the end of our post. Note that there are packages available for Python and R that implement this and other related algorithms, which can be found athttps://github.com/DoubleML.
Unlike human beings, machine learning algorithms are bad at determining what's known as 'causal inference,' the process of understanding the independent, actual effect of a certain phenomenon that is happening within a larger system.What is the best study for causal inference? ›
Randomized controlled trials are the gold standard for measuring causality. The best method to infer causality is through randomized controlled trials (RCTs).Why use double machine learning? ›
This is why we usually say that Machine Learning is good for prediction, but bad for causal inference. The bias has two sources, regularization and overfitting. Double Machine Learning aims to correct both: regularization bias by means of orthogonalization and overfitting bias by means of cross-fitting.What is a double machine learning? ›
Double/debiased machine learning (DML) is a method developed to use regularized regression techniques, such as least absolute shrinkage and selection operator (LASSO)  or l2 -boosting , for variable selection in a high-dimensional causal inference setting .Can AI make causal inference? ›
When humans rationalize the world, we often think in terms of cause and effect — if we understand why something happened, we can change our behavior to improve future outcomes. Causal inference is a statistical tool that enables our AI and machine learning algorithms to reason in similar ways.Which type of method is most likely to allow for causal inferences? ›
Randomized controlled trials (RCTs) are considered as the gold standard for causal inference because they rely on the fewest and weakest assumptions.What are the three requirements for causal inference? ›
According to John Stuart Mill's classical formulation (Shadish, Cook, & Campbell, 2002), establishing a causal relationship requires three criteria: (a) temporal precedence (i.e., the cause precedes the effect), (b) covariance (i.e., the cause and effect are related), and (c) disqualification of alternative ...What is the best research method for causality? ›
One of the most common tools for estimating causal effects in nonexperimental studies is propensity score methods. These methods replicate a randomized experiment to the extent possible by forming treatment and comparison groups that are similar with respect to the observed confounders.What are the three types of causal inference? ›
Modes of Causal Inference. There are two distinct forms of assignment-mechanism-based (or randomization-based) modes of causal inference: one due to Neyman (1923) and the other due to Fisher (1925). There is a third approach (Rubin, 1978), which is posterior predictive (Bayesian).What are the advantages of stacking machine learning? ›
The Advantages of Stacking
Stacking has various advantages in machine learning: Improved Predictive Performance: Stacking can reduce bias and variance in the final forecast by merging the results of numerous base models, resulting in improved predictive performance.
The "double" in DML comes from the fact that we are fitting two ML algorithms to deal with this problem: We fit a preliminary ML algorithm to predict the outcome variable using the control variables; and we fit a second ML algorithm to partial out the effect of the control variables on the policy variable.Is double ML doubly robust? ›
Yes, but only because double machine learning uses a doubly robust estimator underneath the hood. There is nothing about the cross-fitting procedure itself that would lead to the double robustness property.What are the 3 types of learning in machine learning? ›
The three machine learning types are supervised, unsupervised, and reinforcement learning.What are 2 main types of machine learning algorithm? ›
There some variations of how to define the types of Machine Learning Algorithms but commonly they can be divided into categories according to their purpose and the main categories are the following: Supervised learning. Unsupervised Learning.What are the two machine learning algorithms? ›
The three basic machine learning algorithms are: Supervised Learning: Algorithms learn from labeled data to make predictions or classify new data. Unsupervised Learning: Algorithms analyze unlabeled data to discover patterns, group similar data, or reduce dimensions.What is an example of machine learning used for inference? ›
During machine learning inference the trained models are used to draw conclusions from new data. For example, during the inference process a developer or data scientist might give the trained ML models some photos of cars that it has never seen before to discover what it can infer from what it has already learned.What types of machine learning are used to draw inferences? ›
It is used to draw inferences from datasets consisting of input data without labeled responses. Clustering is the most common unsupervised learning technique. It is used for exploratory data analysis to find hidden patterns or groupings in data.
Common frameworks for causal inference include the causal pie model (component-cause), Pearl's structural causal model (causal diagram + do-calculus), structural equation modeling, and Rubin causal model (potential-outcome), which are often used in areas such as social sciences and epidemiology.