Title: | Double Machine Learning Algorithms |
---|---|
Description: | Implementation of double machine learning (DML) algorithms in R, based on Emmenegger and Buehlmann (2021) "Regularizing Double Machine Learning in Partially Linear Endogenous Models" <arXiv:2101.12525> and Emmenegger and Buehlmann (2021) <arXiv:2108.13657> "Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements". First part: our goal is to perform inference for the linear parameter in partially linear models with confounding variables. The standard DML estimator of the linear parameter has a two-stage least squares interpretation, which can lead to a large variance and overwide confidence intervals. We apply regularization to reduce the variance of the estimator, which produces narrower confidence intervals that are approximately valid. Nuisance terms can be flexibly estimated with machine learning algorithms. Second part: our goal is to estimate and perform inference for the linear coefficient in a partially linear mixed-effects model with DML. Machine learning algorithms allows us to incorporate more complex interaction structures and high-dimensional variables. |
Authors: | Corinne Emmenegger [aut, cre]
|
Maintainer: | Corinne Emmenegger <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.2 |
Built: | 2025-03-13 04:01:14 UTC |
Source: | https://github.com/cran/dmlalg |
This is a method for the class regsdml
. It returns the estimated
coefficients from
objects of class regsdml
, which typically result from a function
call to regsdml
.
## S3 method for class 'regsdml' coef(object, print_regsDML = NULL, print_safety = NULL, print_DML = NULL, print_regDML = NULL, print_regDML_all_gamma = !is.null(parm), parm = NULL, print_gamma = FALSE, ...)
## S3 method for class 'regsdml' coef(object, print_regsDML = NULL, print_safety = NULL, print_DML = NULL, print_regDML = NULL, print_regDML_all_gamma = !is.null(parm), parm = NULL, print_gamma = FALSE, ...)
object |
An object of class |
print_regsDML |
A boolean. If |
print_safety |
A boolean. If |
print_DML |
A boolean. If |
print_regDML |
A boolean. If |
print_regDML_all_gamma |
A boolean. If |
parm |
A vector containing the indices for which |
print_gamma |
A boolean. If |
... |
Further arguments passed to or from other methods. |
Coefficients of the methods regsDML
, the safety
device,
DML
, regDML
with the optimal
choice of (including the factor
a_N
),
and regDML
with prespecified -values are returned by setting the
respective arguments. It is possible to return the respective
gamma
-values.
If none of the printing arguments are set, only the results of regsDML
are returned if they are available. If they are not available and none of
the printing arguments are set, the results from all available methods
are returned. If print_regsDML = FALSE
, only the results from
those methods are returned that are explicitly specified by the printing
arguments.
regsdml
,
summary.regsdml
,
confint.regsdml
,
vcov.regsdml
print.regsdml
## See example(regsdml) for examples
## See example(regsdml) for examples
This is a method for the class mmdml
.
It computes two-sided
confidence intervals for testing the two-sided component-wise
null hypotheses
with the (approximate) asymptotic Gaussian distribution of the coefficient
estimator. The method can be applied to objects
of class
mmdml
that typically result from a function
call to mmdml
.
## S3 method for class 'mmdml' confint(object, parm = NULL, level = 0.95, ...)
## S3 method for class 'mmdml' confint(object, parm = NULL, level = 0.95, ...)
object |
An object of class |
parm |
A vector containing the indices for which |
level |
A number between 0 and 1 representing
the confidence level. The default is |
... |
Further arguments passed to or from other methods. |
A matrix with columns giving the lower and upper confidence limits for each
entry of .
The columns are labelled as
These will be labelled as
(1-level)/2
% and
1 - (1-level)/2
%, by default 2.5% and 97.5%.
## See example(mmdml) for examples
## See example(mmdml) for examples
This is a method for the class regsdml
.
It computes two-sided
confidence intervals for testing the two-sided component-wise
null hypotheses that tests if a component equals zero
with the (approximate) asymptotic Gaussian distribution of the
coefficient estimator. The method can be applied to objects
of class regsdml
, which typically result from a function
call to regsdml
.
## S3 method for class 'regsdml' confint(object, parm = NULL, level = 0.95, print_regsDML = NULL, print_safety = NULL, print_DML = NULL, print_regDML = NULL, print_regDML_all_gamma = !is.null(parm), print_gamma = FALSE, ...)
## S3 method for class 'regsdml' confint(object, parm = NULL, level = 0.95, print_regsDML = NULL, print_safety = NULL, print_DML = NULL, print_regDML = NULL, print_regDML_all_gamma = !is.null(parm), print_gamma = FALSE, ...)
object |
An object of class |
parm |
A vector containing the indices for which |
level |
A number between 0 and 1 representing
the confidence level. The default is |
print_regsDML |
A boolean. If |
print_safety |
A boolean. If |
print_DML |
A boolean. If |
print_regDML |
A boolean. If |
print_regDML_all_gamma |
A boolean. If |
print_gamma |
A boolean. If |
... |
Further arguments passed to or from other methods. |
Confidence intervals for the methods regsDML
, the safety
device,
DML
, regDML
with the optimal
choice of (including the factor
a_N
),
and regDML
with prespecified -values are returned by setting the
respective arguments.
If none of the printing arguments are set, only the results of regsDML
are returned if they are available. If they are not available and none of
the printing arguments are set, the results from all available methods
are returned. If print_regsDML = FALSE
, only the results from
those methods are returned that are explicitly specified by the printing
arguments.
regsdml
,
summary.regsdml
,
coef.regsdml
,
vcov.regsdml
print.regsdml
## See example(regsdml) for examples.
## See example(regsdml) for examples.
The dmlalg
package contains implementations of
double machine learning (DML) algorithms in R
.
Our goal is to perform inference for the linear parameter in partially linear models with confounding variables. The standard DML estimator of the linear parameter has a two-stage least squares interpretation, which can lead to a large variance and overwide confidence intervals. We apply regularization to reduce the variance of the estimator, which produces narrower confidence intervals that are approximately valid. Nuisance terms can be flexibly estimated with machine learning algorithms.
regsdml
Estimates the linear parameter in a partially linear model with confounding variables with regularized and standard DML methods.
summary.regsdml
A summary
method for objects
fitted with regsdml
.
confint.regsdml
A confint
method for objects
fitted with regsdml
.
coef.regsdml
A coef
method for objects
fitted with regsdml
.
vcov.regsdml
A vcov
method for objects
fitted with regsdml
.
print.regsdml
A print
method for objects
fitted with regsdml
.
Our goal is to estimate and perform inference for the linear coefficient in a partially linear mixed-effects model with DML. Machine learning algorithms allows us to incorporate more complex interaction structures and high-dimensional variables.
mmdml
Estimates the linear parameter in a PLMM with repeated measurements using double machine learning.
confint.mmdml
A confint
method for objects
fitted with mmdml
.
fixef.mmdml
A fixef
method for objects
fitted with mmdml
.
print.mmdml
A print
method for objects
fitted with mmdml
.
ranef.mmdml
A ranef
method for objects
fitted with mmdml
.
residuals.mmdml
A residuals
method for objects
fitted with mmdml
.
sigma.mmdml
A sigma
method for objects
fitted with mmdml
.
summary.mmdml
A summary
method for objects
fitted with mmdml
.
vcov.mmdml
A vcov
method for objects
fitted with mmdml
.
VarCorr.mmdml
A VarCorr
method for objects
fitted with mmdml
.
C. Emmenegger and P. Bühlmann. Regularizing Double Machine Learning in Partially Linear Endogenous Models, 2021. Preprint arXiv:2101.12525.
C. Emmenegger and P. Bühlmann. Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements. Preprint arXiv:2108.13657.
Generate data from a partially linear mixed-effects model with one or two fixed effects, 2 random effects, and 3 nonparametric variables. The true underlying function of the nonparametric variables are step functions. The random effects and error terms are from a Gaussian distribution.
example_data_mmdml(beta0, N = 10L, n = 5L)
example_data_mmdml(beta0, N = 10L, n = 5L)
beta0 |
Numeric vector of length 1 or 2 representing the linear coefficient/fixed effects of the model. |
N |
Number of experimental units. Equals 10 per default. |
n |
Expected number of observations per experimental unit, needs to be at least 5. Equals 5 per default. |
A data frame with the columns resp
(the response), id
and
cask
(random effects), w1
, w2
, and w3
(nonparametric confounders), and x1
if beta0
is of length
1 and x1
and x2
if beta0
is of length 2.
The random effects are modelled with "(1|id) + (1|cask:id)"
.
## See example(mmdml) for examples
## See example(mmdml) for examples
Methods for the class mmdml
for generics from lme4.
fixef(object, ...) ## S3 method for class 'mmdml' fixef(object, ...) ranef(object, ...) ## S3 method for class 'mmdml' ranef(object, ...) VarCorr(x, sigma = 1, ...) ## S3 method for class 'mmdml' VarCorr(x, ...) vcov(object, ...) ## S3 method for class 'mmdml' vcov(object, ...)
fixef(object, ...) ## S3 method for class 'mmdml' fixef(object, ...) ranef(object, ...) ## S3 method for class 'mmdml' ranef(object, ...) VarCorr(x, sigma = 1, ...) ## S3 method for class 'mmdml' VarCorr(x, ...) vcov(object, ...) ## S3 method for class 'mmdml' vcov(object, ...)
object , x
|
An object of class |
sigma |
|
... |
Further arguments passed to or from other methods. |
fixef.mmdml
:
Extracts the estimator of the linear coefficient , which
is a named and numeric vector.
ranef.mmdml
:
Extracts the random_eff
entry from object
.
VarCorr.mmdml
:
The variance and correlation components are computed with the
sigma
and the theta
entries of x
as in
lmer
.
For each of the S
repetitions, sigma
and theta
computed
on the K
sample splits are aggregated by taking the mean.
Then, the S
mean-aggregated estimates are aggregated by
the median.
The variance and correlation components are computed with these
median-aggregated estimates.
vcov.mmdml
:
It returns the variance-covariance matrix of the estimator of the linear
coefficient is extracted.
It is computed based on the asymptotic Gaussian distribution
of the estimator.
First, for each of the S
repetitions, the variance-covariance
matrices computed
on the K
sample splits are aggregated by taking the mean.
Second, the S
mean-aggregated estimates are aggregated by
adding a term correcting for the randomness in the sample splits
and by taking the median of these corrected terms.
This final corrected and median-aggregated matrix is returned.
## See example(mmdml) for examples
## See example(mmdml) for examples
Our goal is to perform inference for the linear parameter in partially linear mixed-effects models (PLMMs) with repeated measurements using double machine learning (DML).
The function mmdml
estimates the linear parameter
in the PLMM
for the continuous response
with linear covariates
, nonlinear covariates
, unobserved random effects
, and the error term
.
For each
, there are
repeated observations available.
That is,
is an
-dimensional vector.
The matrix
is fixed. The random effects
and the
error terms
are Gaussian distributed, independent,
and independent of
and
, respectively,
for
. The linear and nonlineare covariates
and
are random and independent of all random effects and error terms.
The linear parameter can be estimated with
a linear mixed-effects modeling approach with maximum likelihood
after regressing out
nonparametrically from
and
using machine learning
algorithms.
A linear mixed-effects model is estimated on the partialled-out data
The conditional expectations are estimated with machine learning algorithms
and sample splitting, and cross-fitting is used to regain full efficiency
of the estimator of . This estimator
is asymptotically Gaussian distributed and efficient.
mmdml( w, x, y, z, data = NULL, z_formula = NULL, group = "group", K = 2L, S = 100L, cond_method = rep("forest", 2), params = NULL, parallel = c("no", "multicore", "snow"), ncpus = 1L, cl = NULL, nr_random_eff = if (S > 5) 1L else S, nr_res = nr_random_eff )
mmdml( w, x, y, z, data = NULL, z_formula = NULL, group = "group", K = 2L, S = 100L, cond_method = rep("forest", 2), params = NULL, parallel = c("no", "multicore", "snow"), ncpus = 1L, cl = NULL, nr_random_eff = if (S > 5) 1L else S, nr_res = nr_random_eff )
w |
A vector, matrix, or data frame. Its columns contain observations
of the nonlinear predictors. Alternatively, if the data is
provided in the data frame |
x |
A vector, matrix, or data frame. This is the linear predictor.
Alternatively, if the data is provided in the data frame
|
y |
A vector, matrix, or data frame. This is the response.
Alternatively, if the data is provided in the data frame
|
z |
A vector, matrix, or data frame. It acts as the fixed matrix
assigning the random effects.
Alternatively, if the data is
provided in the data frame |
z_formula |
A string specifying the structure of the random effect
using the notation as in |
group |
A string containing the name of the grouping variable in
|
data |
An optional data frame. If it is specified, its column names
need to coincide with the character vectors specified in |
K |
The number of sample splits used for cross-fitting. |
S |
Number of replications to correct for the random
splitting of the sample. It is set to |
cond_method |
A character vector of length 2 specifying the estimation
methods used to fit the conditional
expectations |
params |
An optional list of length 2. The 2 elements of this list
are lists themselves. These lists specify additional input arguments for
estimating the conditional expectations |
parallel |
One out of |
ncpus |
An integer specifying the number of cores used if
|
cl |
An optional parallel or snow cluster if |
nr_random_eff |
An integer specifying the number of unaggregated sets
of random effect estimates among the |
nr_res |
An integer specifying the number of unaggregated sets
of residual estimates among the |
The estimator of is computed using sample splitting and
cross-fitting.
The subject-specific data (over
) is split
into
K
sets that are equally large
if possible. For each such set, the nuisance parameters
(that is, the conditional expectations and
) are estimated on its complement and evaluated on the
set itself.
Estimators of
and the variance parameters are computed
for each
of the
K
data sets and are then averaged.
If K = 1
, no sample splitting
is performed. In this case, the nuisance parameters are estimated and
predicted on the full sample.
The whole estimation procedure is repeated S
times to
account for the randomness introduced by the random sample splits.
The S
repetitions can be run in parallel by specifying the
arguments parallel
and ncpus
.
The S
estimators of and the variance components
are aggregated by taking the
median of them. The
S
variance-covariance matrices of the estimator
of are aggregated
by first adding a correction term to them that accounts for the random
splitting and by afterwards taking the median of the corrected
variance-covariance matrices. If
, it can happen that this
final matrix is not positive definite anymore, in which case the mean
is considered instead.
Estimates of the conditional random effects and the residuals are computed
in each of the
S
repetitions. A total number of nr_random_eff
and nr_res
of them, respectively, is returned.
Additionally, the random effects estimates from all S
repetitions
are aggregated using the mean and returned.
If the design in at least 0.5 * S
of the S
repetitions is
singular, an error message is displayed.
If the designs in some but less than 0.5 * S
of the S
repetitions are singular, another S
repetitions are performed.
If, in total, at least S
repetitions result in a nonsingular design,
the results are returned together with a warning message.
The default options of the "spline"
, "forest"
,
"ols"
, "lasso"
, "ridge"
, and "elasticnet"
methods are as follows. With the "spline"
method,
the function bs
from the package splines
is employed
with degree = 3
and df = ceiling(N ^ (1 / 5)) + 2
if
N
satisfies (df + 1) * v + 1 > N
, where v
denotes
the number of columns of w
and N
denotes the sample size.
Otherwise, df
is consecutively
reduced by 1
until this condition is satisfied.
The splines are fitted and predicted on different data sets.
If they are extrapolated, a warning message is displayed.
With the "forest"
method, the function randomForest
from
the package randomForest
is employed with nodesize = 5
,
ntree = 500
, na.action = na.omit
, and replace = TRUE
.
With the "ols"
method, the default arguments are used and no
additional arguments are specified.
With the "lasso"
and "ridge"
methods,
the function cv.glmnet
from the package glmnet
performs
10-fold cross validation by default (argument nfolds
)
to find the one-standard-error-rule -parameter.
With the
"elasticnet"
method, the function cv.glmnet
from
the package glmnet
performs 10-fold cross validation
(argument nfolds
) with alpha = 0.5
by default
to find the one-standard-error-rule -parameter.
All default values of the mentioned parameters can be adapted by
specifying the argument
params
.
There are three possibilities to set the argument parallel
, namely
"no"
for serial evaluation (default),
"multicore"
for parallel evaluation using forking,
and "snow"
for parallel evaluation using a parallel
socket cluster. It is recommended to select RNGkind
("L'Ecuyer-CMRG"
) and to set a seed to ensure that the parallel
computing of the package dmlalg
is reproducible.
This ensures that each processor receives a different substream of the
pseudo random number generator stream.
Thus, the results reproducible if the arguments remain unchanged.
There is an optional argument cl
to specify a custom cluster
if parallel = "snow"
.
The response y
needs to be continuous.
The covariate w
may contain factor variables in its columns.
If the variable x
contains factor variables,
the factors should not be included as factor columns of x
.
Instead, dummy encoding should be used for all individual levels of the
factor.
That is, a factor with 4 levels should be encoded with 4 columns where each
column consists of 1 and 0 entries indicating the presence of the
respective level of the factor.
There are confint
, fixef
, print
, ranef
,
residuals
, sigma
, summary
, vcov
,
and VarCorr
methods available
for objects fitted with mmdml
. They are called
confint.mmdml
,
fixef.mmdml
,
print.mmdml
,
ranef.mmdml
,
residuals.mmdml
,
sigma.mmdml
,
summary.mmdml
,
vcov.mmdml
, and
VarCorr.mmdml
, respectively.
A list similar to the output of lmer
from package lme4 containing
the following entries.
beta |
Estimator of the linear coefficient |
vcov |
Variance-covariance matrix of |
sigma |
Please see |
theta |
Please see |
varcor |
Variance correlation components computed with
|
random_eff |
Conditional estimates of the random effects
similarly to |
random_eff_all |
The first |
residuals |
The first |
The other elements ngrps
, nobs
, fitMsgs
, cnms
,
nc
, nms
, useSc
, optinfo
, and methTitle
are as in lmer
.
The gradient and Hessian information of optinfo
is computed
by aggregating the respective information over the S
repetitions
with the median.
C. Emmenegger and P. Bühlmann. Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements. Preprint arXiv:2108.13657.
confint
,
fixef
,
print
,
ranef
,
residuals
,
sigma
,
summary
,
vcov
,
VarCorr
## generate data RNGkind("L'Ecuyer-CMRG") set.seed(19) data1 <- example_data_mmdml(beta0 = 0.2) data2 <- example_data_mmdml(beta0 = c(0.2, 0.2)) ## fit models ## Caveat: Warning messages are displayed because the small number of ## observations results in a singular random effects model fit1 <- mmdml(w = c("w1", "w2", "w3"), x = "x1", y = "resp", z = c("id", "cask"), data = data1, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3) fit2 <- mmdml(w = c("w1", "w2", "w3"), x = c("x1", "x2"), y = "resp", z = c("id", "cask"), data = data2, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3) ## apply methods confint(fit2) fixef(fit2) print(fit2) ranef(fit2) residuals(fit2) sigma(fit2) summary(fit2) vcov(fit2) VarCorr(fit2)
## generate data RNGkind("L'Ecuyer-CMRG") set.seed(19) data1 <- example_data_mmdml(beta0 = 0.2) data2 <- example_data_mmdml(beta0 = c(0.2, 0.2)) ## fit models ## Caveat: Warning messages are displayed because the small number of ## observations results in a singular random effects model fit1 <- mmdml(w = c("w1", "w2", "w3"), x = "x1", y = "resp", z = c("id", "cask"), data = data1, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3) fit2 <- mmdml(w = c("w1", "w2", "w3"), x = c("x1", "x2"), y = "resp", z = c("id", "cask"), data = data2, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3) ## apply methods confint(fit2) fixef(fit2) print(fit2) ranef(fit2) residuals(fit2) sigma(fit2) summary(fit2) vcov(fit2) VarCorr(fit2)
This is a method for the class mmdml
.
It prints
objects of class mmdml
that typically result from a function
call to mmdml
.
## S3 method for class 'mmdml' print(x, digits = max(3, getOption("digits") - 3), ranef.comp = "Std.Dev.", ...)
## S3 method for class 'mmdml' print(x, digits = max(3, getOption("digits") - 3), ranef.comp = "Std.Dev.", ...)
x |
An object of class |
digits |
Number of significant digits for printing;
also see |
ranef.comp |
A character vector of length one or two
indicating if random-effects parameters should be reported
on the variance and/or standard deviation scale; also see
|
... |
Further arguments passed to or from other methods. |
See lmer
.
## See example(mmdml) for examples
## See example(mmdml) for examples
This is a method for the class regsdml
.
It prints
objects of class regsdml
, which typically result from a function
call to regsdml
.
## S3 method for class 'regsdml' print(x, ...)
## S3 method for class 'regsdml' print(x, ...)
x |
An object of class |
... |
Further arguments passed to or from other methods. |
By default, summary(x)
is called. Please see
summary.regsdml
for further details.
regsdml
,
summary.regsdml
,
confint.regsdml
,
coef.regsdml
,
vcov.regsdml
## Generate some data: set.seed(19) # true linear parameter beta0 <- 1 n <- 40 # observed confounder w <- pi * runif(n, -1, 1) # instrument a <- 3 * tanh(2 * w) + rnorm(n, 0, 1) # unobserved confounder h <- 2 * sin(w) + rnorm(n, 0, 1) # linear covariate x <- -1 * abs(a) - h - 2 * tanh(w) + rnorm(n, 0, 1) # response y <- beta0 * x - 3 * cos(pi * 0.25 * h) + 0.5 * w ^ 2 + rnorm(n, 0, 1) ## Estimate the linear coefficient from x to y ## (The parameters are chosen small enough to make estimation fast): ## Caveat: A spline estimator is extrapolated, which raises a warning message. ## Extrapolation lies in the nature of our method. To omit the warning message ## resulting from the spline estimator, another estimator may be used. fit <- regsdml(a, w, x, y, gamma = exp(seq(-4, 1, length.out = 4)), S = 3, do_regDML_all_gamma = TRUE, cond_method = c("forest", # for E[A|W] "spline", # for E[X|W] "spline"), # for E[Y|W] params = list(list(ntree = 1), NULL, NULL)) print(fit)
## Generate some data: set.seed(19) # true linear parameter beta0 <- 1 n <- 40 # observed confounder w <- pi * runif(n, -1, 1) # instrument a <- 3 * tanh(2 * w) + rnorm(n, 0, 1) # unobserved confounder h <- 2 * sin(w) + rnorm(n, 0, 1) # linear covariate x <- -1 * abs(a) - h - 2 * tanh(w) + rnorm(n, 0, 1) # response y <- beta0 * x - 3 * cos(pi * 0.25 * h) + 0.5 * w ^ 2 + rnorm(n, 0, 1) ## Estimate the linear coefficient from x to y ## (The parameters are chosen small enough to make estimation fast): ## Caveat: A spline estimator is extrapolated, which raises a warning message. ## Extrapolation lies in the nature of our method. To omit the warning message ## resulting from the spline estimator, another estimator may be used. fit <- regsdml(a, w, x, y, gamma = exp(seq(-4, 1, length.out = 4)), S = 3, do_regDML_all_gamma = TRUE, cond_method = c("forest", # for E[A|W] "spline", # for E[X|W] "spline"), # for E[Y|W] params = list(list(ntree = 1), NULL, NULL)) print(fit)
Our goal is to perform inference for the linear parameter in partially linear models with confounding variables. The standard double machine learning (DML) estimator of the linear parameter has a two-stage least squares interpretation, which can lead to a large variance and overwide confidence intervals. We apply regularization to reduce the variance of the estimator, which produces narrower confidence intervals that remain approximately valid.
The function regsdml
estimates the linear parameter
in the partially linear model
of the continuous response
with linear covariates
, nonlinear covariates
, unobserved confounding
variables
, and the error term
. An additional
variable
is required that is not part of the right-hand side
defining
. The variable
acts as an instrument after
is regressed out of it.
The linear parameter can be estimated with
a two-stage least squares (TSLS) approach ("standard" DML) or with
regularized approaches (regDML, regsDML).
All approaches use double
machine learning.
The TSLS approach regresses the residual
on
using the instrument
.
The regularized approaches
minimize an objective function that equals
times
the objective function
of TSLS plus an objective function that partials out
from the residual quantity
.
The different regularization approaches choose different regularization
parameters
.
The conditional expectations act as nuisance parameters and are estimated
with machine learning algorithms.
All approaches use sample splitting and cross-fitting to
estimate
.
regsdml( a, w, x, y, data = NULL, DML = c("DML2", "DML1"), K = 2L, gamma = exp(seq(-4, 10, length.out = 100)), aN = NULL, do_regsDML = TRUE, do_safety = FALSE, do_DML = do_regDML || do_regsDML || do_safety, do_regDML = FALSE, do_regDML_all_gamma = FALSE, safety_factor = 0.7, cond_method = rep("spline", 3), params = NULL, level = 0.95, S = 100L, parallel = c("no", "multicore", "snow"), ncpus = 1L, cl = NULL )
regsdml( a, w, x, y, data = NULL, DML = c("DML2", "DML1"), K = 2L, gamma = exp(seq(-4, 10, length.out = 100)), aN = NULL, do_regsDML = TRUE, do_safety = FALSE, do_DML = do_regDML || do_regsDML || do_safety, do_regDML = FALSE, do_regDML_all_gamma = FALSE, safety_factor = 0.7, cond_method = rep("spline", 3), params = NULL, level = 0.95, S = 100L, parallel = c("no", "multicore", "snow"), ncpus = 1L, cl = NULL )
a |
A vector, matrix, or data frame. It acts as an instrument after
regressing out |
w |
A vector, matrix, or data frame. Its columns contain observations
of the nonlinear predictors. Alternatively, if the data is
provided in the data frame |
x |
A vector, matrix, or data frame. This is the linear predictor.
Alternatively, if the data is provided in the data frame
|
y |
A vector, matrix, or data frame. This is the response.
Alternatively, if the data is provided in the data frame
|
data |
An optional data frame. If it is specified, its column names
need to coincide with the character vectors specified in |
DML |
Either |
K |
The number of sample splits used for cross-fitting. |
gamma |
A vector specifying the grid of regularization parameters over which to optimize. |
aN |
The |
do_regsDML |
A boolean that specifies whether the regsDML estimator
is computed. It is set to |
do_safety |
A boolean that specifies whether a safety device is employed.
The safety device chooses the regularization parameter |
do_DML |
A boolean that specifies whether the standard DML estimator is
computed. It is set to |
do_regDML |
A boolean that specifies whether the regularized DML
estimator regDML with the regularization parameter equal to |
do_regDML_all_gamma |
A boolean that specifies whether the regularized
estimators for all values |
safety_factor |
The factor of the safety method. It is set to |
cond_method |
A character vector of length 3 specifying the estimation
methods used to fit the conditional
expectations |
params |
An optional list of length 3. All 3 elements of this list
are lists themselves. These lists specify additional input arguments for
estimating the conditional expectations |
level |
Level for computing
confidence intervals for testing the two-sided component-wise
null hypotheses that test if a component equals zero
with the (approximate) asymptotic Gaussian distribution. The default is
|
S |
Number of replications to correct for the random
splitting of the sample. It is set to |
parallel |
One out of |
ncpus |
An integer specifying the number of cores used if
|
cl |
An optional parallel or snow cluster if |
The estimator of is computed using sample splitting and
cross-fitting.
Irrespective of which methods are performed,
the data is split into
K
sets that are equally large
if possible. For each such set, the nuisance parameters
(that is, the conditional expectations ,
,
and
) are estimated on its complement and evaluated on the
set itself. If
DML = "DML1"
, then K
individual
estimators are computed for each
of the K
data sets and are then averaged. If DML = "DML2"
,
the nuisance parameter matrices are first assembled before the estimator
of is computed. This enhances stability of the coefficient
estimator compared to
"DML1"
. If K = 1
, no sample splitting
is performed. In this case, the nuisance parameters are estimated and
predicted on the full sample.
The whole estimation procedure can be repeated S
times to
account for the randomness introduced by the random sample splits.
The S
repetitions can be run in parallel by specifying the
arguments parallel
and ncpus
.
The S
estimators of are aggregated by taking the
median of them. The
S
variance-covariance matrices are aggregated
by first adding a correction term to them that accounts for the random
splitting and by afterwards taking the median of the corrected
variance-covariance matrices. If , it can happen that this
final matrix is not positive definite anymore, in which case the mean
is considered instead.
If the design in at least 0.5 * S
of the S
repetitions is
singular, an error message is displayed.
If the designs in some but less than 0.5 * S
of the S
repetitions are singular, another S
repetitions are performed.
If, in total, at least S
repetitions result in a nonsingular design,
the results are returned together with a warning message.
The regularized estimators and their associated mean squared errors
(MSEs) are computed for the regularization parameters of
the grid
gamma
. These estimators are returned if the argument
do_regDML_all_gamma
is set to TRUE
.
The -value whose corresponding regularized estimator
from the
do_regDML_all_gamma
method achieves
the smallest MSE
is multiplied by aN
, leading to .
The
do_regDML_all_gamma
estimator with regularization parameter
is called regDML.
The regsDML estimator equals regDML or DML depending on whose variance
is smaller.
If
is of larger dimension than 1, the MSE computations and
the variance comparison step are performed with the sum of the diagonal
entries of the respective variance-covariance matrices.
If do_safety = TRUE
, a value is chosen such that the
regularized estimator among
do_regDML_all_gamma
with this value of has a variance
that is just not smaller than
safety_factor
times the variance of
DML.
If is of larger dimension than 1, the sum of the diagonal
entries of the respective variance-covariance matrices is taken as
a measure of variance.
If the regularization scheme leads to considerable variance
reductions, it is possible that this safety device cannot be applied.
In this case, a respective message is returned.
The default options of the "spline"
, "forest"
,
"ols"
, "lasso"
, "ridge"
, and "elasticnet"
methods are as follows. With the "spline"
method,
the function bs
from the package splines
is employed
with degree = 3
and df = ceiling(N ^ (1 / 5)) + 2
if
N
satisfies (df + 1) * v + 1 > N
, where v
denotes
the number of columns of w
and N
denotes the sample size.
Otherwise, df
is consecutively
reduced by 1
until this condition is satisfied.
The splines are fitted and predicted on different data sets.
If they are extrapolated, a warning message is displayed.
With the "forest"
method, the function randomForest
from
the package randomForest
is employed with nodesize = 5
,
ntree = 500
, na.action = na.omit
, and replace = TRUE
.
With the "ols"
method, the default arguments are used and no
additional arguments are specified.
With the "lasso"
and "ridge"
methods,
the function cv.glmnet
from the package glmnet
performs
10-fold cross validation by default (argument nfolds
)
to find the one-standard-error-rule -parameter.
With the
"elasticnet"
method, the function cv.glmnet
from
the package glmnet
performs 10-fold cross validation
(argument nfolds
) with alpha = 0.5
by default
to find the one-standard-error-rule -parameter.
All default values of the mentioned parameters can be adapted by
specifying the argument
params
.
There are three possibilities to set the argument parallel
, namely
"no"
for serial evaluation (default),
"multicore"
for parallel evaluation using forking,
and "snow"
for parallel evaluation using a parallel
socket cluster. It is recommended to select RNGkind
("L'Ecuyer-CMRG"
) and to set a seed to ensure that the parallel
computing of the package dmlalg
is reproducible.
This ensures that each processor receives a different substream of the
pseudo random number generator stream.
Thus, the results reproducible if the arguments remain unchanged.
There is an optional argument cl
to specify a custom cluster
if parallel = "snow"
.
The response y
needs to be continuous.
The covariate w
may contain factor variables in its columns.
If the variables a
and x
contain factor variables,
the factors should not be included as factor columns of a
or
x
.
Instead, dummy encoding should be used for all individual levels of the
factor.
That is, a factor with 4 levels should be encoded with 4 columns where each
column consists of 1 and 0 entries indicating the presence of the
respective level of the factor.
There are summary
, confint
, coef
, vcov
,
and print
methods available
for objects fitted with regsdml
. They are called
summary.regsdml
,
confint.regsdml
,
coef.regsdml
,
vcov.regsdml
, and
print.regsdml
, respectively.
A list containing some of the lists
regsDML_statistics
,
regDML_safety_statistics
,
DML_statistics
, regDML_statistics
, and
regDML_all_gamma_statistics
is returned.
The individual sublists contain the following arguments supplemented
by an additional suffix specifying the method they correspond to.
beta |
Estimator of the linear coefficient |
sd |
Standard error estimates of the respective entries
of |
var |
Variance-covariance matrix of |
pval |
p-values for the respective entries of |
CI |
Two-sided confidence intervals
for |
The list regsDML_statistics
contains the following additional entries:
message_regsDML |
Specifies if regsDML selects the regularized estimator or DML. |
gamma_aN |
Chosen optimal regularization parameter if regsDML equals the regularized estimator. This entry is not present if DML is selected. |
If the safety device is applicable, the list regDML_safety_statistics
contains the following additional entries:
message_safety |
Specifies whether the safety device was applicable. |
gamma_safety |
Chosen regularization parameter of the safety device. |
If the safety device is not applicable, the list
regDML_safety_statistics
contains message_safety
as its only entry.
The list regDML_statistics
contains the
following additional entry:
gamma_opt |
Chosen optimal regularization parameter. |
The list regDML_all_gamma_statistics
is a list of the same
length as the grid gamma
, where each individual list is of the
structure just described.
C. Emmenegger and P. Bühlmann. Regularizing Double Machine Learning in Partially Linear Endogenous Models, 2021. Preprint arXiv:2101.12525.
summary.regsdml
,
confint.regsdml
,
coef.regsdml
,
vcov.regsdml
print.regsdml
## Generate some data: RNGkind("L'Ecuyer-CMRG") set.seed(19) # true linear parameter beta0 <- 1 n <- 40 # observed confounder w <- pi * runif(n, -1, 1) # instrument a <- 3 * tanh(2 * w) + rnorm(n, 0, 1) # unobserved confounder h <- 2 * sin(w) + rnorm(n, 0, 1) # linear covariate x <- -1 * abs(a) - h - 2 * tanh(w) + rnorm(n, 0, 1) # response y <- beta0 * x - 3 * cos(pi * 0.25 * h) + 0.5 * w ^ 2 + rnorm(n, 0, 1) ## Estimate the linear coefficient from x to y ## (The parameters are chosen small enough to make estimation fast): ## Caveat: A spline estimator is extrapolated, which raises a warning message. ## Extrapolation lies in the nature of our method. To omit the warning message ## resulting from the spline estimator, another estimator may be used. fit <- regsdml(a, w, x, y, gamma = exp(seq(-4, 1, length.out = 4)), S = 3, do_regDML_all_gamma = TRUE, cond_method = c("forest", # for E[A|W] "spline", # for E[X|W] "spline"), # for E[Y|W] params = list(list(ntree = 1), NULL, NULL)) ## parm = c(2, 3) prints an additional summary for the 2nd and 3rd gamma-values summary(fit, parm = c(2, 3), correlation = TRUE, print_gamma = TRUE) confint(fit, parm = c(2, 3), print_gamma = TRUE) coef(fit) # coefficients vcov(fit) # variance-covariance matrices ## Alternatively, provide the data in a single data frame ## (see also caveat above): data <- data.frame(a = a, w = w, x = x, y = y) fit <- regsdml(a = "a", w = "w", x = "x", y = "y", data = data, gamma = exp(seq(-4, 1, length.out = 4)), S = 3) ## With more realistic parameter choices: if (FALSE) { fit <- regsdml(a, w, x, y, cond_method = c("forest", # for E[A|W] "spline", # for E[X|W] "spline")) # for E[Y|W] summary(fit) confint(fit) ## Alternatively, provide the data in a single data frame: ## (see also caveat above): data <- data.frame(a = a, w = w, x = x, y = y) fit <- regsdml(a = "a", w = "w", x = "x", y = "y", data = data) }
## Generate some data: RNGkind("L'Ecuyer-CMRG") set.seed(19) # true linear parameter beta0 <- 1 n <- 40 # observed confounder w <- pi * runif(n, -1, 1) # instrument a <- 3 * tanh(2 * w) + rnorm(n, 0, 1) # unobserved confounder h <- 2 * sin(w) + rnorm(n, 0, 1) # linear covariate x <- -1 * abs(a) - h - 2 * tanh(w) + rnorm(n, 0, 1) # response y <- beta0 * x - 3 * cos(pi * 0.25 * h) + 0.5 * w ^ 2 + rnorm(n, 0, 1) ## Estimate the linear coefficient from x to y ## (The parameters are chosen small enough to make estimation fast): ## Caveat: A spline estimator is extrapolated, which raises a warning message. ## Extrapolation lies in the nature of our method. To omit the warning message ## resulting from the spline estimator, another estimator may be used. fit <- regsdml(a, w, x, y, gamma = exp(seq(-4, 1, length.out = 4)), S = 3, do_regDML_all_gamma = TRUE, cond_method = c("forest", # for E[A|W] "spline", # for E[X|W] "spline"), # for E[Y|W] params = list(list(ntree = 1), NULL, NULL)) ## parm = c(2, 3) prints an additional summary for the 2nd and 3rd gamma-values summary(fit, parm = c(2, 3), correlation = TRUE, print_gamma = TRUE) confint(fit, parm = c(2, 3), print_gamma = TRUE) coef(fit) # coefficients vcov(fit) # variance-covariance matrices ## Alternatively, provide the data in a single data frame ## (see also caveat above): data <- data.frame(a = a, w = w, x = x, y = y) fit <- regsdml(a = "a", w = "w", x = "x", y = "y", data = data, gamma = exp(seq(-4, 1, length.out = 4)), S = 3) ## With more realistic parameter choices: if (FALSE) { fit <- regsdml(a, w, x, y, cond_method = c("forest", # for E[A|W] "spline", # for E[X|W] "spline")) # for E[Y|W] summary(fit) confint(fit) ## Alternatively, provide the data in a single data frame: ## (see also caveat above): data <- data.frame(a = a, w = w, x = x, y = y) fit <- regsdml(a = "a", w = "w", x = "x", y = "y", data = data) }
A list whose elements correspond to the potentially scaled first
nr_res
sets of
residuals of the S
residuals.
## S3 method for class 'mmdml' residuals(object, scaled = FALSE, ...)
## S3 method for class 'mmdml' residuals(object, scaled = FALSE, ...)
object |
An object of class |
scaled |
A boolean specifying whether scaled residuals should be returned. It is set to FALSE by default. |
... |
Further arguments passed to or from other methods. |
A list whose elements correspond to the first nr_res
sets of
residuals of the S
residuals.
## See example(mmdml) for examples
## See example(mmdml) for examples
Extract the estimated standard deviation of the errors,
the “residual standard deviation”,
from a fitted mmdml
model.
## S3 method for class 'mmdml' sigma(object, ...)
## S3 method for class 'mmdml' sigma(object, ...)
object |
An object of class |
... |
Further arguments passed to or from other methods. |
A number representing the estimated standard deviation.
First, for each of the S
repetitions, the standard deviations computed
on the K
sample splits are aggregated by taking the mean.
Second, the S
mean-aggregated estimates are aggregated by
the median. This final value is returned.
## See example(mmdml) for examples
## See example(mmdml) for examples
This is a method for the class mmdml
. It summarizes
objects of class mmdml
that typically result from a function
call to mmdml
.
## S3 method for class 'mmdml' summary(object, correlation = (p <= getOption("lme4.summary.cor.max")), nr_res = NULL, ...)
## S3 method for class 'mmdml' summary(object, correlation = (p <= getOption("lme4.summary.cor.max")), nr_res = NULL, ...)
object |
An object of class |
correlation |
Boolean indicating if the variance and correlation
components ( |
nr_res |
Boolean indicating how many sets of residuals among the |
... |
Further arguments passed to or from other methods. |
Prints a summary output similar to lmer
from package lme4.
## See example(mmdml) for examples
## See example(mmdml) for examples
This is a method for the class regsdml
. It summarizes
objects of class regsdml
, which typically result from a function
call to regsdml
.
## S3 method for class 'regsdml' summary(object, print_regsDML = NULL, print_safety = NULL, print_DML = NULL, print_regDML = NULL, print_regDML_all_gamma = !is.null(parm), parm = NULL, correlation = FALSE, print_gamma = FALSE, ...)
## S3 method for class 'regsdml' summary(object, print_regsDML = NULL, print_safety = NULL, print_DML = NULL, print_regDML = NULL, print_regDML_all_gamma = !is.null(parm), parm = NULL, correlation = FALSE, print_gamma = FALSE, ...)
object |
An object of class |
print_regsDML |
A boolean. If |
print_safety |
A boolean. If |
print_DML |
A boolean. If |
print_regDML |
A boolean. If |
print_regDML_all_gamma |
A boolean. If |
parm |
A vector containing the indices for which |
correlation |
A boolean. If |
print_gamma |
A boolean. If |
... |
Further arguments passed to or from other methods. |
Summary statistics of the methods regsDML
, the safety
device,
DML
, regDML
with the optimal
choice of (including the factor
a_N
),
and regDML
with prespecified -values are returned by setting the
respective arguments. It is possible to return the respective
gamma
-values and variance-covariance matrices.
If none of the printing arguments are set, only the results of regsDML
are returned if they are available. If they are not available and none of
the printing arguments are set, the results from all available methods
are returned. If print_regsDML = FALSE
, only the results from
those methods are returned that are explicitly specified by the printing
arguments.
regsdml
,
confint.regsdml
,
coef.regsdml
,
vcov.regsdml
print.regsdml
## See example(regsdml) for examples
## See example(regsdml) for examples
This is a method for the class regsdml
. It returns the
variance-covariance matrices of the coefficients from
objects of class regsdml
, which typically result from a function
call to regsdml
.
## S3 method for class 'regsdml' vcov(object, print_regsDML = NULL, print_safety = NULL, print_DML = NULL, print_regDML = NULL, print_regDML_all_gamma = !is.null(parm), parm = NULL, print_gamma = FALSE, ...)
## S3 method for class 'regsdml' vcov(object, print_regsDML = NULL, print_safety = NULL, print_DML = NULL, print_regDML = NULL, print_regDML_all_gamma = !is.null(parm), parm = NULL, print_gamma = FALSE, ...)
object |
An object of class |
print_regsDML |
A boolean. If |
print_safety |
A boolean. If |
print_DML |
A boolean. If |
print_regDML |
A boolean. If |
print_regDML_all_gamma |
A boolean. If |
parm |
A vector containing the indices for which |
print_gamma |
A boolean. If |
... |
Further arguments passed to or from other methods. |
Variance-covariance matrices of the methods regsDML
,
the safety
device, DML
, regDML
with the optimal
choice of (including the factor
a_N
),
and regDML
with prespecified -values are returned by setting the
respective arguments. It is possible to return the respective
gamma
-values.
If none of the printing arguments are set, only the results of regsDML
are returned if they are available. If they are not available and none of
the printing arguments are set, the results from all available methods
are returned. If print_regsDML = FALSE
, only the results from
those methods are returned that are explicitly specified by the printing
arguments.
regsdml
,
summary.regsdml
,
confint.regsdml
,
coef.regsdml
print.regsdml
## See example(regsdml) for examples
## See example(regsdml) for examples