I'm looking for a way to find every transformation of a variable in a formula (in a time series problem) or alternatively, find the position of every transformation of a variable in the vector of coefficients in the regression analysis associated to this formula.
Let's assume the following example:
library(xts)
library(dyn)
n <- 100
dates <- seq.Date(from = as.Date("01-01-01"), length.out = n, by = "months")
X1 <- xts(rnorm(n, 0, 2), dates)
X2 <- xts(rnorm(n, 0, 2), dates)
X3 <- xts(rnorm(n, 0, 2), dates)
Y <- xts(rnorm(n, 0, 2), dates)
data <- merge.xts(Y, X1, X2, X3)
fla <- Y ~ lag.xts(Y,1) X1 X2 diff.xts(log(X1), 1) exp(X1) lag.xts(X2 X3, 1)
model <- dyn$lm(fla, data = data)
summary(model)
And the output:
Call:
lm(formula = dyn(fla), data = data)
Residuals:
Min 1Q Median 3Q Max
-4.6017 -0.7453 0.1471 0.9604 2.4439
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.21310 0.99148 -0.215 0.832
lag.xts(Y, 1) -0.06777 0.16848 -0.402 0.691
X1 0.08185 0.73443 0.111 0.912
X2 -0.11891 0.15018 -0.792 0.436
diff.xts(log(X1), 1) -0.01050 0.38079 -0.028 0.978
exp(X1) 0.01028 0.04157 0.247 0.807
lag.xts(X2 X3, 1) -0.09484 0.08746 -1.084 0.288
Residual standard error: 1.712 on 26 degrees of freedom
(67 observations deleted due to missingness)
Multiple R-squared: 0.1129, Adjusted R-squared: -0.09183
F-statistic: 0.5514 on 6 and 26 DF, p-value: 0.7644
There are 4 variables in the regression (Y, X1, X2 and X3). My goal is to find the correspondance:
Y -> c(2)
X1 -> c(3, 5, 6)
X2 -> c(4, 7)
X3 -> c(7)
CodePudding user response:
There is no need for text processing. R offers extensive facilities to deal with model terms.
foo <- \(model) {
vars <- attr(terms(model), "variables")
sel <- lapply(vars, all.vars)[-1]
res <- lapply(all.vars(terms(model)), \(x) which(sapply(sel, \(y) x %in% y)))
setNames(res, all.vars(terms(model)))
}
foo(model)
#$Y
#[1] 1 2
#
#$X1
#[1] 3 5 6
#
#$X2
#[1] 4 7
#
#$X3
#[1] 7
Obviously, the indices start with the dependent variable.
CodePudding user response:
Try grepl for each variable name with sapply:
sapply(names(data), function(y) which(grepl( y, names(model$coefficients), fixed = TRUE)))
$Y
[1] 2
$X1
[1] 3 5 6
$X2
[1] 4 7
$X3
[1] 7
CodePudding user response:
We can exploit the set of characters that constitute valid R identifiers (i.e. valid symbol names; see Identifiers for more details) in conjunction with regular expressions to extract the indices you want from names(model$coefficients).
sapply(names(data),
function(name) {
grep(pattern = paste0("^", name, "$|[^:alnum:_.]", name, "[^:alnum:_.]"),
names(model$coefficients))
}
)
Output:
$Y
[1] 2
$X1
[1] 3 5 6
$X2
[1] 4 7
$X3
[1] 7
Explanation
- R only permits alphanumeric characters, underscores or the dot as valid symbol names so we use a group negation
[^:alnum:_.]to exclude this set to the left or right of the symbols - We also allow for the case of a standalone symbol with the beginning of string and end of string special characters
^and$respectively - Since the use of a function to transform a variable necessarily places an excluded character on both sides of the symbol (using operands like
,-or/lead to invalid model formulae), we need not handle the one-sided edge case.
