Find every transformation of a variable in a formula or a regression table-CodePudding

I'm looking for a way to find every transformation of a variable in a formula (in a time series problem) or alternatively, find the position of every transformation of a variable in the vector of coefficients in the regression analysis associated to this formula.

Let's assume the following example:

library(xts)
library(dyn)

n <- 100
dates <- seq.Date(from = as.Date("01-01-01"), length.out = n, by = "months")
X1 <- xts(rnorm(n, 0, 2), dates)
X2 <- xts(rnorm(n, 0, 2), dates)
X3 <- xts(rnorm(n, 0, 2), dates)
Y <- xts(rnorm(n, 0, 2), dates)
data <- merge.xts(Y, X1, X2, X3)

fla <- Y ~ lag.xts(Y,1)   X1   X2   diff.xts(log(X1), 1)   exp(X1)   lag.xts(X2   X3, 1)
model <- dyn$lm(fla, data = data)

summary(model)

And the output:

Call:
lm(formula = dyn(fla), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6017 -0.7453  0.1471  0.9604  2.4439 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)          -0.21310    0.99148  -0.215    0.832
lag.xts(Y, 1)        -0.06777    0.16848  -0.402    0.691
X1                    0.08185    0.73443   0.111    0.912
X2                   -0.11891    0.15018  -0.792    0.436
diff.xts(log(X1), 1) -0.01050    0.38079  -0.028    0.978
exp(X1)               0.01028    0.04157   0.247    0.807
lag.xts(X2   X3, 1)  -0.09484    0.08746  -1.084    0.288

Residual standard error: 1.712 on 26 degrees of freedom
  (67 observations deleted due to missingness)
Multiple R-squared:  0.1129,    Adjusted R-squared:  -0.09183 
F-statistic: 0.5514 on 6 and 26 DF,  p-value: 0.7644

There are 4 variables in the regression (Y, X1, X2 and X3). My goal is to find the correspondance:

Y -> c(2)
X1 -> c(3, 5, 6)
X2 -> c(4, 7)
X3 -> c(7)

CodePudding user response：

There is no need for text processing. R offers extensive facilities to deal with model terms.

foo <- \(model) {
  vars <- attr(terms(model), "variables")
  
  sel <- lapply(vars, all.vars)[-1]
  
  res <- lapply(all.vars(terms(model)), \(x) which(sapply(sel, \(y) x %in% y)))
  setNames(res, all.vars(terms(model)))
}

foo(model)
#$Y
#[1] 1 2
#
#$X1
#[1] 3 5 6
#
#$X2
#[1] 4 7
#
#$X3
#[1] 7

Obviously, the indices start with the dependent variable.

CodePudding user response：

Try grepl for each variable name with sapply:

sapply(names(data), function(y) which(grepl( y, names(model$coefficients), fixed = TRUE)))
$Y
[1] 2

$X1
[1] 3 5 6

$X2
[1] 4 7

$X3
[1] 7

CodePudding user response：

We can exploit the set of characters that constitute valid R identifiers (i.e. valid symbol names; see Identifiers for more details) in conjunction with regular expressions to extract the indices you want from names(model$coefficients).

sapply(names(data), 
       function(name) {
  grep(pattern = paste0("^", name, "$|[^:alnum:_.]", name, "[^:alnum:_.]"), 
       names(model$coefficients))
  }
)

Output:

$Y
[1] 2
$X1
[1] 3 5 6
$X2
[1] 4 7
$X3
[1] 7

Explanation

R only permits alphanumeric characters, underscores or the dot as valid symbol names so we use a group negation [^:alnum:_.] to exclude this set to the left or right of the symbols
We also allow for the case of a standalone symbol with the beginning of string and end of string special characters ^ and $ respectively
Since the use of a function to transform a variable necessarily places an excluded character on both sides of the symbol (using operands like , - or / lead to invalid model formulae), we need not handle the one-sided edge case.