R lm using subset of my data frame with c(index)-CodePudding

I have a large data frame. The first n columns represent my dependent variables, the remaining m=(N-n) columns represent my explanatory variables.

I need to do a variable selection, i.e., I want to run a linear model with one of my dependent variables against a selection of explanatory variables.

I use the following code, but it does not work.

structure(list(y1 = c(-0.159526983540257, 2.16892194367082, 0.695539528415267, 
-0.841375527728487, 0.146186718603554), y2 = c(0.843930369507526, 
1.15189158283099, -0.162651238219114, 0.384543148695671, -0.768095169822086
), y3 = c(0.676606087565373, -1.54403120779262, 0.309217049561983, 
-1.35994467980478, 0.025666048887934), x1 = c(-0.462318888988991, 
0.637219370641707, 0.169306615605319, 0.773825637643689, -1.80512938432685
), x2 = c(0.420644990269304, 0.168496378157891, -0.288787457624397, 
-1.8207116669123, -1.04563859296061), x3 = c(0.529585006756937, 
-0.69696010268217, 0.72760512189806, 1.27475852051601, 0.0547933726620265
), x4 = c(0.995548762574541, -1.42396489630791, 1.34343306027338, 
1.14879495559021, 1.11600859581743), x5 = c(-0.989878720668274, 
-0.823824983427361, -1.58910626627862, -0.987929834373281, -1.75551410908407
), x6 = c(-0.206995723222616, -0.712762437418153, -0.516370544799284, 
0.124635650806358, 1.08149368199072), x7 = c(-0.409575294823497, 
1.5077513417679, -1.17700768734441, -0.159607245758965, 1.11768048557717
)), class = "data.frame", row.names = c(NA, -5L))   

    index=c(5,8,9)

    model = lm(df[,1] ~ df[,c(index)])

Is it possible to subset the data frame in a similar way? I really want to avoid column names, since I may run several different models.

Edit: the length of c(index) may vary each time.

CodePudding user response：

You can use reformulate :

with index_y the index of your y variable of interest in your dataframe df

model=lm(reformulate(colnames(df)[index],response=colnames(df)[index_y]),df)

CodePudding user response：

This works:

# Generating data.
n = 1000 # Use n for denoting number of observations, for consistency reasons!
k = 20 # k is used for independent/explanatory variables (also p often)!

set.seed(1986)

X = matrix(rnorm(n * k), ncol = k)
y = runif(n)

df = data.frame(y, X)
head(df)

# Fitting model on a subset of explanatory variables.
index = 4:10
model = lm(y ~ ., data = df[, index])
summary(model)

First of all, let me advice you about notation: n is generally used to denote the number of observations, i.e., the number of rows (not columns), while k or p are used for explanatory variables. Moreover, explantory and independent variables are the same thing - so maybe you meant that you want to run a linear model with one of my dependent variables against a selection of explanatory variables.

Going back to your question, I suggest to rely on the optional parameter data in the lm() function to pass only the data you actually want to use. In this way, you can use the formula y ~ ., which reads as regress y on all the other variables you find in the data I pass in.

As final warning, I set index = 4:10. Notice that I am not using explanatory/independent variables from the fourth to the tenth, but from the third to the ninth, as the first column of data is y, i.e., the dependent variable (which you have to always include in data).

EDIT

I see you provided some data to work with. Here how to adapt the code:

# Fitting model on a subset of explanatory variables.
index=c(5,8,9)
model = lm(y1 ~ ., data = df[, c(1, index)]) # HERE I AM ADDING THE FIRST COLUMN!
summary(model)

Basically, either your index includes the column with the dependent variable (y1 in my case), or you add it in the optional parameter data (as I did in the example).