I am trying to use the rlm function to create a linear model to test against my training data. Specifically, the data frame trainingData contains 100 predictors (IR Wavelengths from 852nm to 1050nm) and 1 observation (Fat). However, when I try to create a robust linear model (rlm) of the data I get the following error.
"Error in [.data.frame(mf, xvars) : undefined columns selected"
I am trying to model all the IR wavelengths against the Fat observation, which is all contained in the data frame trainingData.
#Loading the "Tecator" data into R
data(tecator)
#Naming columns for easier interpretation
colnames(absorp) <- c(paste0(seq(852,1050,2),'nm'))
colnames(endpoints) <- c('Water','Fat','Protein')
#Creating training and test sets
index <- createDataPartition(endpoints[,'Fat'], p = .7, list = FALSE)
train.absorp <- as.data.frame(absorp[index,])
test.absorp <- as.data.frame(absorp[-index,])
train.endpoints <- as.data.frame(endpoints[index,])
test.endpoints <- as.data.frame(endpoints[-index,])
#Creating training data frame for fat content prediction
trainingData <- train.absorp
trainingData$Fat <- train.endpoints$Fat
rlmFitAllPredictors <- rlm(Fat ~., data = trainingData)
CodePudding user response:
The issue seems to be related to the column names which starts with digits. If we change it by adding a character in front it would work
names(trainingData)[-ncol(trainingData)] <- paste0("X", names(trainingData)[-ncol(trainingData)])
-testing
rlmFitAllPredictors <- rlm(Fat ~., data = trainingData)
-output structure
> str(rlmFitAllPredictors)
List of 21
$ coefficients : Named num [1:101] 4.56 13058.33 -12528.76 -13401.94 35613.86 ...
..- attr(*, "names")= chr [1:101] "(Intercept)" "X852nm" "X854nm" "X856nm" ...
$ residuals : Named num [1:152] 0.0484 0.013 -0.0547 0.0158 -0.0386 ...
..- attr(*, "names")= chr [1:152] "1" "2" "3" "4" ...
$ wresid : Named num [1:152] 0.0484 0.013 -0.0547 0.0158 -0.0386 ...
..- attr(*, "names")= chr [1:152] "1" "2" "3" "4" ...
$ effects : Named num [1:152] -210.4 50.5 35.7 59.2 52.3 ...
..- attr(*, "names")= chr [1:152] "(Intercept)" "X852nm" "X854nm" "X856nm" ...
$ rank : int 101
$ fitted.values: Named num [1:152] 22.45 40.09 8.45 5.88 25.54 ...
..- attr(*, "names")= chr [1:152] "1" "2" "3" "4" ...
...
The reason is because there is a mismatch in column names when the names starts with digits. In the model.matrix, it creates the column names with backquotes. i.e. if we add some print statements it would be clear
rlm_test <- function (formula, data, weights, ..., subset, na.action, method = c("M",
"MM", "model.frame"), wt.method = c("inv.var", "case"), model = TRUE,
x.ret = TRUE, y.ret = FALSE, contrasts = NULL)
{
mf <- match.call(expand.dots = FALSE)
mf$method <- mf$wt.method <- mf$model <- mf$x.ret <- mf$y.ret <- mf$contrasts <- mf$... <- NULL
mf[[1L]] <- quote(stats::model.frame)
mf <- eval.parent(mf)
method <- match.arg(method)
wt.method <- match.arg(wt.method)
if (method == "model.frame")
return(mf)
mt <- attr(mf, "terms")
print(mt)
y <- model.response(mf)
offset <- model.offset(mf)
if (!is.null(offset))
y <- y - offset
x <- model.matrix(mt, mf, contrasts)
print("x vars")
print(head(x, 2))
xvars <- as.character(attr(mt, "variables"))[-1L]
if ((yvar <- attr(mt, "response")) > 0L)
xvars <- xvars[-yvar]
xlev <- if (length(xvars) > 0L) {
xlev <- lapply(mf[xvars], levels)
xlev[!sapply(xlev, is.null)]
}
weights <- model.weights(mf)
if (!length(weights))
weights <- rep(1, nrow(x))
fit <- rlm.default(x, y, weights, method = method, wt.method = wt.method,
...)
fit$terms <- mt
cl <- match.call()
cl[[1L]] <- as.name("rlm")
fit$call <- cl
fit$contrasts <- attr(x, "contrasts")
fit$xlevels <- .getXlevels(mt, mf)
fit$na.action <- attr(mf, "na.action")
if (model)
fit$model <- mf
if (!x.ret)
fit$x <- NULL
if (y.ret)
fit$y <- y
fit$offset <- offset
if (!is.null(offset))
fit$fitted.values <- fit$fitted.values offset
fit
}
Now test it again on the original data
rlm_test(Fat ~., data = trainingData)
part of the output printed
...
attr(,"predvars")
list(Fat, `852nm`, `854nm`, `856nm`, `858nm`, `860nm`, `862nm`,
`864nm`, `866nm`, `868nm`, `870nm`, `872nm`, `874nm`, `876nm`,
`878nm`, `880nm`, `882nm`, `884nm`, `886nm`, `888nm`, `890nm`,
`892nm`, `894nm`, `896nm`, `898nm`, `900nm`, `902nm`, `904nm`,
`906nm`, `908nm`, `910nm`, `912nm`, `914nm`, `916nm`, `918nm`,
`920nm`, `922nm`, `924nm`, `926nm`, `928nm`, `930nm`, `932nm`,
`934nm`, `936nm`, `938nm`, `940nm`, `942nm`, `944nm`, `946nm`,
`948nm`, `950nm`, `952nm`, `954nm`, `956nm`, `958nm`, `960nm`,
`962nm`, `964nm`, `966nm`, `968nm`, `970nm`, `972nm`, `974nm`,
`976nm`, `978nm`, `980nm`, `982nm`, `984nm`, `986nm`, `988nm`,
`990nm`, `992nm`, `994nm`, `996nm`, `998nm`, `1000nm`, `1002nm`,
`1004nm`, `1006nm`, `1008nm`, `1010nm`, `1012nm`, `1014nm`,
`1016nm`, `1018nm`, `1020nm`, `1022nm`, `1024nm`, `1026nm`,
`1028nm`, `1030nm`, `1032nm`, `1034nm`, `1036nm`, `1038nm`,
`1040nm`, `1042nm`, `1044nm`, `1046nm`, `1048nm`, `1050nm`)
