My input dependent variable has NA values but was used to build linear models for several groups. How can I use the linear model coefficients to derive predicted values for each row among each group?
Example data
library(dplyr)
df <- iris
df[c(1:2,51:52,101:102),3] <- NA # add NA values to dependent variable
# Creates a linear model for each Species group
fitted_models <- df %>%
group_by(Species) %>%
do(model = lm(Petal.Length~Sepal.Length Sepal.Width, data = .))
The below code works so long as there are no NA values in the dependent variables. However, the below code throws the following error message when you try to use it with the data with NA values (i.e., df): Error: Incompatible lengths: 50, 48.
library(tidyr)
library(purrr)
# works
fitted_models <- iris %>%
nest(data = -Species) %>%
mutate(fit = map(data, ~ lm(Petal.Length ~ Sepal.Length Sepal.Width, data = .x)),
fitted.values = map(fit, "fitted.values")) %>%
unnest(cols = c(data, fitted.values)) %>%
select(-fit)
# does not work
fitted_models <- df %>%
nest(data = -Species) %>%
mutate(fit = map(data, ~ lm(Petal.Length ~ Sepal.Length Sepal.Width, data = .x)),
fitted.values = map(fit, "fitted.values")) %>%
unnest(cols = c(data, fitted.values)) %>%
select(-fit)
The ideal output (see below) would use the coefficients from each of the Species models to derive predicted values for each row, including those with NA in the dependent variable.
Species Sepal.Length Sepal.Width Petal.Length Petal.Width predicted.values
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 NA 0.2 1.47
2 setosa 4.9 3 NA 0.2 1.46
3 setosa 4.7 3.2 1.3 0.2 1.42
4 setosa 4.6 3.1 1.5 0.2 1.41
5 setosa 5 3.6 1.4 0.2 1.46
6 setosa 5.4 3.9 1.7 0.4 1.51
7 setosa 4.6 3.4 1.4 0.3 1.40
8 setosa 5 3.4 1.5 0.2 1.46
9 setosa 4.4 2.9 1.4 0.2 1.38
10 setosa 4.9 3.1 1.5 0.1 1.45
# ... with 140 more rows
CodePudding user response:
This version with predict() and purrr::map2() works.
f <- function (.fit, .new_data) {
predict(.fit, newdata = .new_data)
}
df %>%
nest(data = -Species) %>%
mutate(
fit = map(data, ~ lm(Petal.Length ~ Sepal.Length Sepal.Width, data = .x)),
yhat = map2(.x = fit, .y = data, f)
) %>%
unnest(cols = c(data, yhat)) %>%
select(-fit)
Output:
# A tibble: 150 × 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width yhat
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 NA 0.2 1.48
2 setosa 4.9 3 NA 0.2 1.46
3 setosa 4.7 3.2 1.3 0.2 1.42
4 setosa 4.6 3.1 1.5 0.2 1.41
5 setosa 5 3.6 1.4 0.2 1.46
6 setosa 5.4 3.9 1.7 0.4 1.51
7 setosa 4.6 3.4 1.4 0.3 1.40
8 setosa 5 3.4 1.5 0.2 1.46
9 setosa 4.4 2.9 1.4 0.2 1.39
10 setosa 4.9 3.1 1.5 0.1 1.46
# … with 140 more rows
I wish I could figure it out with the fitted.values variable that's already calculated. I tried adding a row id before nesting, so the two data.frames could be left-joined (because data has all 50 rows). But then I thought the two unnesting and one join operations are more conceptual work than rerunning the linear equation used in predict().
