R GAM producing different results via group argument-CodePudding

I have some hourly data over 24 hrs across a year, by 7 groups. When I fit a gam (mgcv::gam) I use the by= argument to produce 7 different fitted lines - it produces some odd looking fits. However when I subset the data to just one of those groups and run the gam again, with no by=Group argument used, the fit looks a lot better and makes sense.

This is a toy example where the change is not so dramatic between the two methods, but my real result is much more drastic when using the by= argument, why would this be?

require(data.table)
require(mgcv)
require(ggplot2)

## create two groups of data, A & B
dtA <- data.table(t = rep(1:12,each=100) , N = c(runif(200, 0.0, 1.0),runif(200, 2.0, 3.0),runif(200, 5.0, 7.0),runif(200, 4.0, 5.0),runif(200, 1.0, 2.0),runif(200, 0.0, 1.0)), Group="A")

dtB <- data.table(t = rep(1:12,each=100) , N = c(runif(200, 20.0, 22.0),runif(200, 14.0, 16.0),runif(200, 6.0, 7.0),runif(200, 5.0, 6.0),runif(200, 12.0, 15.0),runif(200, 17.0, 20.0)), Group="B")

## put the data together, set the group as a factor
dt_gp <- rbindlist(list(dtA,dtB), use.names = T)
dt_gp[, Group := factor(Group, levels=c("A","B"))]

## create the gam , using the by grouping, and then fit to a blank table
gam1 <- gam(N ~ s(t,k=8, bs="cc", by=Group), data = dt_gp)

dt_fit1 <- data.table(t=rep(c(1:12),2), Group=rep(c("A","B"), each=12))
dt_fit1[, Group := factor(Group, levels=c("A","B"))]

fits1 = predict(gam1, newdata=dt_fit1, type='response', se=T)
predicts1 = as.data.table(data.frame(dt_fit1, fits1))

## now subset GpA data and run and recreate GAM and fitted line. 
dt <- dt_gp[Group=="A"]
dt[,Group:=NULL]

gam2   <- gam(N ~ s(t,k=8, bs="cc"), data = dt)

dt_fit2 <- data.table(t=1:12)

fits2 = predict(gam2, newdata=dt_fit2, type='response', se=T)
predicts2 = as.data.table(data.frame(dt_fit2, fits2))

## plot to see difference (add Group to 2nd prediction for facet in plot)
predicts2[,Group:="A"]
ggplot() 
  geom_line(data=predicts1, aes(x=t, y=fit), colour="blue") 
  geom_line(data=predicts2, aes(x=t, y=fit), colour="red") 
  geom_point(data=dt_gp, aes(x=t,y=N), colour="grey50") 
  facet_wrap(~Group, nrow=2, scales="free_y") 
  ggtitle("GAM on numbers grouped by A & B (numbers in A identical in both cases)") 
  theme_bw() 
  theme(axis.text.x = element_text(size=12),
        axis.text.y = element_text(size=12),
        axis.title = element_text(size=16),
        legend.title=element_blank())

Red line is when I separate out the data, blue lines are done in a grouping manner. Does the grouping function in mgcv::gam() not separate the data? As the more I make A & B 'different', the worse the blue lines fit the raw data dots.

CodePudding user response：

From the documentation for the s function in mgcv:

In the factor by variable case a replicate of the smooth is produced for each factor level (these smooths will be centered, so the factor usually needs to be added as a main effect as well). See gam.models for further details.

So it looks like you'll want to include Group in your formula outside the call to s as well, e.g.,

gam1 <- gam(N ~ Group s(t,k=8, bs="cc", by=Group), data = dt_gp).