I would like to fit a model of the form weight ~ bs(height) + id, where id is a factor (so I have several observations per every id). There are perhaps 100k values of id. Does anyone know the best way to go about fitting this? I have tried the mgcv package, but I do not think it was designed for problems this large and sparse. I can define my own knots and figure out the best smoothing parameter, if necessary. Also, if anyone has a solution in Python, that's great too. Or just the right phrase to google would be helpful, also.
2 Answers
I don't think your problem is really to do with splines, but with fitting 100,000 degrees of freedom for id. Depending on what you actually want to do with it, this seems like a good situation for a random-effects model. In R, you can fit them using a variety of functions, with lme in package nlme probably being a good place to start.
Also, the GAM label usually refers to the use of smoothing splines, not fixed-knot splines as fit by bs and ns. If you're happy to use bs and its cousins, you don't need gam; glm will suffice.
You can use mgcv using something like this model:
mod <- gam(weight ~ s(height, bs = "cr") + s(id, bs = "re"), data = FOO)
If that is causing trouble with fitting, try bam() which is designed for large data sets:
mod <- bam(weight ~ s(height, bs = "cr") + s(id, bs = "re"), data = FOO)
The bs = "re" specifies a random effect for id as you might use in a mixed effects model. This exploits the close relationship between the penalised regression view of the GAM and mixed models where the some of the penalties can be viewed as random effects terms.