I’m currently fitting GAM models to predict how bulk density evolves with depth, for several combinations of land cover and soil texture.
My model has the following structure:
gam(
BD ~ s(DEPTH_M, by = LCxTEXT_mod) +
LCxTEXT_mod +
s(SOURCE, bs = "re"),
data = df_filtered,
family = tw(link = "log"),
method = "REML"
)
One issue I’m struggling with is that a few observations at greater depth, where data are very sparse, seem to strongly influence the tail of the smooth. Even after removing the points with the highest leverage, reducing k based on k.check(), the end of the curve still changes substantially depending on whether those deep observations are included.
I’m wondering whether there are recommended strategies to make GAMs less sensitive to such sparsely populated regions, or systematic ways to down‑weight or filter these data so that they don’t overly drive the smooth at depth.
Thanks in advance!
