How to reduce GAM sensitivity to sparse data

Question

I’m currently fitting GAM models to predict how bulk density evolves with depth, for several combinations of land cover and soil texture.

My model has the following structure:

gam(
  BD ~ s(DEPTH_M, by = LCxTEXT_mod) +
       LCxTEXT_mod +
       s(SOURCE, bs = "re"),
  data = df_filtered,
  family = tw(link = "log"),
  method = "REML"
)

One issue I’m struggling with is that a few observations at greater depth, where data are very sparse, seem to strongly influence the tail of the smooth. Even after removing the points with the highest leverage, reducing k based on k.check(), the end of the curve still changes substantially depending on whether those deep observations are included.

I’m wondering whether there are recommended strategies to make GAMs less sensitive to such sparsely populated regions, or systematic ways to down‑weight or filter these data so that they don’t overly drive the smooth at depth.

Thanks in advance!

Peter Flom · Accepted Answer · 2025-12-05 12:25:30Z

6

When dealing with extreme points, there is a tendency by many to want to "fix" the problems they raise. The first thing is to be sure it's really a problem. Is that a bug or a feature?

That's really a substantive question, not a statistical one, and I know nothing about this topic; I'm not even sure if it's geology or geography or something else. But maybe you want that curve?

If you really don't want it, one thing you could do is take log(depth) instead of depth. That's assuming no depths are exactly 0 (it seems so on your graph, but it's hard to be sure). If some are 0, then you could try taking square root of depth, but that's less intuitive to interpret.

answered yesterday

Peter Flom

141k37 gold badges201 silver badges486 bronze badges

$\begingroup$ If I understand correctly, you would prefer to keep the model as it is and attribute the issue to limited data or to the limitations of the method, rather than trying to correct it. $\endgroup$

Aurélien Lengrand
– Aurélien Lengrand

2025-12-05 13:54:35 +00:00
Commented yesterday
$\begingroup$ No. What I am saying is that you, as a subject matter expert, have to figure out whether you want the "correction" of the model. There is nothing wrong with what you have from a statistical point of view and, in many cases, outliers are important. There is also a tendency, on the part of subject matter experts, to give away their expertise. Do you want to take logs of depth? Go ahead. The curve will be less influenced by extreme points because they will be less extreme. Do you want to keep the model as is? Also fine. $\endgroup$

Peter Flom
– Peter Flom

2025-12-05 14:00:21 +00:00
Commented yesterday
1

$\begingroup$ I agree with Peter, but am more bullish; often, I don't think it matters too much which transform you use, you just want to spread out the observations of the covariate. However, one would likely do this spreading out, in this case, to better reflect know variation in depth (soil types, etc), as there is likely much more going in the uppermost layers and less at depth, in general, if my recollection of soil sampling is not faulty. $\endgroup$

Gavin Simpson
– Gavin Simpson

2025-12-05 19:15:42 +00:00
Commented yesterday

Add a comment |

Stack Exchange Network

How to reduce GAM sensitivity to sparse data

1 Answer 1

Your Answer

Hot Network Questions

How to reduce GAM sensitivity to sparse data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions