6
$\begingroup$

I’m currently fitting GAM models to predict how bulk density evolves with depth, for several combinations of land cover and soil texture.

My model has the following structure:

gam(
  BD ~ s(DEPTH_M, by = LCxTEXT_mod) +
       LCxTEXT_mod +
       s(SOURCE, bs = "re"),
  data = df_filtered,
  family = tw(link = "log"),
  method = "REML"
)

enter image description here

One issue I’m struggling with is that a few observations at greater depth, where data are very sparse, seem to strongly influence the tail of the smooth. Even after removing the points with the highest leverage, reducing k based on k.check(), the end of the curve still changes substantially depending on whether those deep observations are included.

I’m wondering whether there are recommended strategies to make GAMs less sensitive to such sparsely populated regions, or systematic ways to down‑weight or filter these data so that they don’t overly drive the smooth at depth.

Thanks in advance!

$\endgroup$

1 Answer 1

6
$\begingroup$

When dealing with extreme points, there is a tendency by many to want to "fix" the problems they raise. The first thing is to be sure it's really a problem. Is that a bug or a feature?

That's really a substantive question, not a statistical one, and I know nothing about this topic; I'm not even sure if it's geology or geography or something else. But maybe you want that curve?

If you really don't want it, one thing you could do is take log(depth) instead of depth. That's assuming no depths are exactly 0 (it seems so on your graph, but it's hard to be sure). If some are 0, then you could try taking square root of depth, but that's less intuitive to interpret.

$\endgroup$
3
  • $\begingroup$ If I understand correctly, you would prefer to keep the model as it is and attribute the issue to limited data or to the limitations of the method, rather than trying to correct it. $\endgroup$ Commented yesterday
  • $\begingroup$ No. What I am saying is that you, as a subject matter expert, have to figure out whether you want the "correction" of the model. There is nothing wrong with what you have from a statistical point of view and, in many cases, outliers are important. There is also a tendency, on the part of subject matter experts, to give away their expertise. Do you want to take logs of depth? Go ahead. The curve will be less influenced by extreme points because they will be less extreme. Do you want to keep the model as is? Also fine. $\endgroup$ Commented yesterday
  • 1
    $\begingroup$ I agree with Peter, but am more bullish; often, I don't think it matters too much which transform you use, you just want to spread out the observations of the covariate. However, one would likely do this spreading out, in this case, to better reflect know variation in depth (soil types, etc), as there is likely much more going in the uppermost layers and less at depth, in general, if my recollection of soil sampling is not faulty. $\endgroup$ Commented yesterday

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.