Optimal prediction under squared percentage loss

Question

I have to find an answer on the following question but I am struggling:

Consider a leaf of a decision tree that consists of object-label pairs $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$.

The prediction $\hat{y}$ of this leaf is defined to minimize the loss on the training samples.

Find the optimal prediction in the leaf, for a regression tree, i.e. $y_{i} \in \mathbb{R}$, and squared percentage error loss $\mathcal{L}(y, \hat{y}) = \cfrac{\left(y - \hat{y} \right)^{2}}{y^2}$.

I tried just to take the derivative of the loss function and setting it to 0, which only yields $y=\hat{y}$ which can not be the result. Intuitively, something like the mean value of the observations present in the leaf should come out, right?

Hint: yes, if there is only a single $y$ in the terminal leaf, then of course the optimal prediction is equal to that value: $\hat{y}=y$. However, we will usually have multiple training samples $y_1, \dots, y_n$ in the terminal leaf, which we want to summarize using a single prediction value $\hat{y}$. Which will minimize the loss? (This will depend on how the loss is summarized over multiple training observations; you can probably assume it's averaged. Also, let's hope very much that none of the $y_i=0$.) — Stephan Kolassa
– Stephan Kolassa, Commented Dec 13, 2021 at 7:38
Incidentally, that the answer will probably not be the mean of the training observations is related to one issue with the Mean Absolute Percentage Error (MAPE), which is quite similar to your loss function (just without the squaring). — Stephan Kolassa
– Stephan Kolassa, Commented Dec 13, 2021 at 7:52
Thx @StephanKolassa. Having read the article, you mean that the problem is that my loss function is not differentiable for $y_i = \hat{y_i}$? — ghxk
– ghxk, Commented Dec 13, 2021 at 8:20
No, your loss is quite nicely differentiable, in contrast to the MAPE. But note that there is no subscript $i$ for your prediction $\hat{y}$. You have multiple training instances $y_1, \dots, y_n$ in your leaf and want to summarize them with a single prediction $\hat{y}$. Just set up the loss function, averaging over the training samples, and differentiate. — Stephan Kolassa
– Stephan Kolassa, Commented Dec 13, 2021 at 8:26
@Stephan Kolassa: What do you mean by "averaging over the training samples"? If I just differentiate the sum of my loss function with respect to $\hat{y}$, I get exactly $\hat{y} = \sum\limits_{i = 1}^n y_i/n$ — ghxk
– ghxk, Commented Dec 13, 2021 at 8:32

Stephan Kolassa · Accepted Answer · 2021-12-13 09:24:06Z

Let us assume that we have training observations $y_1, \dots, y_n$ in the leaf, all of which are nonzero. Let us further assume that we summarize losses using their sum, which is equivalent to taking their average: $$ L = \sum_{i=1}^n \frac{(y_i-\hat{y})^2}{y_i^2}. $$ To minimize the loss, we take the derivative with respect to $\hat{y}$ and set it to zero: $$ \frac{d}{d\hat{y}}L = \frac{d}{d\hat{y}}\sum_{i=1}^n \frac{(y_i-\hat{y})^2}{y_i^2} = \sum_{i=1}^n \frac{-2(y_i-\hat{y})}{y_i^2}\stackrel{!}{=}0,$$ or $$ 0= \sum_{i=1}^n \frac{(y_i-\hat{y})}{y_i^2} = \sum_{i=1}^n \frac{y_i}{y_i^2}-\sum_{i=1}^n \frac{\hat{y}}{y_i^2}= \sum_{i=1}^n \frac{1}{y_i}-\hat{y}\sum_{i=1}^n \frac{1}{y_i^2}, $$ resulting in $$ \hat{y}=\frac{\sum_{i=1}^n \frac{1}{y_i}}{\sum_{i=1}^n \frac{1}{y_i^2}}. $$

As an example, let us simulate $y_1, \dots, y_n\sim U[0,1]$ and find the optimal $\hat{y}$ both numerically and using our formula:

nn <- 100
set.seed(1)
yy <- runif(nn)

yhat_numerical <- optimize(f=function(yhat)sum((yhat-yy)^2/yy^2),interval=c(0,1))$minimum
yhat_theory <- sum(1/yy)/sum(1/yy^2)

Happily, both agree:

> yhat_numerical 
[1] 0.04473853
> yhat_theory 
[1] 0.04473853

Also, the optimal prediction is far away from the "intuitive" prediction, which is just the mean of the training samples:

> mean(yy)
[1] 0.5178471

This illustrates that the "best" point forecast depends on the error or accuracy measure (Kolassa, 2020, IJF).

Stack Exchange Network

Optimal prediction under squared percentage loss

1 Answer 1

Your Answer

Linked

Hot Network Questions

Optimal prediction under squared percentage loss

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions