-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Fix sample weight handling in SAG(A) #31675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Running the test as follows import numpy as np
from scipy.stats import kstest
from sklearn.linear_model.tests.test_sag import sag, squared_dloss, get_step_size
from sklearn.datasets import make_regression, make_classification
from sklearn.utils._testing import assert_allclose_dense_sparse
alpha=1
n_features = 6
rng = np.random.RandomState(0)
X, y = make_classification(n_samples=1000,random_state=77,n_features=n_features)
weights = rng.randint(0,5,size=X.shape[0])
X_repeated = np.repeat(X,weights,axis=0)
y_repeated = np.repeat(y,weights,axis=0)
weights_w_all = np.zeros([n_features,50])
weights_r_all = np.zeros([n_features,50])
step_size_w=get_step_size(X,alpha,True,sample_weight=weights)
step_size_r= get_step_size(X_repeated,alpha,True)
for random_state in np.arange(50):
weights_w, int_w = sag(X,y,step_size=step_size_w,sample_weight=weights,alpha=alpha,dloss=squared_dloss,random_state=random_state)
weights_w_all[:,random_state] = weights_w
weights_r, int_r = sag(X_repeated,y_repeated,step_size=step_size_r,alpha=alpha,dloss=squared_dloss,random_state=random_state)
weights_r_all[:,random_state] = weights_r
print(kstest(weights_r_all[0],weights_w_all[0])) Now gives the result
|
@snath-xoc at the meeting, you mentioned that the statistical test would not pass for some datasets. Could you please post an example and add a TODO item to the PR to investigate this problem. Also, whenever penalty is non-zero, the problem is strictly convex and the solution show be unique. So it should be possible to write deterministic tests (with various random seed values) instead of statistical tests to:
This might require setting |
if sample_weight is not None: | ||
gradient *= sample_weight[idx] | ||
|
||
update = entry * gradient + alpha * weights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should keep the gradient
multiplied by the sample_weight
even if you change the sampling of idx, otherwise you are not computing the correct loss gradient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loss with weights
coefficient of the linear model (skipping the intercept for now):
can be put in several ways as the finite sum minimized by sag(a):
Let
-
$f_i(w) = s_i l_i(w) + s_i \frac{\alpha}{2} \Vert w \Vert^2 $ . The gradient updated and stored in memory is thenupdate = sample_weight[idx] * (entry * gradient + alpha * weights)
-
$f_i(w) = s_i l_i(w) + \frac{S}{n} \frac{\alpha}{2} \Vert w \Vert^2$ . This gives the code in main where only the data$i$ loss term is weighted. -
$f_i(w) = s_i l_i(w)$ . The gradient updated and stored in memory is thenupdate = sample_weight[idx] * entry * gradient
, ie only the data$i$ loss term. The regularizer is then taken into account when doing the gradient descent step onweights
, for exampleweights -= alpha*step_size*weights
.
I feel 2. (the current code) is a bit weird, for instance there is maybe a weird scaling of alpha
as weights
at the end, so it shouldn't matter too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have a slight preference for option 1. with the meaning of adding a term proportional to n_seen
as the weighted sum of seen indices, which will converge to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should keep the
gradient
multiplied by thesample_weight
even if you change the sampling of idx, otherwise you are not computing the correct loss gradient.
I am not entirely sure about this, in the case of sampling with frequency weighted probabilities, is this not equivalent to uniformly sampling from the repeated set of data in which case we can maintain the update step as is? Not sure if I have misunderstood something here?
In the above case it seems like we are considering weights to be more of the form of precision weights no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the confusion comes from the comparison with SGD.
Let's step back. For the weighted version consider a function
The (FG) full gradient step is
where the correct gradient is
SG (stochastic gradient) and SAG (stochastic average gradient) are two stochastic methods to approximate the FG but in different ways.
SG: the descent step uses the gradient at a randomly selected data point. In the unweighted case, you must sample uniformly the data points. Otherwise the gradient (averaged over some iterations) would be biased and your descent does not converge toward the right minima. In the weighted case therefore you have two choices: either you sample uniformly
or you sample
Notice that, in expectation, and averaged over some iterations, the gradient approximates the FG
SAG(A): the descent step uses an approximation of the average gradient
This a crucial difference with SG. In the unweighted case for example, to quote the sag paper:
In standard SG methods, it is crucial to sample the functions fi uniformly, at least asymptotically, in order to yield an unbiased gradient estimate and subsequently achieve convergence to the optimal value (alternately, the bias induced by non-uniform sampling would need to be asymptotically corrected). In SAG iterations, however, the weight of each gradient is constant and equal to 1/n, regardless of the frequency at which the corresponding function is sampled. We might thus consider sampling the functions fi non-uniformly, without needing to correct for this bias.
So then a question is which sampling procedure to use ? See above and the sag(a) paper, but I think it involves looking at the Lipschitz smoothness constant.
def get_step_size(X, alpha, fit_intercept, classification=True, sample_weight=None): | ||
if sample_weight is None: | ||
X_prod = np.sum(X * X, axis=1) | ||
else: | ||
X = X[sample_weight != 0] | ||
X_prod = np.sum(X * X, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think removing the zero sample_weight is enough.
If I understood correctly the sag(a) paper the recommended step size is step_size = 1 / L
where
It will depend on how we decompose the weighted sum into individual
-
$L_i = s_i \kappa \Vert x_i \Vert^2 + s_i \alpha$ . -
$L_i = s_i \kappa \Vert x_i \Vert^2 + \frac{S}{n} \alpha$ .
where
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end, it depends on how we allocate the regularizer:
- evenly wrt the indices, each
$i$ has the same regularizer$\frac{S}{n} \frac{\alpha}{2}\Vert w \Vert^2$ - evenly wrt the
sample_weight
, each$i$ has a regularizer$s_i \frac{\alpha}{2}\Vert w \Vert^2$ proportional to itssample_weight
.
Maybe we should try both strategies and see which one leads to better convergence ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we first need to figure out the step size computation before trying non-uniform sampling, for example by following 5.5 Effect of non-uniform sampling of the sag paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm so I guess you are suggesting a variable step size only for the special case of sample weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I wasn't clear, I meant first find/test the correct formula for step_size
with sample_weight
(see options 1 and 2 above).
Later we can try nonuniform sampling, which I think would be more difficult than just proportional to the sample_weight, see the discussion in 5.5 around the Lipschitz constants.
Regarding #31675 (comment).
I think we should start by adding a convergence check to the reference Python code similar to what is done for the main Cython implementation. EDIT: here is a minimal reproducer for the convergence bug: # %%
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.exceptions import ConvergenceWarning
import warnings
warnings.filterwarnings("error", category=ConvergenceWarning)
rng = np.random.RandomState(42)
X = rng.randn(1000, 10)
y = rng.randint(0, 2, size=X.shape[0])
# Use sample weights to trigger the convergence warning
# sample_weight = rng.randint(0, 10, size=X.shape[0]).astype(np.float64)
sample_weight = np.full(shape=X.shape[0], fill_value=5, dtype=np.float64)
# Uncomment any of the two following lines and the convergence problems go away.
# sample_weight /= sample_weight.max() # reduce magnitude
# sample_weight = None
common_params = {
"tol": 1e-12,
"max_iter": 10_000,
"C": 1e10,
"random_state": 42,
}
# %%
for solver in ["lbfgs", "saga"]:
lr = LogisticRegression(solver=solver, **common_params).fit(X, y, sample_weight=sample_weight)
print(lr.solver, lr.n_iter_[0]) lbfgs 7 # no convergence problem with LBFGS
Traceback (most recent call last):
Cell In[37], line 3
lr = LogisticRegression(solver=solver, **common_params).fit(X, y, sample_weight=sample_weight)
File ~/code/scikit-learn/sklearn/base.py:1365 in wrapper
return fit_method(estimator, *args, **kwargs)
File ~/code/scikit-learn/sklearn/linear_model/_logistic.py:1384 in fit
fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, prefer=prefer)(
File ~/code/scikit-learn/sklearn/utils/parallel.py:82 in __call__
return super().__call__(iterable_with_config_and_warning_filters)
File ~/miniforge3/envs/dev/lib/python3.13/site-packages/joblib/parallel.py:1986 in __call__
return output if self.return_generator else list(output)
File ~/miniforge3/envs/dev/lib/python3.13/site-packages/joblib/parallel.py:1914 in _get_sequential_output
res = func(*args, **kwargs)
File ~/code/scikit-learn/sklearn/utils/parallel.py:147 in __call__
return self.function(*args, **kwargs)
File ~/code/scikit-learn/sklearn/linear_model/_logistic.py:560 in _logistic_regression_path
w0, n_iter_i, warm_start_sag = sag_solver(
File ~/code/scikit-learn/sklearn/linear_model/_sag.py:348 in sag_solver
warnings.warn(
ConvergenceWarning: The max_iter was reached which means the coef_ did not converge EDIT: the weights can be constant and the problem is still present as long as their value is large enough. |
import numpy as np
from scipy.stats import kstest
from sklearn.linear_model.tests.test_sag import sag, squared_dloss, get_step_size
from sklearn.datasets import make_regression, make_classification
from sklearn.utils._testing import assert_allclose_dense_sparse
alpha=1
n_features = 6
rng = np.random.RandomState(0)
X, y = make_classification(n_samples=1000,random_state=77,n_features=n_features,n_classes=5,n_informative=4)
weights = rng.randint(0,5,size=X.shape[0])
X_repeated = np.repeat(X,weights,axis=0)
y_repeated = np.repeat(y,weights,axis=0)
weights_w_all = np.zeros([n_features,50])
weights_r_all = np.zeros([n_features,50])
step_size_w=get_step_size(X,alpha,True,sample_weight=weights)
step_size_r= get_step_size(X_repeated,alpha,True)
for random_state in np.arange(50):
weights_w, int_w = sag(X,y,step_size=step_size_w,sample_weight=weights,alpha=alpha,dloss=squared_dloss,random_state=random_state)
weights_w_all[:,random_state] = weights_w
weights_r, int_r = sag(X_repeated,y_repeated,step_size=step_size_r,alpha=alpha,dloss=squared_dloss,random_state=random_state)
weights_r_all[:,random_state] = weights_r
print(kstest(weights_r_all[0],weights_w_all[0])) this gives results
Although it still isn't too bad I do think as @antoinebaker pointed out there is still some things that aren't completely correct |
We need to check if the step size fix of #31675 (comment) can fix the convergence problem when EDIT: I see that #31837 is doing it. |
Reference Issues/PRs
Fixes issue on sample weight handling within SAG(A) #31536.
What does this implement/fix? Explain your changes.
SAG(A) now accounts for sample weights by:
TO DO:
get_auto_step_size