-
Notifications
You must be signed in to change notification settings - Fork 956
fix: don't create autostart workspace builds with no available provisioners #19067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
} | ||
|
||
// Check if any provisioners are active (not stale) | ||
now := dbtime.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be using the passed in time from runOnce
rather than getting it's own time
if testing.Testing() { | ||
staleInterval = TestingStaleInterval | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use the passed in time from runOnce
I don't think changing the interval would be necessary, you could write the test in a way that the default interval works fine
@@ -52,6 +56,9 @@ type Executor struct { | |||
experiments codersdk.Experiments | |||
|
|||
metrics executorMetrics | |||
|
|||
// Skip provisioner availability check (should only be true for *some* tests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says it should only be true for some tests, but it seems it's true for almost every test using coderdtest
. How many tests are actually broken by this change? If it's only a few, I'd rather we just fix those tests (e.g. by adding provisioners) rather than add a skip flag.
provisioner is not available Signed-off-by: Callum Styan <callumstyan@gmail.com>
transition so that we don't accumulate autostart jobs that will eventually have to be cleaned up by the reaper routine Signed-off-by: Callum Styan <callumstyan@gmail.com>
DB/provisioner to be available, so add a way to skip the check Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
…ndling This commit addresses the requirements to: 1. Comment out provisionerCloser.Close() in TestExecutorAutostartSkipsWhenNoProvisionersAvailable 2. Remove the need for SkipProvisionerCheck by refactoring tests Changes: - Remove SkipProvisionerCheck field and logic from Executor - Remove SkipProvisionerCheck option from coderdtest.Options - Add proper provisioner daemon tags to match template requirements - Add helper functions for waiting for provisioner availability - Update tests to use real provisioner availability checks - Fix timing issues in provisioner staleness tests - Comment out provisioner close call to test proper behavior The test now correctly validates that provisioners remain available when not explicitly closed, and uses the real hasAvailableProvisioners logic instead of bypassing it. Signed-off-by: Callum Styan <callumstyan@gmail.com>
9971794
to
78a8f5e
Compare
autobuild path have a valid provisioner available and pass an appropriate time.Time to the channel for triggering the build based on whether the provisioner is supposed to be valid or stale Signed-off-by: Callum Styan <callumstyan@gmail.com>
Okay @deansheather thanks for your patience, I think this is finally ready for another review. Took quite a bit of digging with the help of Blink to find all the bits and pieces that had to be updated to get the tests working without the The TL; DR of the changes is that we in tests which deal with autobuilds we need:
|
@@ -131,6 +134,36 @@ func (e *Executor) Run() { | |||
}() | |||
} | |||
|
|||
func (e *Executor) hasAvailableProvisioners(ctx context.Context, tx database.Store, t time.Time, ws database.Workspace, templateVersionJob database.ProvisionerJob) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Online
is a better name here, since we're just checking that there are provisioners, not that they're necessarily available right now for a new build.
queryParams := database.GetProvisionerDaemonsByOrganizationParams{ | ||
OrganizationID: ws.OrganizationID, | ||
WantTags: templateVersionJob.Tags, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should put something up here like logger := e.log.With(slog.F("tags", templateVersionJob.Tags))
for those log statements below
const TestingStaleInterval = time.Second * 5 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be only used in tests and not in this file. We should try to avoid test-only constants like this anyways, in favor of letting it be configured as a new argument to NewExecutor
|
||
// Wait for the provisioner to become stale (heartbeat stops) | ||
// The stale interval is 5 seconds, so we need to wait a bit longer | ||
time.Sleep(6 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically avoid sleeps wherever possible in tests. You could try:
- Manipulating the database to set the provisioners to have a last_seen_at property six seconds in the past
- Manipulating the tick time you pass into the executor to use a time six seconds in the past
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'd also be able to avoid the 5s test timeout constant if you did this instead
// The stale interval is 5 seconds, so we need to wait a bit longer | ||
time.Sleep(6 * time.Second) | ||
|
||
// Debug: check what's in the database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind a few debug log statements typically, but this test has a lot which impacts readability, and I personally don't think most of them are that useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think if the test avoids using sleeps and switches to manipulation of the db/ticks it'll be more reliable and won't need all of these debugs, since you won't have to worry about a real provisioner daemon sending updates
t.Logf("Daemon %d: ID=%s, Name=%s, LastSeen=%v", i, daemon.ID, daemon.Name, daemon.LastSeenAt) | ||
} | ||
|
||
// Since we commented out the close call, the provisioner should NOT become stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this comment, it looks like the close call isn't commented out
// Debug: Test the exact query that hasAvailableProvisioners uses | ||
queryParams := database.GetProvisionerDaemonsByOrganizationParams{ | ||
OrganizationID: workspace.OrganizationID, | ||
WantTags: templateVersion.Job.Tags, | ||
} | ||
t.Logf("Query params: OrgID=%s, WantTags=%v", queryParams.OrganizationID, queryParams.WantTags) | ||
t.Logf("Test query detailed params: org_id_type=%T, org_id_value=%s, want_tags_type=%T, want_tags_value=%v", | ||
queryParams.OrganizationID, queryParams.OrganizationID, queryParams.WantTags, queryParams.WantTags) | ||
|
||
// Test 1: Transaction isolation - try with different transaction contexts | ||
t.Logf("=== Testing transaction isolation ===") | ||
|
||
// Query with context.Background() (what we used above) | ||
matchingDaemons1, err := db.GetProvisionerDaemonsByOrganization(context.Background(), queryParams) | ||
require.NoError(t, err) | ||
t.Logf("With context.Background(): %d daemons", len(matchingDaemons1)) | ||
|
||
// Query with a fresh context | ||
ctx2 := context.Background() | ||
matchingDaemons2, err2 := db.GetProvisionerDaemonsByOrganization(ctx2, queryParams) | ||
require.NoError(t, err2) | ||
t.Logf("With fresh context: %d daemons", len(matchingDaemons2)) | ||
|
||
// Query within a transaction (like hasAvailableProvisioners might use) | ||
err = db.InTx(func(tx database.Store) error { | ||
matchingDaemons3, err := tx.GetProvisionerDaemonsByOrganization(ctx2, queryParams) | ||
if err != nil { | ||
return err | ||
} | ||
t.Logf("Within transaction: %d daemons", len(matchingDaemons3)) | ||
return nil | ||
}, nil) | ||
require.NoError(t, err) | ||
|
||
// Test 2: Timing issue - query right before and after hasAvailableProvisioners | ||
t.Logf("=== Testing timing issue ===") | ||
|
||
// Query right before the autostart trigger | ||
beforeDaemons, err := db.GetProvisionerDaemonsByOrganization(context.Background(), queryParams) | ||
require.NoError(t, err) | ||
t.Logf("Right before autostart: %d daemons", len(beforeDaemons)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You run the same query 4 times in a row here. If you'd like to assert the value before you trigger the run, just query once and check the result (e.g. the Query right before the autostart trigger
query you have).
Testing different combinations of contexts/transactions is out of the realm of the lifecycle executor test, and would be better for the database package (that being said, I don't think it'd be a high-value test in the database package anyway).
// Query with context.Background() (what we used above) | ||
matchingDaemons1, err := db.GetProvisionerDaemonsByOrganization(context.Background(), queryParams) | ||
require.NoError(t, err) | ||
t.Logf("With context.Background(): %d daemons", len(matchingDaemons1)) | ||
|
||
// Query with a fresh context | ||
ctx2 := context.Background() | ||
matchingDaemons2, err2 := db.GetProvisionerDaemonsByOrganization(ctx2, queryParams) | ||
require.NoError(t, err2) | ||
t.Logf("With fresh context: %d daemons", len(matchingDaemons2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are identical since they both use a background context and the same parameters.
ticker <- build.Job.CompletedAt.Add(failureTTL * 2) | ||
tickTime := build.Job.CompletedAt.Add(failureTTL * 2) | ||
|
||
p, err := coderdtest.GetProvisionerForWorkspace(t, db, time.Now(), ws.OrganizationID, database.ProvisionerJob{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provisionerdserver.StaleInterval
is 90s, so is this necessary? Most tests have a timeout of <30s.
@@ -1590,3 +1592,117 @@ func DeploymentValues(t testing.TB, mut ...func(*codersdk.DeploymentValues)) *co | |||
} | |||
return cfg | |||
} | |||
|
|||
// GetProvisionerForWorkspace returns the first valid provisioner for a workspace + template tags. | |||
func GetProvisionerForWorkspace(t *testing.T, tx database.Store, curTime time.Time, orgID uuid.UUID, templateVersionJob database.ProvisionerJob) (database.ProvisionerDaemon, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name says it's for a workspace but takes a template version. Maybe you should change the signature to:
func GetProvisionerForTags(t *testing.T, tx database.Store, curTime time.Time, orgID uuid.UUID, tags []string) (database.ProvisionerDaemon, error) {
This should fix #17941
With just the test addition the test will fail since the workspace build is triggered, we attempt to move the workspace into the
start
state:require.Equal(t, codersdk.WorkspaceTransitionStop, workspace.LatestBuild.Transition)
With the 2nd commit we skip the calls to
wsbuilder.New
andbuilder.Build
and so the transition is never scheduled.However, many tests don't fully setup provisioner or wait for the provisioner + DB to be fully available, so the third commit adds a somewhat smelly way to skip the provisioner check in tests that don't need to test that functionality.