Skip to content

Conversation

@ssncferreira
Copy link
Contributor

@ssncferreira ssncferreira commented Nov 10, 2025

Problem

Fix race condition in prebuilds reconciler. Previously, a job notification event was sent to a Go channel before the provisioning database transaction completed. The notification is consumed by a separate goroutine that publishes to PostgreSQL's LISTEN/NOTIFY, using a separate database connection. This creates a potential race: if a provisioner daemon receives the notification and queries for the job before the provisioning transaction commits, it won't find the job in the database.

This manifested as a flaky test failure in TestReinitializeAgent, where provisioners would occasionally miss newly created jobs. The test uses a 25-second timeout context, while the acquirer's backup polling mechanism checks for jobs every 30 seconds. This made the race condition visible in tests, though in production the backup polling would eventually pick up the job. The solution presented here guarantees that a job notification is only sent after the provisioning database transaction commits.

Changes

  • The provision() and provisionDelete() functions now return the provisioner job instead of sending notifications internally.
  • A new publishProvisionerJob() helper centralizes the notification logic and is called after each transaction completes.

Closes: coder/internal#963

@ssncferreira ssncferreira force-pushed the ssncferreira/fix-prebuild-job-notification-race branch 5 times, most recently from c31db3d to 2c77682 Compare November 10, 2025 13:48
@ssncferreira ssncferreira force-pushed the ssncferreira/fix-prebuild-job-notification-race branch from 2c77682 to f161c82 Compare November 10, 2025 14:28
@ssncferreira ssncferreira marked this pull request as ready for review November 10, 2025 14:33
Comment on lines +746 to +748
// Publish provisioner job event to notify the acquirer that a new job was posted
c.publishProvisionerJob(ctx, provisionerJob, prebuiltWorkspaceID)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only do this if provisionerJob is not nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provisionerJob should never be nil, we already check it in provision and return an error in that case:

if provisionerJob == nil {
// This should not happen, builder.Build() should either return a job or an error.
// Returning an error to fail fast if we hit this unexpected case.
return nil, xerrors.Errorf("provision succeeded but returned no job")
}

Nevertheless, I'm also checking in publishProvisionerJob if the job is nil, just in case:

if provisionerJob == nil {
return
}

// Publish provisioner job events to notify the acquirer that new jobs were posted
for workspaceID, job := range provisionerJobs {
c.publishProvisionerJob(ctx, job, workspaceID)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the same channel name is signaled multiple times with identical payload strings within the same transaction, only one instance of the notification event is delivered to listeners.

Link

We are now publishing these job events outside of the transaction, so we don't get this automatic de-duping. I'm not sure if this will cause issues with larger deployments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, we were never publishing the events within the transaction. Previously, we would send the job to the provisionNotifyCh channel, which would then publish the job outside the transaction https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/reconcile.go#L164
Therefore, this deduplication was not happening in the previous implementation as well. The only difference here is that we guarantee that we only publish the job after the transaction commits, ensuring provisioners can see the job when they query the database.

Also, there seems to be a rule to avoid calling publish within a transaction: https://github.com/coder/coder/blob/8274251f/scripts/rules.go#L143

Nevertheless, this is a good point about deduping. Because it seems the job is published as (link):

msg, err := json.Marshal(JobPosting{
    OrganizationID:  job.OrganizationID,
    ProvisionerType: job.Provisioner,
    Tags:            job.Tags,
})

Meaning that all of these canceled jobs would have the same payload. So theoretically, we only need to send one job event, right? 🤔

Copy link
Member

@johnstcn johnstcn Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, this deduplication was not happening in the previous implementation as well.

Fair enough 👍 Just wanted to make sure this was called out.

Also, there seems to be a rule to avoid calling publish within a transaction: https://github.com/coder/coder/blob/8274251f/scripts/rules.go#L143

I completely forgot about that 😁 good callout!

Meaning that all of these canceled jobs would have the same payload. So theoretically, we only need to send one job event, right? 🤔

That's how I understand it, yes! provisionerdserver.Acquirer listens to all provisioner_job_posted events and 'wakes up' its provisioner if the tags match.

EDIT: now that I think about it more, I'm not sure that pubsub events actually impact canceled jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to only send 1 job event notification: fef8176

I think we still need to send the notification because these orphan-delete jobs are still handled by provisioners if any are available. Only when no provisioner is available is the job marked as completed automatically (and the workspace deleted). If provisioners exist, the notification ensures they pick up jobs immediately instead of waiting up to 30 seconds for the backup polling mechanism.

@ssncferreira ssncferreira merged commit ca94588 into main Nov 12, 2025
29 checks passed
@ssncferreira ssncferreira deleted the ssncferreira/fix-prebuild-job-notification-race branch November 12, 2025 10:36
@github-actions github-actions bot locked and limited conversation to collaborators Nov 12, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flake: TestReinitializeAgent/#0

4 participants