fix: send prebuild job notification after job build db commit #20693

ssncferreira · 2025-11-10T12:29:04Z

Problem

Fix race condition in prebuilds reconciler. Previously, a job notification event was sent to a Go channel before the provisioning database transaction completed. The notification is consumed by a separate goroutine that publishes to PostgreSQL's LISTEN/NOTIFY, using a separate database connection. This creates a potential race: if a provisioner daemon receives the notification and queries for the job before the provisioning transaction commits, it won't find the job in the database.

This manifested as a flaky test failure in TestReinitializeAgent, where provisioners would occasionally miss newly created jobs. The test uses a 25-second timeout context, while the acquirer's backup polling mechanism checks for jobs every 30 seconds. This made the race condition visible in tests, though in production the backup polling would eventually pick up the job. The solution presented here guarantees that a job notification is only sent after the provisioning database transaction commits.

Changes

The provision() and provisionDelete() functions now return the provisioner job instead of sending notifications internally.
A new publishProvisionerJob() helper centralizes the notification logic and is called after each transaction completes.

Closes: coder/internal#963

johnstcn · 2025-11-10T14:44:40Z

enterprise/coderd/prebuilds/reconcile.go

+	// Publish provisioner job event to notify the acquirer that a new job was posted
+	c.publishProvisionerJob(ctx, provisionerJob, prebuiltWorkspaceID)
+


Should we only do this if provisionerJob is not nil?

provisionerJob should never be nil, we already check it in provision and return an error in that case:

coder/enterprise/coderd/prebuilds/reconcile.go

Lines 930 to 934 in f161c82

if provisionerJob == nil {

// This should not happen, builder.Build() should either return a job or an error.

// Returning an error to fail fast if we hit this unexpected case.

return nil, xerrors.Errorf("provision succeeded but returned no job")

}

Nevertheless, I'm also checking in publishProvisionerJob if the job is nil, just in case:

coder/enterprise/coderd/prebuilds/reconcile.go

Lines 947 to 949 in f161c82

if provisionerJob == nil {

return

}

johnstcn · 2025-11-10T14:52:13Z

enterprise/coderd/prebuilds/reconcile.go

+	// Publish provisioner job events to notify the acquirer that new jobs were posted
+	for workspaceID, job := range provisionerJobs {
+		c.publishProvisionerJob(ctx, job, workspaceID)
+	}


If the same channel name is signaled multiple times with identical payload strings within the same transaction, only one instance of the notification event is delivered to listeners.

Link

We are now publishing these job events outside of the transaction, so we don't get this automatic de-duping. I'm not sure if this will cause issues with larger deployments?

AFAIK, we were never publishing the events within the transaction. Previously, we would send the job to the provisionNotifyCh channel, which would then publish the job outside the transaction https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/reconcile.go#L164
Therefore, this deduplication was not happening in the previous implementation as well. The only difference here is that we guarantee that we only publish the job after the transaction commits, ensuring provisioners can see the job when they query the database.

Also, there seems to be a rule to avoid calling publish within a transaction: https://github.com/coder/coder/blob/8274251f/scripts/rules.go#L143

Nevertheless, this is a good point about deduping. Because it seems the job is published as (link):

msg, err := json.Marshal(JobPosting{ OrganizationID: job.OrganizationID, ProvisionerType: job.Provisioner, Tags: job.Tags, })

Meaning that all of these canceled jobs would have the same payload. So theoretically, we only need to send one job event, right? 🤔

Therefore, this deduplication was not happening in the previous implementation as well.

Fair enough 👍 Just wanted to make sure this was called out.

Also, there seems to be a rule to avoid calling publish within a transaction: https://github.com/coder/coder/blob/8274251f/scripts/rules.go#L143

I completely forgot about that 😁 good callout!

Meaning that all of these canceled jobs would have the same payload. So theoretically, we only need to send one job event, right? 🤔

That's how I understand it, yes! provisionerdserver.Acquirer listens to all provisioner_job_posted events and 'wakes up' its provisioner if the tags match.

EDIT: now that I think about it more, I'm not sure that pubsub events actually impact canceled jobs.

Updated to only send 1 job event notification: fef8176

I think we still need to send the notification because these orphan-delete jobs are still handled by provisioners if any are available. Only when no provisioner is available is the job marked as completed automatically (and the workspace deleted). If provisioners exist, the notification ensures they pick up jobs immediately instead of waiting up to 30 seconds for the backup polling mechanism.

…uild-job-notification-race

github-actions bot assigned ssncferreira Nov 10, 2025

ssncferreira force-pushed the ssncferreira/fix-prebuild-job-notification-race branch 5 times, most recently from c31db3d to 2c77682 Compare November 10, 2025 13:48

fix: send prebuild job notification after job build db commit

f161c82

ssncferreira force-pushed the ssncferreira/fix-prebuild-job-notification-race branch from 2c77682 to f161c82 Compare November 10, 2025 14:28

ssncferreira marked this pull request as ready for review November 10, 2025 14:33

ssncferreira requested review from SasSwart and johnstcn November 10, 2025 14:34

johnstcn reviewed Nov 10, 2025

View reviewed changes

johnstcn approved these changes Nov 11, 2025

View reviewed changes

ssncferreira added 2 commits November 11, 2025 13:00

chore: only send 1 job event for canceled pending prebuilds

fef8176

Merge remote-tracking branch 'origin/main' into ssncferreira/fix-preb…

bf6e023

…uild-job-notification-race

SasSwart approved these changes Nov 12, 2025

View reviewed changes

ssncferreira merged commit ca94588 into main Nov 12, 2025
29 checks passed

ssncferreira deleted the ssncferreira/fix-prebuild-job-notification-race branch November 12, 2025 10:36

github-actions bot locked and limited conversation to collaborators Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: send prebuild job notification after job build db commit #20693

fix: send prebuild job notification after job build db commit #20693

Uh oh!

ssncferreira commented Nov 10, 2025 •

edited

Loading

Uh oh!

johnstcn Nov 10, 2025

Uh oh!

ssncferreira Nov 10, 2025

Uh oh!

johnstcn Nov 10, 2025

Uh oh!

ssncferreira Nov 10, 2025

Uh oh!

johnstcn Nov 10, 2025 •

edited

Loading

Uh oh!

ssncferreira Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// Publish provisioner job event to notify the acquirer that a new job was posted
		c.publishProvisionerJob(ctx, provisionerJob, prebuiltWorkspaceID)

	if provisionerJob == nil {
	// This should not happen, builder.Build() should either return a job or an error.
	// Returning an error to fail fast if we hit this unexpected case.
	return nil, xerrors.Errorf("provision succeeded but returned no job")
	}

fix: send prebuild job notification after job build db commit #20693

fix: send prebuild job notification after job build db commit #20693

Uh oh!

Conversation

ssncferreira commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Uh oh!

johnstcn Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

ssncferreira Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

johnstcn Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

ssncferreira Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

johnstcn Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssncferreira Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ssncferreira commented Nov 10, 2025 •

edited

Loading

johnstcn Nov 10, 2025 •

edited

Loading