-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix: send prebuild job notification after job build db commit #20693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: send prebuild job notification after job build db commit #20693
Conversation
c31db3d to
2c77682
Compare
2c77682 to
f161c82
Compare
| // Publish provisioner job event to notify the acquirer that a new job was posted | ||
| c.publishProvisionerJob(ctx, provisionerJob, prebuiltWorkspaceID) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only do this if provisionerJob is not nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provisionerJob should never be nil, we already check it in provision and return an error in that case:
coder/enterprise/coderd/prebuilds/reconcile.go
Lines 930 to 934 in f161c82
| if provisionerJob == nil { | |
| // This should not happen, builder.Build() should either return a job or an error. | |
| // Returning an error to fail fast if we hit this unexpected case. | |
| return nil, xerrors.Errorf("provision succeeded but returned no job") | |
| } |
Nevertheless, I'm also checking in publishProvisionerJob if the job is nil, just in case:
coder/enterprise/coderd/prebuilds/reconcile.go
Lines 947 to 949 in f161c82
| if provisionerJob == nil { | |
| return | |
| } |
| // Publish provisioner job events to notify the acquirer that new jobs were posted | ||
| for workspaceID, job := range provisionerJobs { | ||
| c.publishProvisionerJob(ctx, job, workspaceID) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the same channel name is signaled multiple times with identical payload strings within the same transaction, only one instance of the notification event is delivered to listeners.
We are now publishing these job events outside of the transaction, so we don't get this automatic de-duping. I'm not sure if this will cause issues with larger deployments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, we were never publishing the events within the transaction. Previously, we would send the job to the provisionNotifyCh channel, which would then publish the job outside the transaction https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/reconcile.go#L164
Therefore, this deduplication was not happening in the previous implementation as well. The only difference here is that we guarantee that we only publish the job after the transaction commits, ensuring provisioners can see the job when they query the database.
Also, there seems to be a rule to avoid calling publish within a transaction: https://github.com/coder/coder/blob/8274251f/scripts/rules.go#L143
Nevertheless, this is a good point about deduping. Because it seems the job is published as (link):
msg, err := json.Marshal(JobPosting{
OrganizationID: job.OrganizationID,
ProvisionerType: job.Provisioner,
Tags: job.Tags,
})
Meaning that all of these canceled jobs would have the same payload. So theoretically, we only need to send one job event, right? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, this deduplication was not happening in the previous implementation as well.
Fair enough 👍 Just wanted to make sure this was called out.
Also, there seems to be a rule to avoid calling publish within a transaction: https://github.com/coder/coder/blob/8274251f/scripts/rules.go#L143
I completely forgot about that 😁 good callout!
Meaning that all of these canceled jobs would have the same payload. So theoretically, we only need to send one job event, right? 🤔
That's how I understand it, yes! provisionerdserver.Acquirer listens to all provisioner_job_posted events and 'wakes up' its provisioner if the tags match.
EDIT: now that I think about it more, I'm not sure that pubsub events actually impact canceled jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to only send 1 job event notification: fef8176
I think we still need to send the notification because these orphan-delete jobs are still handled by provisioners if any are available. Only when no provisioner is available is the job marked as completed automatically (and the workspace deleted). If provisioners exist, the notification ensures they pick up jobs immediately instead of waiting up to 30 seconds for the backup polling mechanism.
Problem
Fix race condition in prebuilds reconciler. Previously, a job notification event was sent to a Go channel before the provisioning database transaction completed. The notification is consumed by a separate goroutine that publishes to PostgreSQL's LISTEN/NOTIFY, using a separate database connection. This creates a potential race: if a provisioner daemon receives the notification and queries for the job before the provisioning transaction commits, it won't find the job in the database.
This manifested as a flaky test failure in
TestReinitializeAgent, where provisioners would occasionally miss newly created jobs. The test uses a 25-second timeout context, while the acquirer's backup polling mechanism checks for jobs every 30 seconds. This made the race condition visible in tests, though in production the backup polling would eventually pick up the job. The solution presented here guarantees that a job notification is only sent after the provisioning database transaction commits.Changes
provision()andprovisionDelete()functions now return the provisioner job instead of sending notifications internally.publishProvisionerJob()helper centralizes the notification logic and is called after each transaction completes.Closes: coder/internal#963