Skip to content

Conversation

@SasSwart
Copy link
Contributor

@SasSwart SasSwart commented Nov 11, 2025

relates to: coder/internal#1094

This is number 2 of 5 pull requests in an effort to add agent script ordering. It adds a drpc API that is exposed via a local socket. This API serves access to a lightweight DAG based dependency manager that was inspired by systemd.

In follow-up PRs:

  • This unit manager will be plumbed into the workspace agent struct.
  • CLI commands will use this agentsocket api to express dependencies between coder scripts

I used an LLM to produce some of these changes, but I have conducted thorough self review and consider this contribution to be ready for an external reviewer.

@SasSwart SasSwart changed the title Jjs/internal 1095 branch2 socket server feat(agent): add agent socket API Nov 11, 2025
@SasSwart SasSwart changed the base branch from main to jjs/internal-1095-branch1-unit-manager November 11, 2025 14:47
@SasSwart SasSwart force-pushed the jjs/internal-1095-branch2-socket-server branch from bf9e8fd to 1daf67e Compare November 12, 2025 13:39
@mafredri mafredri self-requested a review November 14, 2025 12:02
@SasSwart SasSwart force-pushed the jjs/internal-1095-branch2-socket-server branch from a17863a to 9fdc98d Compare November 14, 2025 13:02
@SasSwart SasSwart marked this pull request as ready for review November 14, 2025 13:03
// Request whether a unit is ready to be started. That is, all dependencies are satisfied.
rpc SyncReady(SyncReadyRequest) returns (SyncReadyResponse);
// Get the status of a unit and list its dependencies.
rpc SyncStatus(SyncStatusRequest) returns (SyncStatusResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the flow going to be:

  • SyncWant(fooscript, barscript)
  • SyncWant(fooscript, bazscript)
  • while ! SyncReady(fooscript)
  • SyncStart(fooscript)
  • ...
  • SyncComplete(fooscript)

I assume SyncStatus is more useful or debugging than actual script writing (which is fine).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, wdyt about supporting watch vs polling?

Copy link
Member

@mafredri mafredri Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thought, what happens if I reverse the order of want and start? So start -> want want -> ready. (I imagine users will feel it's natural to start immediately once your script gets going, it may be wrong use but still.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your described flow is correct. Although SyncStart should probably do the polling server side.

In terms of watch vs poll, we agreed with Cian to only poll for now. We can implement watch later.

To your last thought:

  • Want should almost always be called before Start.
  • Wants that are added after Start will only be applied on the next Start.
  • We could add a guard to say that dependencies cannot be added to a unit that has been started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further consideration: we do need to think about what happens when a script fails (without being marked as such) and gets retried. The status will be stuck in "started".


// Fall back to /tmp with user-specific path
uid := os.Getuid()
return filepath.Join("/tmp", fmt.Sprintf("coder-agent-%d.sock", uid)), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I wonder if this might create conflicts between tests. Especially later on with CLI support and CLI tests. Consider adding a CLI flag to configure it.)

Base automatically changed from jjs/internal-1095-branch1-unit-manager to main November 19, 2025 17:03
@SasSwart SasSwart force-pushed the jjs/internal-1095-branch2-socket-server branch from f7245fa to 79d783a Compare November 20, 2025 07:23
Further slim down the agentsocket proxy

simplify the agentsocket server interface

avoid two race conditions

use t.Cleanup where applicable
func (s *Server) acceptConnections() {
// In an edge case, Stop() might race with acceptConnections() and set s.listener to nil.
// Therefore, we grab a copy of the listener under a lock. We might still get a nil listener,
// but then we know close has already run and we can return early.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated comment (no Stop). Entire comment is not needed after my suggestion for type Service.


// getDefaultSocketPath returns the default socket path for Unix-like systems
func getDefaultSocketPath() (string, error) {
// Try XDG_RUNTIME_DIR first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave this for future refactor, but it's a good idea to handle these types of things in the agent, and pass them down rather than have each component do their own thing.

}

// Try to connect to see if it's actually listening
conn, err := net.Dial("unix", path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use net.Dialer{}.DialContext(ctx, ...) here so we don't block forever if the socket is in a bad state or there's a bad agent listening on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@DanielleMaywood DanielleMaywood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamping with a nit about a comment


server.listener = listener

// This context is canceled by s.Stop() when the server is stopped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This context is canceled by s.Stop() when the server is stopped.
// This context is canceled by s.Close() when the server is stopped.

I think this comment is outdated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I got to that one 😄. On it. Thanks!

@SasSwart SasSwart merged commit 2840fdc into main Nov 21, 2025
30 checks passed
@SasSwart SasSwart deleted the jjs/internal-1095-branch2-socket-server branch November 21, 2025 11:09
@github-actions github-actions bot locked and limited conversation to collaborators Nov 21, 2025
Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post-merge LGTM! Thank you both

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants