Skip to content

chore: add usage tracking package #19095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

deansheather
Copy link
Member

@deansheather deansheather commented Jul 30, 2025

Not used in coderd yet, see stack.

Adds two new packages:

  • coderd/usage: provides an interface for the "Collector" as well as a stub implementation for AGPL
  • enterprise/coderd/usage: provides an interface for the "Publisher" as well as a Tallyman implementation

Relates to coder/internal#814

Copy link
Member Author

deansheather commented Jul 30, 2025

@deansheather deansheather marked this pull request as ready for review July 30, 2025 06:05
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some non-blocking comments below but might need to take another pass.

@@ -3913,6 +3913,13 @@ func (q *querier) InsertTemplateVersionWorkspaceTag(ctx context.Context, arg dat
return q.db.InsertTemplateVersionWorkspaceTag(ctx, arg)
}

func (q *querier) InsertUsageEvent(ctx context.Context, arg database.InsertUsageEventParams) error {
if err := q.authorizeContext(ctx, policy.ActionCreate, rbac.ResourceSystem); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably create a separate RBAC resource and role for these events.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are currently not directly CRUDable by any users, even admins. I was going to originally add this to it's own resource, but after I realized that I decided that it could be added down the track if we add APIs for using this information

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even so, ResourceSystem has become a dumping ground of a bunch of random stuff that is unrelated. The whole point of RBAC is to separate sensitive stuff from unsensitive, and this is way less important than, say, crypto keys (which are currently ResourceSystem, so it needs its own resource.

Comment on lines +33 to +34
-- The parenthesis around @now::timestamptz are necessary to
-- avoid sqlc from generating an extra argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

const (
CoderLicenseJWTHeader = "Coder-License-JWT"

tallymanURL = "https://tallyman-ingress.coder.com"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see an argument for making this a var so it's configurable at build time.

if !allFailed {
// These are all going to have the same message, so don't log
// them. We already logged the overall error above.
p.log.Warn(ctx, "tallyman rejected usage event", slog.F("id", event.ID), slog.F("message", rejectedEvent.Message), slog.F("permanent", rejectedEvent.Permanent))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead collect the failed IDs and warn once? Depending on the number of events, this could spam.

} else {
// It's not good if this path gets hit, but we'll handle it as if it
// was a temporary rejection.
p.log.Warn(ctx, "tallyman did not include a usage event in the response, considering it temporarily rejected", slog.F("id", event.ID))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I'd prefer if we didn't log in a tight loop.

Comment on lines +297 to +298
err = p.db.UpdateUsageEventsPostPublish(ctx, dbUpdate)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: suggest

if err := p.b.UpdateUsageEventsPostPublish(ctx, dbUpdate); err != nil { ... }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obligatory reminder to check migration number before merging!


# Usage tracking code requires intimate knowledge of Tallyman and Metronome, as
# well as guidance from revenue.
coderd/usage/ @deansheather
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, I'd like to ensure we have 2 code owners for each thing, so that we have some ability to make progress if someone is out

@@ -0,0 +1,26 @@
CREATE TYPE usage_event_type AS ENUM (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd normally be in favor of an enum here, and I'm sorry I didn't think of it at RFC time, but...

We are restricted from using ALTER TYPE ... ADD VALUE in later migrations and instead have to convert everything to an intermediate text column, then back to the enum. Since the usage events table is likely to be very large, this could be a costly migration.

Instead we could make the event_type a text and for now just add a CHECK constraint that it equals 'dc_managed_agents_v1'. Later migrations can use NOT VALID when modifying this check constraint to avoid a costly scan. WDYT?

-- always permanently reject these events anyways.
-- The parenthesis around @now::timestamptz are necessary to
-- avoid sqlc from generating an extra argument.
potential_event.created_at > (@now::timestamptz) - INTERVAL '30 days'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if PG's query planner is smart enough to figure out the optimal order of these clauses, but this constraint probably is less selective than published_at IS NULL in a working system, so should probably go last.


CREATE INDEX idx_usage_events_created_at ON usage_events (created_at);
CREATE INDEX idx_usage_events_publish_started_at ON usage_events (publish_started_at);
CREATE INDEX idx_usage_events_published_at ON usage_events (published_at);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want a single index over a tuple and the order matters. Having 3 indexes is much less useful for a query that uses all 3 fields.

It should be published_at first, to allow the query to ignore already published events, which will be the vast majority of the table. Next is publish_started_at, which we use to filter out in-progress events. Lastly created_at since we order by this and exclude anything older than 30 days.

// Note that the following event types should not be updated once they are
// merged into the product. Please consult Dean before making any changes.
type Event interface {
usageEvent() // to prevent external types from implementing this interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the motivation for preventing external types from implementing?

// the count of all existing managed agents (count=N)
// - A new managed agent is created (count=1)
type DCManagedAgentsV1 struct {
Count uint64 `json:"count"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we'll want a lot more than just the count so that customers can understand their usage, e.g. by template, organization, user.

startErr := make(chan error)
go func() {
err := publisher.Start()
testutil.RequireSend(ctx, t, startErr, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't call Require methods in goroutines, only the main test goroutine.

handler func(req usage.TallymanIngestRequestV1) any
)
ingestURL := fakeServer(t, tallymanHandler(t, licenseJWT, func(req usage.TallymanIngestRequestV1) any {
callCount := atomic.AddInt64(&calls, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to prevent test races, we need to ensure that the publish calls are completed before we attempt to read this value. If the test prevents such races, atomic is superfluous because we will not concurrently access it. Making it atomic doesn't do anything to prevent racy tests and is confusing because it signals to people that there might be concurrent accesses, when we need to ensure there are none.

// publishOnce publishes up to tallymanPublishBatchSize usage events to
// tallyman. It returns the number of successfully published events.
func (p *tallymanPublisher) publishOnce(ctx context.Context, deploymentID uuid.UUID) (int, error) {
licenseJwt, err := p.getBestLicenseJWT(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be doing this for every publish. Can we connect it up to the rest of the license code so we don't add more queries? Or at least use the same query strategy of querying every 10 minutes and listening for published changes?

var (
acceptedEvents = make(map[string]*TallymanIngestAcceptedEventV1)
rejectedEvents = make(map[string]*TallymanIngestRejectedEventV1)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the multiline var makes this harder to read IMO and takes up twice the number of lines as

acceptedEvents := make(map[string]*TallymanIngestAcceptedEventV1)
rejectedEvents := make(map[string]*TallymanIngestRejectedEventV1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants