-
Notifications
You must be signed in to change notification settings - Fork 956
chore: add usage tracking package #19095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some non-blocking comments below but might need to take another pass.
@@ -3913,6 +3913,13 @@ func (q *querier) InsertTemplateVersionWorkspaceTag(ctx context.Context, arg dat | |||
return q.db.InsertTemplateVersionWorkspaceTag(ctx, arg) | |||
} | |||
|
|||
func (q *querier) InsertUsageEvent(ctx context.Context, arg database.InsertUsageEventParams) error { | |||
if err := q.authorizeContext(ctx, policy.ActionCreate, rbac.ResourceSystem); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably create a separate RBAC resource and role for these events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are currently not directly CRUDable by any users, even admins. I was going to originally add this to it's own resource, but after I realized that I decided that it could be added down the track if we add APIs for using this information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even so, ResourceSystem
has become a dumping ground of a bunch of random stuff that is unrelated. The whole point of RBAC is to separate sensitive stuff from unsensitive, and this is way less important than, say, crypto keys (which are currently ResourceSystem
, so it needs its own resource.
-- The parenthesis around @now::timestamptz are necessary to | ||
-- avoid sqlc from generating an extra argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL
const ( | ||
CoderLicenseJWTHeader = "Coder-License-JWT" | ||
|
||
tallymanURL = "https://tallyman-ingress.coder.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see an argument for making this a var
so it's configurable at build time.
if !allFailed { | ||
// These are all going to have the same message, so don't log | ||
// them. We already logged the overall error above. | ||
p.log.Warn(ctx, "tallyman rejected usage event", slog.F("id", event.ID), slog.F("message", rejectedEvent.Message), slog.F("permanent", rejectedEvent.Permanent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead collect the failed IDs and warn once? Depending on the number of events, this could spam.
} else { | ||
// It's not good if this path gets hit, but we'll handle it as if it | ||
// was a temporary rejection. | ||
p.log.Warn(ctx, "tallyman did not include a usage event in the response, considering it temporarily rejected", slog.F("id", event.ID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I'd prefer if we didn't log in a tight loop.
err = p.db.UpdateUsageEventsPostPublish(ctx, dbUpdate) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: suggest
if err := p.b.UpdateUsageEventsPostPublish(ctx, dbUpdate); err != nil { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obligatory reminder to check migration number before merging!
|
||
# Usage tracking code requires intimate knowledge of Tallyman and Metronome, as | ||
# well as guidance from revenue. | ||
coderd/usage/ @deansheather |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, I'd like to ensure we have 2 code owners for each thing, so that we have some ability to make progress if someone is out
@@ -0,0 +1,26 @@ | |||
CREATE TYPE usage_event_type AS ENUM ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd normally be in favor of an enum here, and I'm sorry I didn't think of it at RFC time, but...
We are restricted from using ALTER TYPE ... ADD VALUE
in later migrations and instead have to convert everything to an intermediate text column, then back to the enum. Since the usage events table is likely to be very large, this could be a costly migration.
Instead we could make the event_type
a text and for now just add a CHECK constraint that it equals 'dc_managed_agents_v1'. Later migrations can use NOT VALID when modifying this check constraint to avoid a costly scan. WDYT?
-- always permanently reject these events anyways. | ||
-- The parenthesis around @now::timestamptz are necessary to | ||
-- avoid sqlc from generating an extra argument. | ||
potential_event.created_at > (@now::timestamptz) - INTERVAL '30 days' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if PG's query planner is smart enough to figure out the optimal order of these clauses, but this constraint probably is less selective than published_at IS NULL
in a working system, so should probably go last.
|
||
CREATE INDEX idx_usage_events_created_at ON usage_events (created_at); | ||
CREATE INDEX idx_usage_events_publish_started_at ON usage_events (publish_started_at); | ||
CREATE INDEX idx_usage_events_published_at ON usage_events (published_at); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want a single index over a tuple and the order matters. Having 3 indexes is much less useful for a query that uses all 3 fields.
It should be published_at first, to allow the query to ignore already published events, which will be the vast majority of the table. Next is publish_started_at, which we use to filter out in-progress events. Lastly created_at since we order by this and exclude anything older than 30 days.
// Note that the following event types should not be updated once they are | ||
// merged into the product. Please consult Dean before making any changes. | ||
type Event interface { | ||
usageEvent() // to prevent external types from implementing this interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the motivation for preventing external types from implementing?
// the count of all existing managed agents (count=N) | ||
// - A new managed agent is created (count=1) | ||
type DCManagedAgentsV1 struct { | ||
Count uint64 `json:"count"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we'll want a lot more than just the count so that customers can understand their usage, e.g. by template, organization, user.
startErr := make(chan error) | ||
go func() { | ||
err := publisher.Start() | ||
testutil.RequireSend(ctx, t, startErr, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't call Require
methods in goroutines, only the main test goroutine.
handler func(req usage.TallymanIngestRequestV1) any | ||
) | ||
ingestURL := fakeServer(t, tallymanHandler(t, licenseJWT, func(req usage.TallymanIngestRequestV1) any { | ||
callCount := atomic.AddInt64(&calls, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to prevent test races, we need to ensure that the publish calls are completed before we attempt to read this value. If the test prevents such races, atomic
is superfluous because we will not concurrently access it. Making it atomic doesn't do anything to prevent racy tests and is confusing because it signals to people that there might be concurrent accesses, when we need to ensure there are none.
// publishOnce publishes up to tallymanPublishBatchSize usage events to | ||
// tallyman. It returns the number of successfully published events. | ||
func (p *tallymanPublisher) publishOnce(ctx context.Context, deploymentID uuid.UUID) (int, error) { | ||
licenseJwt, err := p.getBestLicenseJWT(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be doing this for every publish. Can we connect it up to the rest of the license code so we don't add more queries? Or at least use the same query strategy of querying every 10 minutes and listening for published changes?
var ( | ||
acceptedEvents = make(map[string]*TallymanIngestAcceptedEventV1) | ||
rejectedEvents = make(map[string]*TallymanIngestRejectedEventV1) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the multiline var
makes this harder to read IMO and takes up twice the number of lines as
acceptedEvents := make(map[string]*TallymanIngestAcceptedEventV1)
rejectedEvents := make(map[string]*TallymanIngestRejectedEventV1)
Not used in coderd yet, see stack.
Adds two new packages:
coderd/usage
: provides an interface for the "Collector" as well as a stub implementation for AGPLenterprise/coderd/usage
: provides an interface for the "Publisher" as well as a Tallyman implementationRelates to coder/internal#814