Skip to content

Comments

Incremental latency quantiles computation#6310

Merged
shuyangli merged 3 commits intomainfrom
sl/incremental-latency-quantile
Feb 13, 2026
Merged

Incremental latency quantiles computation#6310
shuyangli merged 3 commits intomainfrom
sl/incremental-latency-quantile

Conversation

@shuyangli
Copy link
Member

@shuyangli shuyangli commented Feb 12, 2026

Instead of calculating and storing exact percentiles in a materialized view, this buckets request latencies into floor(log2(latency) * 64 buckets (so total number of buckets is practically bounded) to give us reasonable estimates of p99.9 latency. We can use midpoint of the bucket as the estimated latency, and if the tail latency is 1 minute, our worst-case error from estimation is ~300ms.

At query time, quantiles are computed from histogram CDFs:

  1. Aggregate bucket counts over the requested window per (model, metric, bucket_id).
  2. Build cumulative counts ordered by bucket_id.
  3. For each target quantile, compute rank target = 1 + quantile * (total_count - 1).
  4. Pick the first bucket where cumulative count >= rank target.
  5. Interpolate within that bucket in log-space between bucket bounds to estimate the quantile value.

Also we should consider not returning p0.1 and p99.9 when count is small, because the error will be high.

A step towards #5691.


Note

Medium Risk
Touches core Postgres schema and background refresh scheduling for dashboard latency metrics; mistakes could lead to stale/incorrect quantiles or increased DB load during refresh windows.

Overview
Replaces the model_latency_quantiles* materialized views with incrementally maintained sparse latency histograms (minute + hour rollups) and regular views that compute approximate quantiles from those histograms.

Adds new Postgres tables/functions to bucket latencies (log2 buckets) and refresh rollups incrementally with persisted watermarks, and updates pg_cron setup/validation + e2e tests + fixture scripts to run the new tensorzero_refresh_model_latency_histograms_incremental job instead of REFRESH MATERIALIZED VIEW.

Written by Cursor Bugbot for commit ebf6027. This will update automatically on new commits. Configure here.

@shuyangli shuyangli force-pushed the sl/incremental-latency-quantile branch 4 times, most recently from 6492b8a to ebf6027 Compare February 12, 2026 20:21
@shuyangli shuyangli marked this pull request as ready for review February 12, 2026 20:31
Copy link
Member Author

@BugBot review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Copy link
Member Author

@amishler - do you have any other suggestions on how we could incrementally compute latency quantiles, so queries don't have to scan the whole table? we can't take on tdigest extensions, and ideally we can implement in sql fully.

@amishler
Copy link
Member

amishler commented Feb 12, 2026

@amishler - do you have any other suggestions on how we could incrementally compute latency quantiles, so queries don't have to scan the whole table? we can't take on tdigest extensions, and ideally we can implement in sql fully.

For quantiles you need to maintain the full distribution, so the two options in principle to avoid computing over the whole table are (1) lossy or (2) lossless compression of the latency distribution. Lossy is what this approach does. Lossless would essentially mean maintaining a frequency table with counts per millisecond value - similar to bucketing in that you collapse rows into counts, but you don't bucket over the x-axis. That obviously only saves space if you have a lot of duplicate ms values, and I don't know if it's feasible space-wise for rollups over longer time scales like an hour. You could consider other lossless compression schemes like Huffman encoding but I have no idea if they can be implemented easily in sql.

Basically any lossy approach involves bucketing. The log-scaled buckets seems sensible if the latency distribution is heavily right-skewed (for example approximately log-normal), which seems like is usually the case in practice. You could potentially come up with more principled buckets for specific distributions + specific quantiles of interest, but you'd have to know the distribution in advance. Without that I'd favor this approach.

Copy link
Member

@amishler amishler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a stats perspective, this approach makes sense. The relative error compared to the actual empirical quantiles is ~1% across all buckets which is nice. See my comment inline also.

Agree that small/large quantiles shouldn't be reported when counts are small. This is more a statistical issue than a data compression issue: empirical quantiles in the tails generally have higher variances than in the middle of the distribution, so you need more data to estimate the true quantiles well.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ebf602739c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@shuyangli shuyangli force-pushed the sl/incremental-latency-quantile branch from ebf6027 to e45f5cc Compare February 13, 2026 01:03
@shuyangli shuyangli force-pushed the sl/incremental-model-provider-stats branch from 9aced72 to 108658f Compare February 13, 2026 01:03
Base automatically changed from sl/incremental-model-provider-stats to main February 13, 2026 03:41
@shuyangli shuyangli force-pushed the sl/incremental-latency-quantile branch from e45f5cc to e3996ef Compare February 13, 2026 04:01
Copy link
Member

@virajmehta virajmehta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally good, just some comments about maintainability. I wish we could test this code properly not sure if that is possible.

@virajmehta virajmehta assigned shuyangli and unassigned virajmehta Feb 13, 2026
@shuyangli shuyangli force-pushed the sl/incremental-latency-quantile branch 6 times, most recently from 7c3f334 to 1f83cb2 Compare February 13, 2026 17:53
@shuyangli shuyangli requested a review from virajmehta February 13, 2026 17:55
@shuyangli shuyangli assigned virajmehta and unassigned shuyangli Feb 13, 2026
Copy link
Member Author

shifted the bucketing and quantiles logic into rust, now the migration is only for rolling up raw data into minutes/hours tables

virajmehta
virajmehta previously approved these changes Feb 13, 2026
@virajmehta virajmehta added this pull request to the merge queue Feb 13, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 13, 2026
@shuyangli shuyangli added this pull request to the merge queue Feb 13, 2026
Merged via the queue into main with commit 12b204d Feb 13, 2026
63 checks passed
@shuyangli shuyangli deleted the sl/incremental-latency-quantile branch February 13, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants