Optimize performance_datum queries in alert generation by camd · Pull Request #9550 · mozilla/treeherder

camd · 2026-05-23T18:29:29Z

Bug 2041948

generate_new_alerts_in_series and generate_new_test_alerts_in_series run on every signature during ingestion. This PR addresses their PerformanceDatum / PerformanceDatumReplicate reads:

The series query filtered only on signature + push_timestamp, so Postgres could not seek the composite (repository, signature, push_timestamp) index (its leading column, repository, was unconstrained). Add repository=signature.repository; a signature has exactly one repository, so results are unchanged.
Trimmed over-fetching on the series query with .only("id", "push_id", "push_timestamp", "value") -- the consuming loop only reads those 4 attributes, so this drops the SELECT from 10 columns to 4 (including three unused VARCHARs).
Removed a dead .select_related("summary__push__time") on the latest-alert query. Django silently ignores select_related when .values_list() is chained, so it produced identical SQL -- misleading dead code.
De-duplicated the two near-identical functions into a shared _load_revision_data() + _collect_replicates_map().

`series` query impact (the headline win)

This is currently the #1 query by total load in prod Cloud SQL Query Insights -- not because any single call is slow (~30 ms avg), but because it is called ~8,858×/hour (once per signature on ingestion). Total load ≈ avg_exec_time × times_called, so cutting per-call cost cuts the dominant load proportionally.

Without repository=, Postgres uses the signature_id-only index and reads the signature's entire history, discarding everything outside the 2-week window. Adding repository= switches it to the composite index, which range-seeks straight to the window. Measured on the staging replica (sig 5182079, repo 77: 9,462 rows all-time, 1,788 in the 2-week window):

	before (10 cols, `signature_id` index)	after (4 cols, composite index)
index	`performance_datum_signature_id_a9af3d6b`	`performance_reposit_c9d328_idx` (repo, sig, push_timestamp)
rows discarded by filter	7,674 (scans all history)	0 (range-seek to window)
buffers	9,490	1,755 (~5.4× fewer)
row width	64	28 (`.only()`)
exec time	15.4 ms	3.6 ms (~4.3× faster)

The gain scales with how much history a signature has beyond the alert window, so the worst offenders (which drive the prod average up) benefit most.

Note on the replicate lookup shape

An earlier revision of this PR replaced the replicate lookup's IN (SELECT ... WHERE EXISTS (...)) with a plain join, on the assumption the EXISTS was redundant. It is logically redundant but load-bearing for the query planner, as @junngo flagged in review. Measured on the staging replica (autoland, sig 5182079, ~2.4k datums, ~118k replicate rows over the 2-week window):

shape	plan	time
plain `INNER JOIN`	Parallel Seq Scan on `performance_datum`	~22.2 s
`IN(Subquery(datum_ids))`, no `EXISTS`	same seq-scan plan	~65.2 s
materialized literal `IN (...)`	Parallel Seq Scan on `performance_datum_replicate`	>120 s
original `IN(Subquery(... EXISTS ...))`	full index path	~5.4 s cold / ~53 ms warm

The original shape resolves the small datum-id set via the composite index first, then probes performance_datum_replicate by its FK index, instead of letting the planner full-scan a giant table. It has been kept as-is (reverted to the original form); only the other optimizations above are new.

Includes a characterization test pinning the replicate-map behavior.

generate_new_alerts_in_series and generate_new_test_alerts_in_series run on every signature during ingestion. Their PerformanceDatum reads were not using the (repository, signature, push_timestamp) index: - The `series` query filtered only on signature + push_timestamp, so Postgres could not seek the composite index (its leading column, repository, was unconstrained). Add repository=signature.repository; a signature has exactly one repository, so results are unchanged. - The replicate lookup used `IN (SELECT ... WHERE EXISTS (...))`. The EXISTS was redundant -- a replicate row's parent datum is in the set by definition. Replace with a plain join via the new shared helper _collect_replicates_map(), removing the duplicated block in both functions and the now-unused Exists/OuterRef/Subquery imports. Add a characterization test pinning the replicate-map behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-05-23T18:44:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.94%. Comparing base (6e680ff) to head (71d3dba).
⚠️ Report is 20 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #9550   +/-   ##
=======================================
  Coverage   82.94%   82.94%           
=======================================
  Files         613      613           
  Lines       35372    35378    +6     
  Branches     3208     3206    -2     
=======================================
+ Hits        29338    29344    +6     
  Misses       5880     5880           
  Partials      154      154

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Trim the PerformanceDatum series query to the 4 columns actually used (.only id/push_id/push_timestamp/value) and drop the no-op select_related on the latest-alert-timestamp lookup in both generate_new_alerts_in_series and generate_new_test_alerts_in_series. Extract the now-identical data-loading prologue into _load_revision_data, parameterized by the latest-alert-timestamp queryset and the RevisionDatum class, so the two alerting paths share one copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

junngo

At that time, we spent quite a bit of effort tuning the datum replicate query shape.
The Subquery and Exists conditions may look logically redundant, but they helped the DB optimizer choose the intended index path for PerformanceDatumReplicate.
I’m happy to hear any ideas for improving it further, and I’ll also try to double-check whether there’s a better query shape here.

junngo · 2026-05-25T18:31:44Z

+        performance_datum__signature=signature,
+        performance_datum__repository=signature.repository,
+        performance_datum__push_timestamp__gte=alert_after_ts,
+    ).values_list("performance_datum_id", "value")


I’m concerned this may introduce a query plan regression.
This query [0] takes over 2 minutes per call in Redash. From the query plan [1], performance_datum uses the intended index quickly, but performance_datum_replicate still performs a full scan.

[0] SELECT pdr.performance_datum_id, pdr.value FROM performance_datum_replicate pdr INNER JOIN performance_datum pd ON pdr.performance_datum_id = pd.id WHERE pd.push_timestamp >= '2026-05-11' and pd.repository_id = 77 and pd.signature_id = 5763463 [1] Execution Time: 138,480 ms ≈ 2m 18s Gather └─ Hash Join Join: pdr.performance_datum_id = pd.id ├─ Outer: Parallel Seq Scan on performance_datum_replicate pdr │ ├─ .... │ ├─ Total scanned: ~1.36 billion rows │ └─ Inner: Hash └─ Index Scan on performance_datum pd ├─ Index: performance_reposit_c9d328_idx ├─ Cond: │ repository_id = 77 │ signature_id = 5763463 │ push_timestamp >= 2026-05-11 ├─ Actual rows: 55,900 └─ Actual time: ~49ms

@junngo : Thank you so much for this information! I did a re-analysis and you're totally right. You did some deep, and truly excellent work to get this query optimized! Thanks for bearing with me on this!

junngo · 2026-05-25T19:05:48Z

-        )
-        replicates = PerformanceDatumReplicate.objects.filter(
-            performance_datum_id__in=Subquery(datum_with_replicates.values("id"))
-        ).values_list("performance_datum_id", "value")


The original query takes about 100 ms per call, which looks reasonable to me.

Execution Time: 107 ms Nested Loop ├─ Outer: Aggregate / Gather │ └─ Nested Loop Semi │ ├─ Bitmap Heap Scan on performance_datum │ │ ├─ Index: performance_reposit_c9d328_idx │ │ ├─ Actual rows: ~55,494 total │ │ └─ Uses repository/signature/push_timestamp condition │ │ │ └─ Index Only Scan on performance_datum_replicate u0 │ ├─ Index: performance_datum_replicate_performance_datum_id_fe2ed518 │ ├─ Cond: performance_datum_id = v0.id │ ├─ Loops: 55,494 │ └─ Heap Fetches: 0 │ └─ Inner: Index Scan on performance_datum_replicate └─ Not executed because no matching datum ids with replicates

The plain join introduced in the previous commit lets PostgreSQL seq-scan a multi-hundred-million-row table. Measured on staging (autoland, ~2.4k datums, ~118k replicates): plain join ~22s, dropping only the EXISTS ~65s, vs the restored two-step IN(Subquery(...EXISTS...)) shape ~53ms warm / ~5.4s cold. The EXISTS/Subquery form looks redundant but is load-bearing: it resolves the small datum-id set via the (repository, signature, push_timestamp) composite index first, then probes performance_datum_replicate by its FK index, instead of letting the planner full-scan a giant table. Keeps the over-fetch trim, dead select_related removal, and the _load_revision_data dedup refactor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

camd · 2026-05-31T18:53:08Z

@junngo Thanks for bearing with me on this. Would you take a look at the changes I've made to the PR now?

I did some exploring and there may still be room for tuning this query. Since it's the one called most / second, even a small gain will have a good impact. I had Claude make a description below:

These are the exact SQL statements Django generates, before vs after, for the two changed queries. Drop in your own repository_id / signature_id / push_timestamp and run each under EXPLAIN (ANALYZE, BUFFERS) to compare plans. (Values below are placeholders; e.g. your earlier prod sample was repository_id = 77, signature_id = 5763463, push_timestamp >= '2026-05-11'.)

1. `series` query

Before — filters on signature_id + push_timestamp only, selects all 10 columns. Postgres can't seek the composite index (leading repository_id unconstrained), so it uses the signature_id-only index and scans the signature's full history, discarding everything outside the window:

EXPLAIN (ANALYZE, BUFFERS)
SELECT "performance_datum"."id", "performance_datum"."repository_id",
       "performance_datum"."signature_id", "performance_datum"."value",
       "performance_datum"."push_timestamp", "performance_datum"."application_version",
       "performance_datum"."job_id", "performance_datum"."push_id",
       "performance_datum"."os_name", "performance_datum"."platform_version"
FROM "performance_datum"
WHERE "performance_datum"."signature_id" = <signature_id>
  AND "performance_datum"."push_timestamp" >= '<push_timestamp>';

After — adds repository_id (a signature has exactly one repository, so results are identical) and trims to the 4 columns the loop actually reads. Now it range-seeks the composite (repository_id, signature_id, push_timestamp) index:

EXPLAIN (ANALYZE, BUFFERS)
SELECT "performance_datum"."id", "performance_datum"."push_id",
       "performance_datum"."push_timestamp", "performance_datum"."value"
FROM "performance_datum"
WHERE "performance_datum"."repository_id" = <repository_id>
  AND "performance_datum"."signature_id" = <signature_id>
  AND "performance_datum"."push_timestamp" >= '<push_timestamp>';

2. Replicate lookup

Plain join (briefly introduced in this PR, now reverted) — lets the planner full-scan a giant table:

EXPLAIN (ANALYZE, BUFFERS)
SELECT "performance_datum_replicate"."performance_datum_id", "performance_datum_replicate"."value"
FROM "performance_datum_replicate"
INNER JOIN "performance_datum"
  ON ("performance_datum_replicate"."performance_datum_id" = "performance_datum"."id")
WHERE "performance_datum"."push_timestamp" >= '<push_timestamp>'
  AND "performance_datum"."repository_id" = <repository_id>
  AND "performance_datum"."signature_id" = <signature_id>;

Restored original — the IN (SELECT … WHERE EXISTS (…)) shape you tuned. Resolves the small datum-id set via the composite index first, then probes performance_datum_replicate by its FK index:

EXPLAIN (ANALYZE, BUFFERS)
SELECT "performance_datum_replicate"."performance_datum_id", "performance_datum_replicate"."value"
FROM "performance_datum_replicate"
WHERE "performance_datum_replicate"."performance_datum_id" IN (
  SELECT V0."id"
  FROM "performance_datum" V0
  WHERE V0."push_timestamp" >= '<push_timestamp>'
    AND V0."repository_id" = <repository_id>
    AND V0."signature_id" = <signature_id>
    AND EXISTS (
      SELECT 1 AS "a"
      FROM "performance_datum_replicate" U0
      WHERE U0."performance_datum_id" = V0."id"
      LIMIT 1
    )
);

The current branch ships the "After" series query and the "Restored original" replicate lookup.

camd self-assigned this May 23, 2026

camd requested review from beatrice-acasandrei and esanuandra as code owners May 23, 2026 18:29

camd requested review from Archaeopteryx and gmierz May 23, 2026 18:31

camd mentioned this pull request May 25, 2026

Add read-replica routing for Perfherder read-only endpoints #9528

Open

6 tasks

junngo reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize performance_datum queries in alert generation#9550

Optimize performance_datum queries in alert generation#9550
camd wants to merge 3 commits into
masterfrom
camd/query-optimization-performance-datum

camd commented May 23, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 23, 2026 •

edited

Loading

Uh oh!

junngo left a comment

Uh oh!

junngo May 25, 2026

Uh oh!

camd May 31, 2026

Uh oh!

junngo May 25, 2026

Uh oh!

camd commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

camd commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

series query impact (the headline win)

Note on the replicate lookup shape

Uh oh!

codecov-commenter commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

junngo left a comment

Choose a reason for hiding this comment

Uh oh!

junngo May 25, 2026

Choose a reason for hiding this comment

Uh oh!

camd May 31, 2026

Choose a reason for hiding this comment

Uh oh!

junngo May 25, 2026

Choose a reason for hiding this comment

Uh oh!

camd commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. series query

2. Replicate lookup

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

camd commented May 23, 2026 •

edited

Loading

`series` query impact (the headline win)

codecov-commenter commented May 23, 2026 •

edited

Loading

camd commented May 31, 2026 •

edited

Loading

1. `series` query