Skip to content

Feat/bot leaderboard/v2.3 followup#4916

Open
colesussmeier wants to merge 8 commits into
feat/bot-leaderboard/v2.3from
feat/bot-leaderboard/v2.3-followup
Open

Feat/bot leaderboard/v2.3 followup#4916
colesussmeier wants to merge 8 commits into
feat/bot-leaderboard/v2.3from
feat/bot-leaderboard/v2.3-followup

Conversation

@colesussmeier

Copy link
Copy Markdown

Batch update for several parameter implementations, bug fixes, and logic updates

colesussmeier and others added 8 commits June 19, 2026 15:14
Always return tuple[float, float | None] instead of conditionally
returning either a bare float or a tuple, so callers have a single
shape to unpack. The second element stays None unless
include_discrimination is set.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the AIB project-id list duplicated inside gather_data with a
single module-level AIB_PROJECT_IDS constant.

Co-authored-by: Cursor <cursoragent@cursor.com>
Factor the question project filter into a project_filter Q object and
add an aib_minibench_only flag that restricts the leaderboard to AIB
and Minibench questions. Tag CSV output with the _AIBMiniB suffix.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add a min_human_forecasters threshold: on community questions with
fewer than that many distinct human forecasters, keep the question but
drop the Community Aggregate head-to-head matches. Do the same for
minibench questions, which have no real human crowd (also skip building
the aggregate for them in gather_data). Tag CSV output with _MinHF.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the NotImplementedError with the per-year split for third-party
bots: rewrite their head-to-head ids to year-tagged strings ("name
(YYYY)"), parallel to the cp/pro aggregate split. This also drops them
from non_metac_bot_ids membership so the per-year history bypasses the
recency filter. Guard with an assert that include_non_metac_bots is set.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add participation_parent_key to map year-split player ids
("... (YYYY)") to their parent, and apply min_participation_count to
the parent's combined question set. This keeps an established
aggregate/bot from being dropped just because individual per-year
slices are sparse.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add combine_year_split_players, which collapses per-year community/pro
aggregates and non_metac_bots_by_year bots into a single combined entry
(contribution-count-weighted mean skill, CI via SE propagation, summed
counts), mirroring the front-end re-aggregation. Apply it to the
leaderboard DB save and CSV output while keeping the per-year fit
intact for the discrimination and distribution diagnostics.

Co-authored-by: Cursor <cursoragent@cursor.com>
Set the Command.handle() run configuration to the current v2.3 defaults
(include_minibench, min_human_forecasters, non_metac_bots_by_year, bot
recency/score windows, ALS off, etc.) and wire the new
aib_minibench_only / min_human_forecasters kwargs through the call.
Move the explanatory comments off the function signature.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0f8d31f3-12f2-4e59-abfe-2fadd831f814

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/bot-leaderboard/v2.3-followup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

🚀 Preview Environment

Your preview environment is ready!

Resource Details
🌐 Preview URL https://metaculus-pr-4916-feat-bot-leaderboard-v2-3-foll-preview.mtcl.cc
📦 Docker Image ghcr.io/metaculus/metaculus:feat-bot-leaderboard-v2.3-followup-bb7f574
🗄️ PostgreSQL NeonDB branch preview/pr-4916-feat-bot-leaderboard-v2-3-foll
Redis Fly Redis mtc-redis-pr-4916-feat-bot-leaderboard-v2-3-foll

Details

  • Commit: 8df5af98df968ca1dda8d769a4bf7804a78baa0d
  • Branch: feat/bot-leaderboard/v2.3-followup
  • Fly App: metaculus-pr-4916-feat-bot-leaderboard-v2-3-foll

ℹ️ Preview Environment Info

Isolation:

  • PostgreSQL and Redis are fully isolated from production
  • Each PR gets its own database branch and Redis instance
  • Changes pushed to this PR will trigger a new deployment

Limitations:

  • Background workers and cron jobs are not deployed in preview environments
  • If you need to test background jobs, use Heroku staging environments

Cleanup:

  • This preview will be automatically destroyed when the PR is closed

@lsabor lsabor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

.exclude(post__default_project__slug__startswith="minibench")
.annotate(
human_forecaster_count=Count(
"user_forecasts__author",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matter in this case because the filter query below already joins the forecasts and User table in user_forecasts__author__is_bot=False, but if that weren't the case, you'd want to replace "user_forecasts__author" with "user_forecasts__author_id" to skip the join and just use the integer field.

Comment on lines +1173 to +1190
# Drop only the community-aggregate matches on low-human / minibench questions,
if drop_cp_question_ids:
keep = [
i
for i, (qid, u1, u2) in enumerate(
zip(question_ids, user1_ids, user2_ids)
)
if not (
qid in drop_cp_question_ids
and ("Community Aggregate" in (u1, u2))
)
]
user1_ids = [user1_ids[i] for i in keep]
user2_ids = [user2_ids[i] for i in keep]
question_ids = [question_ids[i] for i in keep]
scores = [scores[i] for i in keep]
coverages = [coverages[i] for i in keep]
timestamps = [timestamps[i] for i in keep]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should drop question ids before the gather_data step. Just filter our questions with those ids. That allows you to skip the gather data calculations on those questions instead of just calculating and tossing later.

Comment on lines +66 to +73
- score: contribution-count-weighted mean of the per-year skills, with
weight = max(distinct-question count, 1).
- CI: per-year half-widths are converted to SEs and propagated through the
same normalized weights -- se_combined = sqrt(Σ (wᵢ/W)² · seᵢ²), then
score ± z·se_combined. Combining estimates *narrows* the interval. CI is
dropped unless every member has both bounds.
- match count / distinct questions / coverage: combined across the year
slices (which are disjoint in questions).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable and I can't suggest an alternative. But just want to double check that it's "close enough" that we don't just have to recalculate the whole thing with and without the split.

Speaking of which, that is an option - if you wanted to validate the proximity of this combiner, you could take the same data with and without splitting and then recombine the output of the split and see the difference. If you've already done it, mentioning it here as the justification of this move would be reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants