Feat/bot leaderboard/v2.3 followup#4916
Conversation
Always return tuple[float, float | None] instead of conditionally returning either a bare float or a tuple, so callers have a single shape to unpack. The second element stays None unless include_discrimination is set. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the AIB project-id list duplicated inside gather_data with a single module-level AIB_PROJECT_IDS constant. Co-authored-by: Cursor <cursoragent@cursor.com>
Factor the question project filter into a project_filter Q object and add an aib_minibench_only flag that restricts the leaderboard to AIB and Minibench questions. Tag CSV output with the _AIBMiniB suffix. Co-authored-by: Cursor <cursoragent@cursor.com>
Add a min_human_forecasters threshold: on community questions with fewer than that many distinct human forecasters, keep the question but drop the Community Aggregate head-to-head matches. Do the same for minibench questions, which have no real human crowd (also skip building the aggregate for them in gather_data). Tag CSV output with _MinHF. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the NotImplementedError with the per-year split for third-party
bots: rewrite their head-to-head ids to year-tagged strings ("name
(YYYY)"), parallel to the cp/pro aggregate split. This also drops them
from non_metac_bot_ids membership so the per-year history bypasses the
recency filter. Guard with an assert that include_non_metac_bots is set.
Co-authored-by: Cursor <cursoragent@cursor.com>
Add participation_parent_key to map year-split player ids
("... (YYYY)") to their parent, and apply min_participation_count to
the parent's combined question set. This keeps an established
aggregate/bot from being dropped just because individual per-year
slices are sparse.
Co-authored-by: Cursor <cursoragent@cursor.com>
Add combine_year_split_players, which collapses per-year community/pro aggregates and non_metac_bots_by_year bots into a single combined entry (contribution-count-weighted mean skill, CI via SE propagation, summed counts), mirroring the front-end re-aggregation. Apply it to the leaderboard DB save and CSV output while keeping the per-year fit intact for the discrimination and distribution diagnostics. Co-authored-by: Cursor <cursoragent@cursor.com>
Set the Command.handle() run configuration to the current v2.3 defaults (include_minibench, min_human_forecasters, non_metac_bots_by_year, bot recency/score windows, ALS off, etc.) and wire the new aib_minibench_only / min_human_forecasters kwargs through the call. Move the explanatory comments off the function signature. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🚀 Preview EnvironmentYour preview environment is ready!
Details
ℹ️ Preview Environment InfoIsolation:
Limitations:
Cleanup:
|
| .exclude(post__default_project__slug__startswith="minibench") | ||
| .annotate( | ||
| human_forecaster_count=Count( | ||
| "user_forecasts__author", |
There was a problem hiding this comment.
doesn't matter in this case because the filter query below already joins the forecasts and User table in user_forecasts__author__is_bot=False, but if that weren't the case, you'd want to replace "user_forecasts__author" with "user_forecasts__author_id" to skip the join and just use the integer field.
| # Drop only the community-aggregate matches on low-human / minibench questions, | ||
| if drop_cp_question_ids: | ||
| keep = [ | ||
| i | ||
| for i, (qid, u1, u2) in enumerate( | ||
| zip(question_ids, user1_ids, user2_ids) | ||
| ) | ||
| if not ( | ||
| qid in drop_cp_question_ids | ||
| and ("Community Aggregate" in (u1, u2)) | ||
| ) | ||
| ] | ||
| user1_ids = [user1_ids[i] for i in keep] | ||
| user2_ids = [user2_ids[i] for i in keep] | ||
| question_ids = [question_ids[i] for i in keep] | ||
| scores = [scores[i] for i in keep] | ||
| coverages = [coverages[i] for i in keep] | ||
| timestamps = [timestamps[i] for i in keep] |
There was a problem hiding this comment.
you should drop question ids before the gather_data step. Just filter our questions with those ids. That allows you to skip the gather data calculations on those questions instead of just calculating and tossing later.
| - score: contribution-count-weighted mean of the per-year skills, with | ||
| weight = max(distinct-question count, 1). | ||
| - CI: per-year half-widths are converted to SEs and propagated through the | ||
| same normalized weights -- se_combined = sqrt(Σ (wᵢ/W)² · seᵢ²), then | ||
| score ± z·se_combined. Combining estimates *narrows* the interval. CI is | ||
| dropped unless every member has both bounds. | ||
| - match count / distinct questions / coverage: combined across the year | ||
| slices (which are disjoint in questions). |
There was a problem hiding this comment.
Seems reasonable and I can't suggest an alternative. But just want to double check that it's "close enough" that we don't just have to recalculate the whole thing with and without the split.
Speaking of which, that is an option - if you wanted to validate the proximity of this combiner, you could take the same data with and without splitting and then recombine the output of the split and see the difference. If you've already done it, mentioning it here as the justification of this move would be reasonable.
Batch update for several parameter implementations, bug fixes, and logic updates