FEAT: Add FORTRESS dataset loader and per-row rubric scorer#1805
FEAT: Add FORTRESS dataset loader and per-row rubric scorer#1805romanlutz wants to merge 8 commits into
Conversation
Adds the FORTRESS (Frontier Risk Evaluation for National Security and Public Safety) benchmark from Scale AI (arXiv:2506.14922) to PyRIT. **Loader** (pyrit/datasets/seed_datasets/remote/fortress_dataset.py): - Three sibling classes: `_FortressAdversarialDataset` (500), `_FortressBenignDataset` (500), and `_FortressPairedDataset` (1000). - Two filter enums: `FortressRiskDomain` (3 values) and `FortressRiskSubdomain` (10 values), matching upstream values. - Pins HF revision `0c096becbc75bb12065c8059a53960c7f0d4d35c` (same as the Inspect Evals port). - Adversarial seeds carry the per-row rubric and `num_dim` in `SeedPrompt.metadata` so the scorer can grade against the prompt's own rubric. Both halves carry the partner prompt via `paired_prompt` for downstream (ARS, ORS) trade-off analysis. - Reproduces the Scale AI use-restriction notice verbatim in the docstrings; tags each seed with `use_restriction='no_adversarial_training'`. **Scorer** (pyrit/score/float_scale/fortress_rubric_scorer.py): - New `FortressRubricScorer(FloatScaleScorer)` that reads `rubric` + `num_dim` (and optional `original_prompt`) from `MessagePiece.prompt_metadata` and asks a configurable judge to emit an N-character Y/N string. The score is the fraction of Y verdicts, in `[0, 1]`. - Malformed grades (wrong length, non-Y/N characters) yield `0.0` with `score_metadata['invalid'] = 'true'` so aggregators can filter, matching upstream Inspect Evals behavior. - YAML system prompt at `pyrit/datasets/score/fortress/rubric_system_prompt.yaml` is parameterized with `criteria` / `num_dim` / `original_prompt`; the rubric pattern is intentionally generic and works for any dataset that supplies a per-row binary rubric. **Validation:** 39 new unit tests (23 loader + 16 scorer), all passing. Pre-commit clean (ruff format, ruff check, ty, nbqa-ruff, validate-docs). Live HF fetch confirmed: 500/500/1000 prompts; domain counts match upstream README exactly (CBRNE 180, PVT 132, CFIA 188); per-subdomain counts reflect upstream's known mismatched-subdomain rows (HF discussion #4) and are preserved verbatim per documented design. References: - `doc/references.bib`: new `@misc{knight2025fortress, ...}` - `doc/bibliography.md`: hidden-citation block updated - `doc/code/datasets/1_loading_datasets.{py,ipynb}`: FORTRESS added to dataset list and discovery output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…fety-utility # Conflicts: # doc/bibliography.md # pyrit/datasets/seed_datasets/remote/__init__.py
…fety-utility # Conflicts: # doc/bibliography.md
…fety-utility # Conflicts: # doc/bibliography.md # doc/references.bib # pyrit/datasets/seed_datasets/remote/__init__.py
…fety-utility # Conflicts: # doc/bibliography.md
adrian-gavrila
left a comment
There was a problem hiding this comment.
One quick question here, otherwise looks great.
| if is_adversarial: | ||
| rubric_list = list(row.get("rubric") or []) | ||
| metadata["rubric"] = "\n".join(str(c) for c in rubric_list) | ||
| metadata["num_dim"] = len(rubric_list) |
There was a problem hiding this comment.
Small thing, but I think there might be a gap here. The YAML tells the judge it'll see the original prompt, and the scorer reads original_prompt from metadata, but I don't see anywhere that actually sets it (we set paired_prompt, rubric, and num_dim). So the grader's prompt section ends up empty. Upstream's model_graded_qa does pass the prompt through as {question}, so this might drift a bit from the reference. Is the plan to wire it up later in FortressScenario, or should the adversarial half just set metadata["original_prompt"] = str(row["adversarial_prompt"]) here? Possible I'm missing where it gets populated.
There was a problem hiding this comment.
Good catch — you're right, this is a real gap. The metadata flow is intact (construct_response_from_request propagates the request's prompt_metadata onto the assistant piece), but the loader was never seeding original_prompt, so the judge was getting an empty # Original prompt block.
Fixed in f9723be: _build_seed_prompt now sets metadata['original_prompt'] = str(value) on the adversarial half. It's redundant with SeedPrompt.value at construction time, but the property the rubric scorer cares about is the as-authored question — independent of any converters that might transform the user-facing value before it reaches the target. Tests assert seed.metadata['original_prompt'] == seed.value for adversarial pieces and that the key is absent on benign pieces.
Thanks for the careful review!
adrian-gavrila
left a comment
There was a problem hiding this comment.
LGTM. One question inline about the original_prompt wiring, but not blocking.
Addresses PR feedback from @adrian-gavrila. The FortressRubricScorer YAML system prompt expects an `{{ original_prompt }}` field so the judge has the original adversarial question alongside the model's response. The scorer reads `original_prompt` from `MessagePiece.prompt_metadata` with an empty-string fallback, but the loader was not setting it — meaning the judge rendered an empty `# Original prompt` block, drifting from upstream Inspect Evals which threads the question through `model_graded_qa`. Fix: set `metadata['original_prompt'] = str(value)` on the adversarial half of `_build_seed_prompt`. This is redundant with `SeedPrompt.value` at construction time, but the property the rubric scorer cares about is the *as-authored* question — independent of any converters that might transform the user-facing value before it reaches the target. Storing it in metadata makes the scorer robust to that transformation. Tests assert `seed.metadata['original_prompt'] == seed.value` for adversarial pieces and confirm the key is absent on benign pieces. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…RESS - Add @OverRide to dataset_name/fetch_dataset_async on the three FORTRESS dataset classes and to _build_identifier/_score_value_with_llm_async/_score_piece_async on FortressRubricScorer; ty pre-commit (with [tool.ty.rules] all = 'error') flags missing-override-decorator on new files. - Extend RECOMMENDED_TAGS with 'national_security' and 'calibration' as deliberate additions for the FORTRESS benchmark, and replace 'over_refusal' with the canonical 'refusal' tag (which already covers over-/false-refusal). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…plit enum Addresses PR feedback. The three loaders all produced flat SeedDatasets from the same upstream HF artifact and only differed in which column they read; the 'paired' loader was a flat union of the other two, not a SeedGroup-based pairing. Single _FortressDataset with split: FortressSplit = ADVERSARIAL (BENIGN, ALL) matches the SGXSTest pattern and is simpler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Duplicate of my other review on this PR; dismissing to avoid confusion.
What
Adds the FORTRESS (Frontier Risk Evaluation for National Security and Public Safety) benchmark from Scale AI (arXiv:2506.14922, HF: ScaleAI/fortress_public) to PyRIT as a first-class dataset loader plus a generic per-row binary-rubric scorer.
FORTRESS is paired (each adversarial prompt ships a benign rephrasing on the same topic) and rubric-based (each adversarial prompt ships its own 4-7 binary Y/N criteria). Both properties are novel in the PyRIT catalog and motivated landing the dataset as a first-class loader rather than folding it into an existing one.
Why
SelfAskGeneralTrueFalseScoreris single-criterion;SelfAskScaleScorerreturns a single Likert float. Neither handles the per-row rubric pattern.FortressRubricScoreris intentionally generic so any future dataset that ships a per-row binary rubric in metadata can reuse it unchanged.What's in
Loader -
pyrit/datasets/seed_datasets/remote/fortress_dataset.py_FortressAdversarialDataset(500),_FortressBenignDataset(500),_FortressPairedDataset(1000).FortressRiskDomain(3) andFortressRiskSubdomain(10), matching upstream stored values.0c096becbc75bb12065c8059a53960c7f0d4d35c(same as the Inspect Evals reference port).num_diminSeedPrompt.metadata; both halves carry the partner viapaired_promptso downstream consumers can compute the paired (ARS, ORS) metric.metadata["use_restriction"] = "no_adversarial_training".Scorer -
pyrit/score/float_scale/fortress_rubric_scorer.pyFortressRubricScorer(FloatScaleScorer)readsrubric+num_dim(and optionaloriginal_prompt) fromMessagePiece.prompt_metadata, asks a configurable judge for an N-character Y/N string, and returns the fraction of Y verdicts in [0, 1].0.0withscore_metadata["invalid"] = "true"so aggregators can filter (matches upstream Inspect Evals behavior).pyrit/datasets/score/fortress/rubric_system_prompt.yaml, parameterized withcriteria/num_dim/original_prompt.Docs / registry
doc/references.bib: new@misc{knight2025fortress, ...}doc/bibliography.md: hidden-citation block updateddoc/code/datasets/1_loading_datasets.{py,ipynb}: FORTRESS added to dataset list and discovery outputpyrit/datasets/seed_datasets/remote/__init__.pyandpyrit/score/__init__.py: registrationsWhat's NOT in (intentional follow-up)
FortressScenario(wiring loader + scorer + over-refusal scorer into a single scenario emitting per-(sub)domain ARS/ORS), an aggregate scorecard primitive for paired-population reduction, and a notebook example are explicitly deferred to follow-up PRs.Validation
tests/unit/datasets/+tests/unit/score/(1545 passed).Use-restriction reminder
Dataset is CC-BY-4.0 with the Scale AI use restriction: "This dataset should be used for safety evaluations only, and it is prohibited to use these data for any adversarial training or research." Loader docstrings reproduce this verbatim and tag each seed; no programmatic enforcement.