FEAT: Add FORTRESS dataset loader and per-row rubric scorer by romanlutz · Pull Request #1805 · microsoft/PyRIT

romanlutz · 2026-05-25T20:58:43Z

What

Adds the FORTRESS (Frontier Risk Evaluation for National Security and Public Safety) benchmark from Scale AI (arXiv:2506.14922, HF: ScaleAI/fortress_public) to PyRIT as a first-class dataset loader plus a generic per-row binary-rubric scorer.

FORTRESS is paired (each adversarial prompt ships a benign rephrasing on the same topic) and rubric-based (each adversarial prompt ships its own 4-7 binary Y/N criteria). Both properties are novel in the PyRIT catalog and motivated landing the dataset as a first-class loader rather than folding it into an existing one.

Why

SelfAskGeneralTrueFalseScorer is single-criterion; SelfAskScaleScorer returns a single Likert float. Neither handles the per-row rubric pattern. FortressRubricScorer is intentionally generic so any future dataset that ships a per-row binary rubric in metadata can reuse it unchanged.

What's in

Loader - pyrit/datasets/seed_datasets/remote/fortress_dataset.py

Three sibling classes: _FortressAdversarialDataset (500), _FortressBenignDataset (500), _FortressPairedDataset (1000).
Two filter enums: FortressRiskDomain (3) and FortressRiskSubdomain (10), matching upstream stored values.
HF revision pinned to 0c096becbc75bb12065c8059a53960c7f0d4d35c (same as the Inspect Evals reference port).
Adversarial seeds carry the per-row rubric + num_dim in SeedPrompt.metadata; both halves carry the partner via paired_prompt so downstream consumers can compute the paired (ARS, ORS) metric.
Reproduces the Scale AI use-restriction notice verbatim in docstrings; tags each seed metadata["use_restriction"] = "no_adversarial_training".

Scorer - pyrit/score/float_scale/fortress_rubric_scorer.py

New FortressRubricScorer(FloatScaleScorer) reads rubric + num_dim (and optional original_prompt) from MessagePiece.prompt_metadata, asks a configurable judge for an N-character Y/N string, and returns the fraction of Y verdicts in [0, 1].
Malformed grades (wrong length, non-Y/N chars) yield 0.0 with score_metadata["invalid"] = "true" so aggregators can filter (matches upstream Inspect Evals behavior).
YAML system prompt at pyrit/datasets/score/fortress/rubric_system_prompt.yaml, parameterized with criteria / num_dim / original_prompt.

Docs / registry

doc/references.bib: new @misc{knight2025fortress, ...}
doc/bibliography.md: hidden-citation block updated
doc/code/datasets/1_loading_datasets.{py,ipynb}: FORTRESS added to dataset list and discovery output
pyrit/datasets/seed_datasets/remote/__init__.py and pyrit/score/__init__.py: registrations

What's NOT in (intentional follow-up)

FortressScenario (wiring loader + scorer + over-refusal scorer into a single scenario emitting per-(sub)domain ARS/ORS), an aggregate scorecard primitive for paired-population reduction, and a notebook example are explicitly deferred to follow-up PRs.

Validation

39 new unit tests (23 loader + 16 scorer), all green.
Full tests/unit/datasets/ + tests/unit/score/ (1545 passed).
Pre-commit clean (ruff format, ruff check, ty, nbqa-ruff, validate-docs).
Live HF fetch confirmed: 500 / 500 / 1000 prompts; domain counts match upstream README exactly (CBRNE 180, PVT 132, CFIA 188); per-subdomain counts reflect the README-documented mismatched-subdomain rows (HF discussion Upgrade transformers to >=4.36.0 to address dependabot alert #4) and are preserved verbatim per documented design.

Use-restriction reminder

Dataset is CC-BY-4.0 with the Scale AI use restriction: "This dataset should be used for safety evaluations only, and it is prohibited to use these data for any adversarial training or research." Loader docstrings reproduce this verbatim and tag each seed; no programmatic enforcement.

Adds the FORTRESS (Frontier Risk Evaluation for National Security and Public Safety) benchmark from Scale AI (arXiv:2506.14922) to PyRIT. **Loader** (pyrit/datasets/seed_datasets/remote/fortress_dataset.py): - Three sibling classes: `_FortressAdversarialDataset` (500), `_FortressBenignDataset` (500), and `_FortressPairedDataset` (1000). - Two filter enums: `FortressRiskDomain` (3 values) and `FortressRiskSubdomain` (10 values), matching upstream values. - Pins HF revision `0c096becbc75bb12065c8059a53960c7f0d4d35c` (same as the Inspect Evals port). - Adversarial seeds carry the per-row rubric and `num_dim` in `SeedPrompt.metadata` so the scorer can grade against the prompt's own rubric. Both halves carry the partner prompt via `paired_prompt` for downstream (ARS, ORS) trade-off analysis. - Reproduces the Scale AI use-restriction notice verbatim in the docstrings; tags each seed with `use_restriction='no_adversarial_training'`. **Scorer** (pyrit/score/float_scale/fortress_rubric_scorer.py): - New `FortressRubricScorer(FloatScaleScorer)` that reads `rubric` + `num_dim` (and optional `original_prompt`) from `MessagePiece.prompt_metadata` and asks a configurable judge to emit an N-character Y/N string. The score is the fraction of Y verdicts, in `[0, 1]`. - Malformed grades (wrong length, non-Y/N characters) yield `0.0` with `score_metadata['invalid'] = 'true'` so aggregators can filter, matching upstream Inspect Evals behavior. - YAML system prompt at `pyrit/datasets/score/fortress/rubric_system_prompt.yaml` is parameterized with `criteria` / `num_dim` / `original_prompt`; the rubric pattern is intentionally generic and works for any dataset that supplies a per-row binary rubric. **Validation:** 39 new unit tests (23 loader + 16 scorer), all passing. Pre-commit clean (ruff format, ruff check, ty, nbqa-ruff, validate-docs). Live HF fetch confirmed: 500/500/1000 prompts; domain counts match upstream README exactly (CBRNE 180, PVT 132, CFIA 188); per-subdomain counts reflect upstream's known mismatched-subdomain rows (HF discussion #4) and are preserved verbatim per documented design. References: - `doc/references.bib`: new `@misc{knight2025fortress, ...}` - `doc/bibliography.md`: hidden-citation block updated - `doc/code/datasets/1_loading_datasets.{py,ipynb}`: FORTRESS added to dataset list and discovery output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…fety-utility # Conflicts: # doc/bibliography.md # pyrit/datasets/seed_datasets/remote/__init__.py

…fety-utility # Conflicts: # doc/bibliography.md

…fety-utility # Conflicts: # doc/bibliography.md # doc/references.bib # pyrit/datasets/seed_datasets/remote/__init__.py

…fety-utility # Conflicts: # doc/bibliography.md

adrian-gavrila

One quick question here, otherwise looks great.

adrian-gavrila · 2026-06-03T21:20:45Z

+        if is_adversarial:
+            rubric_list = list(row.get("rubric") or [])
+            metadata["rubric"] = "\n".join(str(c) for c in rubric_list)
+            metadata["num_dim"] = len(rubric_list)


Small thing, but I think there might be a gap here. The YAML tells the judge it'll see the original prompt, and the scorer reads original_prompt from metadata, but I don't see anywhere that actually sets it (we set paired_prompt, rubric, and num_dim). So the grader's prompt section ends up empty. Upstream's model_graded_qa does pass the prompt through as {question}, so this might drift a bit from the reference. Is the plan to wire it up later in FortressScenario, or should the adversarial half just set metadata["original_prompt"] = str(row["adversarial_prompt"]) here? Possible I'm missing where it gets populated.

Good catch — you're right, this is a real gap. The metadata flow is intact (construct_response_from_request propagates the request's prompt_metadata onto the assistant piece), but the loader was never seeding original_prompt, so the judge was getting an empty # Original prompt block.

Fixed in f9723be: _build_seed_prompt now sets metadata['original_prompt'] = str(value) on the adversarial half. It's redundant with SeedPrompt.value at construction time, but the property the rubric scorer cares about is the as-authored question — independent of any converters that might transform the user-facing value before it reaches the target. Tests assert seed.metadata['original_prompt'] == seed.value for adversarial pieces and that the key is absent on benign pieces.

Thanks for the careful review!

adrian-gavrila

LGTM. One question inline about the original_prompt wiring, but not blocking.

@adrian-gavrila

Addresses PR feedback from @adrian-gavrila. The FortressRubricScorer YAML system prompt expects an `{{ original_prompt }}` field so the judge has the original adversarial question alongside the model's response. The scorer reads `original_prompt` from `MessagePiece.prompt_metadata` with an empty-string fallback, but the loader was not setting it — meaning the judge rendered an empty `# Original prompt` block, drifting from upstream Inspect Evals which threads the question through `model_graded_qa`. Fix: set `metadata['original_prompt'] = str(value)` on the adversarial half of `_build_seed_prompt`. This is redundant with `SeedPrompt.value` at construction time, but the property the rubric scorer cares about is the *as-authored* question — independent of any converters that might transform the user-facing value before it reaches the target. Storing it in metadata makes the scorer robust to that transformation. Tests assert `seed.metadata['original_prompt'] == seed.value` for adversarial pieces and confirm the key is absent on benign pieces. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@OverRide

…RESS - Add @OverRide to dataset_name/fetch_dataset_async on the three FORTRESS dataset classes and to _build_identifier/_score_value_with_llm_async/_score_piece_async on FortressRubricScorer; ty pre-commit (with [tool.ty.rules] all = 'error') flags missing-override-decorator on new files. - Extend RECOMMENDED_TAGS with 'national_security' and 'calibration' as deliberate additions for the FORTRESS benchmark, and replace 'over_refusal' with the canonical 'refusal' tag (which already covers over-/false-refusal). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…plit enum Addresses PR feedback. The three loaders all produced flat SeedDatasets from the same upstream HF artifact and only differed in which column they read; the 'paired' loader was a flat union of the other two, not a SeedGroup-based pairing. Single _FortressDataset with split: FortressSplit = ADVERSARIAL (BENIGN, ALL) matches the SGXSTest pattern and is simpler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Duplicate of my other review on this PR; dismissing to avoid confusion.

Roman Lutz and others added 5 commits May 25, 2026 13:56

Merge remote-tracking branch 'origin/main' into romanlutz/fortress-sa…

532dec5

…fety-utility # Conflicts: # doc/bibliography.md # pyrit/datasets/seed_datasets/remote/__init__.py

Merge remote-tracking branch 'origin/main' into romanlutz/fortress-sa…

bfaedc5

…fety-utility # Conflicts: # doc/bibliography.md

Merge remote-tracking branch 'origin/main' into romanlutz/fortress-sa…

4436f96

…fety-utility # Conflicts: # doc/bibliography.md # doc/references.bib # pyrit/datasets/seed_datasets/remote/__init__.py

Merge remote-tracking branch 'origin/main' into romanlutz/fortress-sa…

3573e11

…fety-utility # Conflicts: # doc/bibliography.md

adrian-gavrila approved these changes Jun 3, 2026

View reviewed changes

adrian-gavrila previously approved these changes Jun 3, 2026

View reviewed changes

Roman Lutz and others added 3 commits June 3, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Add FORTRESS dataset loader and per-row rubric scorer#1805

FEAT: Add FORTRESS dataset loader and per-row rubric scorer#1805
romanlutz wants to merge 8 commits into
microsoft:mainfrom
romanlutz:romanlutz/fortress-safety-utility

romanlutz commented May 25, 2026 •

edited

Loading

Uh oh!

adrian-gavrila left a comment

Uh oh!

adrian-gavrila Jun 3, 2026

Uh oh!

romanlutz Jun 4, 2026

Uh oh!

adrian-gavrila left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

romanlutz commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

What's in

What's NOT in (intentional follow-up)

Validation

Use-restriction reminder

Uh oh!

adrian-gavrila left a comment

Choose a reason for hiding this comment

Uh oh!

adrian-gavrila Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

adrian-gavrila left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

romanlutz commented May 25, 2026 •

edited

Loading