Skip to content

Implement Secure ZIP Extraction for Remote Datasets #1879

@RuslanSemchenko

Description

@RuslanSemchenko

Is your feature request related to a problem? Please describe.

Several remote dataset loaders download ZIP archives and extract them directly using zipfile.ZipFile.extractall() without validating archive entries before extraction.

Examples include:

  • pyrit/datasets/seed_datasets/remote/figstep_dataset.py
  • pyrit/datasets/seed_datasets/remote/vlguard_dataset.py

Current extraction logic relies entirely on Python's built-in ZIP handling and does not explicitly validate extracted paths, file counts, or archive metadata.

Since PyRIT supports Python versions 3.10 through 3.14, adding application-level validation would provide consistent security guarantees across all supported runtimes.

Describe the solution you'd like

Introduce a shared helper for secure ZIP extraction that:

  • Validates archive member paths before extraction.

  • Prevents path traversal attempts.

  • Ensures extracted files remain inside the intended destination directory.

  • Optionally enforces limits on:

    • file count,
    • individual file size,
    • total extracted size.

This helper could then be reused by all dataset downloaders that process remote ZIP archives.

Describe alternatives you've considered, if relevant

The current implementation relies on the protections provided by Python's zipfile module. While modern Python versions include safeguards, explicit validation would provide defense-in-depth and reduce reliance on interpreter-specific behavior.

Additional context

PyRIT currently supports:

requires-python = ">=3.10, <3.15"

The same extraction pattern appears in multiple dataset loaders, suggesting that a centralized safe extraction utility would improve maintainability and security across the project.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions