Skip to content

Fall back to .tar.gz sdists when no wheel is available#285

Merged
simonw merged 1 commit into
mainfrom
claude/pypi-tools-targz-fallback-63x2oj
Jun 9, 2026
Merged

Fall back to .tar.gz sdists when no wheel is available#285
simonw merged 1 commit into
mainfrom
claude/pypi-tools-targz-fallback-63x2oj

Conversation

@simonw

@simonw simonw commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Both PyPI browsing tools (zip-wheel-explorer and pypi-changelog) previously
only worked with .whl files. Packages that ship only a source distribution
(e.g. apple-fm-sdk) failed with "no wheel found".

Add a fallback to the .tar.gz source distribution:

  • Detect gzip via magic bytes and decompress with the native
    DecompressionStream, then parse the tar with a small parser that handles
    ustar prefixes, GNU long names and PAX extended headers.
  • zip-wheel-explorer: prefer a wheel, else use the sdist; browse its files.
  • pypi-changelog: include sdist-only versions and strip the leading
    "-/" directory so paths align across versions when diffing.

https://claude.ai/code/session_018HTvSZHvL1T3EEZ7P8vbVy

Both PyPI browsing tools (zip-wheel-explorer and pypi-changelog) previously
only worked with .whl files. Packages that ship only a source distribution
(e.g. apple-fm-sdk) failed with "no wheel found".

Add a fallback to the .tar.gz source distribution:
- Detect gzip via magic bytes and decompress with the native
  DecompressionStream, then parse the tar with a small parser that handles
  ustar prefixes, GNU long names and PAX extended headers.
- zip-wheel-explorer: prefer a wheel, else use the sdist; browse its files.
- pypi-changelog: include sdist-only versions and strip the leading
  "<name>-<version>/" directory so paths align across versions when diffing.
@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Claude finished @simonw's task —— View job


Code Review Complete

I've conducted a thorough review of this PR that adds .tar.gz fallback support for PyPI tools. Here's my detailed analysis:

Todo List

  • Read and analyze pypi-changelog.html changes
  • Read and analyze zip-wheel-explorer.html changes
  • Review tar parsing implementation for correctness
  • Analyze gzip decompression approach
  • Check for security vulnerabilities
  • Assess performance considerations
  • Look for potential bugs and edge cases
  • Provide comprehensive review feedback

Positive Aspects

1. Excellent Architecture & Design

  • Smart fallback strategy: Prioritizes wheels (binary distributions) but gracefully falls back to source distributions when needed
  • Clean separation of concerns: Tar parsing, gzip decompression, and file handling are well-separated
  • Consistent user experience: Both tools now handle the same distribution types seamlessly

2. Robust Tar Parser Implementation

  • Comprehensive format support: Handles ustar prefixes, GNU long names (L entries), and PAX extended headers (x/g entries)
  • Proper standard compliance: Follows tar format specifications correctly
  • Smart path normalization: The path stripping in pypi-changelog.html:414 ensures consistent diffing across versions

3. Modern Browser API Usage

  • Native gzip decompression: Uses DecompressionStream for efficient, secure decompression
  • Progressive enhancement: Graceful fallback with clear error messages for unsupported browsers

⚠️ Security & Safety Considerations

1. Path Traversal Protection

Current state: ✅ Already secure

  • No explicit path validation, but the tools only process files in-memory for display
  • No file system writes occur, so path traversal isn't exploitable in this context
  • Both tools display file paths as-is, which is appropriate for their use case

2. Resource Exhaustion Protection

Areas for improvement:

  • Memory bombs: Large tar files could consume excessive memory (see performance section)
  • Zip bombs: While JSZip has some protections, additional safeguards could help
  • Decompression bombs: Gzip streams with extreme compression ratios could exhaust memory

Recommendation: Consider adding file size limits during extraction

3. Content Safety

Current state: ✅ Good protection

  • Proper HTML escaping with escapeHtml() function (pypi-changelog.html:207, zip-wheel-explorer.html:388)
  • Binary file detection prevents rendering non-text content
  • No eval() or dynamic code execution

🚀 Performance Analysis

1. Strengths

  • Parallel processing: Both tools fetch multiple versions concurrently
  • Efficient parsing: Single-pass tar parser with minimal copying
  • Smart content handling: Binary files are detected and skipped for processing

2. Potential Issues

  • Large file handling: Files exceeding 512KB are marked as large but still fully loaded into memory
  • No streaming: Entire archives are loaded before processing begins
  • Memory usage: All file contents held in memory simultaneously

Suggestion: Consider implementing streaming extraction for very large packages

🐛 Potential Issues Found

1. Error Handling Gaps

In zip-wheel-explorer.html:410-413:

for (const [name, file] of Object.entries(zip.files)) {
    const content = await file.async('text');
    const rawBytes = await file.async('uint8array');
    // ...
}

Issue: If file.async('text') fails for a binary file, the error handling may not catch all edge cases.

Recommendation: Wrap individual file extraction in try-catch blocks

2. Version Comparison Robustness

pypi-changelog.html:276-297 has a custom version comparison that might not handle all PEP 440 edge cases perfectly.

Minor concern: Complex pre-release, post-release, and dev release versions might not sort correctly in all cases.

3. Missing Magic Number Validation

Both tools detect gzip via magic bytes (0x1f 0x8b), but don't validate the full gzip header structure.

Low priority: Current approach is sufficient for the use case, but full header validation would be more robust.

📝 Code Quality Assessment

1. Excellent Practices

  • Consistent code style across both files
  • Clear function naming and logical organization
  • Proper async/await usage throughout
  • Good separation of UI and data logic

2. Minor Suggestions

  • Constants: Magic numbers like 512 (tar block size) could be named constants
  • Documentation: Complex functions like parseTar() could benefit from JSDoc comments
  • Input validation: URL inputs could be validated more strictly

🧪 Test Coverage Concerns

Current state: No automated tests visible in these HTML files

Recommendations:

  • Unit tests for the tar parser with various tar file formats
  • Tests for edge cases like corrupted archives, empty files, and unusual paths
  • Integration tests with real PyPI packages that only ship source distributions

🏆 Overall Assessment

This is a high-quality implementation that successfully addresses a real user need. The code is well-structured, follows security best practices, and handles the complex tar format correctly.

Risk Level: 🟢 Low Risk

  • No critical security vulnerabilities
  • Graceful error handling prevents crashes
  • Limited attack surface (client-side only)

Recommendation: ✅ Approve with Minor Suggestions

The implementation is production-ready. The suggestions above are mostly optimizations and defensive programming practices rather than blocking issues.

Key Strengths:

  • Solves a real problem elegantly
  • Maintains backward compatibility
  • Adds significant value to both tools
  • Code quality is consistent with the rest of the codebase

@simonw

simonw commented Jun 9, 2026

Copy link
Copy Markdown
Owner Author

apple-fm-sdk works in both tools now.

@simonw simonw merged commit d64b067 into main Jun 9, 2026
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants