Skip to content

Add pg_gpu.windowed_pca — winpca-style per-window PCA#113

Open
stsmall wants to merge 1 commit into
mainfrom
add-windowed-pca
Open

Add pg_gpu.windowed_pca — winpca-style per-window PCA#113
stsmall wants to merge 1 commit into
mainfrom
add-windowed-pca

Conversation

@stsmall

@stsmall stsmall commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

A new windowed_pca function (and WindowedPCAResult dataclass) modeled on winpca. It's a thin wrapper over the existing local_pca (lostruct) machinery that re-frames the output for the winpca workflow — plot PC1/PC2 trajectories per sample across the chromosome.

from pg_gpu import HaplotypeMatrix, windowed_pca

hm = HaplotypeMatrix.from_zarr(VCZ, region='X:1-10000000', streaming='never')
result = windowed_pca(
    hm, window_size=100_000, step_size=50_000,
    n_components=10,
    maf_threshold=0.05, ld_prune=True, biallelic_only=True,
    scaler='patterson',
)
# result.windows           — pandas DataFrame, one row per window (chrom/start/end/center/n_variants/ev_1..k)
# result.coords            — ndarray (n_windows, n_samples, n_components), per-sample
# result.sample_ids        — list of n_samples labels
# result.component_labels  — ['PC1', 'PC2', ...]

How it relates to local_pca

local_pca (lostruct) windowed_pca (winpca)
Purpose Find regions where structure changes (MDS on inter-window PC distances) Plot per-window PC trajectories per sample
Pre-filter pipeline none built in biallelic + MAF + LD-prune (defaults match scikit-allel)
Default scaler None (lostruct convention) 'patterson' (scikit-allel / winpca convention)
Output shape eigvecs (n_windows, k, n_haplotypes) coords (n_windows, n_samples, n_components) (folded across haplotype-pair axis under pg_gpu's [hap0_of_all, hap1_of_all] layout)

windowed_pca calls local_pca under the hood and transposes/folds the eigenvectors. Equivalence is asserted by a test (see test_coords_match_local_pca_eigvecs_transpose).

Filtering caveat

Per-window adaptive MAF / LD-prune is not supported — the filters apply once globally before windowing. This is documented in the function docstring. A per-window implementation would duplicate local_pca's iteration loop; happy to add if there's demand.

Tests

9 pytest cases (tests/test_windowed_pca.py) using the shipped examples/data/gamb.X.8-12Mb.n100.derived.zarr fixture:

  • Shape / metadata / sample id checks
  • Per-window eigenvalues are non-increasing
  • Overlapping windows produce more rows than non-overlapping
  • Pre-filter-off equivalence with local_pca up to transpose+fold
  • MAF filter monotonically reduces per-window variant count
  • Streaming input is rejected with ValueError

All 9 pass in ~14s.

Test plan

  • pixi run -e default pytest tests/test_windowed_pca.py -v
  • Spot-check the trajectory plot on a real dataset (e.g. Ag1000G Ag3.0 X)
  • Confirm from pg_gpu import windowed_pca, WindowedPCAResult succeeds

🤖 Generated with Claude Code

A thin wrapper that re-frames local_pca for the winpca workflow
(https://github.com/MoritzBlumer/winpca): per-window biallelic / MAF /
LD-prune filtering and the Patterson scaler that scikit-allel / winpca
use by default. The output is shaped (n_windows, n_samples, n_components)
— transpose of LocalPCAResult.eigvecs, folded across the haplotype-pair
axis under pg_gpu's [hap0_of_all, hap1_of_all] layout — ready for the
per-sample PC trajectory plots winpca renders.

local_pca remains the right tool for the lostruct workflow (find
regions where structure changes); windowed_pca is for plotting PC1/PC2
trajectories across the chromosome.

Filters (biallelic_only, maf_threshold, ld_prune) apply once globally
before windowing; per-window adaptive MAF / LD is documented as
unsupported.

Tests: 9 pytest cases covering shape, metadata columns, eigenvalue
ordering, MAF filter monotonicity, equivalence with local_pca up to
the documented transpose+fold, and a streaming-input guard. All pass
against the shipped examples/data/gamb.X.8-12Mb.n100.derived.zarr
fixture.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@stsmall

stsmall commented May 27, 2026

Copy link
Copy Markdown
Collaborator Author

Adapted the code from here https://github.com/MoritzBlumer/winpca to use pg_gpu instead of scikit-allel. This was used extensively in the Af1000g paper and Sonia's culex paper.
Screenshot 2026-05-26 at 20 58 17

@nspope

nspope commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Hmm ... seems to me this would be better as a tutorial or example workflow, rather than a new function --- as it's composing specific filters/LD pruning with the existing local PCA, then creating a new results class that's more-or-less redundant with LocalPCAResult (which is already straightforward to get without the MDS, via "windowed_analysis"). I guess it's a question of convenience for a particular workflow, vs having redundant stuff in the API. I could go either way. What do you think @andrewkern ?

@stsmall

stsmall commented May 27, 2026

Copy link
Copy Markdown
Collaborator Author

... it does have redundancy w/ localpca but offers a usable 'shortcut' for anyone wanting to do these analyses. A tutorial may be well received, and API access seemed directly useful to those unwilling to tinker.

@andrewkern

Copy link
Copy Markdown
Member

yeah i agree with Nate here-- this is a fine example script or tutorial, but we don't need a whole other module and class to maintain. can you turn this in to an example script?

@stsmall

stsmall commented May 27, 2026

Copy link
Copy Markdown
Collaborator Author

ok. I can add this to the walkthrough (e.g., PR #112). The callout can be 'do WinPCA w/ size, step and pop', and the code will expose the local_pca call and the input params for flexibility.

@stsmall stsmall closed this May 27, 2026
@stsmall stsmall deleted the add-windowed-pca branch May 27, 2026 15:16
@andrewkern andrewkern restored the add-windowed-pca branch June 5, 2026 04:43
@andrewkern andrewkern reopened this Jun 5, 2026
@andrewkern

Copy link
Copy Markdown
Member

@stsmall i'm re-openning this PR.

can you refactor this to be a helper function within pg_gpu/decompositions.py?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants