Add pg_gpu.windowed_pca — winpca-style per-window PCA#113
Conversation
A thin wrapper that re-frames local_pca for the winpca workflow (https://github.com/MoritzBlumer/winpca): per-window biallelic / MAF / LD-prune filtering and the Patterson scaler that scikit-allel / winpca use by default. The output is shaped (n_windows, n_samples, n_components) — transpose of LocalPCAResult.eigvecs, folded across the haplotype-pair axis under pg_gpu's [hap0_of_all, hap1_of_all] layout — ready for the per-sample PC trajectory plots winpca renders. local_pca remains the right tool for the lostruct workflow (find regions where structure changes); windowed_pca is for plotting PC1/PC2 trajectories across the chromosome. Filters (biallelic_only, maf_threshold, ld_prune) apply once globally before windowing; per-window adaptive MAF / LD is documented as unsupported. Tests: 9 pytest cases covering shape, metadata columns, eigenvalue ordering, MAF filter monotonicity, equivalence with local_pca up to the documented transpose+fold, and a streaming-input guard. All pass against the shipped examples/data/gamb.X.8-12Mb.n100.derived.zarr fixture. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Adapted the code from here https://github.com/MoritzBlumer/winpca to use pg_gpu instead of scikit-allel. This was used extensively in the Af1000g paper and Sonia's culex paper. |
|
Hmm ... seems to me this would be better as a tutorial or example workflow, rather than a new function --- as it's composing specific filters/LD pruning with the existing local PCA, then creating a new results class that's more-or-less redundant with LocalPCAResult (which is already straightforward to get without the MDS, via "windowed_analysis"). I guess it's a question of convenience for a particular workflow, vs having redundant stuff in the API. I could go either way. What do you think @andrewkern ? |
|
... it does have redundancy w/ localpca but offers a usable 'shortcut' for anyone wanting to do these analyses. A tutorial may be well received, and API access seemed directly useful to those unwilling to tinker. |
|
yeah i agree with Nate here-- this is a fine example script or tutorial, but we don't need a whole other module and class to maintain. can you turn this in to an example script? |
|
ok. I can add this to the walkthrough (e.g., PR #112). The callout can be 'do WinPCA w/ size, step and pop', and the code will expose the local_pca call and the input params for flexibility. |
|
@stsmall i'm re-openning this PR. can you refactor this to be a helper function within |

Summary
A new
windowed_pcafunction (andWindowedPCAResultdataclass) modeled on winpca. It's a thin wrapper over the existinglocal_pca(lostruct) machinery that re-frames the output for the winpca workflow — plot PC1/PC2 trajectories per sample across the chromosome.How it relates to
local_pcalocal_pca(lostruct)windowed_pca(winpca)scalerNone(lostruct convention)'patterson'(scikit-allel / winpca convention)eigvecs (n_windows, k, n_haplotypes)coords (n_windows, n_samples, n_components)(folded across haplotype-pair axis under pg_gpu's[hap0_of_all, hap1_of_all]layout)windowed_pcacallslocal_pcaunder the hood and transposes/folds the eigenvectors. Equivalence is asserted by a test (seetest_coords_match_local_pca_eigvecs_transpose).Filtering caveat
Per-window adaptive MAF / LD-prune is not supported — the filters apply once globally before windowing. This is documented in the function docstring. A per-window implementation would duplicate
local_pca's iteration loop; happy to add if there's demand.Tests
9 pytest cases (
tests/test_windowed_pca.py) using the shippedexamples/data/gamb.X.8-12Mb.n100.derived.zarrfixture:local_pcaup to transpose+foldValueErrorAll 9 pass in ~14s.
Test plan
pixi run -e default pytest tests/test_windowed_pca.py -vfrom pg_gpu import windowed_pca, WindowedPCAResultsucceeds🤖 Generated with Claude Code