[fsdp] Exclude fully-padding microbatches from metric aggregation (parity with #1817)#1863
Open
kurtislin wants to merge 2 commits into
Open
[fsdp] Exclude fully-padding microbatches from metric aggregation (parity with #1817)#1863kurtislin wants to merge 2 commits into
kurtislin wants to merge 2 commits into
Conversation
NovaSky-AI#1817 excluded fully-padding microbatches from metric aggregation for the Megatron backend. Apply the same skip to the shared forward_backward loops in worker.py used by FSDP, mirroring megatron_worker.py: padding microbatches still run forward/backward (per-rank collective counts stay equal), only the metric append is skipped.
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces changes to skip fully-padding microbatches during metric aggregation in both the policy and critic workers, preventing dummy zero-valued metrics from skewing mean-reduced metrics. It also adds corresponding unit tests. The review feedback correctly identifies that loss_fn_outputs are currently extracted before the padding check in the policy worker, which would pollute the aggregated outputs with dummy data. The reviewer suggests moving the padding check before this extraction and updating the tests to verify that these dummy outputs are successfully excluded.
Move the padding skip above the loss_fn_outputs extraction so dummy per-sample entries from padding microbatches are not returned (review feedback). Test asserts only real-sample outputs remain.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#1817 excluded fully-padding microbatches from metric aggregation, but only in megatron_worker.py. The shared forward_backward loops in worker.py (used by FSDP) still append them, so with
max_tokens_per_microbatch > 0mean-reduced metrics (policy_entropy,policy_kl,loss_metrics/*) are dragged toward 0.This PR mirrors the Megatron-side skip into both loops (policy + critic), same placement and comment. Note:
critic_lossis mean-reduced, so its reported value was previously biased toward 0 and changes with this fix.Testing
New CPU test
tests/backends/skyrl_train/workers/test_forward_backward_padding_metrics.pyforces one padding microbatch and asserts mean metrics are undiluted (1.0, not 2/3) and summed metrics unchanged. Fails without the fix; adjacent suites pass (34 tests).