ci: use a unique identifier per integration-suite attempt#1286
Closed
jchrostek-dd wants to merge 1 commit into
Closed
ci: use a unique identifier per integration-suite attempt#1286jchrostek-dd wants to merge 1 commit into
jchrostek-dd wants to merge 1 commit into
Conversation
73a3a9d to
e7eecf3
Compare
Integration-suite deploys intermittently failed at the CDK deploy step (before any test runs) with: Automatic import of existing resource /aws/lambda/integ-<sha>-<suite>-... needs a DeletionPolicy of 'Retain' or 'RetainExceptOnCreate'. The test stack was named only after the commit SHA, and the job retries. When a retry's teardown deleted a function's log group while another attempt's function was still being invoked, the Lambda service recreated the group as an unmanaged, never-expire group. That group survived 'cdk destroy' and blocked the next attempt, which reused the same name and tried to auto-import it (--import-existing-resources) -- but the construct uses RemovalPolicy.DESTROY, so the import was rejected. Append CI_JOB_ID to the identifier so every attempt (including retries) uses unique stack, function, and log group names. Names never recur, so a leftover group can no longer collide with a later deploy. CI_JOB_ID is unique per retry and available in both script and after_script, so teardown still targets the right stack.
e7eecf3 to
4d89b29
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Integration-suite deploys intermittently fail at the CDK deploy step, before any test runs, with:
Root cause. The test stack is named only after the commit SHA (
integ-<sha>-<suite>) and the job hasretry: 2, so retries reuse the same name. The CDKLogGroupis already wired into each function, butLoggingConfigonly points the function at a log group name — it doesn't keep that name alive. When a retry'sdelete-stack(run inafter_script, even on failure) deletes a function's log group while another attempt's function is still being invoked, the Lambda service recreates the group as an unmanaged, never-expire group. That group is never owned by CloudFormation, so it survivescdk destroyand blocks the next attempt — which reuses the same name and tries to auto-import it (--import-existing-resources), but the construct usesRemovalPolicy.DESTROY, notRetain, so the import is rejected. ~990 such orphans had accumulated (~600 from the auth suite, whose slow SnapStart test retries most often).Fix. Append
CI_JOB_IDto the identifier so every attempt — automatic retry, manual job retry, or pipeline retry — gets unique stack, function, and log group names. Names never recur, so a leftover group from a prior attempt can never collide with a later deploy.CI_JOB_IDis unique per job instance (unlikeCI_PIPELINE_ID, which is reused on a pipeline retry) and is available in bothscriptandafter_script, so teardown still deletes the right stack.CI-only change; no application code or IAM changes.
Testing
logGroupprop — the originally-assumed fix — does not prevent it; eliminating name reuse does.integ-<sha>-otlp-response-validation-lambda) stays under Lambda's 64-char limit after adding-<CI_JOB_ID>(~57 chars).gomplate --config .gitlab/config.yamland validated the generated YAML parses.integ-40fb9899-auth-{java,node}groups from the referenced failing job.