Skip to content

ci: use a unique identifier per integration-suite attempt#1286

Closed
jchrostek-dd wants to merge 1 commit into
mainfrom
john/lambda-ext-auth-loggroup-orphan
Closed

ci: use a unique identifier per integration-suite attempt#1286
jchrostek-dd wants to merge 1 commit into
mainfrom
john/lambda-ext-auth-loggroup-orphan

Conversation

@jchrostek-dd

@jchrostek-dd jchrostek-dd commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Overview

Integration-suite deploys intermittently fail at the CDK deploy step, before any test runs, with:

Automatic import of existing resource /aws/lambda/integ-<sha>-<suite>-... needs a DeletionPolicy of 'Retain' or 'RetainExceptOnCreate'.

Root cause. The test stack is named only after the commit SHA (integ-<sha>-<suite>) and the job has retry: 2, so retries reuse the same name. The CDK LogGroup is already wired into each function, but LoggingConfig only points the function at a log group name — it doesn't keep that name alive. When a retry's delete-stack (run in after_script, even on failure) deletes a function's log group while another attempt's function is still being invoked, the Lambda service recreates the group as an unmanaged, never-expire group. That group is never owned by CloudFormation, so it survives cdk destroy and blocks the next attempt — which reuses the same name and tries to auto-import it (--import-existing-resources), but the construct uses RemovalPolicy.DESTROY, not Retain, so the import is rejected. ~990 such orphans had accumulated (~600 from the auth suite, whose slow SnapStart test retries most often).

Fix. Append CI_JOB_ID to the identifier so every attempt — automatic retry, manual job retry, or pipeline retry — gets unique stack, function, and log group names. Names never recur, so a leftover group from a prior attempt can never collide with a later deploy. CI_JOB_ID is unique per job instance (unlike CI_PIPELINE_ID, which is reused on a pipeline retry) and is available in both script and after_script, so teardown still deletes the right stack.

CI-only change; no application code or IAM changes.

Testing

  • Verified in the serverless sandbox that the orphan is caused by the Lambda service recreating a deleted log group on invocation (deleting a managed group and invoking the function recreates it as never-expire), so wiring the logGroup prop — the originally-assumed fix — does not prevent it; eliminating name reuse does.
  • Confirmed the longest function name (integ-<sha>-otlp-response-validation-lambda) stays under Lambda's 64-char limit after adding -<CI_JOB_ID> (~57 chars).
  • Rendered the pipeline template with gomplate --config .gitlab/config.yaml and validated the generated YAML parses.
  • Immediate unblock: deleted the orphaned integ-40fb9899-auth-{java,node} groups from the referenced failing job.

@jchrostek-dd jchrostek-dd force-pushed the john/lambda-ext-auth-loggroup-orphan branch from 73a3a9d to e7eecf3 Compare June 24, 2026 11:39
@jchrostek-dd jchrostek-dd changed the title ci: delete orphaned log groups before integration-suite deploy ci: use a unique identifier per integration-suite attempt Jun 24, 2026
Integration-suite deploys intermittently failed at the CDK deploy step
(before any test runs) with:

  Automatic import of existing resource /aws/lambda/integ-<sha>-<suite>-...
  needs a DeletionPolicy of 'Retain' or 'RetainExceptOnCreate'.

The test stack was named only after the commit SHA, and the job retries.
When a retry's teardown deleted a function's log group while another attempt's
function was still being invoked, the Lambda service recreated the group as an
unmanaged, never-expire group. That group survived 'cdk destroy' and blocked
the next attempt, which reused the same name and tried to auto-import it
(--import-existing-resources) -- but the construct uses RemovalPolicy.DESTROY,
so the import was rejected.

Append CI_JOB_ID to the identifier so every attempt (including retries) uses
unique stack, function, and log group names. Names never recur, so a leftover
group can no longer collide with a later deploy. CI_JOB_ID is unique per retry
and available in both script and after_script, so teardown still targets the
right stack.
@jchrostek-dd jchrostek-dd force-pushed the john/lambda-ext-auth-loggroup-orphan branch from e7eecf3 to 4d89b29 Compare June 24, 2026 11:40
@jchrostek-dd jchrostek-dd deleted the john/lambda-ext-auth-loggroup-orphan branch June 24, 2026 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant