Enable ymir todo trigger#540
Conversation
There was a problem hiding this comment.
Code Review
This pull request replaces the global SILENT_RUN configuration with a silent-by-default model that can be bypassed per-issue using a new ymir_todo label, which initiates a user_triggered run. It also introduces atomic label swapping in the fetcher to prevent duplicate processing, increases the fetcher cron frequency to 5 minutes, and implements critical label write retries. Feedback on these changes highlights several key issues: the user_triggered flag is not propagated to downstream backport and rebase agents; the JQL query allows non-Red Hat employees to trigger runs on early adopter CVEs; immediate re-queuing on label-write failures can cause a tight infinite loop; and the fetcher's 10-minute active deadline exceeds its 5-minute cron interval, potentially blocking scheduled runs.
1f78be6 to
09ed19b
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a maintainer-facing trigger label, ymir_todo, allowing Red Hat Employees to explicitly request end-to-end triage runs. It refactors the pipeline to be silent by default, suppressing intermediate failure labels and comments unless a run is user-triggered (user_triggered=True). The Jira Issue Fetcher is updated to atomically swap trigger labels for ymir_triage_in_progress before enqueuing tasks in Redis to prevent duplicate processing. Feedback on the changes identifies a high-severity bug where fresh issues with no Ymir labels are repeatedly enqueued due to an incorrect elif block. Additionally, improvements are suggested to safely handle potential null values in the Jira API responses (such as author, groups, and created fields) to prevent runtime exceptions, and to add an early return in set_jira_labels to avoid redundant API calls when no label updates are needed.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new maintainer-facing trigger label ymir_todo to initiate end-to-end runs on demand, replacing the global SILENT_RUN environment variable with a per-task user_triggered flag. When an issue is triggered via ymir_todo by a verified Red Hat Employee, the fetcher atomically swaps the label for ymir_triage_in_progress and enqueues it as user-triggered, which bypasses silence filters to post progress acknowledgments and result comments. Additionally, the fetcher's cron schedule is shortened to 5 minutes to process these requests promptly, and critical Jira label writes now include retry logic with exponential backoff to prevent duplicate processing. There are no review comments to assess, so I have no feedback to provide.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces the global SILENT_RUN environment variable with a per-issue, user-triggered workflow using a new ymir_todo label. When added by a verified Red Hat Employee, this label triggers end-to-end processing with active feedback (acknowledgements, results, and intermediate failure labels), while default runs remain silent. The fetcher atomically swaps trigger labels for ymir_triage_in_progress before enqueueing tasks to Redis. Feedback on these changes highlights two key issues: first, the triage agent's acknowledgement comment is posted prematurely and could be duplicated if subsequent critical label writes fail and the task is re-queued; second, expanding the changelog in the main issue API response is limited to 100 entries, which may cause author verification to fail on issues with long histories. It is recommended to move the acknowledgement comment after the successful label write and to use the dedicated paginated changelog endpoint.
|
/gemini review |
dc38ead to
044dcc0
Compare
There was a problem hiding this comment.
Code Review
This pull request replaces the global SILENT_RUN environment variable with a per-issue, user-triggered execution flow initiated by the ymir_todo label. The Jira issue fetcher now verifies that the ymir_todo label was added by a Red Hat Employee before atomically swapping it for ymir_triage_in_progress and enqueuing the task with a user_triggered flag. Agents are silent by default, suppressing comments and intermediate failure labels unless the run is user-triggered. Feedback on these changes highlights that the newly introduced _label_added_by_rh_employee method makes synchronous HTTP requests without retry or backoff mechanisms, which could cause transient network errors or rate limits to silently fail the verification. It is recommended to implement a retrying GET helper decorated with exponential backoff.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new maintainer-facing trigger label, ymir_todo, allowing Red Hat employees to explicitly request end-to-end runs on Jira issues. It replaces the global SILENT_RUN environment variable with a silent-by-default design, where user-facing comments and intermediate failure labels are suppressed unless a run is explicitly user_triggered (via ymir_todo). Additionally, the Jira issue fetcher now runs more frequently (every 5 minutes) and performs atomic label flips (ymir_todo / ymir_retry_needed to ymir_triage_in_progress) before enqueueing tasks to Redis to prevent duplicate processing. It also verifies that the ymir_todo label was added by a member of the "Red Hat Employee" group by paginating through the issue's changelog. There are no review comments to address, and the implementation is robust and well-tested.
a347be8 to
2950233
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces the global SILENT_RUN environment variable with a per-issue user_triggered flag (activated by the new ymir_todo label) to allow on-demand, verbose e2e runs. It updates the Jira Issue Fetcher to verify that the ymir_todo label was added by a Red Hat Employee, atomically flip trigger labels to ymir_triage_in_progress to prevent duplicate enqueues, and configures a new, more frequent CronJob for processing these manual requests. Additionally, git authentication was updated to support private GitLab forks under a custom FORK_NAMESPACE. Feedback on the changes highlights an issue in the fetcher's author verification logic where transient network or API errors are caught and treated as a non-employee action, which would incorrectly result in the silent deletion of the ymir_todo label; it is recommended to propagate and handle requests.RequestException as a transient failure to skip the issue without removing the label.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
0f48faf to
bf1b14f
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the pipeline's notification and execution model to be silent by default, introducing a new user-triggered workflow via the ymir_todo Jira label. The SILENT_RUN environment variable is removed in favor of a user_triggered flag propagated through tasks, which bypasses silence filters to post comments and intermediate failure labels. A new OpenShift CronJob is added to fetch ymir_todo issues promptly, verifying that the label was added by a Red Hat Employee. Additionally, the private GitLab helper is updated to support authentication for forks under a configured namespace. The review feedback correctly identifies a critical bug in the triage agent where writing a terminal error label prematurely would cause task retries to be skipped, and offers a robust improvement to handle potential typos in the FORK_NAMESPACE configuration.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
dec6133 to
300c388
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces the SILENT_RUN environment variable with a default-silent behavior, introducing a new ymir_todo trigger label that allows Red Hat employees to opt into interactive runs with active feedback. The Jira Issue Fetcher is updated to atomically swap trigger labels for ymir_triage_in_progress and verify the trigger's author via the Jira changelog. A new cron job is introduced to sweep for these requests, and GitLab authentication is updated to support custom fork namespaces. The review feedback correctly identifies a high-severity issue in the triage agent where failing to write the in-progress label re-queues the task without incrementing its attempt count, which could cause an infinite loop and block the queue; a suggestion is provided to use the retry helper instead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces the global SILENT_RUN configuration with a per-issue user_triggered flag, enabling a silent-by-default behavior unless explicitly triggered by a Red Hat Employee via the new ymir_todo label. The Jira Issue Fetcher has been updated to verify the label's author through the issue's changelog and atomically swap trigger labels for ymir_triage_in_progress before enqueuing. Additionally, a new cron job has been introduced to sweep for these requests promptly. The review feedback highlights two critical issues: first, a potential infinite retry loop in the fetcher if the author verification fails with a permanent HTTP error (such as a 404 or 403); second, an unhandled exception in the triage agent's error-handling path that could crash the agent if writing the terminal error label fails.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
TomasTomecek
left a comment
There was a problem hiding this comment.
Very nice work Maja, opus is very happy with your PR :)
Overall the PR is solid — the dedup logic is carefully designed
and the critical-write retry is a good addition. The ack-after-retry bug is the only clear correctness issue.
The recently-added ymir_todo label provides a per-issue opt-in for maintainers who want feedback (user_triggered=True bypasses every silence filter). DRY_RUN is unchanged and remains the only env-var knob. user_triggered remains the only escape hatch for verbose output, and lives entirely inside the task payload rather than in process-level configuration. Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The user_triggered flag (set on the Task when the issue carried
ymir_todo) was correctly read by the triage agent but lost when triage
auto-chained to a downstream agent.
Also drops an orphaned `silent_run = os.getenv("SILENT_RUN", ...)` read
in backport_agent.run_workflow that was left over from the
SILENT_RUN-removal commit (its only consumer, the change_jira_status
gate, was simplified at that time to use dry_run only).
Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When set_jira_labels(critical=True) exhausts its internal retries (the ymir_triage_in_progress dedup-anchor write), the triage agent re-queues the task and continues. With brpop and a near-empty queue, the agent would immediately re-pop the same task and hammer the Jira API in a tight loop while it's down. Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cron schedule is */5 * * * * (every 5 minutes) and the job runs with concurrencyPolicy: Forbid. The previous activeDeadlineSeconds was 600 (10 minutes) — twice the cron interval, so a hung fetcher could block the next scheduled run for up to 10 minutes and cause scheduled sweeps to be skipped. Lower the deadline to 240 seconds (4 minutes), comfortably under the 5-minute schedule interval. A hung job is now terminated before the next sweep is due, so Forbid never has to skip a run. Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tasks.set_jira_labels: early-return when both labels_to_add and
labels_to_remove are empty.
* jira_issue_fetcher._label_added_by_rh_employee: switch from
`.get(key, default)` to `.get(key) or default` for author / created /
groups / items. Jira returns explicit `null` for these fields on
some shapes (notably `author` on bot-edits and partially populated
user records); the previous form only handled the missing-key case
and would crash with AttributeError on a real-world null.
* jira_issue_fetcher.push_issues_to_queue: drop the
`elif not ymir_labels: remove_issues_for_retry.add(...)` branch.
The branch was a duplicate-push hazard: a fresh issue with no
ymir_* labels but already sitting in some Redis queue or result list
was being rescued from the dedup skip every sweep, so the same task
landed in triage_queue repeatedly.
Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the bypass: if ymir_triage_in_progress is in current_labels at the start of main(), the fetcher just flipped this task to us (or a previous agent run crashed mid-process and left the marker behind). Either way, we should proceed rather than skip — we are the triage. Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the user-with-groups GET from string interpolation
(`?accountId={...}&expand=groups`) to `params={...}` so requests
auto-percent-encodes the values. Matches the pattern already used by
ymir/tools/privileged/jira.py:823 for the same endpoint.
Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oudly The catch-all `except Exception` swallowed transient HTTP failures, parse errors, and auth failures into one undifferentiated WARNING. Narrow the catch to the specific exception classes the helper can actually produce (request/network errors, JSON parse errors, missing keys, attribute errors on null shapes) and log 401/403 at ERROR so it shows up as something the operator should act on. Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When _label_added_by_rh_employee returns False, the previous code added the issue to existing_keys (so it wasn't enqueued that sweep) but left the ymir_todo label on the Jira issue. Add a best-effort label removal in the same branch so the bogus ymir_todo is consumed. Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ands
* jira_issue_fetcher._label_added_by_rh_employee now walks the
dedicated GET /rest/api/3/issue/{key}/changelog endpoint with
startAt/maxResults pagination instead of the inline
?expand=changelog form.
* triage_agent.main moves the user-triggered ack comment from "before
the critical label write" to "after the critical label write
succeeds" and gates it on task.attempts == 0. Now the
ack lands exactly once, only after the dedup anchor is in place.
Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_is_private_gitlab` only matched paths under `/redhat/`, so URLs like `https://gitlab.com/redhat-ymir-agent/centos_rpms_vim.git` returned False. As a result, `_get_git_auth_args` produced no auth args and `git push` to the fork fell back to an interactive credential prompt, which then failed in the container with "could not read Username". Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…weep jira-issue-fetcher */30 generic batch filter jira-issue-fetcher-todo */5 labels = "ymir_todo" Common knobs (IGNORED_COMPONENTS, MAX_ISSUES, LOGLEVEL) stay in the existing jira-issue-fetcher-env ConfigMap; each cron mounts that plus a per-cron ConfigMap that only carries QUERY. Both share the same ymir-agent identity, the same Redis-queue dedup, and the same atomic label flip, so a single ymir_todo issue that also matches the batch filter is still picked up exactly once. Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_label_added_by_rh_employee no longer swallows requests.RequestException: it only catches parse errors (ValueError, KeyError, AttributeError) and treats those as non-RH-employee. Transient HTTP/network failures now propagate to the caller, which skips the issue for the current sweep (adding it to existing_keys) so a flaky Jira lookup is retried next sweep instead of being permanently classified as an external label. Assisted-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
An ERROR resolution is dispatched to the retry queue, not a terminal outcome, so it must keep ymir_triage_in_progress as its dedup anchor. Previously the resolution-label block fired for ERROR too, swapping ymir_triage_in_progress for ymir_triage_errored; the next fetcher sweep would then skip the issue and the retry could never reclaim it. Exclude Resolution.ERROR from the label write so the in-progress label survives until the retry produces a real terminal resolution. Assisted-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
A FORK_NAMESPACE value with a leading or trailing slash (e.g.
"/redhat/rhel/bot-branches/") would break the prefix match: the
constructed "/{fork_namespace}/" gained a doubled slash and never
matched parsed.path, so fork URLs were misclassified as non-private
and the push token was not injected. Strip "/" from both ends and
default to "" so the check is tolerant of how the env var is written.
The previous exception handler re-LPUSHed the task and slept 60s in an unbounded loop. Route the failure through the existing retry() helper instead. It increments task.attempts and, on exhaustion, pushes the task to ERROR_LIST with an ErrorData payload describing the in-progress label failure. Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the unbidden-comment set to cover all four no-MR resolutions. Addresses review comments on PR packit#540 (lbarcziova). Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fc29c28 to
5a63045
Compare
The previous "post the ack on attempts==0" guard had a real bug: when the in-progress label write failed before the ack was posted, retry() incremented task.attempts, so the next attempt's attempts==0 check was False and the ack was skipped — the user never received an acknowledgement. Extract the ack-posting logic into post_user_ack_once(), which gates on task.metadata['ack_posted'] and only persists the flag after comment_in_jira returns successfully. A failed post leaves the flag unset so the next retry can try again. Re-queued tasks pick up the flag automatically because task.metadata round-trips through Redis as part of the task JSON. Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the unbidden-comment set to cover all four no-MR resolutions. Addresses review comments on PR packit#540 (lbarcziova). Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Agents were unconditionally moving Jira issues to "In Progress" on
rebase/backport start, and the issue-verification agent was driving
issues through "Release Pending" / "Closed". Operators have asked for
this to be opt-in.
Add a JIRA_ALLOW_STATUS_CHANGES env var (default false). When unset
or false, every status-change call short-circuits and the
status-change comment posted by issue-verification ("Changing status
from X => Y") is suppressed alongside it so the comment stream doesn't
claim a transition that didn't happen.
Guards land in three places:
- tasks.change_jira_status — the shared helper used by the backport
and rebase agents.
- issue_verification_agent._change_status — directly, so the
accompanying comment is also skipped.
- The dry-run paths in backport/rebase agents now log "skipping"
rather than "would change", since the helper itself further gates.
Production note: OpenShift agents will stop changing statuses on
deploy until JIRA_ALLOW_STATUS_CHANGES is flipped to "true" in the
agents-env ConfigMap. The configmap ships with an inline comment
pointing operators at the toggle, and openshift/README.md has the
exact oc patch command.
Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5a63045 to
94362c9
Compare
|
@majamassarini Hi Maja, can you also update the preliminary_testing agent so it doesn't set the Preliminary Testing = Pass unless JIRA_ALLOW_STATUS_CHANGES is set to true (although that variable might benefit from renaming then). Preliminary Testing = Pass is quite an important milestone, although it is not defined by Jira status change. Setting that field allows the build into a compose, triggers erratum creation and Jira transition to Integration. Technically, it's just a field change but in fact it is just as important as Jira status chnage (it triggers one). |
Setting the field admits the build into a compose, triggers erratum creation, and moves the issue to Integration. Assisted-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@martinky82 @lbarcziova I have addressed all the topics PTAL 🙏🏻 |
When we use
ymir_todoUX experience looks like these: