Skip to content

refactor(orchestrator): prepare running ic-gateway as a side-car to the replica in cloud engines#10499

Open
pierugo-dfinity wants to merge 30 commits into
masterfrom
pierugo/orchestrator/cloud-engine-ic-gateway
Open

refactor(orchestrator): prepare running ic-gateway as a side-car to the replica in cloud engines#10499
pierugo-dfinity wants to merge 30 commits into
masterfrom
pierugo/orchestrator/cloud-engine-ic-gateway

Conversation

@pierugo-dfinity

@pierugo-dfinity pierugo-dfinity commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Cloud engines should be self-contained, but today their nodes rely on external routing to reach the replica. This PR prepares running ic-gateway as a side-car alongside the replica on each node, so the gateway terminates external traffic locally and proxies it to the replica.
Launching ic-gateway is still gated behind a flag and only happens on cloud engines, so there's no behavioral change for now. To enable the functionality, it should be enough to flip that flag.

Refactoring the orchestrator's process manager

The orchestrator's process management was written around a single process (the replica), with a parallel, partially-duplicated path for ic-boundary. Adding a third managed process the same way would have meant copying that logic again.
Instead, process management is now generic. The upgrade loop no longer knows what it's running — it just starts and stops a manager that owns the set of node processes. Adding a future process is a matter of describing how it's built and registering it; no changes to the upgrade loop.

Scope: this refactor is confined to the upgrade loop, i.e. processes whose lifetime is tied to a subnet's. Unassigned nodes and API BNs are unaffected.

Testing

A new system test cloud_engine_ic_gateway_test (now still manual because the feature flag is disabled) hits api/v2/status on port 80 of each node, confirming ic-gateway is up and correctly proxying to the replica.
Moreover, new unit tests were added to test the existing ic-boundary logic that restarts the process on domain name changes.

Implementation details for reviewers
  • The upgrade loop now holds a MultipleProcessesManager and calls start_all/stop_all. It's process-agnostic, except for stop_replica which is (arguably, see comment) still needed during recoveries.
  • All per-process logic (including ic-boundary) moved into processes.rs. Each Process declares how it's built via build(&Self::Config, Self::Args), separating static config from dynamic arguments that can change across the orchestrator's lifetime, likely derived from the registry.
  • ProcessManager<Process> centralizes the logging, metrics, and OrchestratorResult handling that was previously duplicated. MultipleProcessesManager holds several of these and decides what to start/stop (e.g. ic-gateway only on cloud engines).
  • ic-boundary is an exception as it has some additional logic when the node's domain changes. Keep that logic by introducing IcBoundaryManager, a wrapper over ProcessManager<IcBoundaryProcess> exposing ensure_ic_boundary_running_and_restarted_on_domain_change.

To add a new process: define NewProcessConfig and NewProcess implementing Process, add a ProcessManager<NewProcess> field to MultipleProcessesManager, and wire it into the start/stop methods.

@pierugo-dfinity pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label Jun 17, 2026
@github-actions github-actions Bot added the feat label Jun 17, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

MultipleProcessesManager::start_all does not actually stop ic-gateway when the launch gate is disabled, contradicting the intended behavior and potentially leaving the sidecar running unexpectedly.

Pull request overview

Refactors the orchestrator’s subnet-scoped process supervision from “replica-only + special-case ic-boundary” into a generic multi-process manager, and introduces ic-gateway as an (currently gated) sidecar process for CloudEngine nodes so public API traffic can be terminated locally on port 80 and proxied to the replica.

Changes:

  • Replace replica-specific process management in the upgrade loop with MultipleProcessesManager / per-process ProcessManager.
  • Add ic-gateway process definition + GuestOS packaging/env config; add a new (manual) system test verifying /api/v2/status via port 80 on CloudEngine nodes.
  • Rename orchestrator start-attempts metric to be per-process (orchestrator_processes_start_attempts_total) and update affected tests.
File summaries
File Description
rs/tests/driver/src/driver/test_env_api.rs Adjust metrics threshold matching to support prefix-based metric checks.
rs/tests/driver/src/driver/group.rs Update default orchestrator metrics name and document prefix-based matching expectations.
rs/tests/consensus/subnet_splitting_test.rs Update expected orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_enable_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_provision_write_access_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_local_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_enable_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_large_no_upgrade_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_enable_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/replica_determinism_test.rs Update expected orchestrator metric name.
rs/tests/consensus/orchestrator/node_reassignment_test.rs Update expected orchestrator metric name.
rs/tests/consensus/orchestrator/cloud_engine_ic_gateway_test.rs New system test validating CloudEngine health on port 80 through ic-gateway.
rs/tests/consensus/orchestrator/Cargo.toml Add reqwest/serde_cbor deps and register new system-test binary.
rs/tests/consensus/orchestrator/BUILD.bazel Add Bazel target for the new system test (tagged manual).
rs/tests/consensus/cup_explorer_test.rs Update expected orchestrator metric name.
rs/orchestrator/src/upgrade.rs Switch upgrade loop to start/stop a multi-process manager instead of replica-only management.
rs/orchestrator/src/registry_helper.rs Add subnet-type lookup helper for process policy decisions.
rs/orchestrator/src/processes.rs New module: process definitions + generic ProcessManager + MultipleProcessesManager including gated ic-gateway.
rs/orchestrator/src/process_manager.rs Refactor to ProcessRunner and ensure child processes are spawned as their own process-group leaders.
rs/orchestrator/src/orchestrator.rs Wire new process configs/managers; pass env file paths via args; integrate with dashboard/boundary management.
rs/orchestrator/src/metrics.rs Replace replica-only start metric with per-process start/stop counters.
rs/orchestrator/src/lib.rs Register the new processes module.
rs/orchestrator/src/error.rs Remove now-unused domain-name-missing error variant.
rs/orchestrator/src/dashboard.rs Display both replica and ic-gateway PIDs via MultipleProcessesManager.
rs/orchestrator/src/boundary_node.rs Delegate ic-boundary lifecycle to IcBoundaryManager.
rs/orchestrator/src/args.rs Add required args for boundary/gateway env files; make ic_binary_directory non-optional.
rs/ic_os/release/BUILD.bazel Add ic-gateway to released/stripped binaries.
ic-os/guestos/envs/recovery/BUILD.bazel Increase size thresholds to accommodate image growth.
ic-os/guestos/envs/prod/BUILD.bazel Increase size thresholds to accommodate image growth.
ic-os/guestos/defs.bzl Include ic-gateway binary in GuestOS image dependencies.
ic-os/components/guestos/share/ic-gateway.env New env file configuring ic-gateway (port 80, proxy to 8080, domain).
ic-os/components/guestos/ic-replica.service Add orchestrator flags for boundary/gateway env file paths.
ic-os/components/guestos.bzl Install ic-gateway.env into /opt/ic/share.
Cargo.lock Add lock entries for new test dependencies.
Cargo.Bazel.json.lock Add ic-gateway as a generated binary crate.
bazel/rust.MODULE.bazel Add crate annotation to generate ic-gateway binary via Bazel.

Copilot's findings

  • Files reviewed: 41/43 changed files
  • Comments generated: 4

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rs/orchestrator/src/processes.rs
Comment thread rs/orchestrator/src/processes.rs
Comment thread rs/orchestrator/src/processes.rs Outdated
Comment on lines +1141 to 1147
// Assume the metrics to check are prefix-free. This allows to specify a metric name
// prefix to check all metrics with that prefix.
let max_value = metrics_to_check
.get(name.split('(').next().unwrap())
.copied()
.iter()
.find(|(metric_name, _)| name.starts_with(**metric_name))
.map(|(_, max_value)| *max_value)
.unwrap_or_default();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is precisely why the documentation of default_[replica/orchestrator]_metrics mentions that the metrics should be prefix-free. Though I'm happy to revisit if the owners (IDX) think otherwise.

// Restarting the replica is enough to pass the unsigned CUP forward.
// If we fail, restart the current process instead.
if let Err(e) = self.stop_replica() {
if let Err(e) = self.processes_manager.write().unwrap().stop_replica() {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopping only the replica should be enough, but I am tempted to just restart all children instead for simplicity. WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ok

SubnetAssignment::Assigned(_)
)
&& self.has_local_cup.is_some()
// TODO(CON-1630): After mocking the process management, we can remove the condition below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After mocking the process management

This was done in #9528, so now we can remove the TODO (orthogonal to the changes of this PR).

Comment on lines -2737 to -2740
Some(RegistryVersion::from(10)),
Some(RegistryVersion::from(50)),
Some(RegistryVersion::from(100)),
Some(RegistryVersion::from(150))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, I removed some (imo redundant) test cases to decrease the cardinality of the Cartesian product which was starting to get a bit big.

@pierugo-dfinity pierugo-dfinity changed the title feat(orchestrator): run ic-gateway as a side-car to the replica in cloud engines refactor(orchestrator): prepare running ic-gateway as a side-car to the replica in cloud engines Jun 22, 2026
@github-actions github-actions Bot added refactor and removed feat labels Jun 22, 2026
@pierugo-dfinity pierugo-dfinity marked this pull request as ready for review June 22, 2026 11:33
@pierugo-dfinity pierugo-dfinity requested review from a team as code owners June 22, 2026 11:33
@zeropath-ai

zeropath-ai Bot commented Jun 22, 2026

Copy link
Copy Markdown

No security or compliance issues detected. Reviewed everything up to 45aeab2.

Security Overview
Detected Code Changes

The diff is too large to display a summary of code changes.

@zeropath-ai

zeropath-ai Bot commented Jun 22, 2026

Copy link
Copy Markdown

No security or compliance issues detected. Reviewed everything up to 45aeab2.

Security Overview
Detected Code Changes

The diff is too large to display a summary of code changes.

Comment on lines +493 to +504
match self.registry.get_subnet_type(subnet_id, registry_version)? {
None
| Some(SubnetType::Unspecified)
| Some(SubnetType::Application)
| Some(SubnetType::System)
| Some(SubnetType::VerifiedApplication) => {
self.ic_gateway_manager.stop()?;
}
Some(SubnetType::CloudEngine) => {
self.ic_gateway_manager.ensure_running(replica_version)?;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's fine to return an error here (even if the replica above worked) because this is called near the very end of the upgrade loop? I think it's a good idea to not let a failing gateway interfere with the replica too much

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this is called near the very end of the upgrade loop?

Yes, I was careful to start the replica before anything else.

I think it's a good idea to not let a failing gateway interfere with the replica too much

I agree. Though failing to start ic-gateway should imo still be considered an error as it would be a sign of broken/unexpected internal state.

Comment thread rs/orchestrator/src/processes.rs Outdated
Comment on lines +517 to +520
self.ic_gateway_manager.stop()?;
self.replica_manager.stop()?;

Ok(())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here: if stopping the gateway fails, we wouldn't even try to stop the replica. Maybe something like this?

Suggested change
self.ic_gateway_manager.stop()?;
self.replica_manager.stop()?;
Ok(())
let ic_gateway_result = self.ic_gateway_manager.stop();
let replica_result = self.replica_manager.stop();
ic_gateway_result.and(replica_result)

(Assuming returning an error at all is a good idea if only the gateway fails)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also thought about that. I first though of stopping the replica first, to be as least intrusive as possible (similar to start_all), but then it felt wrong because I thought it would be better to be consistent in that started processes should be stopped in reverse order (here, ic-gateway depends on the replica, so we start it second and stop it first).
Your suggestion works too, but I think in that case, we should be consistent with start_all and have the same logic there (of trying to start ic-gateway even if the replica failed to start).

Assuming returning an error at all is a good idea if only the gateway fails

I think it is because returning an error prevents from removing the state and CUP, meaning that on the next iteration of the loop, we will try to stop the processes again. If we did not return an error, we would remove the CUP on disk and not retry, potentially leaving an ic-gateway running indefinitely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever we do, we should probably still gate it: 45aeab2

Comment thread rs/orchestrator/src/processes.rs Outdated
}

// ---------------------------------------------------------------------------
// MultipleProcessManager

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// MultipleProcessManager
// MultipleProcessesManager

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Comment thread rs/orchestrator/src/process_manager.rs
// Restarting the replica is enough to pass the unsigned CUP forward.
// If we fail, restart the current process instead.
if let Err(e) = self.stop_replica() {
if let Err(e) = self.processes_manager.write().unwrap().stop_replica() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ok

Comment thread rs/orchestrator/src/upgrade.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants