refactor(orchestrator): prepare running ic-gateway as a side-car to the replica in cloud engines#10499
refactor(orchestrator): prepare running ic-gateway as a side-car to the replica in cloud engines#10499pierugo-dfinity wants to merge 30 commits into
ic-gateway as a side-car to the replica in cloud engines#10499Conversation
There was a problem hiding this comment.
⚠️ Not ready to approve
MultipleProcessesManager::start_all does not actually stop ic-gateway when the launch gate is disabled, contradicting the intended behavior and potentially leaving the sidecar running unexpectedly.
Pull request overview
Refactors the orchestrator’s subnet-scoped process supervision from “replica-only + special-case ic-boundary” into a generic multi-process manager, and introduces ic-gateway as an (currently gated) sidecar process for CloudEngine nodes so public API traffic can be terminated locally on port 80 and proxied to the replica.
Changes:
- Replace replica-specific process management in the upgrade loop with
MultipleProcessesManager/ per-processProcessManager. - Add
ic-gatewayprocess definition + GuestOS packaging/env config; add a new (manual) system test verifying/api/v2/statusvia port 80 on CloudEngine nodes. - Rename orchestrator start-attempts metric to be per-process (
orchestrator_processes_start_attempts_total) and update affected tests.
File summaries
| File | Description |
|---|---|
| rs/tests/driver/src/driver/test_env_api.rs | Adjust metrics threshold matching to support prefix-based metric checks. |
| rs/tests/driver/src/driver/group.rs | Update default orchestrator metrics name and document prefix-based matching expectations. |
| rs/tests/consensus/subnet_splitting_test.rs | Update expected orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_same_nodes_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_same_nodes_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_same_nodes_enable_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_provision_write_access_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_local_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_enable_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_large_no_upgrade_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_enable_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/replica_determinism_test.rs | Update expected orchestrator metric name. |
| rs/tests/consensus/orchestrator/node_reassignment_test.rs | Update expected orchestrator metric name. |
| rs/tests/consensus/orchestrator/cloud_engine_ic_gateway_test.rs | New system test validating CloudEngine health on port 80 through ic-gateway. |
| rs/tests/consensus/orchestrator/Cargo.toml | Add reqwest/serde_cbor deps and register new system-test binary. |
| rs/tests/consensus/orchestrator/BUILD.bazel | Add Bazel target for the new system test (tagged manual). |
| rs/tests/consensus/cup_explorer_test.rs | Update expected orchestrator metric name. |
| rs/orchestrator/src/upgrade.rs | Switch upgrade loop to start/stop a multi-process manager instead of replica-only management. |
| rs/orchestrator/src/registry_helper.rs | Add subnet-type lookup helper for process policy decisions. |
| rs/orchestrator/src/processes.rs | New module: process definitions + generic ProcessManager + MultipleProcessesManager including gated ic-gateway. |
| rs/orchestrator/src/process_manager.rs | Refactor to ProcessRunner and ensure child processes are spawned as their own process-group leaders. |
| rs/orchestrator/src/orchestrator.rs | Wire new process configs/managers; pass env file paths via args; integrate with dashboard/boundary management. |
| rs/orchestrator/src/metrics.rs | Replace replica-only start metric with per-process start/stop counters. |
| rs/orchestrator/src/lib.rs | Register the new processes module. |
| rs/orchestrator/src/error.rs | Remove now-unused domain-name-missing error variant. |
| rs/orchestrator/src/dashboard.rs | Display both replica and ic-gateway PIDs via MultipleProcessesManager. |
| rs/orchestrator/src/boundary_node.rs | Delegate ic-boundary lifecycle to IcBoundaryManager. |
| rs/orchestrator/src/args.rs | Add required args for boundary/gateway env files; make ic_binary_directory non-optional. |
| rs/ic_os/release/BUILD.bazel | Add ic-gateway to released/stripped binaries. |
| ic-os/guestos/envs/recovery/BUILD.bazel | Increase size thresholds to accommodate image growth. |
| ic-os/guestos/envs/prod/BUILD.bazel | Increase size thresholds to accommodate image growth. |
| ic-os/guestos/defs.bzl | Include ic-gateway binary in GuestOS image dependencies. |
| ic-os/components/guestos/share/ic-gateway.env | New env file configuring ic-gateway (port 80, proxy to 8080, domain). |
| ic-os/components/guestos/ic-replica.service | Add orchestrator flags for boundary/gateway env file paths. |
| ic-os/components/guestos.bzl | Install ic-gateway.env into /opt/ic/share. |
| Cargo.lock | Add lock entries for new test dependencies. |
| Cargo.Bazel.json.lock | Add ic-gateway as a generated binary crate. |
| bazel/rust.MODULE.bazel | Add crate annotation to generate ic-gateway binary via Bazel. |
Copilot's findings
- Files reviewed: 41/43 changed files
- Comments generated: 4
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Assume the metrics to check are prefix-free. This allows to specify a metric name | ||
| // prefix to check all metrics with that prefix. | ||
| let max_value = metrics_to_check | ||
| .get(name.split('(').next().unwrap()) | ||
| .copied() | ||
| .iter() | ||
| .find(|(metric_name, _)| name.starts_with(**metric_name)) | ||
| .map(|(_, max_value)| *max_value) | ||
| .unwrap_or_default(); |
There was a problem hiding this comment.
This is precisely why the documentation of default_[replica/orchestrator]_metrics mentions that the metrics should be prefix-free. Though I'm happy to revisit if the owners (IDX) think otherwise.
| // Restarting the replica is enough to pass the unsigned CUP forward. | ||
| // If we fail, restart the current process instead. | ||
| if let Err(e) = self.stop_replica() { | ||
| if let Err(e) = self.processes_manager.write().unwrap().stop_replica() { |
There was a problem hiding this comment.
Stopping only the replica should be enough, but I am tempted to just restart all children instead for simplicity. WDYT?
| SubnetAssignment::Assigned(_) | ||
| ) | ||
| && self.has_local_cup.is_some() | ||
| // TODO(CON-1630): After mocking the process management, we can remove the condition below. |
There was a problem hiding this comment.
After mocking the process management
This was done in #9528, so now we can remove the TODO (orthogonal to the changes of this PR).
| Some(RegistryVersion::from(10)), | ||
| Some(RegistryVersion::from(50)), | ||
| Some(RegistryVersion::from(100)), | ||
| Some(RegistryVersion::from(150)) |
There was a problem hiding this comment.
Here and below, I removed some (imo redundant) test cases to decrease the cardinality of the Cartesian product which was starting to get a bit big.
ic-gateway as a side-car to the replica in cloud enginesic-gateway as a side-car to the replica in cloud engines
|
✅ No security or compliance issues detected. Reviewed everything up to 45aeab2. Security Overview
Detected Code ChangesThe diff is too large to display a summary of code changes. |
|
✅ No security or compliance issues detected. Reviewed everything up to 45aeab2. Security Overview
Detected Code ChangesThe diff is too large to display a summary of code changes. |
| match self.registry.get_subnet_type(subnet_id, registry_version)? { | ||
| None | ||
| | Some(SubnetType::Unspecified) | ||
| | Some(SubnetType::Application) | ||
| | Some(SubnetType::System) | ||
| | Some(SubnetType::VerifiedApplication) => { | ||
| self.ic_gateway_manager.stop()?; | ||
| } | ||
| Some(SubnetType::CloudEngine) => { | ||
| self.ic_gateway_manager.ensure_running(replica_version)?; | ||
| } | ||
| } |
There was a problem hiding this comment.
I guess it's fine to return an error here (even if the replica above worked) because this is called near the very end of the upgrade loop? I think it's a good idea to not let a failing gateway interfere with the replica too much
There was a problem hiding this comment.
because this is called near the very end of the upgrade loop?
Yes, I was careful to start the replica before anything else.
I think it's a good idea to not let a failing gateway interfere with the replica too much
I agree. Though failing to start ic-gateway should imo still be considered an error as it would be a sign of broken/unexpected internal state.
| self.ic_gateway_manager.stop()?; | ||
| self.replica_manager.stop()?; | ||
|
|
||
| Ok(()) |
There was a problem hiding this comment.
Also here: if stopping the gateway fails, we wouldn't even try to stop the replica. Maybe something like this?
| self.ic_gateway_manager.stop()?; | |
| self.replica_manager.stop()?; | |
| Ok(()) | |
| let ic_gateway_result = self.ic_gateway_manager.stop(); | |
| let replica_result = self.replica_manager.stop(); | |
| ic_gateway_result.and(replica_result) |
(Assuming returning an error at all is a good idea if only the gateway fails)
There was a problem hiding this comment.
Yes, I also thought about that. I first though of stopping the replica first, to be as least intrusive as possible (similar to start_all), but then it felt wrong because I thought it would be better to be consistent in that started processes should be stopped in reverse order (here, ic-gateway depends on the replica, so we start it second and stop it first).
Your suggestion works too, but I think in that case, we should be consistent with start_all and have the same logic there (of trying to start ic-gateway even if the replica failed to start).
Assuming returning an error at all is a good idea if only the gateway fails
I think it is because returning an error prevents from removing the state and CUP, meaning that on the next iteration of the loop, we will try to stop the processes again. If we did not return an error, we would remove the CUP on disk and not retry, potentially leaving an ic-gateway running indefinitely.
There was a problem hiding this comment.
Whatever we do, we should probably still gate it: 45aeab2
| } | ||
|
|
||
| // --------------------------------------------------------------------------- | ||
| // MultipleProcessManager |
There was a problem hiding this comment.
| // MultipleProcessManager | |
| // MultipleProcessesManager |
| // Restarting the replica is enough to pass the unsigned CUP forward. | ||
| // If we fail, restart the current process instead. | ||
| if let Err(e) = self.stop_replica() { | ||
| if let Err(e) = self.processes_manager.write().unwrap().stop_replica() { |
Cloud engines should be self-contained, but today their nodes rely on external routing to reach the replica. This PR prepares running
ic-gatewayas a side-car alongside the replica on each node, so the gateway terminates external traffic locally and proxies it to the replica.Launching
ic-gatewayis still gated behind a flag and only happens on cloud engines, so there's no behavioral change for now. To enable the functionality, it should be enough to flip that flag.Refactoring the orchestrator's process manager
The orchestrator's process management was written around a single process (the replica), with a parallel, partially-duplicated path for
ic-boundary. Adding a third managed process the same way would have meant copying that logic again.Instead, process management is now generic. The upgrade loop no longer knows what it's running — it just starts and stops a manager that owns the set of node processes. Adding a future process is a matter of describing how it's built and registering it; no changes to the upgrade loop.
Scope: this refactor is confined to the upgrade loop, i.e. processes whose lifetime is tied to a subnet's. Unassigned nodes and API BNs are unaffected.
Testing
A new system test
cloud_engine_ic_gateway_test(now stillmanualbecause the feature flag is disabled) hitsapi/v2/statuson port 80 of each node, confirmingic-gatewayis up and correctly proxying to the replica.Moreover, new unit tests were added to test the existing
ic-boundarylogic that restarts the process on domain name changes.Implementation details for reviewers
MultipleProcessesManagerand callsstart_all/stop_all. It's process-agnostic, except forstop_replicawhich is (arguably, see comment) still needed during recoveries.ic-boundary) moved intoprocesses.rs. EachProcessdeclares how it's built viabuild(&Self::Config, Self::Args), separating static config from dynamic arguments that can change across the orchestrator's lifetime, likely derived from the registry.ProcessManager<Process>centralizes the logging, metrics, andOrchestratorResulthandling that was previously duplicated.MultipleProcessesManagerholds several of these and decides what to start/stop (e.g.ic-gatewayonly on cloud engines).ic-boundaryis an exception as it has some additional logic when the node's domain changes. Keep that logic by introducingIcBoundaryManager, a wrapper overProcessManager<IcBoundaryProcess>exposingensure_ic_boundary_running_and_restarted_on_domain_change.To add a new process: define
NewProcessConfigandNewProcessimplementingProcess, add aProcessManager<NewProcess>field toMultipleProcessesManager, and wire it into the start/stop methods.