refactor(orchestrator): prepare running `ic-gateway` as a side-car to the replica in cloud engines by pierugo-dfinity · Pull Request #10499 · dfinity/ic

pierugo-dfinity · 2026-06-17T13:18:55Z

Cloud engines should be self-contained, but today their nodes rely on external routing to reach the replica. This PR prepares running ic-gateway as a side-car alongside the replica on each node, so the gateway terminates external traffic locally and proxies it to the replica.
Launching ic-gateway is still gated behind a flag and only happens on cloud engines, so there's no behavioral change for now. To enable the functionality, it should be enough to flip that flag.

Refactoring the orchestrator's process manager

The orchestrator's process management was written around a single process (the replica), with a parallel, partially-duplicated path for ic-boundary. Adding a third managed process the same way would have meant copying that logic again.
Instead, process management is now generic. The upgrade loop no longer knows what it's running — it just starts and stops a manager that owns the set of node processes. Adding a future process is a matter of describing how it's built and registering it; no changes to the upgrade loop.

Scope: this refactor is confined to the upgrade loop, i.e. processes whose lifetime is tied to a subnet's. Unassigned nodes and API BNs are unaffected.

Testing

A new system test cloud_engine_ic_gateway_test (now still manual because the feature flag is disabled) hits api/v2/status on port 80 of each node, confirming ic-gateway is up and correctly proxying to the replica.
Moreover, new unit tests were added to test the existing ic-boundary logic that restarts the process on domain name changes.

Implementation details for reviewers

The upgrade loop now holds a MultipleProcessesManager and calls start_all/stop_all. It's process-agnostic, except for stop_replica which is (arguably, see comment) still needed during recoveries.
All per-process logic (including ic-boundary) moved into processes.rs. Each Process declares how it's built via build(&Self::Config, Self::Args), separating static config from dynamic arguments that can change across the orchestrator's lifetime, likely derived from the registry.
ProcessManager<Process> centralizes the logging, metrics, and OrchestratorResult handling that was previously duplicated. MultipleProcessesManager holds several of these and decides what to start/stop (e.g. ic-gateway only on cloud engines).
ic-boundary is an exception as it has some additional logic when the node's domain changes. Keep that logic by introducing IcBoundaryManager, a wrapper over ProcessManager<IcBoundaryProcess> exposing ensure_ic_boundary_running_and_restarted_on_domain_change.

To add a new process: define NewProcessConfig and NewProcess implementing Process, add a ProcessManager<NewProcess> field to MultipleProcessesManager, and wire it into the start/stop methods.

…pping

Copilot

⚠️ Not ready to approve

MultipleProcessesManager::start_all does not actually stop ic-gateway when the launch gate is disabled, contradicting the intended behavior and potentially leaving the sidecar running unexpectedly.

Pull request overview

Refactors the orchestrator’s subnet-scoped process supervision from “replica-only + special-case ic-boundary” into a generic multi-process manager, and introduces ic-gateway as an (currently gated) sidecar process for CloudEngine nodes so public API traffic can be terminated locally on port 80 and proxied to the replica.

Changes:

Replace replica-specific process management in the upgrade loop with MultipleProcessesManager / per-process ProcessManager.
Add ic-gateway process definition + GuestOS packaging/env config; add a new (manual) system test verifying /api/v2/status via port 80 on CloudEngine nodes.
Rename orchestrator start-attempts metric to be per-process (orchestrator_processes_start_attempts_total) and update affected tests.

File summaries

File	Description
rs/tests/driver/src/driver/test_env_api.rs	Adjust metrics threshold matching to support prefix-based metric checks.
rs/tests/driver/src/driver/group.rs	Update default orchestrator metrics name and document prefix-based matching expectations.
rs/tests/consensus/subnet_splitting_test.rs	Update expected orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_with_chain_keys_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_enable_chain_keys_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_with_chain_keys_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_provision_write_access_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_local_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_enable_chain_keys_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_large_no_upgrade_with_chain_keys_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_with_chain_keys_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_enable_chain_keys_test.rs	Update removed orchestrator metric name.
rs/tests/consensus/replica_determinism_test.rs	Update expected orchestrator metric name.
rs/tests/consensus/orchestrator/node_reassignment_test.rs	Update expected orchestrator metric name.
rs/tests/consensus/orchestrator/cloud_engine_ic_gateway_test.rs	New system test validating CloudEngine health on port 80 through `ic-gateway`.
rs/tests/consensus/orchestrator/Cargo.toml	Add `reqwest`/`serde_cbor` deps and register new system-test binary.
rs/tests/consensus/orchestrator/BUILD.bazel	Add Bazel target for the new system test (tagged `manual`).
rs/tests/consensus/cup_explorer_test.rs	Update expected orchestrator metric name.
rs/orchestrator/src/upgrade.rs	Switch upgrade loop to start/stop a multi-process manager instead of replica-only management.
rs/orchestrator/src/registry_helper.rs	Add subnet-type lookup helper for process policy decisions.
rs/orchestrator/src/processes.rs	New module: process definitions + generic `ProcessManager` + `MultipleProcessesManager` including gated `ic-gateway`.
rs/orchestrator/src/process_manager.rs	Refactor to `ProcessRunner` and ensure child processes are spawned as their own process-group leaders.
rs/orchestrator/src/orchestrator.rs	Wire new process configs/managers; pass env file paths via args; integrate with dashboard/boundary management.
rs/orchestrator/src/metrics.rs	Replace replica-only start metric with per-process start/stop counters.
rs/orchestrator/src/lib.rs	Register the new `processes` module.
rs/orchestrator/src/error.rs	Remove now-unused domain-name-missing error variant.
rs/orchestrator/src/dashboard.rs	Display both replica and `ic-gateway` PIDs via `MultipleProcessesManager`.
rs/orchestrator/src/boundary_node.rs	Delegate `ic-boundary` lifecycle to `IcBoundaryManager`.
rs/orchestrator/src/args.rs	Add required args for boundary/gateway env files; make `ic_binary_directory` non-optional.
rs/ic_os/release/BUILD.bazel	Add `ic-gateway` to released/stripped binaries.
ic-os/guestos/envs/recovery/BUILD.bazel	Increase size thresholds to accommodate image growth.
ic-os/guestos/envs/prod/BUILD.bazel	Increase size thresholds to accommodate image growth.
ic-os/guestos/defs.bzl	Include `ic-gateway` binary in GuestOS image dependencies.
ic-os/components/guestos/share/ic-gateway.env	New env file configuring `ic-gateway` (port 80, proxy to 8080, domain).
ic-os/components/guestos/ic-replica.service	Add orchestrator flags for boundary/gateway env file paths.
ic-os/components/guestos.bzl	Install `ic-gateway.env` into `/opt/ic/share`.
Cargo.lock	Add lock entries for new test dependencies.
Cargo.Bazel.json.lock	Add `ic-gateway` as a generated binary crate.
bazel/rust.MODULE.bazel	Add crate annotation to generate `ic-gateway` binary via Bazel.

Copilot's findings

Files reviewed: 41/43 changed files
Comments generated: 4

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pierugo-dfinity · 2026-06-19T08:25:58Z

+                // Assume the metrics to check are prefix-free. This allows to specify a metric name
+                // prefix to check all metrics with that prefix.
                let max_value = metrics_to_check
-                    .get(name.split('(').next().unwrap())
-                    .copied()
+                    .iter()
+                    .find(|(metric_name, _)| name.starts_with(**metric_name))
+                    .map(|(_, max_value)| *max_value)
                    .unwrap_or_default();


This is precisely why the documentation of default_[replica/orchestrator]_metrics mentions that the metrics should be prefix-free. Though I'm happy to revisit if the owners (IDX) think otherwise.

pierugo-dfinity · 2026-06-22T08:22:43Z

            // Restarting the replica is enough to pass the unsigned CUP forward.
            // If we fail, restart the current process instead.
-            if let Err(e) = self.stop_replica() {
+            if let Err(e) = self.processes_manager.write().unwrap().stop_replica() {


Stopping only the replica should be enough, but I am tempted to just restart all children instead for simplicity. WDYT?

Should be ok

pierugo-dfinity · 2026-06-22T08:25:57Z

                SubnetAssignment::Assigned(_)
-            )
-            && self.has_local_cup.is_some()
-            // TODO(CON-1630): After mocking the process management, we can remove the condition below.


After mocking the process management

This was done in #9528, so now we can remove the TODO (orthogonal to the changes of this PR).

pierugo-dfinity · 2026-06-22T08:30:07Z

-            Some(RegistryVersion::from(10)),
            Some(RegistryVersion::from(50)),
-            Some(RegistryVersion::from(100)),
-            Some(RegistryVersion::from(150))


Here and below, I removed some (imo redundant) test cases to decrease the cardinality of the Cartesian product which was starting to get a bit big.

zeropath-ai · 2026-06-22T11:34:33Z

✅ No security or compliance issues detected. Reviewed everything up to 45aeab2.

Security Overview

🔎 Scanned files: 43 changed file(s)
🔗 Scan Link: https://zeropath.com/app/repositories/6992424c-495e-4621-b441-efd6821a4be6?scanId=88e490e1-e397-4ed7-90d5-0fcaf7cf601d&codeScanTypes=PrScan&tab=issues

Detected Code Changes

The diff is too large to display a summary of code changes.

zeropath-ai · 2026-06-22T11:35:02Z

✅ No security or compliance issues detected. Reviewed everything up to 45aeab2.

Security Overview

🔎 Scanned files: 43 changed file(s)
🔗 Scan Link: https://zeropath.com/app/repositories/f5e9d2bb-29a4-4b64-902a-391d65fd8b2d?scanId=709d6d3e-9694-4fdd-b810-4be225af48bc&codeScanTypes=PrScan&tab=issues

Detected Code Changes

The diff is too large to display a summary of code changes.

eichhorl · 2026-06-22T12:21:56Z

+            match self.registry.get_subnet_type(subnet_id, registry_version)? {
+                None
+                | Some(SubnetType::Unspecified)
+                | Some(SubnetType::Application)
+                | Some(SubnetType::System)
+                | Some(SubnetType::VerifiedApplication) => {
+                    self.ic_gateway_manager.stop()?;
+                }
+                Some(SubnetType::CloudEngine) => {
+                    self.ic_gateway_manager.ensure_running(replica_version)?;
+                }
+            }


I guess it's fine to return an error here (even if the replica above worked) because this is called near the very end of the upgrade loop? I think it's a good idea to not let a failing gateway interfere with the replica too much

because this is called near the very end of the upgrade loop?

Yes, I was careful to start the replica before anything else.

I think it's a good idea to not let a failing gateway interfere with the replica too much

I agree. Though failing to start ic-gateway should imo still be considered an error as it would be a sign of broken/unexpected internal state.

eichhorl · 2026-06-22T12:26:51Z

+        self.ic_gateway_manager.stop()?;
+        self.replica_manager.stop()?;
+
+        Ok(())


Also here: if stopping the gateway fails, we wouldn't even try to stop the replica. Maybe something like this?

Suggested change

self.ic_gateway_manager.stop()?;

self.replica_manager.stop()?;

Ok(())

let ic_gateway_result = self.ic_gateway_manager.stop();

let replica_result = self.replica_manager.stop();

ic_gateway_result.and(replica_result)

(Assuming returning an error at all is a good idea if only the gateway fails)

Yes, I also thought about that. I first though of stopping the replica first, to be as least intrusive as possible (similar to start_all), but then it felt wrong because I thought it would be better to be consistent in that started processes should be stopped in reverse order (here, ic-gateway depends on the replica, so we start it second and stop it first).
Your suggestion works too, but I think in that case, we should be consistent with start_all and have the same logic there (of trying to start ic-gateway even if the replica failed to start).

Assuming returning an error at all is a good idea if only the gateway fails

I think it is because returning an error prevents from removing the state and CUP, meaning that on the next iteration of the loop, we will try to stop the processes again. If we did not return an error, we would remove the CUP on disk and not retry, potentially leaving an ic-gateway running indefinitely.

Whatever we do, we should probably still gate it: 45aeab2

eichhorl · 2026-06-22T12:29:17Z

+}
+
+// ---------------------------------------------------------------------------
+// MultipleProcessManager


Suggested change

// MultipleProcessManager

// MultipleProcessesManager

blind-oracle · 2026-06-22T13:09:32Z

            // Restarting the replica is enough to pass the unsigned CUP forward.
            // If we fail, restart the current process instead.
-            if let Err(e) = self.stop_replica() {
+            if let Err(e) = self.processes_manager.write().unwrap().stop_replica() {


Should be ok

pierugo-dfinity added 11 commits June 17, 2026 09:40

feat: cloud engine firewall

460bfa6

test: system test

7a8ba86

feat: run ic-boundary next to the replica

a70d671

refactor: make Upgrade agnostic to processes running

9a2a31d

feat: add ic-gateway

932808e

feat: run ic-gateway only for now

de65d14

refactor: minor refactor

01b9362

docs: remove gov-team change

2af2935

refactor: deduplicate/generalize code

d6ee4b5

feat: do not attempt to stop when not running. Log + Metrics when sto…

e630926

…pping

feat: propagate orchestrator's changed metric's name

e50142e

pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label Jun 17, 2026

github-actions Bot added the feat label Jun 17, 2026

pierugo-dfinity and others added 12 commits June 17, 2026 13:19

re-trigger CI

11138a7

Automatically fixing code for linting and formatting issues

b73a95b

style: check metric of all orchestrator processes

8a5a43e

feat: support checking metrics prefixes

fa1dd85

feat: ic-gateway metrics

3adc572

feat: stop ic-gateway first

852e891

fix: fix image size estimates

9295ec5

fix: spawn child as leader of new process group

78de3f9

docs: add TODO

20e0162

feat: firewall changes in separate PR

dd9deff

feat: gate launching ic-gateway behind flag

b0a493c

docs

93a758e

pierugo-dfinity requested a review from Copilot June 19, 2026 07:45

Copilot started reviewing on behalf of pierugo-dfinity June 19, 2026 07:46 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

pierugo-dfinity added 2 commits June 19, 2026 08:14

docs

aefebff

style: inline args.ic_binary_directory

7ba43d7

pierugo-dfinity added 2 commits June 22, 2026 08:21

docs

603cfb5

style: rename Fake struct

afc74aa

pierugo-dfinity commented Jun 22, 2026

View reviewed changes

test: add IcBoundaryManager unit tests

e12eccc

pierugo-dfinity changed the title ~~feat(orchestrator): run ic-gateway as a side-car to the replica in cloud engines~~ refactor(orchestrator): prepare running ic-gateway as a side-car to the replica in cloud engines Jun 22, 2026

github-actions Bot added refactor and removed feat labels Jun 22, 2026

pierugo-dfinity marked this pull request as ready for review June 22, 2026 11:33

pierugo-dfinity requested review from a team as code owners June 22, 2026 11:33

github-actions Bot added @idx @node @core-protocol labels Jun 22, 2026

eichhorl reviewed Jun 22, 2026

View reviewed changes

pierugo-dfinity added 2 commits June 22, 2026 12:53

docs

9ab62e6

feat: gate stopping

45aeab2

blind-oracle reviewed Jun 22, 2026

View reviewed changes

-        self.ic_gateway_manager.stop()?;
-        self.replica_manager.stop()?;
-        Ok(())
+        let ic_gateway_result = self.ic_gateway_manager.stop();
+        let replica_result = self.replica_manager.stop();
+        ic_gateway_result.and(replica_result)

Conversation

pierugo-dfinity commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refactoring the orchestrator's process manager

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

⚠️ Not ready to approve

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeropath-ai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeropath-ai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pierugo-dfinity commented Jun 17, 2026 •

edited

Loading

zeropath-ai Bot commented Jun 22, 2026 •

edited

Loading

zeropath-ai Bot commented Jun 22, 2026 •

edited

Loading