Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions docs/integration/hcg-tier2-rollout-runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@

# HCG tier-2 — rollout & rollback runbook

**Version:** 0.3 (live policy promoted, Phase E in-progress)
**Date:** 2026-06-09 (rev. from 2026-06-08)
**Status:** Phase E deliverables E1 (deploy spec) + E5 (rollback runbook) drafted; live gateway policy (`config/gateway-policy-boj.yaml`) promoted from the worked example (§1.5). Owner-input markers (`!OWNER:`) remain to be filled before any traffic-shift action is taken.
**Version:** 0.4 (policy-deny smoke script landed, Phase E in-progress)
**Date:** 2026-06-10 (rev. from 2026-06-09)
**Status:** Phase E deliverables E1 (deploy spec) + E5 (rollback runbook) drafted; live gateway policy (`config/gateway-policy-boj.yaml`) promoted from the worked example (§1.5); `scripts/hcg-policy-smoke.sh` lands as the checked-in §1.5 operator pre-check (deny-path covers gateway-alone; `--with-backend` adds allow-path coverage). Owner-input markers (`!OWNER:`) remain to be filled before any traffic-shift action is taken.
**ADR:** [`docs/decisions/0004-adopt-http-capability-gateway.md`](../decisions/0004-adopt-http-capability-gateway.md)
**Plan:** [`docs/integration/http-capability-gateway-plan.md`](http-capability-gateway-plan.md) (§ Phase E)
**Contract:** [`docs/integration/http-capability-gateway-boj-contract.md`](http-capability-gateway-boj-contract.md)
Expand Down Expand Up @@ -88,7 +88,7 @@ These cannot be inferred from the code/contract; the owner must fill them before
- [x] `container/gateway-deploy.k9.ncl` exists in the gateway repo (plan §E1) — http-capability-gateway#38 (2026-06-03). Five-level k9-svc pedigree (Snout / Scent / Leash / Gut / Muscle) modelled on `boj-server:container/deploy.k9.ncl`; per-environment `BACKEND_URL` (`http://127.0.0.1:7700` staging, `http://unix:/run/boj/gnosis.sock:/` production); trust source `"header"` staging → `"mtls"` production after §2.4 rehearsal; `max_unavailable = 0`; `failure_mode = "fail-closed"` matching the `[SEAMS] gateway-boj-gnosis` declaration.
- [x] Gateway policy file in place: `config/gateway-policy-boj-example.yaml`, covering all BoJ surface routes (`/.well-known/boj-node-pubkey`, `/health`, `/menu`, `/cartridges`, `/cartridge/:name`, `/cartridge/:name/invoke`, `/cartridge/:name/sse`, plus any added since contract v1.0). Re-verified 2026-05-28 against `BojRest.Router`; the `POST /cartridge/:name/sse` route (router.ex line 130, wired since the SSE landing — ADR-0013 §6, STATE entry 2026-05-18) was the only drift since contract v1.0 and is now governed by the `cartridge-sse-post` rule alongside `cartridge-invoke-post` (boj-server#165).
- [x] Live policy file (`config/gateway-policy-boj.yaml`) promoted from the example. Content-identical to the example at promotion time; future BoJ-surface evolution lands in the live file and the example remains as the worked-example artefact (Phase A A3). Both §2.1 staging and §3.1 production load the live file via `POLICY_PATH`.
- [ ] Gateway has been smoke-tested in isolation with the policy, returning expected allow/deny on each route. Sequence: stand the gateway up against `gateway-policy-boj-example.yaml`, exercise one allow + one deny per route from §1.5 above; confirm `POST /cartridge/:name/sse` with `X-Trust-Level: authenticated` proxies through and with `X-Trust-Level: untrusted` returns 403 (deferred to this step by boj-server#165's test plan). Out of band of code review — operator pre-check before §2.1.
- [ ] Gateway has been smoke-tested in isolation with the policy, returning expected allow/deny on each route. Run `scripts/hcg-policy-smoke.sh --gateway-url <staging-gateway-url>` against the gateway loaded with `config/gateway-policy-boj.yaml`; the script exercises a no-trust-header deny probe for every non-public route plus default-deny verb canaries (DELETE/PUT/PATCH on `/cartridges` and `/health`) and is fully gateway-internal — BoJ does **not** need to be reachable for this run. Once BoJ is up behind the gateway, re-run with `--with-backend` from a trusted-proxy IP (loopback by default) to also cover the allow path on authenticated/internal routes including the `POST /cartridge/:name/sse` authenticated/untrusted pair carried over from boj-server#165's test plan. Attach the script's PASS/FAIL summary to the cut-over ticket; a single FAIL is a stop-the-rollout condition (gateway loaded the policy but is not enforcing as declared, or BoJ is unreachable from the gateway, or the script is being run from a non-trusted-proxy IP and the trust header is being stripped).

---

Expand Down Expand Up @@ -312,3 +312,4 @@ Also update `[HTTP_CAPABILITY_GATEWAY]` section per plan §E acceptance: `status
- `http-capability-gateway/docs/perf-contract.md` — Phase D perf-contract.
- `elixir/lib/boj_rest/trust_policy.ex` — `satisfies?/3` Phase C enforcement.
- `.machine_readable/contractiles/trust/Trustfile.a2ml` — `[CLOUDFLARE_EDGE_SECURITY].rate_limiting.tier_2_gateway` (current `PENDING` site; §6.4 flip target) + `[SEAMS]` (Phase C gateway↔BoJ-gnosis declaration).
- `scripts/hcg-policy-smoke.sh` — §1.5 operator pre-check: deny-path smoke (gateway-alone) + optional `--with-backend` allow-path smoke against the live policy.
245 changes: 245 additions & 0 deletions scripts/hcg-policy-smoke.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
#!/usr/bin/env bash
# SPDX-License-Identifier: MPL-2.0
# Copyright (c) 2026 Jonathan D.A. Jewell <j.d.a.jewell@open.ac.uk>
#
# hcg-policy-smoke.sh — Exercise the HCG tier-2 live Verb Governance
# Spec from outside the gateway. Returns non-zero on any unexpected
# response, so it can be invoked from the §1.5 / §2.1 prerequisite
# checklist in `docs/integration/hcg-tier2-rollout-runbook.md`.
#
# The default mode probes the *deny* path for every non-public route in
# `config/gateway-policy-boj.yaml` plus a default-deny verb canary for
# DELETE/PUT/PATCH. The deny path is fully gateway-internal — it does
# not require BoJ to be reachable, so this script is the cheapest way
# to confirm policy enforcement before staging cut-over.
#
# With `--with-backend`, the script additionally sends an authenticated
# (or internal) probe per route and asserts the gateway *forwarded* it
# (response did not come from the gateway's own deny path). Allow
# probes require BoJ to be reachable from the gateway's BACKEND_URL,
# and the script itself must run from an IP listed in the gateway's
# `:trusted_proxies` config (loopback by default) so that the
# X-Trust-Level header is not stripped by the gateway's
# `strip_untrusted_headers` plug.
#
# Usage:
# ./scripts/hcg-policy-smoke.sh --gateway-url http://127.0.0.1:8080
# ./scripts/hcg-policy-smoke.sh --gateway-url https://stage:8443 \
# --insecure --with-backend
#
# Exit codes: 0 = all probes matched expectations, 1 = at least one
# mismatch, 64 = bad usage.
#
# Cross-refs:
# docs/integration/hcg-tier2-rollout-runbook.md §1.5 / §2.1
# docs/integration/http-capability-gateway-plan.md §Phase E
# config/gateway-policy-boj.yaml source of truth
# standards#100 tracking issue

set -euo pipefail

GATEWAY_URL=""
WITH_BACKEND=0
INSECURE=0
TRUST_HEADER_NAME="X-Trust-Level"

usage() {
cat >&2 <<'EOF'
hcg-policy-smoke.sh — Exercise the HCG live policy.

USAGE:
hcg-policy-smoke.sh --gateway-url URL [--with-backend] [--insecure]
[--trust-header NAME]

OPTIONS:
--gateway-url URL Base URL of the gateway (required), e.g.
http://127.0.0.1:8080 or https://stage:8443.
--with-backend Additionally probe the allow path on each route
(requires BoJ reachable at the gateway's
BACKEND_URL, and this script to run from a
trusted-proxy IP).
--insecure Pass `-k` to curl (self-signed staging TLS).
--trust-header NAME Override the trust-level header name. Defaults
to the gateway default `X-Trust-Level`; set this
only if `:trust_level_header` was customised.
-h, --help Show this help.

EXAMPLES:
# Deny-only smoke against a local gateway with no BoJ behind it:
./scripts/hcg-policy-smoke.sh --gateway-url http://127.0.0.1:8080

# Full smoke against staging, BoJ up, self-signed TLS:
./scripts/hcg-policy-smoke.sh --gateway-url https://stage:8443 \
--insecure --with-backend

Designed to be run by the operator from the rollout runbook §1.5 last
open item (replacing the out-of-band manual probe sequence) and §2.1
post-stand-up sanity check.
EOF
exit 64
}

while [ $# -gt 0 ]; do
case "$1" in
--gateway-url) GATEWAY_URL="${2:-}"; shift 2 ;;
--with-backend) WITH_BACKEND=1; shift ;;
--insecure) INSECURE=1; shift ;;
--trust-header) TRUST_HEADER_NAME="${2:-}"; shift 2 ;;
-h|--help) usage ;;
*) echo "unknown arg: $1" >&2; usage ;;
esac
done

[ -n "$GATEWAY_URL" ] || usage
command -v curl >/dev/null || { echo "curl: not found" >&2; exit 1; }

GATEWAY_URL="${GATEWAY_URL%/}" # strip trailing slash
CURL_BASE=(curl -sS -o /dev/null -w '%{http_code}' --max-time 10)
[ "$INSECURE" = "1" ] && CURL_BASE+=(-k)

PASS=0
FAIL=0
FAIL_LINES=()

# probe VERB PATH EXPECTED_PATTERN LABEL [trust_level]
#
# EXPECTED_PATTERN is an extended-regex matched against the three-digit
# status code; "deny" expands to 4xx, "allow_or_upstream" expands to
# "anything but a gateway-origin 4xx" (2xx, 3xx, 5xx).
#
# trust_level (optional) is sent as the X-Trust-Level header. Without
# it the gateway treats the caller as untrusted, which is the deny-path
# input.
probe() {
local verb="$1" path="$2" pattern="$3" label="$4" trust="${5:-}"
local url="${GATEWAY_URL}${path}"
local args=("${CURL_BASE[@]}" -X "$verb")
if [ -n "$trust" ]; then
args+=(-H "${TRUST_HEADER_NAME}: ${trust}")
fi
# Some routes are POST; send an empty JSON body so Plug.Parsers
# does not 400 on missing content-type.
if [ "$verb" = "POST" ]; then
args+=(-H "Content-Type: application/json" --data '{}')
fi
args+=("$url")

local code
# Quote "${args[@]}" so multi-word array elements (the JSON
# Content-Type header in particular) stay as single arguments to
# curl — without quoting, word-splitting turned "Content-Type:
# application/json" into two args and curl saw "application/json"
# as a second URL, double-writing %{http_code}.
code="$("${args[@]}" 2>/dev/null || true)"
case "$pattern" in
deny)
if [[ "$code" =~ ^4[0-9][0-9]$ ]]; then
printf ' PASS %-65s %s\n' "$label" "$code"
PASS=$((PASS + 1))
return
fi
;;
allow_or_upstream)
# The gateway forwarded the request iff the response is NOT
# a gateway-origin 4xx deny. 2xx/3xx mean BoJ replied;
# 5xx is upstream-down (also forwarded). The gateway's own
# circuit-breaker 503 is indistinguishable from an upstream
# 503 at this level, which is fine — neither indicates a
# policy regression.
if [[ ! "$code" =~ ^4[0-9][0-9]$ ]]; then
printf ' PASS %-65s %s\n' "$label" "$code"
PASS=$((PASS + 1))
return
fi
;;
esac
printf ' FAIL %-65s %s (expected %s)\n' "$label" "$code" "$pattern"
FAIL=$((FAIL + 1))
FAIL_LINES+=("$label got=$code expected=$pattern")
}

echo "==> HCG policy deny smoke against ${GATEWAY_URL}"
echo " (config/gateway-policy-boj.yaml; no X-Trust-Level header)"

# Authenticated routes — gateway must 4xx without a trust header.
# Internal+stealth routes — also 4xx (status code shape depends on
# `:stealth_profiles` runtime config; 4xx covers both stealth and
# bare 403).
probe GET /status deny "auth:status-get"
probe GET /menu deny "auth:menu-get"
probe GET /matrix deny "auth:matrix-get"
probe GET /cartridges deny "auth:cartridges-list-get"
probe GET /cartridge/probe deny "auth:cartridge-detail-get"
probe POST /cartridge/probe/invoke deny "auth:cartridge-invoke-post"
probe POST /cartridge/probe/sse deny "auth:cartridge-sse-post"
probe POST /graphql deny "auth:graphql-post"
probe POST /grpc/svc/method deny "auth:grpc-method-post"
probe GET /sse deny "auth:sse-get"
probe POST /order deny "auth:order-post"
probe POST /order-ticket deny "auth:order-ticket-post"
probe GET /umoja/status deny "auth:umoja-status-get"
probe GET /umoja/transport deny "auth:umoja-transport-get"
probe GET /umoja/peers deny "auth:umoja-peers-get"
probe GET /coprocessor/status deny "auth:coprocessor-status-get"
probe GET /sla/status deny "auth:sla-status-get"
probe GET /community/submissions deny "auth:community-submissions-get"
probe POST /community/submit deny "auth:community-submit-post"

probe POST /cartridge/probe/load deny "internal:cartridge-load-post"
probe POST /cartridge/probe/unload deny "internal:cartridge-unload-post"
probe POST /cartridge/probe/reload deny "internal:cartridge-reload-post"
probe POST /umoja/peers deny "internal:umoja-peers-post"
probe POST /coprocessor/select deny "internal:coprocessor-select-post"
probe GET /sdp/status deny "internal:sdp-status-get"

# Default-deny verb canaries — global_verbs is [GET, POST], so any
# DELETE/PUT/PATCH on a known path must be denied via the no-match
# (or unknown-method) path. Verifies the verb-governance core invariant
# of ADR-0004.
probe DELETE /cartridges deny "verb-canary:DELETE /cartridges"
probe PUT /health deny "verb-canary:PUT /health"
probe PATCH /cartridges deny "verb-canary:PATCH /cartridges"

if [ "$WITH_BACKEND" = "1" ]; then
echo
echo "==> HCG policy allow smoke (--with-backend)"
echo " (X-Trust-Level: authenticated/internal; requires BoJ up)"

# Authenticated routes — gateway forwards under X-Trust-Level: authenticated.
# We assert "not a gateway-origin 4xx"; BoJ's own 200/404/500 is fine.
probe GET /status allow_or_upstream "auth-allow:status-get" authenticated
probe GET /menu allow_or_upstream "auth-allow:menu-get" authenticated
probe GET /cartridges allow_or_upstream "auth-allow:cartridges-list-get" authenticated
probe GET /cartridge/probe allow_or_upstream "auth-allow:cartridge-detail-get" authenticated
probe POST /cartridge/probe/invoke allow_or_upstream "auth-allow:cartridge-invoke-post" authenticated
probe POST /cartridge/probe/sse allow_or_upstream "auth-allow:cartridge-sse-post" authenticated

# Public routes — should forward without any trust header.
probe GET /health allow_or_upstream "public-allow:health-get" ""
probe GET /.well-known/boj-node-pubkey allow_or_upstream "public-allow:node-pubkey-get" ""

# Internal+stealth routes — gateway forwards only under
# X-Trust-Level: internal.
probe POST /cartridge/probe/load allow_or_upstream "internal-allow:cartridge-load-post" internal
probe POST /cartridge/probe/unload allow_or_upstream "internal-allow:cartridge-unload-post" internal
probe POST /cartridge/probe/reload allow_or_upstream "internal-allow:cartridge-reload-post" internal
probe GET /sdp/status allow_or_upstream "internal-allow:sdp-status-get" internal
fi

echo
echo "────────────────────────────────────────────────────────────────────────"
echo "HCG policy smoke: PASS=${PASS} FAIL=${FAIL}"
if [ "$FAIL" -gt 0 ]; then
echo
echo "Mismatches:"
for line in "${FAIL_LINES[@]}"; do
echo " - ${line}"
done
echo
echo "Investigate before flipping the §1.5 checkbox. A 4xx miss on a"
echo "deny probe means the policy was loaded but is not enforcing as"
echo "declared; a 4xx on an allow probe means the trust header was"
echo "stripped (run from a trusted-proxy IP) or BoJ is unreachable."
exit 1
fi
echo "All probes matched policy. Safe to proceed with §2.1 staging cut-over."
Loading