diff --git a/docs/integration/hcg-tier2-rollout-runbook.md b/docs/integration/hcg-tier2-rollout-runbook.md index 2b5c27f5..b68e95f2 100644 --- a/docs/integration/hcg-tier2-rollout-runbook.md +++ b/docs/integration/hcg-tier2-rollout-runbook.md @@ -3,9 +3,9 @@ # HCG tier-2 — rollout & rollback runbook -**Version:** 0.3 (live policy promoted, Phase E in-progress) -**Date:** 2026-06-09 (rev. from 2026-06-08) -**Status:** Phase E deliverables E1 (deploy spec) + E5 (rollback runbook) drafted; live gateway policy (`config/gateway-policy-boj.yaml`) promoted from the worked example (§1.5). Owner-input markers (`!OWNER:`) remain to be filled before any traffic-shift action is taken. +**Version:** 0.4 (policy-deny smoke script landed, Phase E in-progress) +**Date:** 2026-06-10 (rev. from 2026-06-09) +**Status:** Phase E deliverables E1 (deploy spec) + E5 (rollback runbook) drafted; live gateway policy (`config/gateway-policy-boj.yaml`) promoted from the worked example (§1.5); `scripts/hcg-policy-smoke.sh` lands as the checked-in §1.5 operator pre-check (deny-path covers gateway-alone; `--with-backend` adds allow-path coverage). Owner-input markers (`!OWNER:`) remain to be filled before any traffic-shift action is taken. **ADR:** [`docs/decisions/0004-adopt-http-capability-gateway.md`](../decisions/0004-adopt-http-capability-gateway.md) **Plan:** [`docs/integration/http-capability-gateway-plan.md`](http-capability-gateway-plan.md) (§ Phase E) **Contract:** [`docs/integration/http-capability-gateway-boj-contract.md`](http-capability-gateway-boj-contract.md) @@ -88,7 +88,7 @@ These cannot be inferred from the code/contract; the owner must fill them before - [x] `container/gateway-deploy.k9.ncl` exists in the gateway repo (plan §E1) — http-capability-gateway#38 (2026-06-03). Five-level k9-svc pedigree (Snout / Scent / Leash / Gut / Muscle) modelled on `boj-server:container/deploy.k9.ncl`; per-environment `BACKEND_URL` (`http://127.0.0.1:7700` staging, `http://unix:/run/boj/gnosis.sock:/` production); trust source `"header"` staging → `"mtls"` production after §2.4 rehearsal; `max_unavailable = 0`; `failure_mode = "fail-closed"` matching the `[SEAMS] gateway-boj-gnosis` declaration. - [x] Gateway policy file in place: `config/gateway-policy-boj-example.yaml`, covering all BoJ surface routes (`/.well-known/boj-node-pubkey`, `/health`, `/menu`, `/cartridges`, `/cartridge/:name`, `/cartridge/:name/invoke`, `/cartridge/:name/sse`, plus any added since contract v1.0). Re-verified 2026-05-28 against `BojRest.Router`; the `POST /cartridge/:name/sse` route (router.ex line 130, wired since the SSE landing — ADR-0013 §6, STATE entry 2026-05-18) was the only drift since contract v1.0 and is now governed by the `cartridge-sse-post` rule alongside `cartridge-invoke-post` (boj-server#165). - [x] Live policy file (`config/gateway-policy-boj.yaml`) promoted from the example. Content-identical to the example at promotion time; future BoJ-surface evolution lands in the live file and the example remains as the worked-example artefact (Phase A A3). Both §2.1 staging and §3.1 production load the live file via `POLICY_PATH`. -- [ ] Gateway has been smoke-tested in isolation with the policy, returning expected allow/deny on each route. Sequence: stand the gateway up against `gateway-policy-boj-example.yaml`, exercise one allow + one deny per route from §1.5 above; confirm `POST /cartridge/:name/sse` with `X-Trust-Level: authenticated` proxies through and with `X-Trust-Level: untrusted` returns 403 (deferred to this step by boj-server#165's test plan). Out of band of code review — operator pre-check before §2.1. +- [ ] Gateway has been smoke-tested in isolation with the policy, returning expected allow/deny on each route. Run `scripts/hcg-policy-smoke.sh --gateway-url ` against the gateway loaded with `config/gateway-policy-boj.yaml`; the script exercises a no-trust-header deny probe for every non-public route plus default-deny verb canaries (DELETE/PUT/PATCH on `/cartridges` and `/health`) and is fully gateway-internal — BoJ does **not** need to be reachable for this run. Once BoJ is up behind the gateway, re-run with `--with-backend` from a trusted-proxy IP (loopback by default) to also cover the allow path on authenticated/internal routes including the `POST /cartridge/:name/sse` authenticated/untrusted pair carried over from boj-server#165's test plan. Attach the script's PASS/FAIL summary to the cut-over ticket; a single FAIL is a stop-the-rollout condition (gateway loaded the policy but is not enforcing as declared, or BoJ is unreachable from the gateway, or the script is being run from a non-trusted-proxy IP and the trust header is being stripped). --- @@ -312,3 +312,4 @@ Also update `[HTTP_CAPABILITY_GATEWAY]` section per plan §E acceptance: `status - `http-capability-gateway/docs/perf-contract.md` — Phase D perf-contract. - `elixir/lib/boj_rest/trust_policy.ex` — `satisfies?/3` Phase C enforcement. - `.machine_readable/contractiles/trust/Trustfile.a2ml` — `[CLOUDFLARE_EDGE_SECURITY].rate_limiting.tier_2_gateway` (current `PENDING` site; §6.4 flip target) + `[SEAMS]` (Phase C gateway↔BoJ-gnosis declaration). +- `scripts/hcg-policy-smoke.sh` — §1.5 operator pre-check: deny-path smoke (gateway-alone) + optional `--with-backend` allow-path smoke against the live policy. diff --git a/scripts/hcg-policy-smoke.sh b/scripts/hcg-policy-smoke.sh new file mode 100755 index 00000000..75f8f447 --- /dev/null +++ b/scripts/hcg-policy-smoke.sh @@ -0,0 +1,245 @@ +#!/usr/bin/env bash +# SPDX-License-Identifier: MPL-2.0 +# Copyright (c) 2026 Jonathan D.A. Jewell +# +# hcg-policy-smoke.sh — Exercise the HCG tier-2 live Verb Governance +# Spec from outside the gateway. Returns non-zero on any unexpected +# response, so it can be invoked from the §1.5 / §2.1 prerequisite +# checklist in `docs/integration/hcg-tier2-rollout-runbook.md`. +# +# The default mode probes the *deny* path for every non-public route in +# `config/gateway-policy-boj.yaml` plus a default-deny verb canary for +# DELETE/PUT/PATCH. The deny path is fully gateway-internal — it does +# not require BoJ to be reachable, so this script is the cheapest way +# to confirm policy enforcement before staging cut-over. +# +# With `--with-backend`, the script additionally sends an authenticated +# (or internal) probe per route and asserts the gateway *forwarded* it +# (response did not come from the gateway's own deny path). Allow +# probes require BoJ to be reachable from the gateway's BACKEND_URL, +# and the script itself must run from an IP listed in the gateway's +# `:trusted_proxies` config (loopback by default) so that the +# X-Trust-Level header is not stripped by the gateway's +# `strip_untrusted_headers` plug. +# +# Usage: +# ./scripts/hcg-policy-smoke.sh --gateway-url http://127.0.0.1:8080 +# ./scripts/hcg-policy-smoke.sh --gateway-url https://stage:8443 \ +# --insecure --with-backend +# +# Exit codes: 0 = all probes matched expectations, 1 = at least one +# mismatch, 64 = bad usage. +# +# Cross-refs: +# docs/integration/hcg-tier2-rollout-runbook.md §1.5 / §2.1 +# docs/integration/http-capability-gateway-plan.md §Phase E +# config/gateway-policy-boj.yaml source of truth +# standards#100 tracking issue + +set -euo pipefail + +GATEWAY_URL="" +WITH_BACKEND=0 +INSECURE=0 +TRUST_HEADER_NAME="X-Trust-Level" + +usage() { + cat >&2 <<'EOF' +hcg-policy-smoke.sh — Exercise the HCG live policy. + +USAGE: + hcg-policy-smoke.sh --gateway-url URL [--with-backend] [--insecure] + [--trust-header NAME] + +OPTIONS: + --gateway-url URL Base URL of the gateway (required), e.g. + http://127.0.0.1:8080 or https://stage:8443. + --with-backend Additionally probe the allow path on each route + (requires BoJ reachable at the gateway's + BACKEND_URL, and this script to run from a + trusted-proxy IP). + --insecure Pass `-k` to curl (self-signed staging TLS). + --trust-header NAME Override the trust-level header name. Defaults + to the gateway default `X-Trust-Level`; set this + only if `:trust_level_header` was customised. + -h, --help Show this help. + +EXAMPLES: + # Deny-only smoke against a local gateway with no BoJ behind it: + ./scripts/hcg-policy-smoke.sh --gateway-url http://127.0.0.1:8080 + + # Full smoke against staging, BoJ up, self-signed TLS: + ./scripts/hcg-policy-smoke.sh --gateway-url https://stage:8443 \ + --insecure --with-backend + +Designed to be run by the operator from the rollout runbook §1.5 last +open item (replacing the out-of-band manual probe sequence) and §2.1 +post-stand-up sanity check. +EOF + exit 64 +} + +while [ $# -gt 0 ]; do + case "$1" in + --gateway-url) GATEWAY_URL="${2:-}"; shift 2 ;; + --with-backend) WITH_BACKEND=1; shift ;; + --insecure) INSECURE=1; shift ;; + --trust-header) TRUST_HEADER_NAME="${2:-}"; shift 2 ;; + -h|--help) usage ;; + *) echo "unknown arg: $1" >&2; usage ;; + esac +done + +[ -n "$GATEWAY_URL" ] || usage +command -v curl >/dev/null || { echo "curl: not found" >&2; exit 1; } + +GATEWAY_URL="${GATEWAY_URL%/}" # strip trailing slash +CURL_BASE=(curl -sS -o /dev/null -w '%{http_code}' --max-time 10) +[ "$INSECURE" = "1" ] && CURL_BASE+=(-k) + +PASS=0 +FAIL=0 +FAIL_LINES=() + +# probe VERB PATH EXPECTED_PATTERN LABEL [trust_level] +# +# EXPECTED_PATTERN is an extended-regex matched against the three-digit +# status code; "deny" expands to 4xx, "allow_or_upstream" expands to +# "anything but a gateway-origin 4xx" (2xx, 3xx, 5xx). +# +# trust_level (optional) is sent as the X-Trust-Level header. Without +# it the gateway treats the caller as untrusted, which is the deny-path +# input. +probe() { + local verb="$1" path="$2" pattern="$3" label="$4" trust="${5:-}" + local url="${GATEWAY_URL}${path}" + local args=("${CURL_BASE[@]}" -X "$verb") + if [ -n "$trust" ]; then + args+=(-H "${TRUST_HEADER_NAME}: ${trust}") + fi + # Some routes are POST; send an empty JSON body so Plug.Parsers + # does not 400 on missing content-type. + if [ "$verb" = "POST" ]; then + args+=(-H "Content-Type: application/json" --data '{}') + fi + args+=("$url") + + local code + # Quote "${args[@]}" so multi-word array elements (the JSON + # Content-Type header in particular) stay as single arguments to + # curl — without quoting, word-splitting turned "Content-Type: + # application/json" into two args and curl saw "application/json" + # as a second URL, double-writing %{http_code}. + code="$("${args[@]}" 2>/dev/null || true)" + case "$pattern" in + deny) + if [[ "$code" =~ ^4[0-9][0-9]$ ]]; then + printf ' PASS %-65s %s\n' "$label" "$code" + PASS=$((PASS + 1)) + return + fi + ;; + allow_or_upstream) + # The gateway forwarded the request iff the response is NOT + # a gateway-origin 4xx deny. 2xx/3xx mean BoJ replied; + # 5xx is upstream-down (also forwarded). The gateway's own + # circuit-breaker 503 is indistinguishable from an upstream + # 503 at this level, which is fine — neither indicates a + # policy regression. + if [[ ! "$code" =~ ^4[0-9][0-9]$ ]]; then + printf ' PASS %-65s %s\n' "$label" "$code" + PASS=$((PASS + 1)) + return + fi + ;; + esac + printf ' FAIL %-65s %s (expected %s)\n' "$label" "$code" "$pattern" + FAIL=$((FAIL + 1)) + FAIL_LINES+=("$label got=$code expected=$pattern") +} + +echo "==> HCG policy deny smoke against ${GATEWAY_URL}" +echo " (config/gateway-policy-boj.yaml; no X-Trust-Level header)" + +# Authenticated routes — gateway must 4xx without a trust header. +# Internal+stealth routes — also 4xx (status code shape depends on +# `:stealth_profiles` runtime config; 4xx covers both stealth and +# bare 403). +probe GET /status deny "auth:status-get" +probe GET /menu deny "auth:menu-get" +probe GET /matrix deny "auth:matrix-get" +probe GET /cartridges deny "auth:cartridges-list-get" +probe GET /cartridge/probe deny "auth:cartridge-detail-get" +probe POST /cartridge/probe/invoke deny "auth:cartridge-invoke-post" +probe POST /cartridge/probe/sse deny "auth:cartridge-sse-post" +probe POST /graphql deny "auth:graphql-post" +probe POST /grpc/svc/method deny "auth:grpc-method-post" +probe GET /sse deny "auth:sse-get" +probe POST /order deny "auth:order-post" +probe POST /order-ticket deny "auth:order-ticket-post" +probe GET /umoja/status deny "auth:umoja-status-get" +probe GET /umoja/transport deny "auth:umoja-transport-get" +probe GET /umoja/peers deny "auth:umoja-peers-get" +probe GET /coprocessor/status deny "auth:coprocessor-status-get" +probe GET /sla/status deny "auth:sla-status-get" +probe GET /community/submissions deny "auth:community-submissions-get" +probe POST /community/submit deny "auth:community-submit-post" + +probe POST /cartridge/probe/load deny "internal:cartridge-load-post" +probe POST /cartridge/probe/unload deny "internal:cartridge-unload-post" +probe POST /cartridge/probe/reload deny "internal:cartridge-reload-post" +probe POST /umoja/peers deny "internal:umoja-peers-post" +probe POST /coprocessor/select deny "internal:coprocessor-select-post" +probe GET /sdp/status deny "internal:sdp-status-get" + +# Default-deny verb canaries — global_verbs is [GET, POST], so any +# DELETE/PUT/PATCH on a known path must be denied via the no-match +# (or unknown-method) path. Verifies the verb-governance core invariant +# of ADR-0004. +probe DELETE /cartridges deny "verb-canary:DELETE /cartridges" +probe PUT /health deny "verb-canary:PUT /health" +probe PATCH /cartridges deny "verb-canary:PATCH /cartridges" + +if [ "$WITH_BACKEND" = "1" ]; then + echo + echo "==> HCG policy allow smoke (--with-backend)" + echo " (X-Trust-Level: authenticated/internal; requires BoJ up)" + + # Authenticated routes — gateway forwards under X-Trust-Level: authenticated. + # We assert "not a gateway-origin 4xx"; BoJ's own 200/404/500 is fine. + probe GET /status allow_or_upstream "auth-allow:status-get" authenticated + probe GET /menu allow_or_upstream "auth-allow:menu-get" authenticated + probe GET /cartridges allow_or_upstream "auth-allow:cartridges-list-get" authenticated + probe GET /cartridge/probe allow_or_upstream "auth-allow:cartridge-detail-get" authenticated + probe POST /cartridge/probe/invoke allow_or_upstream "auth-allow:cartridge-invoke-post" authenticated + probe POST /cartridge/probe/sse allow_or_upstream "auth-allow:cartridge-sse-post" authenticated + + # Public routes — should forward without any trust header. + probe GET /health allow_or_upstream "public-allow:health-get" "" + probe GET /.well-known/boj-node-pubkey allow_or_upstream "public-allow:node-pubkey-get" "" + + # Internal+stealth routes — gateway forwards only under + # X-Trust-Level: internal. + probe POST /cartridge/probe/load allow_or_upstream "internal-allow:cartridge-load-post" internal + probe POST /cartridge/probe/unload allow_or_upstream "internal-allow:cartridge-unload-post" internal + probe POST /cartridge/probe/reload allow_or_upstream "internal-allow:cartridge-reload-post" internal + probe GET /sdp/status allow_or_upstream "internal-allow:sdp-status-get" internal +fi + +echo +echo "────────────────────────────────────────────────────────────────────────" +echo "HCG policy smoke: PASS=${PASS} FAIL=${FAIL}" +if [ "$FAIL" -gt 0 ]; then + echo + echo "Mismatches:" + for line in "${FAIL_LINES[@]}"; do + echo " - ${line}" + done + echo + echo "Investigate before flipping the §1.5 checkbox. A 4xx miss on a" + echo "deny probe means the policy was loaded but is not enforcing as" + echo "declared; a 4xx on an allow probe means the trust header was" + echo "stripped (run from a trusted-proxy IP) or BoJ is unreachable." + exit 1 +fi +echo "All probes matched policy. Safe to proceed with §2.1 staging cut-over."