Skip to content

feat(mqtt): add mapache user + drop user_data from ignore_changes#70

Merged
BK1031 merged 1 commit into
mainfrom
bk1031/mqtt-mapache-user
Jun 5, 2026
Merged

feat(mqtt): add mapache user + drop user_data from ignore_changes#70
BK1031 merged 1 commit into
mainfrom
bk1031/mqtt-mapache-user

Conversation

@BK1031

@BK1031 BK1031 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Two coupled changes:

1. New `mapache` MQTT user

Adds `mqtt_user_mapache` (default `"mapache"`) with a 32-char `random_password`, written as a third line in `/etc/nanomq_pwd.conf` via user-data. Intended for the mapache services fleet beyond gr26 — query, foreman, any future MQTT publishers. Distinct from `gr26` so the service-fleet credential can rotate independently of the CAN-ingest pipeline.

Read via:
```bash
terraform -chdir=infra/environments/prod output -raw mqtt_password_mapache
```

2. Drop `user_data` from `lifecycle.ignore_changes`

Old guard:
```hcl
lifecycle {
ignore_changes = [user_data, ami]
}
```

…meant adding users, ACL tweaks, or any user-data edit required an explicit `terraform taint` / `-replace=...` to actually take effect. Easy to forget, and we've already hit the trap twice (tcm26 user, clickhouse chown). nanomq carries no persistent state, so a ~90s replacement on legitimate config changes is acceptable — paho clients auto-reconnect, no data is lost.

`ami` stays in `ignore_changes` so unrelated AL2023 base-image churn doesn't silently roll the instance.

Postgres + ClickHouse modules keep `user_data` ignored — they own data volumes and stop/start downtime is a bigger ask there. Justification is symmetric to why `prevent_destroy = true` exists on those EBS volumes but not here.

Apply consequence

`terraform plan` will show `module.mqtt.aws_instance.this` must be replaced because user-data changed (new mapache user line). EIP, SG, all three `random_password` values survive in state. Expect ~60–120s of broker downtime during the EC2 swap; paho clients in gr26 reconnect automatically.

Test plan

  • `terraform validate` + `fmt -check` on module + prod env
  • `terraform plan`: 1 replace on `module.mqtt.aws_instance.this`, in-place updates on the EIP + volume_attachment, no destroys elsewhere
  • After apply: SSM into the new gr-mqtt EC2 and `sudo cat /etc/nanomq_pwd.conf` shows three lines (gr26, tcm26, mapache)
  • `mqtt pub -h gr-mqtt.gauchoracing.com -u mapache -pw "$(terraform output -raw mqtt_password_mapache)" -t test -m hello` succeeds
  • Existing gr26 / tcm26 credentials still work (passwords unchanged in TF state)
  • gr26 pods reconnect within 30s post-replace (paho `SetAutoReconnect`)

Two coupled changes:

1. New mqtt_user_mapache (default "mapache") with a 32-char
   random_password, written as a third line in /etc/nanomq_pwd.conf via
   user-data. Intended for the mapache services fleet beyond gr26 —
   query, foreman, any future MQTT publishers. Keeps the credential
   distinct from gr26 so it can rotate independently of CAN ingest.

2. Drop user_data from lifecycle.ignore_changes (keep ami). The old
   guard meant adding users, ACL tweaks, or any user-data edit required
   an explicit `terraform taint` / `-replace` to actually deploy — easy
   to forget, and we've already hit the trap twice. nanomq carries no
   persistent state, so a ~90s replacement on legitimate config changes
   is acceptable; paho clients auto-reconnect. Postgres + ClickHouse
   modules keep user_data ignored because they own data volumes.

Side effect: the apply that lands this PR will replace the running
gr-mqtt instance (new user-data → instance recreate). EIP, SG, and all
three random_password values survive in state.
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Terraform plan: prod

step result
fmt success
init success
validate success
plan success
plan output
module.origin_cert.tls_private_key.this: Refreshing state... [id=4c5b5e0a4a4f13723ba2aebe888a7cb50529fcc0]
module.mqtt.random_password.mqtt_tcm26: Refreshing state... [id=none]
module.postgres.random_password.postgres: Refreshing state... [id=none]
module.mqtt.random_password.mqtt: Refreshing state... [id=none]
module.clickhouse.random_password.admin: Refreshing state... [id=none]
data.cloudflare_zone.gauchoracing: Reading...
module.origin_cert.tls_cert_request.this: Refreshing state... [id=9398eb1d26e54eb59b18d476ced10810e80172e5]
module.origin_cert.cloudflare_origin_ca_certificate.this: Refreshing state... [id=307819530070461722629184968406649377122389744032]
module.eks.module.eks.module.kms.data.aws_caller_identity.current[0]: Reading...
module.clickhouse.data.aws_ami.al2023_arm64: Reading...
module.eks.module.eks.data.aws_iam_policy_document.assume_role_policy[0]: Reading...
module.mqtt.data.aws_ami.al2023_arm64: Reading...
module.clickhouse.aws_ebs_volume.data: Refreshing state... [id=vol-0e312e8d71875ec89]
module.eks.module.eks.aws_cloudwatch_log_group.this[0]: Refreshing state... [id=/aws/eks/gr-prod/cluster]
module.eks.module.eks.module.kms.data.aws_partition.current[0]: Reading...
module.postgres.aws_ebs_volume.data: Refreshing state... [id=vol-02b203218b57629b4]
module.eks.module.eks.module.kms.data.aws_partition.current[0]: Read complete after 0s [id=aws]
module.eks.module.eks.data.aws_iam_policy_document.assume_role_policy[0]: Read complete after 0s [id=2830595799]
module.eks.module.eks.data.aws_iam_policy_document.node_assume_role_policy[0]: Reading...
module.vpc.module.vpc.aws_vpc.this[0]: Refreshing state... [id=vpc-06e13a97395396a3b]
module.eks.module.eks.data.aws_iam_policy_document.node_assume_role_policy[0]: Read complete after 0s [id=3518401652]
module.eks.module.eks.data.aws_caller_identity.current[0]: Reading...
module.eks.module.eks.module.kms.data.aws_caller_identity.current[0]: Read complete after 0s [id=211125506628]
module.eks.module.eks.data.aws_partition.current[0]: Reading...
module.eks.module.eks.data.aws_partition.current[0]: Read complete after 0s [id=aws]
module.postgres.data.aws_ami.al2023_arm64: Reading...
data.cloudflare_zone.gauchoracing: Read complete after 0s [id=5ac5ae9c6086e4b55c5e1b21ca963d94]
module.eks.module.eks.data.aws_caller_identity.current[0]: Read complete after 0s [id=211125506628]
module.eks.module.eks.aws_iam_role.this[0]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002]
module.eks.module.eks.aws_iam_role.eks_auto[0]: Refreshing state... [id=gr-prod-eks-auto-20260601094833482500000004]
cloudflare_ruleset.ssl_overrides: Refreshing state... [id=5a5ed8237d6f48418c172979ffe5da81]
module.eks.module.eks.data.aws_iam_session_context.current[0]: Reading...
module.eks.module.eks.data.aws_iam_policy_document.custom[0]: Reading...
module.eks.module.eks.data.aws_iam_policy_document.custom[0]: Read complete after 0s [id=513122117]
module.eks.module.eks.aws_iam_policy.custom[0]: Refreshing state... [id=arn:aws:iam::211125506628:policy/gr-prod-cluster-20260601094833480900000001]
module.eks.module.eks.data.aws_iam_session_context.current[0]: Read complete after 0s [id=arn:aws:sts::211125506628:assumed-role/github-actions-terraform/GitHubActions]
module.eks.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSBlockStoragePolicy"]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::aws:policy/AmazonEKSBlockStoragePolicy]
module.eks.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSComputePolicy"]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::aws:policy/AmazonEKSComputePolicy]
module.eks.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSClusterPolicy"]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.postgres.data.aws_ami.al2023_arm64: Read complete after 1s [id=ami-0a2a049c945b84826]
module.eks.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSLoadBalancingPolicy"]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::aws:policy/AmazonEKSLoadBalancingPolicy]
module.clickhouse.data.aws_ami.al2023_arm64: Read complete after 1s [id=ami-0a2a049c945b84826]
module.eks.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSNetworkingPolicy"]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::aws:policy/AmazonEKSNetworkingPolicy]
module.mqtt.data.aws_ami.al2023_arm64: Read complete after 1s [id=ami-0a2a049c945b84826]
module.eks.module.eks.aws_iam_role_policy_attachment.eks_auto["AmazonEC2ContainerRegistryPullOnly"]: Refreshing state... [id=gr-prod-eks-auto-20260601094833482500000004/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryPullOnly]
module.eks.module.eks.aws_iam_role_policy_attachment.eks_auto["AmazonEKSWorkerNodeMinimalPolicy"]: Refreshing state... [id=gr-prod-eks-auto-20260601094833482500000004/arn:aws:iam::aws:policy/AmazonEKSWorkerNodeMinimalPolicy]
module.eks.module.eks.module.kms.data.aws_iam_policy_document.this[0]: Reading...
module.eks.module.eks.module.kms.data.aws_iam_policy_document.this[0]: Read complete after 0s [id=922405470]
module.eks.module.eks.aws_iam_role_policy_attachment.custom[0]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::211125506628:policy/gr-prod-cluster-20260601094833480900000001]
module.eks.module.eks.module.kms.aws_kms_key.this[0]: Refreshing state... [id=7768801a-b38a-4c26-8bc3-7bf6fe2aac86]
module.vpc.module.vpc.aws_default_security_group.this[0]: Refreshing state... [id=sg-0a592b2169fd42df8]
module.vpc.module.vpc.aws_default_route_table.default[0]: Refreshing state... [id=rtb-08f817bde5f65eb92]
module.mqtt.aws_security_group.this: Refreshing state... [id=sg-0f5a2dc492283dafe]
module.eks.module.eks.aws_security_group.cluster[0]: Refreshing state... [id=sg-0cac44db03a686436]
module.vpc.module.vpc.aws_default_network_acl.this[0]: Refreshing state... [id=acl-0fd92b5b8eb95b2f9]
module.vpc.module.vpc.aws_internet_gateway.this[0]: Refreshing state... [id=igw-0ace83040de106603]
module.clickhouse.aws_security_group.this: Refreshing state... [id=sg-0dee964416e7aaeb5]
module.eks.module.eks.aws_security_group.node[0]: Refreshing state... [id=sg-0b19db83dbe18cbf1]
module.vpc.module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0c18db918f54ad033]
module.vpc.module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-09fbaccd0b3aaab85]
module.vpc.module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-022e58c410c24d794]
module.vpc.module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-06540460bfd7d06a2]
module.vpc.module.vpc.aws_route_table.public[0]: Refreshing state... [id=rtb-0815845194166b58b]
module.vpc.module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0182a0562244a6fac]
module.vpc.module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-0a5a8299f6da9bc59]
module.vpc.module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0264aaaa19faa70f5]
module.postgres.aws_security_group.this: Refreshing state... [id=sg-08a5b2e02e0540520]
module.eks.module.eks.module.kms.aws_kms_alias.this["cluster"]: Refreshing state... [id=alias/eks/gr-prod]
module.origin_cert.aws_acm_certificate.this: Refreshing state... [id=arn:aws:acm:us-west-2:211125506628:certificate/d10d5205-6d4b-4798-a152-293c69174660]
module.vpc.module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-015b9b6ae09534761]
module.mqtt.aws_security_group_rule.ingress_cidr[0]: Refreshing state... [id=sgrule-1294033340]
module.clickhouse.aws_security_group_rule.ingress_cidr["8123"]: Refreshing state... [id=sgrule-467233960]
module.clickhouse.aws_security_group_rule.ingress_cidr["9000"]: Refreshing state... [id=sgrule-2026118668]
module.eks.module.eks.aws_iam_policy.cluster_encryption[0]: Refreshing state... [id=arn:aws:iam::211125506628:policy/gr-prod-cluster-ClusterEncryption20260601094855072000000006]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_10251_webhook"]: Refreshing state... [id=sgrule-1337801498]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_9443_webhook"]: Refreshing state... [id=sgrule-3115694346]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_kubelet"]: Refreshing state... [id=sgrule-2079615841]
module.eks.module.eks.aws_security_group_rule.node["ingress_nodes_ephemeral"]: Refreshing state... [id=sgrule-504933240]
module.eks.module.eks.aws_security_group_rule.node["ingress_self_coredns_tcp"]: Refreshing state... [id=sgrule-1577514230]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_4443_webhook"]: Refreshing state... [id=sgrule-118149494]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_6443_webhook"]: Refreshing state... [id=sgrule-3460871904]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_443"]: Refreshing state... [id=sgrule-500562133]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_8443_webhook"]: Refreshing state... [id=sgrule-3709110977]
module.eks.module.eks.aws_security_group_rule.node["ingress_self_coredns_udp"]: Refreshing state... [id=sgrule-4200159001]
module.eks.module.eks.aws_security_group_rule.node["egress_all"]: Refreshing state... [id=sgrule-4004824215]
module.vpc.module.vpc.aws_route.public_internet_gateway[0]: Refreshing state... [id=r-rtb-0815845194166b58b1080289494]
module.vpc.module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0c836864d0eccc7fc]
module.vpc.module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-0b736817e30c38444]
module.vpc.module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-07ea61af9805b07cb]
module.eks.module.eks.aws_security_group_rule.cluster["ingress_nodes_443"]: Refreshing state... [id=sgrule-535925259]
module.vpc.module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-041ffe3137ec2ff7a]
module.vpc.module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-01d455080d37c3108]
module.vpc.module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-09913d53f1ff40442]
module.postgres.aws_security_group_rule.ingress_cidr[0]: Refreshing state... [id=sgrule-3721507580]
module.eks.module.eks.aws_iam_role_policy_attachment.cluster_encryption[0]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::211125506628:policy/gr-prod-cluster-ClusterEncryption20260601094855072000000006]
module.vpc.module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-019992cce709b8681]
module.postgres.aws_security_group_rule.ingress_sg["sg-0b19db83dbe18cbf1"]: Refreshing state... [id=sgrule-1130224857]
module.mqtt.aws_security_group_rule.ingress_sg["sg-0b19db83dbe18cbf1"]: Refreshing state... [id=sgrule-1062873908]
module.clickhouse.aws_security_group_rule.ingress_sg["sg-0b19db83dbe18cbf1-8123"]: Refreshing state... [id=sgrule-4177740833]
module.clickhouse.aws_security_group_rule.ingress_sg["sg-0b19db83dbe18cbf1-9000"]: Refreshing state... [id=sgrule-2466486093]
module.mqtt.aws_instance.this: Refreshing state... [id=i-0bf98528bc8e9dab0]
module.clickhouse.aws_instance.this: Refreshing state... [id=i-0f862cc6460b5d98a]
module.vpc.module.vpc.aws_route.private_nat_gateway[0]: Refreshing state... [id=r-rtb-0c18db918f54ad0331080289494]
module.postgres.aws_instance.this: Refreshing state... [id=i-013aab40e28a0b6b7]
module.eks.module.eks.aws_eks_cluster.this[0]: Refreshing state... [id=gr-prod]
module.eks.module.eks.aws_eks_access_entry.this["arn-aws-iam--211125506628-role-github-actions-terraform"]: Refreshing state... [id=gr-prod:arn:aws:iam::211125506628:role/github-actions-terraform]
module.eks.module.eks.aws_eks_access_entry.this["arn-aws-iam--211125506628-user-admin-cli"]: Refreshing state... [id=gr-prod:arn:aws:iam::211125506628:user/admin-cli]
module.eks.module.eks.time_sleep.this[0]: Refreshing state... [id=2026-06-01T10:46:10Z]
module.eks.module.eks.data.tls_certificate.this[0]: Reading...
module.eks.module.eks.data.tls_certificate.this[0]: Read complete after 0s [id=f97f646c2cd14cc0db0f757f0fccc96abbbe2af5]
module.eks.module.eks.aws_iam_openid_connect_provider.oidc_provider[0]: Refreshing state... [id=arn:aws:iam::211125506628:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/21512EE80634956C7C9D0B9647C70224]
module.eks.module.eks.aws_eks_access_policy_association.this["arn-aws-iam--211125506628-user-admin-cli_admin"]: Refreshing state... [id=gr-prod#arn:aws:iam::211125506628:user/admin-cli#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.eks.module.eks.aws_eks_access_policy_association.this["arn-aws-iam--211125506628-role-github-actions-terraform_admin"]: Refreshing state... [id=gr-prod#arn:aws:iam::211125506628:role/github-actions-terraform#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.argocd.helm_release.argocd: Refreshing state... [id=argocd]
module.mqtt.aws_eip.this[0]: Refreshing state... [id=eipalloc-000796d397533c6d8]
cloudflare_dns_record.gr_mqtt: Refreshing state... [id=53badd7d1ab83280dc57c671ee486f90]
module.clickhouse.aws_volume_attachment.data: Refreshing state... [id=vai-843739595]
module.clickhouse.aws_eip.this[0]: Refreshing state... [id=eipalloc-0970fced638b7d8d9]
cloudflare_dns_record.gr_clickhouse: Refreshing state... [id=d6f3d3c09db5c56703a7b107d6ac3f29]
module.postgres.aws_eip.this[0]: Refreshing state... [id=eipalloc-06d6c59b1e0a49482]
module.postgres.aws_volume_attachment.data: Refreshing state... [id=vai-3584522434]
cloudflare_dns_record.gr_postgres: Refreshing state... [id=a69d68353c27a536e84d7448af48e3f0]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:

  # module.mqtt.aws_instance.this will be updated in-place
  ~ resource "aws_instance" "this" {
        id                                   = "i-0bf98528bc8e9dab0"
      ~ public_dns                           = "ec2-184-33-5-10.us-west-2.compute.amazonaws.com" -> (known after apply)
      ~ public_ip                            = "184.33.5.10" -> (known after apply)
        tags                                 = {
            "Name" = "gr-mqtt"
            "Role" = "nanomq"
        }
      ~ user_data                            = (sensitive value)
        # (38 unchanged attributes hidden)

        # (9 unchanged blocks hidden)
    }

  # module.mqtt.random_password.mqtt_mapache will be created
  + resource "random_password" "mqtt_mapache" {
      + bcrypt_hash = (sensitive value)
      + id          = (known after apply)
      + length      = 32
      + lower       = true
      + min_lower   = 0
      + min_numeric = 0
      + min_special = 0
      + min_upper   = 0
      + number      = true
      + numeric     = true
      + result      = (sensitive value)
      + special     = false
      + upper       = true
    }

Plan: 1 to add, 1 to change, 0 to destroy.

Changes to Outputs:
  + mqtt_password_mapache     = (sensitive value)

─────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't
guarantee to take exactly these actions if you run "terraform apply" now.

@BK1031 BK1031 merged commit a815510 into main Jun 5, 2026
1 check passed
@BK1031 BK1031 deleted the bk1031/mqtt-mapache-user branch June 5, 2026 18:44
BK1031 added a commit that referenced this pull request Jun 5, 2026
The AWS provider's default for aws_instance.user_data is
user_data_replace_on_change = false. terraform applies a user_data
diff via ModifyInstanceAttribute, which *stores* the new user-data but
doesn't re-execute cloud-init — the new file lands only on the next
stop+start. So the apply for PR #70 (which added the mapache user)
updated state in place and never touched the running /etc/nanomq_pwd.conf.

Setting user_data_replace_on_change = true makes user-data a force-
replace attribute again, which is what the previous PR's removal of
user_data from lifecycle.ignore_changes was actually trying to achieve.

This change won't trigger anything on its own (state and code match
the running box now that the running box was manually patched via SSM).
The next legitimate user-data edit will trigger a clean instance
replacement instead of the silent no-op trap.
BK1031 added a commit that referenced this pull request Jun 5, 2026
The dedicated `gr26` MQTT user (PR #45) and the `mapache` fleet user
(PR #70) both exist on the gr-mqtt broker. Standardizing the in-cluster
gr26 service onto the fleet credential so all mapache services share a
single broker identity that rotates together — keeps the `gr26` user
free for the on-vehicle ingest path if we ever want it there.

- MQTT_USER: gr26 → mapache
- MQTT_PASSWORD secretKeyRef.key: MQTT_PASSWORD → MQTT_MAPACHE_PASSWORD
- mapache-secrets already has both keys populated (MQTT_PASSWORD untouched,
  MQTT_MAPACHE_PASSWORD added manually via `kubectl patch` before this PR)

Rollout: ArgoCD sync → new pods come up with the new env, paho connects
fresh under the mapache identity. Old gr26-user MQTT sessions hold their
TCP socket until pod termination then drop cleanly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant