Skip to content

fix(clickhouse-ec2): disable text_log and lower log level in provisioning#79

Merged
jacobjurek merged 1 commit into
mainfrom
fix/clickhouse-disable-noisy-logs
Jun 7, 2026
Merged

fix(clickhouse-ec2): disable text_log and lower log level in provisioning#79
jacobjurek merged 1 commit into
mainfrom
fix/clickhouse-disable-noisy-logs

Conversation

@jacobjurek

Copy link
Copy Markdown
Contributor

Summary

The hosted ClickHouse EC2 instance was provisioned with stock image defaults: trace log level, system.text_log enabled, and no TTL on the system log tables. On this box text_log alone grew ~155 GiB in ~2 days and filled the data volume, after which every INSERT failed with Code: 243 NOT_ENOUGH_SPACE — blocking the query service's writes (query_token on POST /query/token, query_log on every GET /query/signals). Reads were unaffected. Actual telemetry (signal/gr26_can/ping) was only ~2.2 GiB; ~99% of the disk was the database logging about itself.

The running server has been fixed by hand, but that fix lived only in the container's writable layer — config.d was never bind-mounted (only users.d and the data volume are), so it would silently revert on the next image bump (clickhouse_version) or instance replacement. This bakes the fix into provisioning so it can't recur on a rebuild.

Changes

infra/modules/clickhouse-ec2/user-data.sh.tftpl, mirroring the existing users.d pattern:

  • Write /etc/clickhouse-server/config.d/quiet-logs.xml:
    • <logger><level>information</level> — drops from the noisy stock trace (the root cause of the volume).
    • <text_log remove="1"/> — stop materializing the server log into a table entirely (the on-disk clickhouse-server.log still exists).
  • Bind-mount config.d (:ro) into the container so the override is applied.
  • chown 101:101 so the container's clickhouse uid can read it (no secrets here, so unlike admin.xml it stays world-readable rather than chmod 600).

Out of scope / follow-ups

  • System-log TTLs (system.query_log 7d, asynchronous_insert_log 3d, with ttl_only_drop_parts=1) are applied separately via SQL. They persist on the preserved EBS data volume, so they don't belong in user-data.
  • Disk-usage alarm: recommend a CloudWatch alarm on the data volume (~75%). This incident hit 100% silently because reads kept working — an alarm turns a future outage into a ticket. Needs the CloudWatch agent (default EC2 metrics don't include disk %).

Testing

Validated equivalent config on the live container: config.d/quiet-logs.xml merged correctly (preprocessed config shows <level>information</level> and no text_log block), the server restarted clean, and text_log stopped being written.

🤖 Generated with Claude Code

…ning

Stock ClickHouse ships at `trace` log level with system.text_log enabled
and no TTL. On the hosted instance text_log grew ~155 GiB in ~2 days and
filled the data volume, after which every INSERT failed with
NOT_ENOUGH_SPACE (Code 243) — query_token/query_log writes from the query
service were blocked; reads were unaffected.

The running server was fixed by hand, but that fix lived only in the
container's writable layer because config.d was never bind-mounted, so it
would not survive an image bump or instance replacement. Bake it into
user-data instead, mirroring the existing users.d pattern:

- Write /etc/clickhouse-server/config.d/quiet-logs.xml setting the logger
  level to `information` and removing text_log.
- Bind-mount config.d (:ro) into the container so the override is applied.
- chown 101:101 so the container's clickhouse uid can read it (no secrets,
  so no chmod 600 like admin.xml).

System-log table TTLs (system.query_log etc.) are applied separately via
SQL; they persist on the preserved EBS data volume.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jacobjurek jacobjurek requested a review from BK1031 as a code owner June 7, 2026 21:56
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Terraform plan: prod

step result
fmt success
init success
validate success
plan success
plan output
module.mqtt.random_password.mqtt_mapache: Refreshing state... [id=none]
module.mqtt.random_password.mqtt: Refreshing state... [id=none]
module.mqtt.random_password.mqtt_tcm26: Refreshing state... [id=none]
module.postgres.random_password.postgres: Refreshing state... [id=none]
module.clickhouse.random_password.admin: Refreshing state... [id=none]
module.origin_cert.tls_private_key.this: Refreshing state... [id=4c5b5e0a4a4f13723ba2aebe888a7cb50529fcc0]
module.origin_cert.tls_cert_request.this: Refreshing state... [id=9398eb1d26e54eb59b18d476ced10810e80172e5]
data.cloudflare_zone.gauchoracing: Reading...
module.origin_cert.cloudflare_origin_ca_certificate.this: Refreshing state... [id=307819530070461722629184968406649377122389744032]
module.clickhouse.data.aws_ami.al2023_arm64: Reading...
module.mqtt.data.aws_ami.al2023_arm64: Reading...
module.eks.module.eks.data.aws_iam_policy_document.assume_role_policy[0]: Reading...
module.eks.module.eks.data.aws_iam_policy_document.node_assume_role_policy[0]: Reading...
module.eks.module.eks.module.kms.data.aws_partition.current[0]: Reading...
module.eks.module.eks.aws_cloudwatch_log_group.this[0]: Refreshing state... [id=/aws/eks/gr-prod/cluster]
module.eks.module.eks.module.kms.data.aws_caller_identity.current[0]: Reading...
module.clickhouse.aws_ebs_volume.data: Refreshing state... [id=vol-0e312e8d71875ec89]
module.vpc.module.vpc.aws_vpc.this[0]: Refreshing state... [id=vpc-06e13a97395396a3b]
module.eks.module.eks.data.aws_iam_policy_document.node_assume_role_policy[0]: Read complete after 0s [id=3518401652]
module.eks.module.eks.module.kms.data.aws_partition.current[0]: Read complete after 0s [id=aws]
module.eks.module.eks.data.aws_iam_policy_document.assume_role_policy[0]: Read complete after 0s [id=2830595799]
module.eks.module.eks.data.aws_caller_identity.current[0]: Reading...
module.postgres.aws_ebs_volume.data: Refreshing state... [id=vol-02b203218b57629b4]
module.origin_cert.aws_acm_certificate.this: Refreshing state... [id=arn:aws:acm:us-west-2:211125506628:certificate/d10d5205-6d4b-4798-a152-293c69174660]
module.eks.module.eks.module.kms.data.aws_caller_identity.current[0]: Read complete after 0s [id=211125506628]
module.postgres.data.aws_ami.al2023_arm64: Reading...
data.cloudflare_zone.gauchoracing: Read complete after 0s [id=5ac5ae9c6086e4b55c5e1b21ca963d94]
module.eks.module.eks.data.aws_partition.current[0]: Reading...
module.eks.module.eks.data.aws_partition.current[0]: Read complete after 0s [id=aws]
module.eks.module.eks.aws_iam_role.eks_auto[0]: Refreshing state... [id=gr-prod-eks-auto-20260601094833482500000004]
module.eks.module.eks.data.aws_caller_identity.current[0]: Read complete after 0s [id=211125506628]
module.eks.module.eks.aws_iam_role.this[0]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002]
module.eks.module.eks.data.aws_iam_session_context.current[0]: Reading...
cloudflare_ruleset.ssl_overrides: Refreshing state... [id=5a5ed8237d6f48418c172979ffe5da81]
module.eks.module.eks.data.aws_iam_session_context.current[0]: Read complete after 0s [id=arn:aws:sts::211125506628:assumed-role/github-actions-terraform/GitHubActions]
module.eks.module.eks.data.aws_iam_policy_document.custom[0]: Reading...
module.eks.module.eks.data.aws_iam_policy_document.custom[0]: Read complete after 0s [id=513122117]
module.eks.module.eks.aws_iam_role_policy_attachment.eks_auto["AmazonEC2ContainerRegistryPullOnly"]: Refreshing state... [id=gr-prod-eks-auto-20260601094833482500000004/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryPullOnly]
module.eks.module.eks.aws_iam_role_policy_attachment.eks_auto["AmazonEKSWorkerNodeMinimalPolicy"]: Refreshing state... [id=gr-prod-eks-auto-20260601094833482500000004/arn:aws:iam::aws:policy/AmazonEKSWorkerNodeMinimalPolicy]
module.eks.module.eks.aws_iam_policy.custom[0]: Refreshing state... [id=arn:aws:iam::211125506628:policy/gr-prod-cluster-20260601094833480900000001]
module.eks.module.eks.module.kms.data.aws_iam_policy_document.this[0]: Reading...
module.eks.module.eks.module.kms.data.aws_iam_policy_document.this[0]: Read complete after 0s [id=922405470]
module.eks.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSLoadBalancingPolicy"]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::aws:policy/AmazonEKSLoadBalancingPolicy]
module.eks.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSNetworkingPolicy"]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::aws:policy/AmazonEKSNetworkingPolicy]
module.eks.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSBlockStoragePolicy"]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::aws:policy/AmazonEKSBlockStoragePolicy]
module.eks.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSClusterPolicy"]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::aws:policy/AmazonEKSClusterPolicy]
module.eks.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSComputePolicy"]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::aws:policy/AmazonEKSComputePolicy]
module.eks.module.eks.module.kms.aws_kms_key.this[0]: Refreshing state... [id=7768801a-b38a-4c26-8bc3-7bf6fe2aac86]
module.eks.module.eks.aws_iam_role_policy_attachment.custom[0]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::211125506628:policy/gr-prod-cluster-20260601094833480900000001]
module.mqtt.data.aws_ami.al2023_arm64: Read complete after 1s [id=ami-0a2a049c945b84826]
module.clickhouse.data.aws_ami.al2023_arm64: Read complete after 1s [id=ami-0a2a049c945b84826]
module.postgres.data.aws_ami.al2023_arm64: Read complete after 1s [id=ami-0a2a049c945b84826]
module.eks.module.eks.module.kms.aws_kms_alias.this["cluster"]: Refreshing state... [id=alias/eks/gr-prod]
module.eks.module.eks.aws_iam_policy.cluster_encryption[0]: Refreshing state... [id=arn:aws:iam::211125506628:policy/gr-prod-cluster-ClusterEncryption20260601094855072000000006]
module.eks.module.eks.aws_iam_role_policy_attachment.cluster_encryption[0]: Refreshing state... [id=gr-prod-cluster-20260601094833481300000002/arn:aws:iam::211125506628:policy/gr-prod-cluster-ClusterEncryption20260601094855072000000006]
module.vpc.module.vpc.aws_default_route_table.default[0]: Refreshing state... [id=rtb-08f817bde5f65eb92]
module.vpc.module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-0c18db918f54ad033]
module.mqtt.aws_security_group.this: Refreshing state... [id=sg-0f5a2dc492283dafe]
module.vpc.module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0182a0562244a6fac]
module.clickhouse.aws_security_group.this: Refreshing state... [id=sg-0dee964416e7aaeb5]
module.vpc.module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-0a5a8299f6da9bc59]
module.vpc.module.vpc.aws_default_security_group.this[0]: Refreshing state... [id=sg-0a592b2169fd42df8]
module.vpc.module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-0264aaaa19faa70f5]
module.postgres.aws_security_group.this: Refreshing state... [id=sg-08a5b2e02e0540520]
module.vpc.module.vpc.aws_default_network_acl.this[0]: Refreshing state... [id=acl-0fd92b5b8eb95b2f9]
module.vpc.module.vpc.aws_internet_gateway.this[0]: Refreshing state... [id=igw-0ace83040de106603]
module.vpc.module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-09fbaccd0b3aaab85]
module.vpc.module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-022e58c410c24d794]
module.vpc.module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-06540460bfd7d06a2]
module.eks.module.eks.aws_security_group.cluster[0]: Refreshing state... [id=sg-0cac44db03a686436]
module.vpc.module.vpc.aws_route_table.public[0]: Refreshing state... [id=rtb-0815845194166b58b]
module.eks.module.eks.aws_security_group.node[0]: Refreshing state... [id=sg-0b19db83dbe18cbf1]
module.vpc.module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-015b9b6ae09534761]
module.mqtt.aws_security_group_rule.ingress_cidr[0]: Refreshing state... [id=sgrule-1294033340]
module.postgres.aws_security_group_rule.ingress_cidr[0]: Refreshing state... [id=sgrule-3721507580]
module.clickhouse.aws_security_group_rule.ingress_cidr["8123"]: Refreshing state... [id=sgrule-467233960]
module.clickhouse.aws_security_group_rule.ingress_cidr["9000"]: Refreshing state... [id=sgrule-2026118668]
module.clickhouse.aws_instance.this: Refreshing state... [id=i-0f862cc6460b5d98a]
module.postgres.aws_instance.this: Refreshing state... [id=i-013aab40e28a0b6b7]
module.mqtt.aws_instance.this: Refreshing state... [id=i-0bf98528bc8e9dab0]
module.vpc.module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-0c836864d0eccc7fc]
module.vpc.module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-07ea61af9805b07cb]
module.vpc.module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-0b736817e30c38444]
module.vpc.module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-01d455080d37c3108]
module.vpc.module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-09913d53f1ff40442]
module.vpc.module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-041ffe3137ec2ff7a]
module.vpc.module.vpc.aws_route.public_internet_gateway[0]: Refreshing state... [id=r-rtb-0815845194166b58b1080289494]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_9443_webhook"]: Refreshing state... [id=sgrule-3115694346]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_kubelet"]: Refreshing state... [id=sgrule-2079615841]
module.eks.module.eks.aws_security_group_rule.node["ingress_nodes_ephemeral"]: Refreshing state... [id=sgrule-504933240]
module.eks.module.eks.aws_security_group_rule.node["ingress_self_coredns_udp"]: Refreshing state... [id=sgrule-4200159001]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_4443_webhook"]: Refreshing state... [id=sgrule-118149494]
module.eks.module.eks.aws_security_group_rule.node["egress_all"]: Refreshing state... [id=sgrule-4004824215]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_10251_webhook"]: Refreshing state... [id=sgrule-1337801498]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_6443_webhook"]: Refreshing state... [id=sgrule-3460871904]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_8443_webhook"]: Refreshing state... [id=sgrule-3709110977]
module.eks.module.eks.aws_security_group_rule.node["ingress_cluster_443"]: Refreshing state... [id=sgrule-500562133]
module.eks.module.eks.aws_security_group_rule.node["ingress_self_coredns_tcp"]: Refreshing state... [id=sgrule-1577514230]
module.eks.module.eks.aws_security_group_rule.cluster["ingress_nodes_443"]: Refreshing state... [id=sgrule-535925259]
module.vpc.module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-019992cce709b8681]
module.postgres.aws_security_group_rule.ingress_sg["sg-0b19db83dbe18cbf1"]: Refreshing state... [id=sgrule-1130224857]
module.mqtt.aws_security_group_rule.ingress_sg["sg-0b19db83dbe18cbf1"]: Refreshing state... [id=sgrule-1062873908]
module.clickhouse.aws_security_group_rule.ingress_sg["sg-0b19db83dbe18cbf1-8123"]: Refreshing state... [id=sgrule-4177740833]
module.clickhouse.aws_security_group_rule.ingress_sg["sg-0b19db83dbe18cbf1-9000"]: Refreshing state... [id=sgrule-2466486093]
module.vpc.module.vpc.aws_route.private_nat_gateway[0]: Refreshing state... [id=r-rtb-0c18db918f54ad0331080289494]
module.eks.module.eks.aws_eks_cluster.this[0]: Refreshing state... [id=gr-prod]
module.eks.module.eks.aws_eks_access_entry.this["arn-aws-iam--211125506628-role-github-actions-terraform"]: Refreshing state... [id=gr-prod:arn:aws:iam::211125506628:role/github-actions-terraform]
module.eks.module.eks.data.tls_certificate.this[0]: Reading...
module.eks.module.eks.aws_eks_access_entry.this["arn-aws-iam--211125506628-user-admin-cli"]: Refreshing state... [id=gr-prod:arn:aws:iam::211125506628:user/admin-cli]
module.eks.module.eks.time_sleep.this[0]: Refreshing state... [id=2026-06-01T10:46:10Z]
module.eks.module.eks.data.tls_certificate.this[0]: Read complete after 0s [id=f97f646c2cd14cc0db0f757f0fccc96abbbe2af5]
module.eks.module.eks.aws_iam_openid_connect_provider.oidc_provider[0]: Refreshing state... [id=arn:aws:iam::211125506628:oidc-provider/oidc.eks.us-west-2.amazonaws.com/id/21512EE80634956C7C9D0B9647C70224]
module.eks.module.eks.aws_eks_access_policy_association.this["arn-aws-iam--211125506628-user-admin-cli_admin"]: Refreshing state... [id=gr-prod#arn:aws:iam::211125506628:user/admin-cli#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.eks.module.eks.aws_eks_access_policy_association.this["arn-aws-iam--211125506628-role-github-actions-terraform_admin"]: Refreshing state... [id=gr-prod#arn:aws:iam::211125506628:role/github-actions-terraform#arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy]
module.mqtt.aws_eip.this[0]: Refreshing state... [id=eipalloc-000796d397533c6d8]
module.argocd.helm_release.argocd: Refreshing state... [id=argocd]
cloudflare_dns_record.gr_mqtt: Refreshing state... [id=53badd7d1ab83280dc57c671ee486f90]
module.clickhouse.aws_volume_attachment.data: Refreshing state... [id=vai-843739595]
module.clickhouse.aws_eip.this[0]: Refreshing state... [id=eipalloc-0970fced638b7d8d9]
module.postgres.aws_volume_attachment.data: Refreshing state... [id=vai-3584522434]
module.postgres.aws_eip.this[0]: Refreshing state... [id=eipalloc-06d6c59b1e0a49482]
cloudflare_dns_record.gr_clickhouse: Refreshing state... [id=d6f3d3c09db5c56703a7b107d6ac3f29]
cloudflare_dns_record.gr_postgres: Refreshing state... [id=a69d68353c27a536e84d7448af48e3f0]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

@BK1031

BK1031 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

You'll probably need to manually make the edit on the ec2

@jacobjurek jacobjurek merged commit 51eb090 into main Jun 7, 2026
1 check passed
@jacobjurek jacobjurek deleted the fix/clickhouse-disable-noisy-logs branch June 7, 2026 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants