From 60db49a2c8f44423d1a8a38b518016a3987ca0d2 Mon Sep 17 00:00:00 2001 From: Komh Date: Sat, 30 May 2026 05:07:09 +0000 Subject: [PATCH] [configure] Capturing kube-scheduler decisions on Alauda Container Platform Rerun (batch7, 2026-05-30): full 8-phase pipeline on lab-base. terminal_route=convert_adapted. --- ...by_Raising_kube_scheduler_Log_Verbosity.md | 146 ++++++++++++++++++ 1 file changed, 146 insertions(+) create mode 100644 docs/en/solutions/Capturing_Scheduler_Decisions_by_Raising_kube_scheduler_Log_Verbosity.md diff --git a/docs/en/solutions/Capturing_Scheduler_Decisions_by_Raising_kube_scheduler_Log_Verbosity.md b/docs/en/solutions/Capturing_Scheduler_Decisions_by_Raising_kube_scheduler_Log_Verbosity.md new file mode 100644 index 00000000..e00719da --- /dev/null +++ b/docs/en/solutions/Capturing_Scheduler_Decisions_by_Raising_kube_scheduler_Log_Verbosity.md @@ -0,0 +1,146 @@ +--- +title: Capturing kube-scheduler decisions on Alauda Container Platform +component: observability +scenario: how-to +tags: [kube-scheduler, kubelet, control-plane, leader-election, logging] +date_created: 2026-05-30 +date_updated: 2026-05-30 +--- + +# Capturing kube-scheduler decisions on Alauda Container Platform + +## Issue + +When pods churn across nodes during scale-out, drain, or eviction, the +question is usually "why did the scheduler pick that node?" The answer +lives in the `kube-scheduler` container log on the leader replica. On +Alauda Container Platform the `kube-scheduler` runs as a kubeadm-style +static pod named `kube-scheduler-` in the `kube-system` +namespace (one per control-plane node, owned by the `Node` object via +`kubernetes.io/config.source=file`) [ev:c1]. Operators are accustomed to +the upstream pattern but need the ACP-specific locations and verbosity +knob to read the same diagnostic signal. + +## Root Cause + +At any moment only a single `kube-scheduler` replica is active — the +holder of the `kube-system/kube-scheduler` Lease (API group +`coordination.k8s.io/v1`). The other replicas idle and renew nothing +until the lease expires (default 15 second duration, ~10 second renew +interval) [ev:c2_a]. Per-pod scheduling-decision log lines are emitted +only by the current leader's container, so log collection that fans out +across all replicas without filtering on the lease holder will appear +sparse on non-leader pods even during heavy scheduling load [ev:c2_b]. + +The kube-scheduler binary that ships on ACP +(`registry.alauda.cn:60080/tkestack/kube-scheduler:v1.34.5-1`) gates +each diagnostic line behind a klog verbosity level. At the default +verbosity, the binary emits `Successfully bound pod to node` from +`schedule_one.go` once a binding completes, plus errors and lifecycle +events — nothing else per-pod [ev:c4]. Filter/score predicate outcomes, +the candidate-node trace, and the `About to try and schedule pod` / +`Attempting to bind pod to node` events live at higher verbosity levels +in the same binary, and are silent unless the verbosity flag is raised +[ev:c5]. + +## Resolution + +The kube-scheduler verbosity is controlled by the `--v=N` flag passed +to the `kube-scheduler` binary. On ACP the static-pod manifest lives at +`/etc/kubernetes/manifests/kube-scheduler.yaml` on each control-plane +node; the kubelet watches that directory and restarts the container +whenever the manifest file changes, so editing the manifest is sufficient +to take effect — there is no operator reconciliation in this path. To +raise verbosity for an investigation, append a `--v=N` entry to +`spec.containers[0].command` on each control-plane node [ev:c1]: + +```yaml +spec: + containers: + - command: + - kube-scheduler + - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf + - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf + - --kubeconfig=/etc/kubernetes/scheduler.conf + - --leader-elect=true + - --config=/etc/kubernetes/scheduler-config.yaml + - --profiling=false + - --v=4 # add this line + image: registry.alauda.cn:60080/tkestack/kube-scheduler:v1.34.5-1 +``` + +The mapping between the verbosity number and the lines that become +visible follows the upstream klog convention baked into the binary: +`--v=2` (default) shows successful binding; `--v=3` adds the +`Attempting to bind pod to node` lines; `--v=4` adds the `About to try +and schedule pod` lines and per-plugin filter/score detail [ev:c5]. + +After the investigation, remove the `--v=N` line from the manifest on +each control-plane node — the kubelet again restarts the container with +the original flag set, returning to default verbosity. Because each +manifest is edited independently per node, repeat the change on every +control-plane node that runs a `kube-scheduler` static pod [ev:c1]. + +## Diagnostic Steps + +Locate the `kube-scheduler` pods and identify the current leader. The +pod naming on ACP follows the kubeadm convention +`kube-scheduler-` in `kube-system`, with one static +pod per control-plane node [ev:c1]: + +```bash +kubectl get pod -n kube-system -l component=kube-scheduler -o wide +``` + +The active leader holds the `kube-system/kube-scheduler` Lease, and the +holder identity in the Lease object is the authoritative source for +which replica is currently leading [ev:c2_a]: + +```bash +kubectl get lease -n kube-system kube-scheduler \ + -o jsonpath='{.spec.holderIdentity}{"\n"}' +``` + +The leader replica's container log also carries a one-shot +`leaderelection.go:271] successfully acquired lease +kube-system/kube-scheduler` line at startup, which serves as a +secondary confirmation when log retention reaches back to the lease +acquisition [ev:c3]: + +```bash +kubectl logs -n kube-system \ + | grep -i 'successfully acquired lease kube-system/kube-scheduler' +``` + +Once the leader is known, tail its log for the per-pod decision events. +At default verbosity, look for `Successfully bound pod to node` lines +from `schedule_one.go`; after raising verbosity to `--v=3` or `--v=4`, +additional `Attempting to bind pod to node` (from `default_binder.go`) +and `About to try and schedule pod` lines surface [ev:c4][ev:c5]: + +```bash +kubectl logs -n kube-system \ + | grep -E 'schedule_one.go|default_binder.go' +``` + +The log format is the standard klog structured form with a source-file +annotation (for example `schedule_one.go:346`) followed by a message +and key=value pairs including `pod="/"` and +`node=""`, matching the per-pod decision template typical +of upstream kube-scheduler [ev:c4]. + +If a pod stays in `Pending` and the `kube-scheduler` leader log shows +no related activity for that pod, the cause is not a scheduling +decision: a successful bind emits the `Scheduled` Event +(`Reason=Scheduled`, `Message=Successfully assigned / to +`) from the kubelet side, and pod-lifecycle failures such as +`ImagePullBackOff` or volume-mount errors are downstream of the +scheduler and surface as kubelet events rather than scheduler log +entries. Investigate those paths separately rather than continuing to +chase the scheduler log [ev:c8]: + +```bash +kubectl describe pod -n +kubectl get events -n \ + --field-selector involvedObject.name= +```