Skip to content

feat(k8s): add k8s_wait_for_condition tool#58

Open
mesutoezdil wants to merge 42 commits into
kagent-dev:mainfrom
mesutoezdil:feat/k8s-wait-for-condition
Open

feat(k8s): add k8s_wait_for_condition tool#58
mesutoezdil wants to merge 42 commits into
kagent-dev:mainfrom
mesutoezdil:feat/k8s-wait-for-condition

Conversation

@mesutoezdil

@mesutoezdil mesutoezdil commented May 7, 2026

Copy link
Copy Markdown

Adds a new MCP tool k8s_wait_for_condition that wraps kubectl wait and blocks until a Kubernetes resource reaches a specified condition or the timeout expires.

Agents that deploy resources currently have to poll with repeated kubectl get calls in a loop. Each iteration is a full LLM turn, wasting tokens and adding latency.

With this tool, a single blocking call replaces the loop:

Before:

[turn 1] kgp -n default   -> Pending
[turn 2] kgp -n default   -> Pending
[turn 3] kgp -n default   -> Running

After:

[turn 1] k8s_wait_for_condition deployment/myapp condition=Available -> condition met
Parameter Required Default Description
resource_type yes deployment, pod, job, etc.
resource_name yes Name of the resource
condition yes Available, Ready, Complete, etc.
namespace no default Namespace of the resource
timeout_seconds no 60 Max wait time in seconds

Seven unit tests cover: success path, custom namespace and timeout, missing required parameters, zero timeout rejection, and kubectl timeout propagation.

Closes #56

dimetron and others added 30 commits July 7, 2025 17:35
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
* fix all linter errors
* add buildx

---------

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
- added telemetry
- security validations
- structured logging
- e2e tests

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
* - 🐛 Fix stdio implementation
- 🚀 Add quickstart guide for agentgateway
- 📝 Update cursor MCP documentation

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* - 🐛 Fix stdio implementation
- 🚀 Add quickstart guide for agentgateway
- 📝 Update cursor MCP documentation

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* add homebrew path

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* increase default timeout

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* quickstart updated

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

---------

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
* set json format optional

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* set json format optional

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* set json format optional

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

---------

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
* updated dependencies

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* ci go-version: "1.25"

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* fix agentgateway config

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

---------

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
Signed-off-by: Sara Qasmi <saraqasmi@Saras-MacBook-Pro.local>
Co-authored-by: Sara Qasmi <saraqasmi@Saras-MacBook-Pro.local>
* dependencies update

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* readme

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* go mod

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* check latest GO version

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* actions/setup-go@v6

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* actions/setup-go@v6

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

---------

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
* TOOLS_ISTIO_VERSION ?= 1.28.3
TOOLS_KUBECTL_VERSION ?= 1.35.0
TOOLS_HELM_VERSION ?= 4.1.0
TOOLS_CILIUM_VERSION ?= 0.19.0

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* helm-unittest install --verify=false

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

---------

Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>
* Add Kubescape integration
- Introduced Kubescape tool support, including registration of various tools for health checks, vulnerability manifests, and configuration scans.
- Implemented specific error handling for Kubescape-related operations, providing detailed suggestions based on error types.

Signed-off-by: Ben <ben@armosec.io>

* Enhance Kubescape tool by adding runtime observability features

- Introduced checks for ApplicationProfiles and NetworkNeighborhoods CRDs in health checks, with corresponding recommendations for enabling runtime observability.
- Added handlers for listing and retrieving ApplicationProfiles and NetworkNeighborhoods, capturing runtime behavior and network communication patterns of workloads.

Signed-off-by: Ben <ben@armosec.io>

* Fix linter errors: remove unused SBOM functions and suppress deprecated test warnings

Signed-off-by: Ben <ben@armosec.io>

* ci: increase golangci-lint timeout to 5m to prevent context deadline errors
Signed-off-by: Ben <ben@armosec.io>

* Updating timeouts for golint
Signed-off-by: Ben <ben@armosec.io>

---------

Signed-off-by: Ben <ben@armosec.io>
…ent-dev#43)

* feat(helm): add enabledTools and extraArgs configuration options

Add support for configuring tool-server CLI arguments via Helm values:

- `tools.enabledTools`: List of tool providers to enable (maps to --tools flag)
- `tools.extraArgs`: Additional command-line arguments for future flags

Example usage:
```yaml
tools:
  enabledTools:
    - k8s
    - helm
    - prometheus
  extraArgs:
    - "--some-future-flag"
```

This is a non-breaking change - empty lists (default) preserve current behavior.

Signed-off-by: Matteo Mori <matteo.mori@rvu.co.uk>

* refactor(helm): rename tools.extraArgs to tools.args

Simplify the Helm values key name for additional CLI arguments.

Signed-off-by: Matteo Mori <matteo.mori@rvu.co.uk>

---------

Signed-off-by: Matteo Mori <matteo.mori@rvu.co.uk>
…ev#41)

* feat: add --read-only flag to disable write operations

Add a new `--read-only` CLI flag that disables tools which perform
write operations (delete, patch, scale, create, apply, etc.).

This enables deploying the MCP server in read-only mode for:
- Observability-only use cases (monitoring, troubleshooting)
- Environments with read-only service accounts
- Compliance requirements separating read/write capabilities

Tools are categorized as read-only or write operations:
- K8s: 8 read-only, 14 write tools
- Helm: 3 read-only, 3 write tools
- Istio: 9 read-only, 3 write tools
- Cilium: ~25 read-only, ~15 write tools
- Argo: 4 read-only, 4 write tools
- Prometheus/Kubescape/Utils: all read-only (unchanged)Signed-off-by: Matteo Mori <matteo.mori@rvu.co.uk>

* fix: disable shell tool in read-only mode

The utils provider exposes a `shell` tool that executes arbitrary
commands, bypassing read-only restrictions. In read-only mode, this
tool is now disabled.

Also pass readOnly to all providers (kubescape, prometheus, utils)
for consistency with the existing providers.

Signed-off-by: Matteo Mori <matteo.mori@rvu.co.uk>

---------

Signed-off-by: Matteo Mori <matteo.mori@rvu.co.uk>
MatteoMori8 and others added 11 commits February 12, 2026 12:52
…gent-dev#44)

Upgrade all Go dependencies to latest versions and bump bundled CLI tools
(kubectl 1.35.1, helm 4.1.1) to address HIGH severity vulnerabilities
flagged by security scanning.

Pin kubescape/storage to v0.0.239 (latest compatible release) as v0.2.0
removed APIs we depend on.

8 remaining HIGHs cannot be addressed as they originate from upstream
pre-compiled binaries (istioctl 1.28.3, kubectl-argo-rollouts 1.8.3)
which are already at their latest releases:

  ✅ TOOLS_ARGO_ROLLOUTS_VERSION=1.8.3 == v1.8.3
  ✅ TOOLS_CILIUM_VERSION=0.19.0 == v0.19.0
  ✅ TOOLS_ISTIO_VERSION=1.28.3 == 1.28.3
  ❌ TOOLS_HELM_VERSION=4.1.0 != v4.1.1       (bumped)
  ❌ TOOLS_KUBECTL_VERSION=1.35.0 != v1.35.1   (bumped)

Signed-off-by: Matteo Mori <matteo.mori@rvu.co.uk>
* feat: add token support for kubectl commands

Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>

* use pre-v4 helm version

Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>

* Add configuration to disable service token automount

Signed-off-by: Jeremy Alvis <jeremy.alvis@solo.io>

* Remove automountServiceAccountToken config

Signed-off-by: Jeremy Alvis <jeremy.alvis@solo.io>

* helm config for using default service account

Signed-off-by: Jeremy Alvis <jeremy.alvis@solo.io>

* Add tools.k8s.tokenPassthrough for requiring token from auth header

Signed-off-by: Jeremy Alvis <jeremy.alvis@solo.io>

* Fix helm version

Signed-off-by: Jeremy Alvis <jeremy.alvis@solo.io>

* Remove automountServiceAccountToken from helm test

Signed-off-by: Jeremy Alvis <jeremy.alvis@solo.io>

* Redact tokens

Signed-off-by: Jeremy Alvis <jeremy.alvis@solo.io>

---------

Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
Signed-off-by: Jeremy Alvis <jeremy.alvis@solo.io>
Co-authored-by: Jeremy Alvis <jeremy.alvis@solo.io>
* feat(metrics): implement Prometheus observability with dedicated server

Replace generateRuntimeMetrics() with prometheus/client_golang and add
flexible metrics server architecture supporting same-port or dedicated
port deployment.

Changes:
- Add internal/metrics package with custom Prometheus registry
- Configurable metrics port via --metrics-port flag (default: 8084)
- Two-server architecture with proper WaitGroup coordination
- Graceful shutdown for both main and metrics servers
- Export kagent_tools_mcp_server_info (version metadata)
- Export kagent_tools_mcp_registered_tools (tool providers)
- Include Go runtime metrics (goroutines, memory, GC stats)
- Include process metrics (CPU, memory, file descriptors)

Architecture improvement: Move http.Server instantiation outside
goroutines to prevent race condition between assignment and shutdown.

Test coverage: 5 unit tests validating registry, collectors, and metrics.Signed-off-by: MatteoMori <morimatteo14@gmail.com>

* feat(metrics): auto-register tool metrics using ListTools() diff

Use MCPServer.ListTools() to automatically detect which tools each
provider registers, eliminating the need to modify individual tool
packages.

The approach snapshots the tool list before and after each provider's
RegisterTools() call, then records the newly added tools in Prometheus
with the correct tool_provider label.

This means:
- Zero changes required in any pkg/ file
- Future tools are automatically tracked
- No risk of forgetting to add a metric for a new toolSigned-off-by: MatteoMori <morimatteo14@gmail.com>

* feat(metrics): instrument tool handlers with invocation counters

Add kagent_tools_mcp_invocations_total and
kagent_tools_mcp_invocations_failure_total counters using the
wrapper/middleware pattern. All handlers are centrally instrumented
in wrapToolHandlersWithMetrics with zero changes to pkg/ files.
Update README with Observability section and CLI flags reference.Signed-off-by: MatteoMori <morimatteo14@gmail.com>

* feat(observability): add Helm chart support and Grafana dashboard

Add comprehensive Prometheus Operator integration via Helm chart:
- ServiceMonitor resource for automatic target discovery
- Dedicated metrics service (kagent-tools-metrics)
- Deployment args for --metrics-port configuration
- Configurable scrape interval, timeout, and labels

Include Grafana dashboard with 8 panels visualizing:
- Server version and health metrics
- Tool invocation rates by provider
- Success/failure rates and trends
- Top invoked tools table with heat mapping

Add CLAUDE.md with architecture documentation covering:
- Tool provider pattern and MCP server lifecycle
- Observability architecture (metrics wrapper pattern)
- Development commands and key implementation patterns
- Helm chart structure and troubleshooting guideSigned-off-by: MatteoMori <morimatteo14@gmail.com>

* fix(metrics): default metrics-port to 0 (same as --port)

Previously --metrics-port defaulted to 8084, causing a mismatch when
the server ran on any other port (e.g. E2E tests use port 18190). The
metrics server would start on 8084 instead of sharing the main port,
so /metrics was unreachable at the expected address.

Change the default to 0, resolved at runtime as "same as --port".
Update Helm templates to fall back to the main targetPort when
tools.metrics.port is unset.

Signed-off-by: MatteoMori <morimatteo14@gmail.com>
* fix(metrics): count result.IsError as invocation failure

The failure counter previously only incremented on non-nil Go errors.
Handlers in this codebase signal tool-level failures by returning
NewToolResultError(...), nil — result.IsError=true, err=nil — a pattern
used 214 times across pkg/. This meant the failure metric was always 0
for tool-level errors.

Fix the wrapper condition to check both:
  err != nil || (result != nil && result.IsError)

Add three tests in cmd/metrics_wrap_test.go:
  - IsError=true increments failure counter (regression test)
  - Successful call does not increment failure counter
  - Real Go error increments failure counter

Remove CLAUDE.md from the repository.

Signed-off-by: MatteoMori <morimatteo14@gmail.com>
---------

Signed-off-by: MatteoMori <morimatteo14@gmail.com>
…rade (kagent-dev#47)

* fix(helm): use fullname in selector labels to prevent mismatch on upgrade

Use kagent.fullname instead of kagent.name in selectorLabels so that
changing nameOverride does not alter the app.kubernetes.io/name selector
label. Deployment spec.selector.matchLabels is immutable in Kubernetes,
so any label change causes a Service/Deployment selector mismatch after
helm upgrade, leaving the Service with zero endpoints.

With this fix, both the old config (fullnameOverride: kagent-tools) and
the new config (nameOverride: tools) resolve to the same fullname
"kagent-tools" for the default release name, keeping selectors stable
across upgrades.

Fixes kagent-dev/kagent#1427

Signed-off-by: Jaison Paul <paul.jaison@gmail.com>

* fix(e2e): update label selectors to match fullname-based selector labels

Update E2E tests to use app.kubernetes.io/instance label selector instead of
app.kubernetes.io/name since the PR changes selectorLabels to use kagent.fullname.

The fullname template returns the release name (kagent-tools-e2e), so the tests
now use app.kubernetes.io/instance=<releaseName> which remains stable and matches
the updated selector labels in the Helm chart.

This fixes the E2E test failures where pods weren't being found because the
label selector no longer matched after the selectorLabels change.Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>

---------

Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
Co-authored-by: Eitan Yarmush <eitan.yarmush@solo.io>
)

Renames all helper templates from kagent.* to kagent-tools.* prefix to
prevent naming conflicts with the parent kagent chart. When Helm renders
subcharts, template definitions are global, causing the parent chart's
helpers to override the subchart's helpers with the same names.

This fixes:
- Selector label mismatch when using nameOverride (was using parent's
  logic instead of subchart's fullname logic)
- Helm upgrade failures due to immutable selector field changes
- Enables proper use of nameOverride instead of requiring
  fullnameOverride workaround

All helper references updated across all template files:
- _helpers.tpl: Renamed 10 helper definitions
- deployment.yaml, service.yaml, serviceaccount.yaml: Updated references
- clusterrole.yaml, clusterrolebinding.yaml: Updated references
- servicemonitor.yaml, NOTES.txt: Updated references

Backward compatible: existing fullnameOverride usage continues to work.

Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
…resource (kagent-dev#50)

Signed-off-by: Felipe Vicens <felipejose.vicensgonzalez@telefonica.com>
…t-dev#52)

Bump google.golang.org/grpc v1.78.0 -> v1.79.3 to fix CRITICAL
CVE-2026-33186 (authorization bypass). Bump all bundled CLI tools
to latest releases (kubectl 1.35.3, helm 4.1.3, istioctl 1.28.5,
argo-rollouts 1.8.4, cilium 0.19.2) to reduce CVE surface area.

Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
* namespaced rbac

Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io>

* oops forgot i renamed it

Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io>

---------

Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io>
Signed-off-by: Dmytro Rashko <dmitriy.rashko@amdocs.com>

* Fix incorrect cilium-dbg subcommands
* Bump outdated tools:
- Argo Rollouts: 1.8.4 → 1.9.0
- Istio: 1.28.5 → 1.29.1
…c.namespaces (kagent-dev#57)

* namespaced rbac update with kagent

Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io>

* use proper helath check

Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io>

---------

Signed-off-by: Jet Chiang <pokyuen.jetchiang-ext@solo.io>
@mesutoezdil mesutoezdil force-pushed the feat/k8s-wait-for-condition branch 4 times, most recently from b8a975c to 33efb25 Compare May 7, 2026 21:37
Wraps kubectl wait so agents can block on a resource condition in one
call instead of polling with repeated kubectl get turns.

Closes kagent-dev#56

Co-authored-by: alexis-brettes <133014848+alexis-brettes@users.noreply.github.com>
Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
@mesutoezdil

Copy link
Copy Markdown
Author

@EItanya any news?

@EItanya EItanya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. @dimetron what do you think?

Also I gotta figure out why the jobs aren't running :(

@mesutoezdil mesutoezdil force-pushed the feat/k8s-wait-for-condition branch from d157f4f to e1d968e Compare June 14, 2026 15:47
@mesutoezdil mesutoezdil requested a review from dimetron as a code owner June 14, 2026 15:47
@mesutoezdil mesutoezdil force-pushed the feat/k8s-wait-for-condition branch from e1d968e to 2a72881 Compare June 14, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Built-in tool to wait for Kubernetes resource conditions

10 participants