Skip to content

Spark on k8s : changes for using supporting multi tenant(multi namespace) k8s cluster#462

Open
ashokkumarrathore wants to merge 6 commits into
apache:masterfrom
ashokkumarrathore:livy_nmsp_fix
Open

Spark on k8s : changes for using supporting multi tenant(multi namespace) k8s cluster#462
ashokkumarrathore wants to merge 6 commits into
apache:masterfrom
ashokkumarrathore:livy_nmsp_fix

Conversation

@ashokkumarrathore

@ashokkumarrathore ashokkumarrathore commented Jan 1, 2025

Copy link
Copy Markdown

What changes were proposed in this pull request?

Related issue : #461

How was this patch tested?

Tested by deploying updated server in a k8s cluster with namespaces and running a job.

Comment thread server/src/main/scala/org/apache/livy/utils/SparkKubernetesApp.scala Outdated
@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has had no activity for at least 3 months. If you are still working on this change or plan to move it forward, please leave a comment or push a new commit so we know to keep it open. Otherwise, this PR will be closed automatically in about one month. Thank you for your contribution to Apache Livy!

@github-actions github-actions Bot added the stale label Feb 21, 2026
@ashokkumarrathore ashokkumarrathore changed the title Spark on k8s : changes for using namspaces with Kubernetes client APIs. Spark on k8s : changes for using namespaces with Kubernetes client APIs. Jun 19, 2026
@ashokkumarrathore ashokkumarrathore changed the title Spark on k8s : changes for using namespaces with Kubernetes client APIs. Spark on k8s : changes for using supporting multi tenant(multi namespace) k8s cluster Jun 19, 2026
Livy on Kubernetes previously assumed a single namespace and listed/killed
applications across all namespaces (inAnyNamespace). In a multi-tenant
cluster the Livy service account is typically scoped to a subset of
namespaces, so cluster-wide calls fail.

This change makes the namespace a first-class, per-session property:

* SparkApp.getNamespace resolves the target namespace from the session
  Spark conf (spark.kubernetes.namespace), falling back to
  $SPARK_HOME/conf/spark-defaults.conf and finally the 'default'
  namespace. It only touches the filesystem on Kubernetes and returns an
  empty namespace for YARN/local, closing the input stream and tolerating
  a missing file or unset SPARK_HOME.

* The namespace is threaded through SparkApp.create and persisted in the
  batch and interactive recovery metadata, so a recovered session keeps
  the namespace it was created with.

* SparkKubernetesApp scopes every client call with .inNamespace(...) and
  tracks the set of namespaces it has seen in a thread-safe set
  (ConcurrentHashMap-backed) so leaked-application cleanup iterates only
  over namespaces Livy actually has access to.

* Bumps kubernetes-client to 6.8.1 and netty to 4.1.108.Final for
  security fixes.

* Adds SparkAppSpec covering the namespace-resolution precedence chain
  and updates existing recovery-metadata fixtures for the new field.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ashokkumarrathore

Copy link
Copy Markdown
Author

Rebased onto the latest master and resolved all conflicts — the PR is now mergeable. Summary of what changed since the last review:

Rebase / conflict resolution

  • Rebased onto current master (the branch was ~42 commits behind).
  • Dropped changes that master has since made independently: the commons.lang3 import and the SQLInterpreterSpec assertion (kept master's broader version).
  • Removed redundant per-module pom overrides (maven-shade-plugin, commons-lang3, netty-all, hadoop-common) now that the parent pom manages those versions; one of them would otherwise have downgraded maven-shade-plugin 3.5.0 → 3.4.1.

Version bumps (security) — kept the higher version across master/PR: kubernetes-client 6.8.1, netty 4.1.108.Final. Verified the code compiles and unit tests pass against the fabric8 6.x client.

Bug fix in SparkApp.getNamespace — the previous version unconditionally read $SPARK_HOME/conf/spark-defaults.conf and called .get on sparkHome(), which threw FileNotFoundException/NoSuchElementException and broke batch-session creation (and several existing tests) for non-Kubernetes deployments. It now:

  • resolves a namespace only on Kubernetes (empty string otherwise, no filesystem access),
  • treats a missing or key-less spark-defaults.conf as "no value" and falls back to the default namespace,
  • closes the input stream.

This also addresses @jahstreet's earlier question about whether the namespace can be empty initially.

Tests — added SparkAppSpec covering the full namespace-resolution precedence chain (session conf → spark-defaults.conf → default), plus the non-Kubernetes and missing-SPARK_HOME paths. All touched suites pass locally (SparkAppSpec, BatchSessionSpec, SessionManagerSpec, SparkKubernetesAppSpec); scalastyle is clean.

Commits were squashed into one. @gyogal — could you kick off the check jobs when you get a chance? Thanks!

Note on the watched-namespace truncation idea (@jahstreet): I've left that out of this PR to keep it focused; happy to follow up in a separate change since picking a safe truncation point needs more thought.

@ashokkumarrathore ashokkumarrathore requested a review from gyogal June 19, 2026 11:02
vallenki added 5 commits June 19, 2026 18:10
SparkKubernetesApp.getApplicationReport previously issued a single LIST
returning every pod tagged with the session's app tag — on large jobs
this deserializes hundreds of executor pods into memory each poll cycle,
even though application state is derived only from the driver pod.

Two changes:
- Narrow the driver lookup to spark-role=driver at the API server so
  the LIST returns a single pod.
- Gate the separate executor LIST on a new config flag
  livy.server.kubernetes.executor-tracking.enabled (default true, so
  existing deployments see no behavior change). Operators running large
  clusters can disable it to skip the executor LIST entirely. When
  disabled, per-executor Grafana/Loki log URLs and per-executor entries
  in session diagnostics are omitted; driver state tracking is
  unaffected.

Adds two tests: default-enabled invariant and diagnostics graceful
degradation when executors is empty.
Sessions deleted via DELETE / idle GC / heartbeat watchdog / retention
GC had their recovery file in the state store resurrected by the next
SparkKubernetesApp monitor tick: stopSession() called app.kill(), but
kill() left the app on the global appQueue, and the in-flight tick
fell through changeState(KILLED|FAILED) to listener.appIdKnown(), which
unconditionally re-saved recovery metadata after SessionManager.delete
had just removed it. End state: in-memory map clean, state store retains
a zombie file. Same shape exists for batch sessions on DELETE / FAILED /
KILLED, with a one-tick race window. The leaked files also skewed our
state-store metrics so we did not get an accurate picture of state.

Fix:
- SparkKubernetesApp.kill() now also removes the app from the shared
  monitor queue, alongside flipping the killed flag.
- monitorSparkKubernetesApp early-returns after changeState(KILLED) and
  changeState(FAILED) so the body cannot fall through to appIdKnown
  after a terminal state observation.
- killed becomes AtomicBoolean for visibility from the unsynchronized
  monitor read.

The SparkApp trait, InteractiveSession.stopSession, and
BatchSession.stopSession are unchanged.

Tests:
- SparkKubernetesAppSpec verifies kill() removes from appQueue and is
  idempotent for the queue removal.
- BatchSessionSpec verifies stopSession invokes kill() on the app.
SparkKubernetesApp.monitorSparkKubernetesApp re-runs getAppFromTag on
every poll tick. Once an earlier tick has set appPromise via
trySuccess(app), a later tick that hits an exception in getAppFromTag
called the non-idempotent Promise.failure(e) on the already-completed
promise. That throws IllegalStateException("Promise already
completed."), which escapes the inner try and lands in the outer
NonFatal handler, transitioning the app to FAILED. The session
listener (BatchSession / InteractiveSession stateChanged) then flips
the session to Dead, killing a healthy app.

The race was always latent but only became reachable in practice after
livy.server.kubernetes.app-lookup.thread-pool.size was raised above 1:
the higher k8s API pressure makes withRetry exhaustion in
getAppFromTag (e.g., during istio-proxy restarts) frequent enough to
hit the catch block after the promise is already set.

Switch the catch block to Promise.tryFailure(e) so a late failure on
an already-completed promise is dropped instead of throwing. This
matches the symmetric trySuccess(app) used on the success path two
lines below. Behavior on the cold path is unchanged: tryFailure
stores the failure exactly like failure when the promise is still
pending, and no caller depends on the throw (the catch block returns
immediately).

Adds two regression tests in SparkKubernetesAppSpec that pin the
idempotency contract in both directions.
@ashokkumarrathore

Copy link
Copy Markdown
Author

Pushed an update that folds in five additional Kubernetes/session fixes that were developed on top of this branch (authored by @vallenki). All are rebased onto current master, compile cleanly against kubernetes-client 6.8.1, pass scalastyle/checkstyle, and have unit tests:

  1. Livy Interactive session fixes — adds livy.rsc.driver.address.use-hostname (RSCConf) so the driver's reported hostname is used instead of the socket IP, which is needed for Kubernetes + Istio setups.
  2. Reduce Kubernetes API load in polling — narrows the driver lookup to spark-role=driver at the API server (single-pod LIST instead of all executor pods) and gates the executor LIST behind a new flag livy.server.kubernetes.executor-tracking.enabled (default true, so no behavior change). Large clusters can disable it to skip per-executor LISTs.
  3. SessionManager refresh + /recovery operator endpoint — a superuser-guarded (livy.superusers, returns 403 otherwise) REST endpoint to re-scan the recovery state store and import sessions written by another Livy server, without a restart.
  4. Fix recovery-file leak after session deletionkill() now removes the app from the shared monitor queue and the monitor early-returns after a terminal state, so a deleted session's recovery file is no longer resurrected by an in-flight monitor tick. killed becomes AtomicBoolean for cross-thread visibility.
  5. Fix Promise.failure race that falsely failed healthy apps — switches the catch path to Promise.tryFailure, matching the trySuccess on the success path, so a late failure on an already-completed promise is dropped instead of flipping a healthy app to FAILED. This became reachable once app-lookup.thread-pool.size > 1.

Combined diff is ~545 insertions across 16 files. Happy to split any of these into separate PRs if you'd prefer to keep this one scoped purely to multi-namespace support — just let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants