Spark on k8s : changes for using supporting multi tenant(multi namespace) k8s cluster#462
Spark on k8s : changes for using supporting multi tenant(multi namespace) k8s cluster#462ashokkumarrathore wants to merge 6 commits into
Conversation
|
This pull request has been automatically marked as stale because it has had no activity for at least 3 months. If you are still working on this change or plan to move it forward, please leave a comment or push a new commit so we know to keep it open. Otherwise, this PR will be closed automatically in about one month. Thank you for your contribution to Apache Livy! |
Livy on Kubernetes previously assumed a single namespace and listed/killed applications across all namespaces (inAnyNamespace). In a multi-tenant cluster the Livy service account is typically scoped to a subset of namespaces, so cluster-wide calls fail. This change makes the namespace a first-class, per-session property: * SparkApp.getNamespace resolves the target namespace from the session Spark conf (spark.kubernetes.namespace), falling back to $SPARK_HOME/conf/spark-defaults.conf and finally the 'default' namespace. It only touches the filesystem on Kubernetes and returns an empty namespace for YARN/local, closing the input stream and tolerating a missing file or unset SPARK_HOME. * The namespace is threaded through SparkApp.create and persisted in the batch and interactive recovery metadata, so a recovered session keeps the namespace it was created with. * SparkKubernetesApp scopes every client call with .inNamespace(...) and tracks the set of namespaces it has seen in a thread-safe set (ConcurrentHashMap-backed) so leaked-application cleanup iterates only over namespaces Livy actually has access to. * Bumps kubernetes-client to 6.8.1 and netty to 4.1.108.Final for security fixes. * Adds SparkAppSpec covering the namespace-resolution precedence chain and updates existing recovery-metadata fixtures for the new field. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c96086c to
2e77626
Compare
|
Rebased onto the latest Rebase / conflict resolution
Version bumps (security) — kept the higher version across Bug fix in
This also addresses @jahstreet's earlier question about whether the namespace can be empty initially. Tests — added Commits were squashed into one. @gyogal — could you kick off the check jobs when you get a chance? Thanks!
|
SparkKubernetesApp.getApplicationReport previously issued a single LIST returning every pod tagged with the session's app tag — on large jobs this deserializes hundreds of executor pods into memory each poll cycle, even though application state is derived only from the driver pod. Two changes: - Narrow the driver lookup to spark-role=driver at the API server so the LIST returns a single pod. - Gate the separate executor LIST on a new config flag livy.server.kubernetes.executor-tracking.enabled (default true, so existing deployments see no behavior change). Operators running large clusters can disable it to skip the executor LIST entirely. When disabled, per-executor Grafana/Loki log URLs and per-executor entries in session diagnostics are omitted; driver state tracking is unaffected. Adds two tests: default-enabled invariant and diagnostics graceful degradation when executors is empty.
Sessions deleted via DELETE / idle GC / heartbeat watchdog / retention GC had their recovery file in the state store resurrected by the next SparkKubernetesApp monitor tick: stopSession() called app.kill(), but kill() left the app on the global appQueue, and the in-flight tick fell through changeState(KILLED|FAILED) to listener.appIdKnown(), which unconditionally re-saved recovery metadata after SessionManager.delete had just removed it. End state: in-memory map clean, state store retains a zombie file. Same shape exists for batch sessions on DELETE / FAILED / KILLED, with a one-tick race window. The leaked files also skewed our state-store metrics so we did not get an accurate picture of state. Fix: - SparkKubernetesApp.kill() now also removes the app from the shared monitor queue, alongside flipping the killed flag. - monitorSparkKubernetesApp early-returns after changeState(KILLED) and changeState(FAILED) so the body cannot fall through to appIdKnown after a terminal state observation. - killed becomes AtomicBoolean for visibility from the unsynchronized monitor read. The SparkApp trait, InteractiveSession.stopSession, and BatchSession.stopSession are unchanged. Tests: - SparkKubernetesAppSpec verifies kill() removes from appQueue and is idempotent for the queue removal. - BatchSessionSpec verifies stopSession invokes kill() on the app.
SparkKubernetesApp.monitorSparkKubernetesApp re-runs getAppFromTag on
every poll tick. Once an earlier tick has set appPromise via
trySuccess(app), a later tick that hits an exception in getAppFromTag
called the non-idempotent Promise.failure(e) on the already-completed
promise. That throws IllegalStateException("Promise already
completed."), which escapes the inner try and lands in the outer
NonFatal handler, transitioning the app to FAILED. The session
listener (BatchSession / InteractiveSession stateChanged) then flips
the session to Dead, killing a healthy app.
The race was always latent but only became reachable in practice after
livy.server.kubernetes.app-lookup.thread-pool.size was raised above 1:
the higher k8s API pressure makes withRetry exhaustion in
getAppFromTag (e.g., during istio-proxy restarts) frequent enough to
hit the catch block after the promise is already set.
Switch the catch block to Promise.tryFailure(e) so a late failure on
an already-completed promise is dropped instead of throwing. This
matches the symmetric trySuccess(app) used on the success path two
lines below. Behavior on the cold path is unchanged: tryFailure
stores the failure exactly like failure when the promise is still
pending, and no caller depends on the throw (the catch block returns
immediately).
Adds two regression tests in SparkKubernetesAppSpec that pin the
idempotency contract in both directions.
|
Pushed an update that folds in five additional Kubernetes/session fixes that were developed on top of this branch (authored by @vallenki). All are rebased onto current
Combined diff is ~545 insertions across 16 files. Happy to split any of these into separate PRs if you'd prefer to keep this one scoped purely to multi-namespace support — just let me know. |
What changes were proposed in this pull request?
Related issue : #461
How was this patch tested?
Tested by deploying updated server in a k8s cluster with namespaces and running a job.