Changing submission logic to accommodate Spark shell applications #402

sahilprasad · 2017-07-28T19:50:20Z

Allows client and cluster mode for shell applications. Specifically, this was done to allow executing a Jupyter notebook server that is able to interact with the k8s cluster.

With this change, I was able to use kubectl run with an image that runs a Jupyter notebook with the jupyter notebook command. After port-forwarding to the container port that the server is running on, I can access and use the notebook as usual, and if I provide appropriate configuration values (spark.master, spark.kubernetes.driver.docker.image, etc.), Spark tasks interact as expected with the Kubernetes cluster, and dynamic allocation behaves as expected.

Along with Jupyter, I've tested and confirmed that the PySpark and Scala shells work, and with slight modifications, so does in-cluster spark-submit.

This PR is less intended for merging and more for figuring out how best to facilitate this in-cluster workflow with Jupyter and client-mode applications. I know that there have been previous discussions around this, such as #211, and that there are limitations (executor IPs must be routable and other networking issues) but I would love to hear any thoughts around alternative approaches or solutions!

erikerlandson · 2017-07-28T20:10:07Z

Can you post some example instructions (and the jupyter notebook image you're using) on this PR discussion?

erikerlandson · 2017-07-28T20:18:22Z

Can you expand on what you mean by "tested pyspark and scala shells"?

sahilprasad · 2017-07-28T21:20:08Z

@erikerlandson Here's a link to doc I put together a few days ago.

Regarding the shells, I ran a small in-cluster job with the spark-shell and pyspark scripts while exec'd into a driver pod and did not notice any abnormalities.

erikerlandson · 2017-07-31T23:00:39Z

@sahilprasad does spark-shell also work from outside the cluster?

sahilprasad · 2017-08-01T19:11:00Z

@erikerlandson no, when I try running spark-shell from outside the cluster, I get errors regarding missing service account tokens and executors not being able to find the driver.

erikerlandson · 2017-08-01T19:23:51Z

My main concern is that enabling spark-shell doesn't distinguish between running inside the cluster, where it works, and outside, where it won't. However I like the jupyter notebook capability. A possible compromise is implement some kind of detection to distinguish these cases, so an informative error can be thrown if it is executed external to a cluster.

sahilprasad · 2017-08-01T21:20:41Z

@erikerlandson I agree that the compromise is the way to go. Would it be preferable to automatically detect in-cluster execution, perhaps by checking for the KUBERNETES_SERVICE_{HOST, PORT} environment variables? Or an additional flag, such as --in-cluster?

erikerlandson · 2017-08-01T21:45:19Z

I'm thinking auto detection - if somebody tries to run it from the outside, it should automatically fail on them, and explain why

erikerlandson · 2017-08-03T14:56:52Z

I don't want to close this, but we are going to hold off from including it in the initial 2.2 release

sahilprasad · 2017-08-03T23:15:23Z

@erikerlandson Sounds good. I'll push something in the next few days for review.

mccheah · 2017-08-04T21:14:32Z

How does this resolve? Does the submitting shell still eventually call org.apache.spark.deploy.kubernetes.Client? Or does it just create a KubernetesClusterSchedulerBackend instance in memory?

sahilprasad · 2017-08-07T04:05:22Z

@mccheah It eventually calls org.apache.spark.deploy.kubernetes.Client

mccheah · 2017-08-14T20:21:53Z

If we're calling the submission client then that implies we're not running in client mode. Client mode runs the scheduler backend and the user application directly in the process that runs spark-submit.

sahilprasad · 2017-08-14T22:11:45Z

@mccheah Then is there any way to get around the submission client to achieve the sort of pseudo client mode that I need for Jupyter? The in-cluster use case is all I'm going for, so any advice you have on this front, or towards functionality closer to true client mode, would be awesome!

mccheah · 2017-08-14T22:29:02Z

Can the submission client just be run inside the cluster in cluster mode? I'm not entirely certain why the submission client has to be run inside a docker container. Another option is just to support client mode in a sense - that's just a matter of getting the Java/Scala code to instantiate a KubernetesClusterSchedulerBackend inside the SparkContext directly as opposed to having the indirection layer of the submission client. This is very tricky right now since the scheduler backend is opinionated to being bootstrapped by the submission client.

foxish · 2017-08-15T23:22:49Z

@sahilprasad, have you tried jupyterhub and https://github.com/jupyterhub/kubespawner?

sahilprasad · 2017-08-16T00:38:22Z

@mccheah I'm not aware of a way to run PySpark applications inside of a Jupyter notebook without client deploy mode. Running the submission client in-cluster is more of a workaround to avoid the sort of networking and dependency resolution issues a remote, or out-of-cluster, client mode would involve. It seems to me that if shell applications work in-cluster and as-is, this would be worth supporting.

@foxish Yeah! Although I don't quite remember the end result, I experimented with those specifically with this project a few weeks ago, and was able to get a similar outcome in that I was able to run a PySpark application on a proxied Jupyter notebook being run on a k8s cluster. Successfully running it did involve allowing client mode through — exactly the changes of this PR.

mccheah · 2017-08-16T00:42:59Z

I think we want to modify SparkSubmit such that if it's in Kubernetes client mode:

If it's a Python application, make the Python runner as the main class. Otherwise, make the user's provided main class the main class. See https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L637 versus https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L369
When the SparkContext is created, it needs to know to make a KubernetesClusterSchedulerBackend instance based on the master URI.

mccheah · 2017-08-16T00:46:01Z

The basis of the client mode submission option is that there isn't an intermediate client that spawns another driver, but that the process running the SparkSubmit class is the class that runs the user's class directly. I think the current approach is unnecessarily creating a driver pod when in fact the pod that's running SparkSubmit should itself be the driver.

sahilprasad · 2017-08-16T21:07:11Z

@mccheah Got it. Would we need something equivalent to a KubernetesClientSchedulerBackend to accomplish this? Or is there an easier way?

mccheah · 2017-08-16T21:12:21Z

It would be good to look into writing a KubernetesClientSchedulerBackend which shares many of its components with KubernetesClusterSchedulerBackend via a shared parent class or a shared inner module (inheritance vs. composition). YARN has a similar model with its YarnClientSchedulerBackend and YarnClusterSchedulerBackend that extend from YarnSchedulerBackend.

sahilprasad · 2017-08-16T21:25:39Z

@mccheah Sounds good. I'll model it after the YARN scheduler backends. Should we continue discussion and review on this PR, or close it off in favor of a more poignant PR?

mccheah · 2017-08-16T21:29:20Z

We should close this and open another one with the more canonical approach.

sahilprasad · 2017-08-16T21:42:51Z

Will open a WIP PR when ready.

Merge upstream

Changing submission logic to accommodate Spark shell applications

6956f46

sahilprasad closed this Aug 16, 2017

sahilprasad mentioned this pull request Aug 23, 2017

In-cluster client mode #456

Open

echarles mentioned this pull request Nov 10, 2017

Add support to run Spark interpreter on a Kubernetes cluster apache/zeppelin#2637

Closed

ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019

Merge pull request apache-spark-on-k8s#402 from palantir/rk/new-master

f8dd15f

Merge upstream

Changing submission logic to accommodate Spark shell applications #402

Changing submission logic to accommodate Spark shell applications #402

Uh oh!

Conversation

sahilprasad commented Jul 28, 2017

Uh oh!

erikerlandson commented Jul 28, 2017

Uh oh!

erikerlandson commented Jul 28, 2017

Uh oh!

sahilprasad commented Jul 28, 2017

Uh oh!

erikerlandson commented Jul 31, 2017

Uh oh!

sahilprasad commented Aug 1, 2017

Uh oh!

erikerlandson commented Aug 1, 2017

Uh oh!

sahilprasad commented Aug 1, 2017

Uh oh!

erikerlandson commented Aug 1, 2017

Uh oh!

erikerlandson commented Aug 3, 2017

Uh oh!

sahilprasad commented Aug 3, 2017

Uh oh!

mccheah commented Aug 4, 2017

Uh oh!

sahilprasad commented Aug 7, 2017

Uh oh!

mccheah commented Aug 14, 2017

Uh oh!

sahilprasad commented Aug 14, 2017

Uh oh!

mccheah commented Aug 14, 2017

Uh oh!

foxish commented Aug 15, 2017

Uh oh!

sahilprasad commented Aug 16, 2017

Uh oh!

mccheah commented Aug 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mccheah commented Aug 16, 2017

Uh oh!

sahilprasad commented Aug 16, 2017

Uh oh!

mccheah commented Aug 16, 2017

Uh oh!

sahilprasad commented Aug 16, 2017

Uh oh!

mccheah commented Aug 16, 2017

Uh oh!

sahilprasad commented Aug 16, 2017

Uh oh!

Uh oh!

mccheah commented Aug 16, 2017 •

edited

Loading