Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Changing submission logic to accommodate Spark shell applications #402

Closed
wants to merge 1 commit into from
Closed

Conversation

sahilprasad
Copy link

Allows client and cluster mode for shell applications. Specifically, this was done to allow executing a Jupyter notebook server that is able to interact with the k8s cluster.

With this change, I was able to use kubectl run with an image that runs a Jupyter notebook with the jupyter notebook command. After port-forwarding to the container port that the server is running on, I can access and use the notebook as usual, and if I provide appropriate configuration values (spark.master, spark.kubernetes.driver.docker.image, etc.), Spark tasks interact as expected with the Kubernetes cluster, and dynamic allocation behaves as expected.

Along with Jupyter, I've tested and confirmed that the PySpark and Scala shells work, and with slight modifications, so does in-cluster spark-submit.

This PR is less intended for merging and more for figuring out how best to facilitate this in-cluster workflow with Jupyter and client-mode applications. I know that there have been previous discussions around this, such as #211, and that there are limitations (executor IPs must be routable and other networking issues) but I would love to hear any thoughts around alternative approaches or solutions!

@erikerlandson
Copy link
Member

Can you post some example instructions (and the jupyter notebook image you're using) on this PR discussion?

@erikerlandson
Copy link
Member

Can you expand on what you mean by "tested pyspark and scala shells"?

@sahilprasad
Copy link
Author

@erikerlandson Here's a link to doc I put together a few days ago.

Regarding the shells, I ran a small in-cluster job with the spark-shell and pyspark scripts while exec'd into a driver pod and did not notice any abnormalities.

@erikerlandson
Copy link
Member

@sahilprasad does spark-shell also work from outside the cluster?

@sahilprasad
Copy link
Author

@erikerlandson no, when I try running spark-shell from outside the cluster, I get errors regarding missing service account tokens and executors not being able to find the driver.

@erikerlandson
Copy link
Member

My main concern is that enabling spark-shell doesn't distinguish between running inside the cluster, where it works, and outside, where it won't. However I like the jupyter notebook capability. A possible compromise is implement some kind of detection to distinguish these cases, so an informative error can be thrown if it is executed external to a cluster.

@sahilprasad
Copy link
Author

@erikerlandson I agree that the compromise is the way to go. Would it be preferable to automatically detect in-cluster execution, perhaps by checking for the KUBERNETES_SERVICE_{HOST, PORT} environment variables? Or an additional flag, such as --in-cluster?

@erikerlandson
Copy link
Member

I'm thinking auto detection - if somebody tries to run it from the outside, it should automatically fail on them, and explain why

@erikerlandson
Copy link
Member

I don't want to close this, but we are going to hold off from including it in the initial 2.2 release

@sahilprasad
Copy link
Author

@erikerlandson Sounds good. I'll push something in the next few days for review.

@mccheah
Copy link

mccheah commented Aug 4, 2017

How does this resolve? Does the submitting shell still eventually call org.apache.spark.deploy.kubernetes.Client? Or does it just create a KubernetesClusterSchedulerBackend instance in memory?

@sahilprasad
Copy link
Author

@mccheah It eventually calls org.apache.spark.deploy.kubernetes.Client

@mccheah
Copy link

mccheah commented Aug 14, 2017

If we're calling the submission client then that implies we're not running in client mode. Client mode runs the scheduler backend and the user application directly in the process that runs spark-submit.

@sahilprasad
Copy link
Author

@mccheah Then is there any way to get around the submission client to achieve the sort of pseudo client mode that I need for Jupyter? The in-cluster use case is all I'm going for, so any advice you have on this front, or towards functionality closer to true client mode, would be awesome!

@mccheah
Copy link

mccheah commented Aug 14, 2017

Can the submission client just be run inside the cluster in cluster mode? I'm not entirely certain why the submission client has to be run inside a docker container. Another option is just to support client mode in a sense - that's just a matter of getting the Java/Scala code to instantiate a KubernetesClusterSchedulerBackend inside the SparkContext directly as opposed to having the indirection layer of the submission client. This is very tricky right now since the scheduler backend is opinionated to being bootstrapped by the submission client.

@foxish
Copy link
Member

foxish commented Aug 15, 2017

@sahilprasad, have you tried jupyterhub and https://github.com/jupyterhub/kubespawner?

@sahilprasad
Copy link
Author

@mccheah I'm not aware of a way to run PySpark applications inside of a Jupyter notebook without client deploy mode. Running the submission client in-cluster is more of a workaround to avoid the sort of networking and dependency resolution issues a remote, or out-of-cluster, client mode would involve. It seems to me that if shell applications work in-cluster and as-is, this would be worth supporting.

@foxish Yeah! Although I don't quite remember the end result, I experimented with those specifically with this project a few weeks ago, and was able to get a similar outcome in that I was able to run a PySpark application on a proxied Jupyter notebook being run on a k8s cluster. Successfully running it did involve allowing client mode through — exactly the changes of this PR.

@mccheah
Copy link

mccheah commented Aug 16, 2017

I think we want to modify SparkSubmit such that if it's in Kubernetes client mode:

  1. If it's a Python application, make the Python runner as the main class. Otherwise, make the user's provided main class the main class. See https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L637 versus https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L369
  2. When the SparkContext is created, it needs to know to make a KubernetesClusterSchedulerBackend instance based on the master URI.

@mccheah
Copy link

mccheah commented Aug 16, 2017

The basis of the client mode submission option is that there isn't an intermediate client that spawns another driver, but that the process running the SparkSubmit class is the class that runs the user's class directly. I think the current approach is unnecessarily creating a driver pod when in fact the pod that's running SparkSubmit should itself be the driver.

@sahilprasad
Copy link
Author

@mccheah Got it. Would we need something equivalent to a KubernetesClientSchedulerBackend to accomplish this? Or is there an easier way?

@mccheah
Copy link

mccheah commented Aug 16, 2017

It would be good to look into writing a KubernetesClientSchedulerBackend which shares many of its components with KubernetesClusterSchedulerBackend via a shared parent class or a shared inner module (inheritance vs. composition). YARN has a similar model with its YarnClientSchedulerBackend and YarnClusterSchedulerBackend that extend from YarnSchedulerBackend.

@sahilprasad
Copy link
Author

@mccheah Sounds good. I'll model it after the YARN scheduler backends. Should we continue discussion and review on this PR, or close it off in favor of a more poignant PR?

@mccheah
Copy link

mccheah commented Aug 16, 2017

We should close this and open another one with the more canonical approach.

@sahilprasad
Copy link
Author

Will open a WIP PR when ready.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants