Spark on Kubernetes
– Poonam Vilas Chaudhari
Being an analytics company, we at Abzooba use Spark extensively for machine learning, data ingestion, ETL & large data processing. Spark enables in-memory processing of large-scale data. Spark jobs can be long-running/short-lived/scheduled as per need. Memory requirements also differ to run these kinds of jobs.
Our product, xpresso.ai, is based on Kubernetes for automatic deployment, auto-scaling, and managing the applications. Why maintain a dedicated Hadoop cluster just for running Spark jobs? Can we use the existing Kubernetes cluster? With all these questions, our exploration journey towards Spark on Kubernetes has started.
Spark jobs can run on clusters managed by Kubernetes(version >= 1.6) This feature makes use of the native Kubernetes scheduler that has been added to Spark. Spark-submit can be directly used to submit a Spark application to a Kubernetes cluster.
How does it work?
- Spark creates a Spark driver running within a Kubernetes pod.
- The driver creates executors that are also running as pods and connects to them and executes application code.
- When the application completes, the executor pods are terminated and cleaned up but, the driver pod persists logs and remains in a “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
- The driver and executor pod scheduling is handled by Kubernetes.
Following are the prerequisites, to try this feature on existing Kubernetes cluster:
- Kubernetes master URL (https://<ip address>:<port no>)
- Kubernetes dashboard URL
- Valid service account which has access to create pods
Spark 2.3 or above ships docker file with its distribution which can create base image for spark. It can be created using below command:
Here, Pushing the image to the repository is mandatory. Kubernetes needs to fetch the spark image and distribute it among all pods.
Spark submit on Kubernetes cluster can be done using below command:
Jobs can run in cluster/client mode. Jobs can also be submitted using YAML file and run using kubctl apply
Spark jobs submitted to the Kubernetes cluster can be seen from the cluster dashboard.
- Spark application UI can only be accessed locally using kubectl port-forward. Spark UI lives until the driver is alive.
- Kubernetes does not provide native support to the Spark history server. It can be done manually by collecting event logs by setting spark.eventLog.dir property and deploy the history server in your cluster with access to event logs location
- The result of Spark on Kubernetes is much slower than Spark on Yarn. The kubelet work directory can only be mounted on one disk so that the Spark scratch space only uses ONE disk. While running in Yarn mode, the spark scratch space to use as many disks as yarn.local.dir configures.
- Spark on Kubernetes is currently experimental but it can be a good choice in the future if you are trying to move to Kubernetes for all deployments.
- It would be very helpful in cloud-native applications because all leading providers like AWS, Microsoft Azure & GCP provide managed Kubernetes service. It will reduce the dependency on cloud-specific components.