Spark on Kubernetes
Spark on Kubernetes
– Poonam Chaudhari
Abzooba being an analytics company, we use spark extensively for machine learning, data ingestion, ETL & large data processing. Spark enables In-memory processing of large-scale data. Spark job can be long running/short-lived/scheduled as per need. Memory requirements also differ to run these kind of jobs.
Our product, xpresso.ai is based on Kubernetes for automatic deployment, auto scaling, and managing the applications. Why to maintain dedicated Hadoop cluster just for running spark jobs? Can we use existing Kubernetes cluster? With all these questions, our exploration journey towards spark on Kubernetes has started.
Spark jobs can run on clusters managed by Kubernetes(version >= 1.6) This feature makes use of native Kubernetes scheduler that has been added to Spark. Spark-submit can be directly used to submit a Spark application to a Kubernetes cluster.
How it actually works?
- Spark creates a Spark driver running within a Kubernetes pod.
- The driver creates executors which are also running as pods and connects to them, and executes application code.
- When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
- The driver and executor pod scheduling is handled by Kubernetes.
Following are the prerequisites,to try this feature on existing Kubernetes cluster:
- Kubernetes master URL (https://<ip address>:<port no>)
- Kubernetes dashboard URL
- Valid service account which has access to create pods
Spark 2.3 or above ships docker file with its distribution which can create base image for spark. It can be created using below command:
Here, Pushing the image to repository is mandatory. Kubernetes needs to fetch the spark image and distribute it among all pods.
Spark submit on Kubernetes cluster can done using below command:
Job can run in cluster/client mode. Job can also be submitted using yaml file and run using kubctl apply
Spark job submitted to kubernetes cluster can be seen from cluster dashboard.
- Spark application UI can only be accessed locally using kubectl port-forward. Spark UI live until driver is alive.
- Kubernetes does not provide native support to spark history server. It can be done manually by collecting event logs by setting spark.eventLog.dir property and deploy the history server in your cluster with access to event logs location
- Result of Spark on kubernetes is much slower than spark on yarn. The kubelet work directory can only be mounted on one disk, so that the spark scratch space only use ONE disk. While running in yarn mode, the spark scratch space use as many disk as yarn.local.dir configures.
- Spark on Kubernetes is currently experimental but it can be a good choice in future if you are trying to move to Kubernetes for all deployments.
- It would be very helpful in cloud native applications because all leading providers like AWS, Microsoft Azure & GCP provide managed Kubernetes service. It will reduce the dependency on cloud specific components.