Making use of Docker best practices, this repo uses a multi stage Dockerfile for Spark jobs. It is meant to be used as a starting point for projects deploying Apache Spark jobs in Kubernetes clusters. It can be used with spark-submit
command or with spark-operator
. Since many of the deployments are running in cloud environments, the jars
to allow S3
and GCS
access are included.
Build the docker image by running:
docker build -t my-spark-image:latest .
To override Spark or Hadoop verson, in the build stage
docker build --build-arg HADOOP_VERSION_DEFAULT=2.7 -t my-spark-image:latest .
The application jar
file should be added in /app
directory.
The spark-submit
command looks like this:
spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
--conf spark.kubernetes.container.image=<spark-image> \
local:///app/<jar-name>.jar
The application files should be added in /app
directory.
The spark-submit
command looks like this:
spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
--conf spark.kubernetes.container.image=<spark-image> \
local:///app/src/main/pi.py
MacOS:
brew install minikube
MacOS:
brew install kubectl
Start minikube cluster:
minikube start --insecure-registry "10.0.0.0/24" --memory 8192 --cpus 4
minikube addons enable registry
Enable pushing images to minikube
docker registry
docker run --rm -it --network=host alpine ash -c "apk add socat && socat TCP-LISTEN:5000,reuseaddr,fork TCP:$(minikube ip):5000"
More info here
For Spark to work, the service account running the appliaction needs to have permissions to start pods in the cluster. Permissions are added in the rbac.yaml
file. Execute:
kubectl apply -f rbac.yaml
Build Docker image.
Scala
docker build -t localhost:5000/spark-local -f Dockerfile .
Python
docker build -t localhost:5000/spark-local-py -f Dockerfile-python .
Push image to registry:
docker push localhost:5000/spark-local
Execute Jar
spark-submit \
--master k8s://https://$(minikube ip):8443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
--conf spark.kubernetes.container.image=localhost:5000/spark-local \
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar
Execute .py
file
spark-submit \
--master k8s://https://$(minikube ip):8443 \
--deploy-mode cluster \
--name spark-pi-py \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
--conf spark.kubernetes.container.image=localhost:5000/spark-local-py \
local:///opt/spark/examples/src/main/python/pi.py