Spark on Kubernetes

    Kubernetes 从 v1.8 开始支持 原生的 Apache Spark 应用(需要 Spark 支持 Kubernetes,比如 v2.3),可以通过 命令直接提交 Kubernetes 任务。比如计算圆周率

    或者使用 Python 版本

    1. bin/spark-submit \
    2. --deploy-mode cluster \
    3. --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    4. --kubernetes-namespace <k8s-namespace> \
    5. --conf spark.executor.instances=5 \
    6. --conf spark.app.name=spark-pi \
    7. --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver-py:v2.2.0-kubernetes-0.4.0 \
    8. --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor-py:v2.2.0-kubernetes-0.4.0 \
    9. --jars local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0.jar \
    10. --py-files local:///opt/spark/examples/src/main/python/sort.py \
    11. local:///opt/spark/examples/src/main/python/pi.py 10

    Kubernetes 示例 上提供了一个详细的 spark 部署方法,由于步骤复杂,这里简化一些部分让大家安装的时候不用去多设定一些东西。

    • 一个 kubernetes 群集, 可参考 集群部署
    • kube-dns 正常运作

    创建一个命名空间

    namespace-spark-cluster.yaml

    1. apiVersion: v1
    2. kind: Namespace
    3. metadata:
    4. name: "spark-cluster"
    5. labels:
    6. name: "spark-cluster"
    1. $ kubectl create -f examples/staging/spark/namespace-spark-cluster.yaml

    这边原文提到需要将 kubectl 的执行环境转到 spark-cluster, 这边为了方便我们不这样做, 而是将之后的佈署命名空间都加入 spark-cluster

    建立一个 replication controller, 来运行 Spark Master 服务

    1. kind: ReplicationController
    2. apiVersion: v1
    3. metadata:
    4. name: spark-master-controller
    5. namespace: spark-cluster
    6. spec:
    7. replicas: 1
    8. selector:
    9. component: spark-master
    10. template:
    11. metadata:
    12. labels:
    13. component: spark-master
    14. spec:
    15. containers:
    16. - name: spark-master
    17. image: gcr.io/google_containers/spark:1.5.2_v1
    18. command: ["/start-master"]
    19. ports:
    20. - containerPort: 7077
    21. - containerPort: 8080
    22. resources:
    23. requests:
    24. cpu: 100m
    1. $ kubectl create -f spark-master-controller.yaml

    创建 master 服务

    spark-master-service.yaml

    1. kind: Service
    2. apiVersion: v1
    3. metadata:
    4. name: spark-master
    5. namespace: spark-cluster
    6. spec:
    7. ports:
    8. - port: 7077
    9. targetPort: 7077
    10. name: spark
    11. - port: 8080
    12. targetPort: 8080
    13. name: http
    14. selector:
    15. component: spark-master
    1. $ kubectl get pod -n spark-cluster
    2. spark-master-controller-qtwm8 1/1 Running 0 6d
    1. $ kubectl logs spark-master-controller-qtwm8 -n spark-cluster
    2. 17/08/07 02:34:54 INFO Master: Registered signal handlers for [TERM, HUP, INT]
    3. 17/08/07 02:34:54 INFO SecurityManager: Changing modify acls to: root
    4. 17/08/07 02:34:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
    5. 17/08/07 02:34:55 INFO Slf4jLogger: Slf4jLogger started
    6. 17/08/07 02:34:55 INFO Remoting: Starting remoting
    7. 17/08/07 02:34:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
    8. 17/08/07 02:34:55 INFO Master: Starting Spark master at spark://spark-master:7077
    9. 17/08/07 02:34:55 INFO Master: Running Spark version 1.5.2
    10. 17/08/07 02:34:56 INFO Utils: Successfully started service 'MasterUI' on port 8080.
    11. 17/08/07 02:34:56 INFO MasterWebUI: Started MasterWebUI at http://10.2.6.12:8080
    12. 17/08/07 02:34:56 INFO Utils: Successfully started service on port 6066.
    13. 17/08/07 02:34:56 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
    14. 17/08/07 02:34:56 INFO Master: I have been elected leader! New state: ALIVE

    若 master 已经被建立与运行, 我们可以透过 Spark 开发的 webUI 来察看我们 spark 的群集状况, 我们将佈署 specialized proxy

    spark-ui-proxy-controller.yaml

    1. kind: ReplicationController
    2. apiVersion: v1
    3. metadata:
    4. name: spark-ui-proxy-controller
    5. namespace: spark-cluster
    6. spec:
    7. replicas: 1
    8. selector:
    9. component: spark-ui-proxy
    10. template:
    11. metadata:
    12. labels:
    13. component: spark-ui-proxy
    14. spec:
    15. containers:
    16. - name: spark-ui-proxy
    17. image: elsonrodriguez/spark-ui-proxy:1.0
    18. ports:
    19. - containerPort: 80
    20. resources:
    21. requests:
    22. cpu: 100m
    23. args:
    24. - spark-master:8080
    25. livenessProbe:
    26. httpGet:
    27. path: /
    28. port: 80
    29. initialDelaySeconds: 120
    30. timeoutSeconds: 5
    1. $ kubectl create -f spark-ui-proxy-controller.yaml

    提供一个 service 做存取, 这边原文是使用 LoadBalancer type, 这边我们改成 NodePort, 如果你的 kubernetes 运行环境是在 cloud provider, 也可以参考原文作法

    spark-ui-proxy-service.yaml

    1. kind: Service
    2. apiVersion: v1
    3. metadata:
    4. name: spark-ui-proxy
    5. namespace: spark-cluster
    6. spec:
    7. ports:
    8. - port: 80
    9. targetPort: 80
    10. nodePort: 30080
    11. selector:
    12. component: spark-ui-proxy
    13. type: NodePort
    1. $ kubectl create -f spark-ui-proxy-service.yaml

    部署完后你可以利用 来察看你的 Spark 群集状态

    可以透过 http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-master:8080/ 察看, 若 kubectl 中断就无法这样观察了, 但我们再先前有设定 nodeport 所以也可以透过任意台 node 的端口 30080 去察看(例如 http://10.201.2.34:30080)。

    部署 Spark workers

    要先确定 Matser 是再运行的状态

    spark-worker-controller.yaml

    1. kind: ReplicationController
    2. apiVersion: v1
    3. metadata:
    4. name: spark-worker-controller
    5. namespace: spark-cluster
    6. spec:
    7. replicas: 2
    8. selector:
    9. component: spark-worker
    10. template:
    11. metadata:
    12. labels:
    13. component: spark-worker
    14. spec:
    15. - name: spark-worker
    16. image: gcr.io/google_containers/spark:1.5.2_v1
    17. command: ["/start-worker"]
    18. ports:
    19. - containerPort: 8081
    20. resources:
    21. requests:
    22. cpu: 100m
    1. $ kubectl create -f spark-worker-controller.yaml
    2. replicationcontroller "spark-worker-controller" created
    1. $ kubectl get pod -n spark-cluster
    2. spark-master-controller-qtwm8 1/1 Running 0 6d
    3. spark-worker-controller-4rxrs 1/1 Running 0 6d
    4. spark-worker-controller-z6f21 1/1 Running 0 6d
    5. spark-ui-proxy-controller-d4br2 1/1 Running 4 6d

    也可以透过上面建立的 WebUI 服务去察看

    基本上到这边 Spark 的群集已经建立完成了

    我们可以利用 Zeppelin UI 经由 web notebook 直接去执行我们的任务, 详情可以看 与 Spark architecture

    zeppelin-controller.yaml

    1. kind: ReplicationController
    2. apiVersion: v1
    3. metadata:
    4. name: zeppelin-controller
    5. namespace: spark-cluster
    6. spec:
    7. replicas: 1
    8. selector:
    9. component: zeppelin
    10. template:
    11. metadata:
    12. labels:
    13. component: zeppelin
    14. spec:
    15. containers:
    16. - name: zeppelin
    17. image: gcr.io/google_containers/zeppelin:v0.5.6_v1
    18. ports:
    19. - containerPort: 8080
    20. resources:
    21. requests:
    22. cpu: 100m
    1. $ kubectl create -f zeppelin-controller.yaml
    2. replicationcontroller "zeppelin-controller" created

    然后一样佈署 Service

    zeppelin-service.yaml

    1. kind: Service
    2. apiVersion: v1
    3. metadata:
    4. name: zeppelin
    5. namespace: spark-cluster
    6. spec:
    7. ports:
    8. - port: 80
    9. targetPort: 8080
    10. nodePort: 30081
    11. selector:
    12. component: zeppelin
    13. type: NodePort

    可以看到我们把 NodePort 设再 30081, 一样可以透过任意台 node 的 30081 port 访问 zeppelin UI。

    通过命令行访问 pyspark(记得把 pod 名字换成你自己的):

    1. $ kubectl exec -it zeppelin-controller-8f14f -n spark-cluster pyspark
    2. Python 2.7.9 (default, Mar 1 2015, 12:57:24)
    3. [GCC 4.9.2] on linux2
    4. Type "help", "copyright", "credits" or "license" for more information.
    5. 17/08/14 01:59:22 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    6. Welcome to
    7. ____ __
    8. / __/__ ___ _____/ /__
    9. _\ \/ _ \/ _ `/ __/ '_/
    10. /__ / .__/\_,_/_/ /_/\_\ version 1.5.2
    11. /_/
    12. Using Python version 2.7.9 (default, Mar 1 2015 12:57:24)
    13. SparkContext available as sc, HiveContext available as sqlContext.
    14. >>>

    zeppelin 常见问题

    • zeppelin 的镜像非常大, 所以再 pull 时会花上一些时间, 而 size 大小的问题现在也正在解决中, 详情可参考 issue #17231
    • 在 GKE 的平台上, kubectl post-forward 可能有些不稳定, 如果你看现 zeppelin 的状态为 ,port-forward 可能已经失败你需要去重新启动它, 详情可参考 #12179

    参考文档