GPU

    从 Kubernetes v1.8 开始,GPU 开始以 DevicePlugin 的形式实现。在使用之前需要配置

    • kubelet/kube-apiserver/kube-controller-manager: --feature-gates="DevicePlugins=true"
    • 在所有的 Node 上安装 Nvidia 驱动,包括 NVIDIA Cuda Toolkit 和 cuDNN 等
    • Kubelet 配置使用 docker 容器引擎(默认就是 docker),其他容器引擎暂不支持该特性

    NVIDIA 插件

    NVIDIA 需要 nvidia-docker。

    安装 nvidia-docker

    设置 Docker 默认运行时为 nvidia

    1. # cat /etc/docker/daemon.json
    2. {
    3. "default-runtime": "nvidia",
    4. "runtimes": {
    5. "nvidia": {
    6. "path": "/usr/bin/nvidia-container-runtime",
    7. "runtimeArgs": []
    8. }
    9. }
    10. }

    部署 NVDIA 设备插件

    1. # For Kubernetes v1.8
    2. kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml
    3. # For Kubernetes v1.9
    4. kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml

    GCE/GKE GPU 插件

    1. # Install NVIDIA drivers on Container-Optimized OS:
    2. kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/daemonset.yaml
    3. # Install NVIDIA drivers on Ubuntu (experimental):
    4. kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/nvidia-driver-installer/ubuntu/daemonset.yaml
    5. # Install the device plugin:
    6. kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

    请求 nvidia.com/gpu 资源示例

    Kubernetes v1.6 和 v1.7

    在 Kubernetes v1.6 和 v1.7 中使用 GPU 需要预先配置

    • 在所有的 Node 上安装 Nvidia 驱动,包括 NVIDIA Cuda Toolkit 和 cuDNN 等
    • 在 apiserver 和 kubelet 上开启 --feature-gates="Accelerators=true"
    • Kubelet 配置使用 docker 容器引擎(默认就是 docker),其他容器引擎暂不支持该特性

    使用资源名 alpha.kubernetes.io/nvidia-gpu 指定请求 GPU 的个数,如

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. name: tensorflow
    5. spec:
    6. restartPolicy: Never
    7. containers:
    8. - image: gcr.io/tensorflow/tensorflow:latest-gpu
    9. name: gpu-container-1
    10. command: ["python"]
    11. env:
    12. - name: LD_LIBRARY_PATH
    13. value: /usr/lib/nvidia
    14. args:
    15. - -u
    16. - -c
    17. - from tensorflow.python.client import device_lib; print device_lib.list_local_devices()
    18. resources:
    19. limits:
    20. alpha.kubernetes.io/nvidia-gpu: 1 # requests one GPU
    21. volumeMounts:
    22. - mountPath: /usr/local/nvidia/bin
    23. name: bin
    24. - mountPath: /usr/lib/nvidia
    25. name: lib
    26. - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so
    27. name: libcuda-so
    28. - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
    29. - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.375.66
    30. name: libcuda-so-375-66
    31. volumes:
    32. - name: bin
    33. path: /usr/lib/nvidia-375/bin
    34. - name: lib
    35. hostPath:
    36. path: /usr/lib/nvidia-375
    37. - name: libcuda-so
    38. hostPath:
    39. path: /usr/lib/x86_64-linux-gnu/libcuda.so
    40. - name: libcuda-so-1
    41. hostPath:
    42. path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
    43. - name: libcuda-so-375-66
    44. hostPath:
    45. path: /usr/lib/x86_64-linux-gnu/libcuda.so.375.66
    1. $ kubectl create -f pod.yaml
    2. pod "tensorflow" created
    3. $ kubectl logs tensorflow
    4. ...
    5. [name: "/cpu:0"
    6. device_type: "CPU"
    7. memory_limit: 268435456
    8. locality {
    9. }
    10. incarnation: 9675741273569321173
    11. , name: "/gpu:0"
    12. device_type: "GPU"
    13. memory_limit: 11332668621
    14. locality {
    15. bus_id: 1
    16. }
    17. incarnation: 7807115828340118187
    18. physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0"
    19. ]

    注意

    • GPU 资源必须在 resources.limits 中请求,resources.requests 中无效
    • 容器可以请求 1 个或多个 GPU,不能只请求一部分
    • 多个容器之间不能共享 GPU
    • 默认假设所有 Node 安装了相同型号的 GPU

    如果集群 Node 中安装了多种型号的 GPU,则可以使用 Node Affinity 来调度 Pod 到指定 GPU 型号的 Node 上。

    1. # Label your nodes with the accelerator type they have.
    2. kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
    3. kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100

    然后,在创建 Pod 时设置 Node Affinity:

    NVIDIA Cuda Toolkit 和 cuDNN 等需要预先安装在所有 Node 上。为了访问 /usr/lib/nvidia-375,需要将 CUDA 库以 hostPath volume 的形式传给容器:

    1. apiVersion: batch/v1
    2. kind: Job
    3. metadata:
    4. name: nvidia-smi
    5. labels:
    6. name: nvidia-smi
    7. spec:
    8. template:
    9. metadata:
    10. labels:
    11. name: nvidia-smi
    12. spec:
    13. containers:
    14. - name: nvidia-smi
    15. image: nvidia/cuda
    16. command: ["nvidia-smi"]
    17. imagePullPolicy: IfNotPresent
    18. resources:
    19. limits:
    20. alpha.kubernetes.io/nvidia-gpu: 1
    21. volumeMounts:
    22. - mountPath: /usr/local/nvidia/bin
    23. name: lib
    24. volumes:
    25. - name: bin
    26. hostPath:
    27. path: /usr/lib/nvidia-375/bin
    28. - name: lib
    29. hostPath:
    30. path: /usr/lib/nvidia-375
    31. restartPolicy: Never
    1. $ kubectl create -f job.yaml
    2. job "nvidia-smi" created
    3. $ kubectl get job
    4. NAME DESIRED SUCCESSFUL AGE
    5. nvidia-smi 1 1 14m
    6. $ kubectl get pod -a
    7. NAME READY STATUS RESTARTS AGE
    8. nvidia-smi-kwd2m 0/1 Completed 0 14m
    9. $ kubectl logs nvidia-smi-kwd2m
    10. Fri Jun 16 19:49:53 2017
    11. +-----------------------------------------------------------------------------+
    12. | NVIDIA-SMI 375.66 Driver Version: 375.66 |
    13. |-------------------------------+----------------------+----------------------+
    14. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    15. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    16. |===============================+======================+======================|
    17. | 0 Tesla K80 Off | 0000:00:04.0 Off | 0 |
    18. | N/A 74C P0 80W / 149W | 0MiB / 11439MiB | 100% Default |
    19. +-------------------------------+----------------------+----------------------+
    20. +-----------------------------------------------------------------------------+
    21. | Processes: GPU Memory |
    22. | GPU PID Type Process name Usage |
    23. |=============================================================================|
    24. | No running processes found |
    25. +-----------------------------------------------------------------------------+

    安装 CUDA:

    1. # Check for CUDA and try to install.
    2. if ! dpkg-query -W cuda; then
    3. # The 16.04 installer works with 16.10.
    4. curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
    5. dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
    6. apt-get update
    7. apt-get install cuda -y
    8. fi

    安装 cuDNN:

    首先到网站 https://developer.nvidia.com/cudnn 注册,并下载 cuDNN v5.1,然后运行命令安装

    1. $ nvidia-smi
    2. Fri Jun 16 19:33:35 2017
    3. +-----------------------------------------------------------------------------+
    4. | NVIDIA-SMI 375.66 Driver Version: 375.66 |
    5. |-------------------------------+----------------------+----------------------+
    6. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    7. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    8. |===============================+======================+======================|
    9. | 0 Tesla K80 Off | 0000:00:04.0 Off | 0 |
    10. | N/A 74C P0 80W / 149W | 0MiB / 11439MiB | 100% Default |
    11. +-------------------------------+----------------------+----------------------+
    12. +-----------------------------------------------------------------------------+
    13. | Processes: GPU Memory |
    14. | GPU PID Type Process name Usage |
    15. |=============================================================================|
    16. +-----------------------------------------------------------------------------+