PyTorch Training

    This Kubeflow component has stable status. See the Kubeflow versioning policies.

    This guide walks you through using PyTorch with Kubeflow.

    If you haven’t already done so please follow the to deploy Kubeflow.

    An alpha version of PyTorch support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow between 0.2.0 and 0.3.5 to use this version.

    Check that the PyTorch custom resource is installed

    The output should include

    1. NAME AGE
    2. ...
    3. pytorchjobs.kubeflow.org 4d
    4. ...

    If it is not included you can add it as follows

    1. export KF_DIR=<your Kubeflow installation directory>
    2. cd ${KF_DIR}/kustomize
    3. kubectl apply -f pytorch-job-crds.yaml
    4. kubectl apply -f pytorch-operator.yaml

    You can create PyTorch Job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.

    1. kubectl create -f pytorch_job_mnist.yaml

    You should now be able to see the created pods matching the specified number of replicas.

    1. kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist

    Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

    1. kubectl get -o yaml pytorchjobs pytorch-tcp-dist-mnist

    See the status section to monitor the job status. Here is sample output when the job is successfully completed.

    1. apiVersion: kubeflow.org/v1
    2. kind: PyTorchJob
    3. metadata:
    4. clusterName: ""
    5. creationTimestamp: 2018-12-16T21:39:09Z
    6. generation: 1
    7. name: pytorch-tcp-dist-mnist
    8. namespace: default
    9. resourceVersion: "15532"
    10. selfLink: /apis/kubeflow.org/v1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist
    11. uid: 059391e8-017b-11e9-bf13-06afd8f55a5c
    12. cleanPodPolicy: None
    13. pytorchReplicaSpecs:
    14. replicas: 1
    15. restartPolicy: OnFailure
    16. template:
    17. metadata:
    18. creationTimestamp: null
    19. spec:
    20. containers:
    21. - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
    22. name: pytorch
    23. ports:
    24. - containerPort: 23456
    25. name: pytorchjob-port
    26. resources: {}
    27. Worker:
    28. replicas: 3
    29. restartPolicy: OnFailure
    30. template:
    31. metadata:
    32. creationTimestamp: null
    33. spec:
    34. containers:
    35. - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
    36. name: pytorch
    37. - containerPort: 23456
    38. resources: {}
    39. status:
    40. completionTime: 2018-12-16T21:43:27Z
    41. conditions:
    42. - lastTransitionTime: 2018-12-16T21:39:09Z
    43. lastUpdateTime: 2018-12-16T21:39:09Z
    44. message: PyTorchJob pytorch-tcp-dist-mnist is created.
    45. reason: PyTorchJobCreated
    46. status: "True"
    47. type: Created
    48. - lastTransitionTime: 2018-12-16T21:39:09Z
    49. lastUpdateTime: 2018-12-16T21:40:45Z
    50. message: PyTorchJob pytorch-tcp-dist-mnist is running.
    51. reason: PyTorchJobRunning
    52. status: "False"
    53. type: Running
    54. - lastTransitionTime: 2018-12-16T21:39:09Z
    55. lastUpdateTime: 2018-12-16T21:43:27Z
    56. message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed.
    57. reason: PyTorchJobSucceeded
    58. status: "True"
    59. type: Succeeded
    60. replicaStatuses:
    61. Master: {}