PyTorch Training

This Kubeflow component has stable status. See the Kubeflow versioning policies.

This guide walks you through using PyTorch with Kubeflow.

If you haven’t already done so please follow the to deploy Kubeflow.

An alpha version of PyTorch support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow between 0.2.0 and 0.3.5 to use this version.

Check that the PyTorch custom resource is installed

The output should include

NAME                                           AGE
...
pytorchjobs.kubeflow.org                       4d
...

If it is not included you can add it as follows

export KF_DIR=<your Kubeflow installation directory>
cd ${KF_DIR}/kustomize
kubectl apply -f pytorch-job-crds.yaml
kubectl apply -f pytorch-operator.yaml

You can create PyTorch Job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.

kubectl create -f pytorch_job_mnist.yaml

You should now be able to see the created pods matching the specified number of replicas.

kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist

Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.

kubectl get -o yaml pytorchjobs pytorch-tcp-dist-mnist

See the status section to monitor the job status. Here is sample output when the job is successfully completed.

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-12-16T21:39:09Z
  generation: 1
  name: pytorch-tcp-dist-mnist
  namespace: default
  resourceVersion: "15532"
  selfLink: /apis/kubeflow.org/v1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist
  uid: 059391e8-017b-11e9-bf13-06afd8f55a5c
  cleanPodPolicy: None
  pytorchReplicaSpecs:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
            name: pytorch
            ports:
            - containerPort: 23456
              name: pytorchjob-port
            resources: {}
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
            name: pytorch
            - containerPort: 23456
            resources: {}
status:
  completionTime: 2018-12-16T21:43:27Z
  conditions:
  - lastTransitionTime: 2018-12-16T21:39:09Z
    lastUpdateTime: 2018-12-16T21:39:09Z
    message: PyTorchJob pytorch-tcp-dist-mnist is created.
    reason: PyTorchJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: 2018-12-16T21:39:09Z
    lastUpdateTime: 2018-12-16T21:40:45Z
    message: PyTorchJob pytorch-tcp-dist-mnist is running.
    reason: PyTorchJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: 2018-12-16T21:39:09Z
    lastUpdateTime: 2018-12-16T21:43:27Z
    message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed.
    reason: PyTorchJobSucceeded
    status: "True"
    type: Succeeded
  replicaStatuses:
    Master: {}