PyTorch Training
This Kubeflow component has stable status. See the Kubeflow versioning policies.
This guide walks you through using PyTorch with Kubeflow.
If you haven’t already done so please follow the to deploy Kubeflow.
An alpha version of PyTorch support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow between 0.2.0 and 0.3.5 to use this version.
Check that the PyTorch custom resource is installed
The output should include
NAME AGE
...
pytorchjobs.kubeflow.org 4d
...
If it is not included you can add it as follows
export KF_DIR=<your Kubeflow installation directory>
cd ${KF_DIR}/kustomize
kubectl apply -f pytorch-job-crds.yaml
kubectl apply -f pytorch-operator.yaml
You can create PyTorch Job by defining a PyTorchJob config file. See the manifests for the distributed MNIST example. You may change the config file based on your requirements.
kubectl create -f pytorch_job_mnist.yaml
You should now be able to see the created pods matching the specified number of replicas.
kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist
Training should run for about 10 epochs and takes 5-10 minutes on a cpu cluster. Logs can be inspected to see its training progress.
kubectl get -o yaml pytorchjobs pytorch-tcp-dist-mnist
See the status section to monitor the job status. Here is sample output when the job is successfully completed.
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
clusterName: ""
creationTimestamp: 2018-12-16T21:39:09Z
generation: 1
name: pytorch-tcp-dist-mnist
namespace: default
resourceVersion: "15532"
selfLink: /apis/kubeflow.org/v1/namespaces/default/pytorchjobs/pytorch-tcp-dist-mnist
uid: 059391e8-017b-11e9-bf13-06afd8f55a5c
cleanPodPolicy: None
pytorchReplicaSpecs:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
name: pytorch
ports:
- containerPort: 23456
name: pytorchjob-port
resources: {}
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
name: pytorch
- containerPort: 23456
resources: {}
status:
completionTime: 2018-12-16T21:43:27Z
conditions:
- lastTransitionTime: 2018-12-16T21:39:09Z
lastUpdateTime: 2018-12-16T21:39:09Z
message: PyTorchJob pytorch-tcp-dist-mnist is created.
reason: PyTorchJobCreated
status: "True"
type: Created
- lastTransitionTime: 2018-12-16T21:39:09Z
lastUpdateTime: 2018-12-16T21:40:45Z
message: PyTorchJob pytorch-tcp-dist-mnist is running.
reason: PyTorchJobRunning
status: "False"
type: Running
- lastTransitionTime: 2018-12-16T21:39:09Z
lastUpdateTime: 2018-12-16T21:43:27Z
message: PyTorchJob pytorch-tcp-dist-mnist is successfully completed.
reason: PyTorchJobSucceeded
status: "True"
type: Succeeded
replicaStatuses:
Master: {}