MXNet Training

Alpha

This Kubeflow component has alpha status with limited support. See the . The Kubeflow team is interested in your feedback about the usability of the feature.

This guide walks you through using MXNet with Kubeflow.

If you haven’t already done so please follow the to deploy Kubeflow.

A version of MXNet support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.

Verify that MXNet support is included in your Kubeflow deployment

Check that the MXNet custom resource is installed

  1. NAME AGE
  2. ...
  3. mxjobs.kubeflow.org 4d
  4. ...

If it is not included you can add it as follows

Alternatively, you can deploy the operator with default settings without using kustomize by running the following from the repo:

  1. git clone https://github.com/kubeflow/mxnet-operator.git
  2. cd mxnet-operator
  3. kubectl create -f manifests/crd-v1beta1.yaml
  4. kubectl create -f manifests/rbac.yaml
  5. kubectl create -f manifests/deployment.yaml

You create a training job by defining a MXJob with MXTrain mode and then creating it with

Creating a TVM tuning job (AutoTVM)

TVM is a end to end deep learning compiler stack, you can easily run AutoTVM with mxnet-operator.You can create a auto tuning job by define a type of MXTune job and then creating it with

  1. kubectl create -f examples/v1beta1/tune/mx_job_tune_gpu.yaml

Before you use the auto-tuning example, there is some preparatory work need to be finished in advance. To let TVM tune your network, you should create a docker image which has TVM module. Then, you need a auto-tuning script to specify which network will be tuned and set the auto-tuning parameters, For more details, please see . Finally, you need a startup script to start the auto-tuning program. In fact, mxnet-operator will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script. We provide an example under examples/v1beta1/tune/, tuning result will be saved in a log file like resnet-18.log in the example we gave. You can refer it for details.

Here is sample output for an example job

  1. apiVersion: kubeflow.org/v1beta1
  2. kind: MXJob
  3. metadata:
  4. creationTimestamp: 2019-03-19T09:24:27Z
  5. generation: 1
  6. name: mxnet-job
  7. namespace: default
  8. resourceVersion: "3681685"
  9. selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/mxjobs/mxnet-job
  10. uid: cb11013b-4a28-11e9-b7f4-704d7bb59f71
  11. spec:
  12. cleanPodPolicy: All
  13. jobMode: MXTrain
  14. mxReplicaSpecs:
  15. Scheduler:
  16. replicas: 1
  17. restartPolicy: Never
  18. template:
  19. metadata:
  20. creationTimestamp: null
  21. containers:
  22. - image: mxjob/mxnet:gpu
  23. name: mxnet
  24. - containerPort: 9091
  25. name: mxjob-port
  26. resources: {}
  27. Server:
  28. replicas: 1
  29. restartPolicy: Never
  30. template:
  31. metadata:
  32. creationTimestamp: null
  33. spec:
  34. containers:
  35. - image: mxjob/mxnet:gpu
  36. name: mxnet
  37. ports:
  38. - containerPort: 9091
  39. name: mxjob-port
  40. resources: {}
  41. Worker:
  42. replicas: 1
  43. restartPolicy: Never
  44. template:
  45. metadata:
  46. creationTimestamp: null
  47. spec:
  48. containers:
  49. - args:
  50. - /incubator-mxnet/example/image-classification/train_mnist.py
  51. - --num-epochs
  52. - "10"
  53. - --num-layers
  54. - "2"
  55. - --kv-store
  56. - dist_device_sync
  57. command:
  58. - python
  59. image: mxjob/mxnet:gpu
  60. name: mxnet
  61. ports:
  62. - containerPort: 9091
  63. name: mxjob-port
  64. resources:
  65. limits:
  66. nvidia.com/gpu: "1"
  67. status:
  68. completionTime: 2019-03-19T09:25:11Z
  69. conditions:
  70. - lastTransitionTime: 2019-03-19T09:24:27Z
  71. lastUpdateTime: 2019-03-19T09:24:27Z
  72. message: MXJob mxnet-job is created.
  73. reason: MXJobCreated
  74. status: "True"
  75. type: Created
  76. - lastTransitionTime: 2019-03-19T09:24:27Z
  77. lastUpdateTime: 2019-03-19T09:24:29Z
  78. message: MXJob mxnet-job is running.
  79. reason: MXJobRunning
  80. status: "False"
  81. type: Running
  82. - lastTransitionTime: 2019-03-19T09:24:27Z
  83. lastUpdateTime: 2019-03-19T09:25:11Z
  84. message: MXJob mxnet-job is successfully completed.
  85. reason: MXJobSucceeded
  86. status: "True"
  87. type: Succeeded
  88. mxReplicaStatuses:
  89. Scheduler: {}
  90. Server: {}
  91. startTime: 2019-03-19T09:24:29Z

Feedback

Was this page helpful?

Glad to hear it! Please .

Sorry to hear that. Please tell us how we can improve.

Last modified 04.02.2020: