Getting Started with Katib

    This guide shows how to get started with Katib and run a few examples using the command line and the Katib user interface (UI) to perform hyperparameter tuning.

    For an overview of the concepts around Katib and hyperparameter tuning, check the introduction to Katib.

    Let’s set up and configure Katib on your Kubernetes cluster with Kubeflow.

    You can skip this step if you have already installed Kubeflow. Your Kubeflow deployment includes Katib.

    To install Katib as part of Kubeflow, follow the .

    If you want to install Katib separately from Kubeflow, or to get a later version of Katib, run the following commands to install Katib directly from its repository on GitHub and deploy Katib to your cluster:

    Note: You should have kustomize version >= 3.2 to install Katib.

    If you used the to deploy Katib, you can skip this step. This script deploys PersistentVolumeClaim (PVC) and on your cluster.

    You can skip this step if you’re using Kubeflow on Google Kubernetes Engine (GKE) or if your Kubernetes cluster includes a StorageClass for dynamic volume provisioning. For more information, check the Kubernetes documentation on dynamic provisioning and PV.

    If you’re using Katib outside GKE and your cluster doesn’t include a StorageClass for dynamic volume provisioning, you must create a PV to bind to the PVC required by Katib.

    After deploying Katib to your cluster, run the following command to create the PV:

    The above kubectl apply command uses a YAML file - - that defines the properties of the PV.

    You can use the Katib user interface (UI) to submit experiments and to monitor your results. The Katib home page within Kubeflow looks like this:

    If you installed Katib as part of Kubeflow, you can access the Katib UI from the Kubeflow UI:

    1. Open the Kubeflow UI. Check the guide to accessing the central dashboard.
    2. Click Katib in the left-hand menu.

    Alternatively, you can set port-forwarding for the Katib UI service:

    1. kubectl port-forward svc/katib-ui -n kubeflow 8080:80

    Then you can access the Katib UI at this URL:

    1. http://localhost:8080/katib/

    Check if you want to contribute to Katib UI.

    This section introduces some examples that you can run to try Katib.

    You can create an experiment for Katib by defining the experiment in a YAML configuration file. The YAML file defines the configurations for the experiment, including the hyperparameter feasible space, optimization parameter, optimization goal, suggestion algorithm, and so on.

    This example uses the YAML file for the random algorithm example.

    The random algorithm example uses an MXNet neural network to train an image classification model using the MNIST dataset. You can check training container source code . The experiment runs twelve training jobs with various hyperparameters and saves the results.

    If you installed Katib as part of Kubeflow, you can’t run experiments in the Kubeflow namespace. Run the following commands to change namespace and launch an experiment using the random algorithm example:

      1. curl https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/random-example.yaml --output random-example.yaml
    1. Edit random-example.yaml and change the following line to use your Kubeflow user profile namespace:

      1. Namespace: kubeflow
    2. (Optional) Note: Katib’s experiments don’t work with Istio sidecar injection. If you installed Kubeflow using , you have to disable sidecar injection. To do that, specify this annotation: sidecar.istio.io/inject: "false" in your experiment’s trial template.

      For the provided random example with Kubernetes Job trial template, annotation should be under . For the Kubeflow TFJob or other training operators check here how to set the annotation.

    3. Deploy the example:

    This example embeds the hyperparameters as arguments. You can embed hyperparameters in another way (for example, using environment variables) by using the template defined in the trialTemplate.trialSpec section of the YAML file. The template uses the format and substitutes parameters from the trialTemplate.trialParameters. Follow the trial template guide to know more about it.

    This example randomly generates the following hyperparameters:

    • --lr: Learning rate. Type: double.
    • --num-layers: Number of layers in the neural network. Type: integer.
    • --optimizer: Optimization method to change the neural network attributes. Type: categorical.

    Check the experiment status:

    1. kubectl -n <YOUR_USER_PROFILE_NAMESPACE> get experiment random-example -o yaml

    The output of the above command should look similar to this:

    1. apiVersion: kubeflow.org/v1beta1
    2. kind: Experiment
    3. metadata:
    4. creationTimestamp: "2020-10-23T21:27:53Z"
    5. finalizers:
    6. - update-prometheus-metrics
    7. generation: 1
    8. name: random-example
    9. namespace: "<YOUR_USER_PROFILE_NAMESPACE>"
    10. resourceVersion: "147081981"
    11. selfLink: /apis/kubeflow.org/v1beta1/namespaces/<YOUR_USER_PROFILE_NAMESPACE>/experiments/random-example
    12. uid: fb3776e8-0f83-4783-88b6-80d06867ca0b
    13. spec:
    14. algorithm:
    15. algorithmName: random
    16. maxFailedTrialCount: 3
    17. maxTrialCount: 12
    18. metricsCollectorSpec:
    19. collector:
    20. kind: StdOut
    21. objective:
    22. additionalMetricNames:
    23. - Train-accuracy
    24. goal: 0.99
    25. metricStrategies:
    26. - name: Validation-accuracy
    27. value: max
    28. - name: Train-accuracy
    29. value: max
    30. objectiveMetricName: Validation-accuracy
    31. type: maximize
    32. parallelTrialCount: 3
    33. parameters:
    34. - feasibleSpace:
    35. max: "0.03"
    36. min: "0.01"
    37. name: lr
    38. - feasibleSpace:
    39. max: "5"
    40. name: num-layers
    41. parameterType: int
    42. - feasibleSpace:
    43. list:
    44. - sgd
    45. - adam
    46. - ftrl
    47. name: optimizer
    48. parameterType: categorical
    49. resumePolicy: LongRunning
    50. trialTemplate:
    51. failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    52. primaryContainerName: training-container
    53. successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    54. trialParameters:
    55. - description: Learning rate for the training model
    56. name: learningRate
    57. reference: lr
    58. - description: Number of training model layers
    59. name: numberLayers
    60. reference: num-layers
    61. - description: Training model optimizer (sdg, adam or ftrl)
    62. name: optimizer
    63. reference: optimizer
    64. trialSpec:
    65. apiVersion: batch/v1
    66. kind: Job
    67. spec:
    68. template:
    69. metadata:
    70. annotations:
    71. sidecar.istio.io/inject: "false"
    72. spec:
    73. containers:
    74. - command:
    75. - python3
    76. - /opt/mxnet-mnist/mnist.py
    77. - --batch-size=64
    78. - --lr=${trialParameters.learningRate}
    79. - --num-layers=${trialParameters.numberLayers}
    80. - --optimizer=${trialParameters.optimizer}
    81. image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-e294a90
    82. name: training-container
    83. restartPolicy: Never
    84. status:
    85. conditions:
    86. - lastTransitionTime: "2020-10-23T21:27:53Z"
    87. lastUpdateTime: "2020-10-23T21:27:53Z"
    88. message: Experiment is created
    89. reason: ExperimentCreated
    90. status: "True"
    91. type: Created
    92. - lastTransitionTime: "2020-10-23T21:28:13Z"
    93. lastUpdateTime: "2020-10-23T21:28:13Z"
    94. message: Experiment is running
    95. reason: ExperimentRunning
    96. status: "True"
    97. currentOptimalTrial:
    98. observation:
    99. metrics:
    100. - latest: "0.993170"
    101. max: "0.993170"
    102. min: "0.920293"
    103. name: Train-accuracy
    104. - latest: "0.978006"
    105. max: "0.978603"
    106. min: "0.959295"
    107. name: Validation-accuracy
    108. parameterAssignments:
    109. - name: lr
    110. value: "0.02889324678979306"
    111. - name: num-layers
    112. value: "5"
    113. - name: optimizer
    114. value: sgd
    115. runningTrialList:
    116. - random-example-26d5wzn2
    117. - random-example-98fpd29m
    118. - random-example-x2vjlzzv
    119. startTime: "2020-10-23T21:27:53Z"
    120. succeededTrialList:
    121. - random-example-n9c4j4cv
    122. - random-example-qfb68jpb
    123. - random-example-s96tq48v
    124. - random-example-smpc6ws2
    125. trials: 7
    126. trialsRunning: 3
    127. trialsSucceeded: 4

    When the last value in status.conditions.type is Succeeded, the experiment is complete. You can check information about the best trial in status.currentOptimalTrial.

    • .currentOptimalTrial.bestTrialName is the trial name.

    • .currentOptimalTrial.observation.metrics is the max, min and latest recorded values for objective and additional metrics.

    • .currentOptimalTrial.parameterAssignments is the corresponding hyperparameter set.

    In addition, status shows the experiment’s trials with their current status.

    View the results of the experiment in the Katib UI:

    1. Open the Katib UI as described .

    2. Click Hyperparameter Tuning on the Katib home page.

    3. Open the Katib menu panel on the left, then open the HP section and click Monitor:

      The Katib menu panel

    4. You should be able to view the list of experiments:

    5. Click the name of the experiment, random-example.

    6. There should be a graph showing the level of validation and train accuracy for various combinations of the hyperparameter values (learning rate, number of layers, and optimizer):

      Graph produced by the random example

    7. You can click on trial name to get metrics for the particular trial:

      Trials that ran during the experiment

    If you installed Katib as part of Kubeflow, you can’t run experiments in the Kubeflow namespace. Run the following commands to launch an experiment using the Kubeflow’s TensorFlow training job operator, TFJob:

    1. Download tfjob-example.yaml:

      1. curl https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/tfjob-example.yaml --output tfjob-example.yaml
    2. Edit tfjob-example.yaml and change the following line to use your Kubeflow user profile namespace:

      1. Namespace: kubeflow
    3. (Optional) Note: Katib’s experiments don’t work with Istio sidecar injection. If you installed Kubeflow using , you have to disable sidecar injection. To do that, specify this annotation: sidecar.istio.io/inject: "false" in your experiment’s trial template. For the provided TFJob example check here how to set the annotation.

    4. Deploy the example:

      1. kubectl apply -f tfjob-example.yaml
    5. You can check the status of the experiment:

    Follow the steps as described for the random algorithm example to obtain the results of the experiment in the Katib UI.

    If you installed Katib as part of Kubeflow, you can’t run experiments in the Kubeflow namespace. Run the following commands to launch an experiment using Kubeflow’s PyTorch training job operator, PyTorchJob:

    1. Download pytorchjob-example.yaml:

      1. curl https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/pytorchjob-example.yaml --output pytorchjob-example.yaml
    2. Edit pytorchjob-example.yaml and change the following line to use your Kubeflow user profile namespace:

      1. Namespace: kubeflow
    3. (Optional) Note: Katib’s experiments don’t work with Istio sidecar injection. If you installed Kubeflow using , you have to disable sidecar injection. To do that, specify this annotation: sidecar.istio.io/inject: "false" in your experiment’s trial template. For the provided PyTorchJob example setting the annotation should be similar to TFJob

    4. Deploy the example:

      1. kubectl apply -f pytorchjob-example.yaml
    5. You can check the status of the experiment:

    Follow the steps as described for the random algorithm example to get the results of the experiment in the Katib UI.

    To remove Katib from your Kubernetes cluster run:

    1. make undeploy