Getting started with Katib

    This page gets you started with Katib. Follow this guide to perform any additional setup you may need, depending on your environment, and to run a few examples using the command line and the Katib user interface (UI).

    For an overview of the concepts around Katib and hyperparameter tuning, read the introduction to Katib.

    This section describes some configurations that you may need to add to your Kubernetes cluster, depending on the way you’re using Kubeflow and Katib.

    You can skip this step if you have already installed Kubeflow. Your Kubeflow deployment includes Katib.

    To install Katib as part of Kubeflow, follow the .

    If you want to install Katib separately from Kubeflow, or to get a later version of Katib, run the following commands to install Katib directly from its repository on GitHub and deploy Katib to your cluster:

    If you used above script to deploy Katib, you can skip this step. This script deploys PVC and PV on your cluster.

    You can skip this step if you’re using Kubeflow on Google Kubernetes Engine (GKE) or if your Kubernetes cluster includes a StorageClass for dynamic volume provisioning. For more information, see the Kubernetes documentation on and persistent volumes.

    If you’re using Katib outside GKE and your cluster doesn’t include a StorageClass for dynamic volume provisioning, you must create a persistent volume (PV) to bind to the persistent volume claim (PVC) required by Katib.

    After deploying Katib to your cluster, run the following command to create the PV:

    The above kubectl apply command uses a YAML file () that defines the properties of the PV.

    You can use the Katib user interface (UI) to submit experiments and to monitor your results. The Katib home page within Kubeflow looks like this:

    If you installed Katib as part of Kubeflow, you can access the Katib UI from the Kubeflow UI:

    1. Open the Kubeflow UI. See the guide to accessing the central dashboard.
    2. Click Katib in the left-hand menu.

    Alternatively, you can set port-forwarding for the Katib UI service:

    1. kubectl port-forward svc/katib-ui -n kubeflow 8080:80

    Then you can access the Katib UI at this URL:

    1. http://localhost:8080/katib/

    This section introduces some examples that you can run to try Katib.

    You can create an experiment for Katib by defining the experiment in a YAML configuration file. The YAML file defines the configurations for the experiment, including the hyperparameter feasible space, optimization parameter, optimization goal, suggestion algorithm, and so on.

    This example uses the .

    Run the following commands to launch an experiment using the random algorithm example:

    1. Download the example:

      1. curl https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/random-example.yaml --output random-example.yaml
    2. Edit random-example.yaml and change the following line to use your Kubeflow user profile namespace:

      1. Namespace: kubeflow
    3. Deploy the example:

    This example embeds the hyperparameters as arguments. You can embed hyperparameters in another way (for example, using environment variables) by using the template defined in the TrialTemplate.GoTemplate.RawTemplate section of the YAML file. The template uses the Go template format.

    This example randomly generates the following hyperparameters:

    • --lr: Learning rate. Type: double.
    • --num-layers: Number of layers in the neural network. Type: integer.
    • --optimizer: Optimizer. Type: categorical.

    Check the experiment status:

    1. kubectl -n <your user profile namespace> describe experiment random-example

    The output of the above command should look similar to this:

    1. Name: random-example
    2. Namespace: <your user namespace>
    3. Labels: controller-tools.k8s.io=1.0
    4. Annotations: <none>
    5. API Version: kubeflow.org/v1alpha3
    6. Kind: Experiment
    7. Metadata:
    8. Creation Timestamp: 2019-12-22T22:53:25Z
    9. Finalizers:
    10. update-prometheus-metrics
    11. Generation: 2
    12. Resource Version: 720692
    13. Self Link: /apis/kubeflow.org/v1alpha3/namespaces/<your user namespace>/experiments/random-example
    14. UID: dc6bc15a-250d-11ea-8cae-42010a80010f
    15. Spec:
    16. Algorithm:
    17. Algorithm Name: random
    18. Algorithm Settings: <nil>
    19. Max Failed Trial Count: 3
    20. Max Trial Count: 12
    21. Metrics Collector Spec:
    22. Collector:
    23. Kind: StdOut
    24. Objective:
    25. Additional Metric Names:
    26. accuracy
    27. Objective Metric Name: Validation-accuracy
    28. Type: maximize
    29. Parameters:
    30. Feasible Space:
    31. Max: 0.03
    32. Min: 0.01
    33. Name: --lr
    34. Parameter Type: double
    35. Feasible Space:
    36. Max: 5
    37. Min: 2
    38. Name: --num-layers
    39. Parameter Type: int
    40. Feasible Space:
    41. List:
    42. sgd
    43. adam
    44. ftrl
    45. Name: --optimizer
    46. Parameter Type: categorical
    47. Resume Policy: LongRunning
    48. Trial Template:
    49. Go Template:
    50. Raw Template: apiVersion: batch/v1
    51. kind: Job
    52. metadata:
    53. name: {{.Trial}}
    54. namespace: {{.NameSpace}}
    55. spec:
    56. template:
    57. spec:
    58. containers:
    59. - name: {{.Trial}}
    60. image: docker.io/kubeflowkatib/mxnet-mnist-example
    61. command:
    62. - "python"
    63. - "/mxnet/example/image-classification/train_mnist.py"
    64. - "--batch-size=64"
    65. {{- with .HyperParameters}}
    66. {{- range .}}
    67. - "{{.Name}}={{.Value}}"
    68. {{- end}}
    69. {{- end}}
    70. Status:
    71. Conditions:
    72. Last Transition Time: 2019-12-22T22:53:25Z
    73. Last Update Time: 2019-12-22T22:53:25Z
    74. Message: Experiment is created
    75. Status: True
    76. Type: Created
    77. Last Transition Time: 2019-12-22T22:55:10Z
    78. Last Update Time: 2019-12-22T22:55:10Z
    79. Message: Experiment is running
    80. Reason: ExperimentRunning
    81. Status: True
    82. Type: Running
    83. Current Optimal Trial:
    84. Observation:
    85. Metrics:
    86. Name: Validation-accuracy
    87. Value: 0.981091
    88. Parameter Assignments:
    89. Name: --lr
    90. Value: 0.025139701133432946
    91. Name: --num-layers
    92. Value: 4
    93. Name: --optimizer
    94. Value: sgd
    95. Start Time: 2019-12-22T22:53:25Z
    96. Trials: 12
    97. Trials Running: 2
    98. Trials Succeeded: 10
    99. Events: <none>

    When the last value in Status.Conditions.Type is Succeeded, the experiment is complete.

    View the results of the experiment in the Katib UI:

    1. Open the Katib UI as described .

    2. Click Hyperparameter Tuning on the Katib home page.

    3. Open the Katib menu panel on the left, then open the HP section and click Monitor:

      The Katib menu panel

    4. You should see the list of experiments:

    5. Click the name of the experiment, random-example.

    6. You should see a graph showing the level of validation and train accuracy for various combinations of the hyperparameter values (learning rate, number of layers, and optimizer):

      Graph produced by the random example

    7. You can click on trial name to see metrics for the particular trial:

      Trials that ran during the experiment

    Run the following commands to launch an experiment using the Kubeflow’s TensorFlow training job operator, TFJob:

    1. Download the tfjob-example.yaml file

      1. curl https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/tfjob-example.yaml --output tfjob-example.yaml
    2. Edit tfjob-example.yaml and change the following line to use your Kubeflow user profile namespace:

      1. Namespace: kubeflow
    3. Deploy the example:

      1. kubectl apply -f tfjob-example.yaml
    4. You can check the status of the experiment:

    Follow the steps as described for the random algorithm example above, to see the results of the experiment in the Katib UI.

    Run the following commands to launch an experiment using Kubeflow’s PyTorch training job operator, PyTorchJob:

    1. Download the pytorchjob-example.yaml file

      1. curl https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha3/pytorchjob-example.yaml --output pytorchjob-example.yaml
    2. Edit pytorchjob-example.yaml and change the following line to use your Kubeflow user profile namespace:

      1. Namespace: kubeflow
    3. Deploy the example:

      1. kubectl apply -f pytorchjob-example.yaml
    4. You can check the status of the experiment:

    Follow the steps as described for the random algorithm example , to see the results of the experiment in the Katib UI.

    Delete the installed components:

    1. bash ./scripts/v1alpha3/undeploy.sh
    • For details of how to configure and run your experiment, see the guide to running an experiment.

    • See how you can change installation of Katib component in the .