Hyperparameter Tuning (Katib)

    The Katib project is inspired by.Katib is a scalable and flexible hyperparameter tuning framework and is tightlyintegrated with Kubernetes. It does not depend on any specific deep learningframework (such as TensorFlow, MXNet, or PyTorch).

    To run Katib jobs, you must install the required packages as shown in thissection. You can do so by following the Kubeflow deployment guide,or by installing Katib directly from its repository:

    If you want to use Katib outside Google Kubernetes Engine (GKE) and you don’thave a StorageClass for dynamic volume provisioning in your cluster, you mustcreate a persistent volume (PV) to bind your persistent volume claim (PVC).

    This is the YAML file for a PV:

    1. kind: PersistentVolume
    2. metadata:
    3. name: katib-mysql
    4. labels:
    5. type: local
    6. app: katib
    7. spec:
    8. capacity:
    9. storage: 10Gi
    10. accessModes:
    11. - ReadWriteOnce
    12. hostPath:
    13. path: /data/katib

    After deploying the Katib package, run the following command to create the PV:

    1. kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha2/pv/pv.yaml

    After deploying everything, you can run some examples.

    You can create an Experiment for Katib by defining an Experiment config file. See the.

    1. kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/random-example.yaml

    Running this command launches an Experiment. It runs a series oftraining jobs to train models using different hyperparameters and save theresults.

    The configurations for the experiment (hyperparameter feasible space, optimizationparameter, optimization goal, suggestion algorithm, and so on) are defined inrandom-example.yaml.

    This demo randomly generates 3 hyperparameters:

    • Learning Rate (–lr) - type: double
    • Number of NN Layer (–num-layers) - type: int
    • optimizer (–optimizer) - type: categorical

    Check the experiment status:

    1. $ kubectl -n kubeflow describe experiment random-example
    2. Name: random-example
    3. Namespace: kubeflow
    4. Labels: controller-tools.k8s.io=1.0
    5. Annotations: <none>
    6. API Version: kubeflow.org/v1alpha2
    7. Kind: Experiment
    8. Metadata:
    9. Creation Timestamp: 2019-01-18T16:30:46Z
    10. Finalizers:
    11. clean-data-in-db
    12. Generation: 5
    13. Resource Version: 1777650
    14. Self Link: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/experiments/random-example
    15. UID: 687a67f9-1b3e-11e9-a0c2-c6456c1f5f0a
    16. Spec:
    17. Algorithm:
    18. Algorithm Name: random
    19. Algorithm Settings:
    20. Max Failed Trial Count: 3
    21. Max Trial Count: 100
    22. Additional Metric Names:
    23. accuracy
    24. Goal: 0.99
    25. Objective Metric Name: Validation-accuracy
    26. Type: maximize
    27. Parallel Trial Count: 10
    28. Parameters:
    29. Feasible Space:
    30. Max: 0.03
    31. Name: --lr
    32. Parameter Type: double
    33. Feasible Space:
    34. Max: 5
    35. Min: 2
    36. Name: --num-layers
    37. Parameter Type: int
    38. Feasible Space:
    39. List:
    40. sgd
    41. adam
    42. ftrl
    43. Name: --optimizer
    44. Parameter Type: categorical
    45. Trial Template:
    46. Go Template:
    47. Template Spec:
    48. Config Map Name: trial-template
    49. Config Map Namespace: kubeflow
    50. Template Path: mnist-trial-template
    51. Status:
    52. Completion Time: 2019-06-20T00:12:07Z
    53. Conditions:
    54. Last Transition Time: 2019-06-19T23:20:56Z
    55. Last Update Time: 2019-06-19T23:20:56Z
    56. Message: Experiment is created
    57. Reason: ExperimentCreated
    58. Status: True
    59. Type: Created
    60. Last Transition Time: 2019-06-20T00:12:07Z
    61. Last Update Time: 2019-06-20T00:12:07Z
    62. Message: Experiment is running
    63. Reason: ExperimentRunning
    64. Status: False
    65. Type: Running
    66. Last Transition Time: 2019-06-20T00:12:07Z
    67. Last Update Time: 2019-06-20T00:12:07Z
    68. Message: Experiment has succeeded because max trial count has reached
    69. Reason: ExperimentSucceeded
    70. Type: Succeeded
    71. Current Optimal Trial:
    72. Observation:
    73. Metrics:
    74. Name: Validation-accuracy
    75. Value: 0.982483983039856
    76. Parameter Assignments:
    77. Value: 0.026666666666666665
    78. Name: --num-layers
    79. Value: 2
    80. Name: --optimizer
    81. Value: sgd
    82. Start Time: 2019-06-19T23:20:55Z
    83. Trials: 100
    84. Trials Succeeded: 100
    85. Events: <none>

    The demo should start an experiment and run three jobs with different parameters.When the spec.Status.Condition changes to Completed, the experiment isfinished.

    To run the TensorFlow operator example, you must install a volume.

    If you are using GKE and default StorageClass, you must create this PVC:

    1. apiVersion: v1
    2. kind: PersistentVolumeClaim
    3. metadata:
    4. name: tfevent-volume
    5. namespace: kubeflow
    6. labels:
    7. type: local
    8. app: tfjob
    9. spec:
    10. accessModes:
    11. - ReadWriteOnce
    12. resources:
    13. requests:
    14. storage: 10Gi

    If you are not using GKE and you don’t have StorageClass for dynamic volumeprovisioning in your cluster, you must create a PVC and a PV:

    Now you can run the TensorFlow operator example:

    1. kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfjob-example.yaml

    You can check the status of the experiment:

    1. kubectl -n kubeflow describe experiment tfjob-example

    This is an example for the PyTorch operator:

    1. kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/pytorchjob-example.yaml
    1. kubectl -n kubeflow describe experiment pytorchjob-example

    You can monitor your results in the Katib UI. If you installed Kubeflowusing the deployment guide, you can access the Katib UI at

    1. https://<your kubeflow endpoint>/katib/

    For example, if you deployed Kubeflow on GKE, your endpoint would be

    Otherwise, you can set port-forwarding for the Katib UI service:

    1. kubectl port-forward svc/katib-ui -n kubeflow 8080:80

    Now you can access the Katib UI at this URL: .

    Delete the installed components:

    1. ./scripts/v1alpha2/undeploy.sh

    If you created a PV for Katib, delete it:

    1. kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha2/pv/pv.yaml

    If you created a PV and PVC for the TensorFlow operator, delete it:

    1. kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pvc.yaml
    2. kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pv.yaml

    Katib has a metrics collector to take metrics from each trial. Katib collectsmetrics from stdout of each trial. Metrics should print in the followingformat: {metrics name}={value}. For example, when your objective value nameis loss and the metrics are recall and precision, your training containershould print like this:

    1. epoch 1:
    2. loss=0.3
    3. recall=0.5
    4. precision=0.4
    5. epoch 2:
    6. loss=0.2
    7. recall=0.55

    Katib periodically launches CronJobs to collect metrics from pods.