Auto Hyperparameter Tuning

Katib is a Kubernetes Native System for Hyperparameter Tuning and Neural Architecture Search. The inspiration of Katib comes from Google Vizier and supports multiple machine learning frameworks, for example, TensorFlow, Apache MXNet, PyTorch, and XGBoost. We compared Katib with some other auto hyperparameter tuning systems, and we prefer its Kubernetes-native architecture.

However, Katib, or hyperparameter tuning in the academic literature, is not sufficient for our use case.

The Paradox

To define a training job, a.k.a., experiment, in Katib, users need to specify the search range of each hyperparameter.

Ironically, it is an extra burden for the users to specify the above information, as our goal is to release users from specifying hyperparameters.

For boosting tree models, especially models with XGBoost, there is a small group of effective hyperparameters, and we can empirically determine their ranges. We noticed that the following two are the most important.

max_depth in the range [2,10], and
num_round in the range [50, 100].

With the introduction of auto hyperparameter tuning, we hope that users don’t need to specify the num_round and max_depth values in the following SQL statement.

For deep learning models, the case is complicated. Each model has its own set of hyperparameters, and their ranges might vary significantly. Our proposed solution is to utilize the model zoo. In particular, users might train a model defined in the zoo with various datasets, in various experiments, with manually specified hyperparameters. After the training, some users might publish their trained models, including the estimated parameters and the specified hyperparameters. Given these published hyperparameter values, the for hyperparameter tuning. We are working on such a Bayesian approach that doesn’t require explicit specification of hyperparameter ranges. We plan to contribute it to Katib.

The System Design

SQLFlow has been working as converting a SQL program into a Python program known as a submitter before executing the submitter. However, we recently realized that the idea of the submitter is insufficient for cloud deployment. As Kubernetes might preempt the SQLFlow server, it could lose the status of the execution of submitters.

This observation urges us to make the following changes.

Introducing a workflow engine, namely .
Make SQLFlow generates a workflow instead of a Python program.
SQLFlow server submits the workflow to Argo for the execution.
Argo manages the status of workflow executions.

Argo takes workflows in the form of YAML files, and it is error-prone to write such YAML files manually. So, we created Couler as an intermediate programmatic representation of workflows.

We need to develop a new codegen, codegen_couler.go, for SQLFlow. codegen_couler.go converts the parsed SQL program, a.k.a., the , or IR, into a Couler program.

SQLFlow parses each SQL program into an IR, which is a list of statement IRs. The codegen_couler.go converts the IR into a Couler program. We need to add a Couler functions couler.sqlflow.train for the calling by the generated Couler program.

Consider the following example program.

indicates to use Katib to train the model xgboost.gbtree. Then the codegen_couler.go might generate the following Couler program.

The arguments in couler.sqlflow.katib.train,

hyperparameters specifies hyperparameters for model given in model.
image specifies the container image source for the Katib tuning job.
sql statement input by users.
datasource train and validation data source

Run Tuning Job on Katib

In each Katib tuning job, users need to define tuning parameters (i.e., the hyperparameter’s name, type, and range) in a model at first. During runtime, the Katib will pick up different values for those hyperparameters and start a single Pod for each value set. Then the tuning job Pods, which are running customized container image, must follow the Katib input format and take those hyperparameters’ values from Katib, to train and measure the model.

For example, users may define the following command for tuning job Pod:

python -m runtime.couler.katib.xgboost_train

The actual command during runtime will be:

python -m runtime.couler.katib.xgboost_train --max_depth 5 ..., hyperparameter max_depth is added by Katib.

The pipeline is as following:

SQLFlow parses the input SQL statement and extract tuning hyperparameters, image, and model.
couler_codegen.go generates couler_submitter.py which will invoke couler.sqlflow.katib.train in the submitter program.
SQLFlow executes couler_submitter.py and invoke couler.sqlflow.katib.train to fill a Katib step in the Argo workflow.
Argo executes the workflow YAML and create Katib tuning job on Katib.