Targeted Users

Modelers, those who create new models, including deep learning researchers and engineers, and
SQLFlow users.

The high-level API must meet the requirements of these users.

Modelers usually craft their Keras models on their personal computers, test the model with small datasets, and would like to file a distributed training job with big datasets on the cloud.

Suppose that one is working on a model in the local directory , where each .py file might contain one ore more Keras model classes. We would love to allow the user to submit an ElasticDL training job from the command-line like the following to train a model defined as a class MyKerasModel.

The above command-line

builds a Docker image containing (1) $HOME/work mapped to /model_zoo/custom, (2) ElasticDL, (3) dependencies of ElasticDL,
submits an ElasticDL job to the Kubernetes cluster as described in $HOME/.kube/config,
prints an URL to the dashboard so users could inspect the progress/status of the job in the user’s Web browser.

Please be aware that in the class fintech.MyKerasModel, in addition to overriding the method call, we also need to provide methods like

default_loss that returns a loss operator,
default_optimizer that returns an optimizer operator,
default_input that takes a record (string) as its input and returns something that can be batched and consumed by MyKerasModel.call. In the above example, the user chooses an input function other than MyKerasModel.default_input.

Because the above example command line specifies --input_fn explicitly, the training job is not going to use MyKerasModel.default_input, but uses fintech.credit_data_processor. Similarly, command line options loss and optimizer overwrites MyKerasModel.default_loss and MyKerasModel.default_optimizer.

Another important command-line is to support prediction.

elasticdl predict \
    --data="gs://bucket-name/tony/imagenet/test/*.recordio" \
    --trained_model="gs://bucket-name/tony/my_trained_model" \

SQLFlow users provide the information required by training or prediction by writing a SQL statement with extended syntax. The syntax for training extends the SELECT statement with the TRAIN clause. For example:

SELECT name, role, salary FROM employee
TRAIN regressor.DNN
WITH hidden_units=[10, 100, 20, 5], learning_rate=0.01
INTO my_trained_model;

Please be aware that to minimize the syntax extension, SQLFlow doesn’t allow users to specify a directory of models; instead, users can only use pre-built models – regressor.DNN in the above example.

SQLFlow is a gRPC server that takes the above SQL statement and translates it into a Python program known as a submitter. It is the responsibility of the submitter to call to launch an ElasticDL job on a Kubernetes cluster.

SQLFlow often runs in Docker containers, and it is usually intractable to build a Docker image from within a Docker container, so the submitter requires a pre-built Docker image containing (1) /model_zoo, (2) ElasticDL, (3) dependencies of ElasticDL. The class regressor.DNN is a class defined in some Python source files in /model_zoo.

To predict using a pre-trained model and to write the results into a column of a table, we can do

SELECT name, role FROM testdata
PREDICT testdata.predicted_salary
USING my_trained_model;

Both the command line tool elasticdl provided for modelers and the submitter program generated by SQLFlow need to call an API that launches ElasticDL jobs. Hence this design.

We hope the ElasticDL API supports not only batch learning, but also online learning, adversarial learning, reinforcement learning, and federated learning. However, at the right moment, let us start with batch learning.

We propose a function elastic.train that can be called like the following:

elasticdl.train(
    model_zoo="https://github.com/sql-machine-learning/models",
    model_def="regressor.DNN",
    input_fn="sqlflow.elasticdl_input_function',
    params="hidden_units=[10, 100, 20, 5], learning_rate=0.01",
    data="gs://sqlflow/job-xxyyzz/train/*.recordio",
    output="gs://sqlflow/job-xxyyzz/my_trained_model")

Please be aware that most parameters of elasticdl.train are of string-type because the command line options and SQL statements are all strings.

We propose a function elasticdl.predict that can be called like the following:

elasticdl.predict(
    data='gs://bucket-name/tony/imagenet/test/*.recordio',
    trained_model='gs://bucket-name/tony/my_trained_model',
    output='gs://bucket-name/tony/imagenet-eval.recordio')

elasticdl.predict(
    data="gs://sqlflow/job-xxyyzz/predict/*.recordio",

When the ElasticDL client or the SQLFlow server call elasticdl.train, this function calls Docker API to build a Docker image then submits the job. The building process should add a model zoo into the Docker image. The function elasticdl.train has a parameter, which could be the following cases:

A local directory, for example,

A URL pointing to a Git repo

elasticdl.train(
    model_zoo="https://git.company.com/sql-machine-learning/models", ...
 )

RUN pip install -r /model_zoo/requirements.txt

Suppose that a Keras model class is referred to as regressor.DNN in elasticdl.train(model_def="regressor.DNN",, the corresponding Python file should be /model_zoo/regressor.py. A class regressor.wide_and_deep.MagicalWAD is in a Python file /model_zoo/regressor/wide_and_deep.py.

A call to elasticdl.predict looks like the following:

elasticdl.predict(
    data='/filestore/yiwang/imagenet/test/*.recordio',
    trained_model='/filestore/tony/my_keras_model',
    output='/filestore/yiwang/imagenet-eval.recordio')

It needs to

build and push a Docker image, and
launch a distributed ElasticDL job of the type “predict”.

The Docker image must contain the model zoo used to train the model trained_model='/filestore/tony/my_keras_model'.

A key question is what information must be in the directory /filestore/tony/my_keras_model.

A Docker image ID.

We need this ID to refer to the Docker image built during the call of elasticdl.train. In this image, we have the model zoo used to train the model. Then, elasticdl.predict could build the Docker image for the distributed prediction job from this commit ID.

This image ID must be a pullable ID so that ElasticDL command line tool can docker pull it as the base image. An example pullable ID is docker-pullable://reg.docker.alibaba-inc.com/asdi/aswf-py3@sha256:e8ca09705eed0 7cdfd060b6b9d27a802.
Model class constructor parameters, like hidden_units=[10, 100, 20].
Other parameters passed to elasticdl.train, including
- model_def
- input_function
- optimizer

We define a new wrapper message: