Training and Validation
SQLFlow generates a temporary table following the user-specific table, trains and evaluates a model.
Notice, we talk about the train process in this post.
Generate a Temporary Table
Splitting the training table into training data and validation data is the key point. We suppose SQLFlow are dealing with the following SQL to train an ML model:
The data comes from the standard select part , and let’s say the query result looks like the following
| col1 | col2 | col3 | | —— | —— | —— | | <data> | <data> | <data> | | <data> | <data> | <data> | | | … | | We want to split the result into 80% training data and 20% validation data.
Note the RAND()
function returns a random number between 0 (inclusive) and 1. The result temporary table looks like the following.
We can generate the corresponding SQL using the following code template
Naming temporary table
Because multi-users run SQLFlow with their own isolated data set at the same time, SQLFlow generates an elaborate name for the temporary table to avoid conflict.
For Maxcompute SQLFlow saves the temporary table into the current project which must be specified by the user. Meanwhile, we specify the to 14 days to release the temporary table automatically.
We fetch the training/validation data using two different queries respectively.
The query for training data can be written as , which fetches row1 and row3 etc.. The query for validation data can be written as SELECT * FROM temp_table WHERE sqlflow_random >= 0.8
, which fetches the rest of the rows.
In SQLFlow, we modify the user-specific data set to our temp_table restricted to sqlflow_random >= 0.8
to train a model, then restricted the temp_table to sqlflow_random < 0.8
to validate that model. This context is built after the temporary data set accomplished, passed to in extendedSelect
.
Codegen
For TensorFlow submitter, SQLFlow generate training data set and validation data set according to extendedSelect.training
and extendedSelect.validation
.
In the end, SQLFlow remove the temporary table to release resources.
For Maxcompute SQLFlow specify the to 14 days in the create statement, so as to release the temporary table automatically.
-
So, SQLFlow need to know when the training job is completed. Whether it is a synchronized job or an asynchronous job.
Notes
- If the column sqlflow_random already exists, SQLFlow chooses to quit Notice, column name started with an underscore is invalid in the hive