Compatibility with Various SQL Engines
SQLFlow calls Go’s standard database API. The submitter programs generated by SQLFlow call Python’s . Both APIs abstract the interface to various SQL engines; however, they are insufficient for SQLFlow to work. In this document, we examine all interactions between SQLFlow and the SQL engine so to identify what SQLFlow authors have to abstract in addition to calling Go’s and Python’s database APIs.
The basic idea of SQLFlow is to extend the SELECT statement of SQL to have the TRAIN and PREDICT clauses. For more discussion, please refer to the syntax design. SQLFlow translates such “extended SQL statements” into submitter programs, which forward the part from SELECT to TO TRAIN or TO PREDICT, which we call the “standard part”, to the SQL engine. SQLFlow also accepts the SELECT statement without TO TRAIN or TO PREDICT clauses and would forward such “standard statements” to the engine. It is noticeable that the “standard part” or “standard statements” are not standardized. For example, various engines use different syntax for .
- Hive supports
FULL OUTER JOIN
directly.
Fortunately, as SQLFlow forwards the above parts to the engine, it doesn’t have to care much about the differences above.
Metadata Retrieval
To verify the semantics of users’ inputs, SQLFlow needs to retrieve the schema of tables. For example, the input might be
In the above example, the user misspelled the field name age
in the COLUMN clause as “agee”. SQLFlow must be able to find that out.
- MySQL:
- Hive:
DESCRIBE FORMATTED employee;
- ODPS:
DESC employee;
The returned data format varies too. Our solution to avoid such differences is not-to-use-them; instead, SQLFlow retrieves the table schema by running a query like SELECT * FROM employee LIMIT 1;
and inferring field types using the mechanism called DatabaseTypeName provided by SQL engines drivers beneath the Go’s standard database API.
A SQLFlow prediction job writes its prediction results into a table. It prepares the prediction table by
- Dropping previous prediction table
DROP TABLE IF EXISTS my_table;
- Creating table with schema
CREATE TABLE my_table (name1, type1, name2 type2);
Most SQL engines, including MySQL, Hive, ODPS, support both statements.
Translate Database Column Type to TensorFlow Feature Column Type
After retrieving database column type name through DatabaseTypeName, we can derive TensorFlow’s type via a mapping such as {"FLOAT", "DOUBLE"} -> tf.numeric_column
.
SQLFlow saves trained ML model by dumping the serialized the model directory into a table. It first creates a table by CREATE TABLE IF NOT EXISTS %s (id INT AUTO_INCREMENT, block BLOB, PRIMARY KEY (id))
and insert blobs by .
Also, note that Hive and ODPS doesn’t support AUTO_INCREMENT
, we need to implemented auto increment logic in sqlfs
.
Load Model
SQLFlow loads trained ML model by reading rows in a table and deserializing the blob to a model directory.
It reads rows by running SELECT block FROM %s ORDER BY id
, which is supported by most databases.
Data Operations in Python
Thanks to the Python database API, connecting to different databases follows a similar API.
Insert Prediction Result into Prediction Table
Python database API provides execute_many(sql, value)
to insert multiple values at once. So one can prepare the following insertion statement. Please be aware that MySQL use INSERT INTO
to insert rows while Hive and ODPS use .