Feature Derivation

Target: We need to know the below two information for each column after the feature type infer routine:

How to transform the column data to tensors, including .
What type of feature column should adapt to the column and the parameters for the feature column call.

We assume all the selected columns will be used as either COLUMN or LABEL.

When we have a training table contains many columns that should be used for training like https://www.kaggle.com/mlg-ulb/creditcardfraud, it’s not friendly if we must provide all column names in COLUMN clause. Since we’d like to use all columns, when we write SELECT * then we can assume that we are using all columns to train and no longer need to write COLUMN anymore:

For columns that may need to do preprocessing, we can add those preprocessing descriptions in the COLUMN clause. For the credit card fraud dataset, assume only the column should be processed use a function before feed to the model, so the SQL statement should look like:

For more complex cases when columns are of quite different data format, like:

If the column represents a “dense tensor”, we can get the shape by reading some of the values and confirm the shapes are the same.

You can also write the full description of every column like below:

For CSV values, we also need to infer the tensor data type by reading some of the training data, whether it’s int value or float value. Note that we always parse float values to float32 but not float64 since float32 seems enough for most cases.

The Feature Derivation Routine

We need to SELECT part of the training data, like 1000 rows and go through the below routine:

If the column data type is numeric: int, bigint, float, double, can directly parse to a tensor of shape [1].
If the column data type is string: VARCHAR or TEXT:
1. If the string is not one of the supported serialized format (only support CSV currently):
  1. If all the rows of the column’s string data can be parsed to a float or int value, treat it as a tensor of shape [1].
  2. The string value can not be parsed to int or float, treat it as enum type and use categorical columns to process the string to tensors.
  3. If the enum values in the above step have very little in common (like only 5% of the data appeared twice or more), use categorical_column_with_hash_bucket.
2. If the string is of CSV format:
  1. If already appeared in COLUMN clause, then continue.
  2. If the rows contain CSV data of different length, then return a parsing error to the client and top.