Feature Derivation
Target: We need to know the below two information for each column after the feature type infer routine:
- How to transform the column data to tensors, including .
- What type of feature column should adapt to the column and the parameters for the feature column call.
We assume all the selected columns will be used as either COLUMN
or LABEL
.
When we have a training table contains many columns that should be used for training like https://www.kaggle.com/mlg-ulb/creditcardfraud, it’s not friendly if we must provide all column names in COLUMN
clause. Since we’d like to use all columns, when we write SELECT *
then we can assume that we are using all columns to train and no longer need to write COLUMN
anymore:
For columns that may need to do preprocessing, we can add those preprocessing descriptions in the COLUMN
clause. For the credit card fraud dataset, assume only the column should be processed use a function before feed to the model, so the SQL statement should look like:
For more complex cases when columns are of quite different data format, like:
If the column represents a “dense tensor”, we can get the shape by reading some of the values and confirm the shapes are the same.
You can also write the full description of every column like below:
For CSV values, we also need to infer the tensor data type by reading some of the training data, whether it’s int value or float value. Note that we always parse float values to float32
but not float64
since float32
seems enough for most cases.
The Feature Derivation Routine
We need to SELECT
part of the training data, like 1000 rows and go through the below routine:
- If the column data type is numeric: int, bigint, float, double, can directly parse to a tensor of shape
[1]
. - If the column data type is string: VARCHAR or TEXT:
- If the string is not one of the supported serialized format (only support CSV currently):
- If all the rows of the column’s string data can be parsed to a float or int value, treat it as a tensor of shape
[1]
. - The string value can not be parsed to int or float, treat it as enum type and use
categorical columns
to process the string to tensors. - If the enum values in the above step have very little in common (like only 5% of the data appeared twice or more), use
categorical_column_with_hash_bucket
.
- If all the rows of the column’s string data can be parsed to a float or int value, treat it as a tensor of shape
- If the string is of CSV format:
- If already appeared in
COLUMN
clause, then continue. - If the rows contain CSV data of different length, then return a parsing error to the client and top.
- If already appeared in
- If the string is not one of the supported serialized format (only support CSV currently):