5-2 feature_column

Feature column is used to converting category features into one-hot encoding, or creating bucketing feature from continuous feature, or generating cross features from multiple features, etc.

Before creating feature column, please call the functions in the module . The nine most frequently used functions in this module are shown in the figure below. All these functions will return a Categorical-Column or a Dense-Column object, but will not return bucketized_column, since the last class is inhereted from the first two classes.

numeric_column, the most frequently used function.

bucketized_column, generated from numerical column, listing multiple features from a numerical clumn; it is one-hot encoded.

categorical_column_with_identity, one-hot encoded, identical to the case that each bucket is one interger.

categorical_column_with_vocabulary_list, one-hot encoded; the dictionary is specified by the list.

categorical_column_with_vocabulary_file， one-hot encoded; the dictionary is specified by the file.

categorical_column_with_hash_bucket, used in the case with a large interger or a large dictionary.

indicator_column, generated by Categorical-Column; one-hot encoded.

embedding_column, generated by Categorical Column; the embedded vector distributed parameter needs learning/training. The recommended dimension of the embedded vector is the fourth root to the number of categories.

crossed_column, consists of arbitrary category column except for categorical_column_with_hash_bucket

2. Demonstration of feature column

Here is a complete example that solves Titanic survival problmen using feature column.

#================================================================================
# 1. Constructing data pipeline
#================================================================================
printlog("step1: prepare dataset...")
dftrain_raw = pd.read_csv("../data/titanic/train.csv")
dftest_raw = pd.read_csv("../data/titanic/test.csv")
dfraw = pd.concat([dftrain_raw,dftest_raw])
def prepare_dfdata(dfraw):
    dfdata = dfraw.copy()
    dfdata.columns = [x.lower() for x in dfdata.columns]
    dfdata = dfdata.rename(columns={'survived':'label'})
    dfdata = dfdata.drop(['passengerid','name'],axis = 1)
    for col,dtype in dict(dfdata.dtypes).items():
        # See if there are missing values.
        if dfdata[col].hasnans:
            # Adding signs to the missing columns
            dfdata[col + '_nan'] = pd.isna(dfdata[col]).astype('int32')
                dfdata[col].fillna(dfdata[col].mean(),inplace = True)
            else:
                dfdata[col].fillna('',inplace = True)
    return(dfdata)
dfdata = prepare_dfdata(dfraw)
dftrain = dfdata.iloc[0:len(dftrain_raw),:]
dftest = dfdata.iloc[len(dftrain_raw):,:]
# Importing data from dataframe
def df_to_dataset(df, shuffle=True, batch_size=32):
    dfdata = df.copy()
    if 'label' not in dfdata.columns:
        ds = tf.data.Dataset.from_tensor_slices(dfdata.to_dict(orient = 'list'))
    else: 
        labels = dfdata.pop('label')
        ds = tf.data.Dataset.from_tensor_slices((dfdata.to_dict(orient = 'list'), labels))  
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dfdata))
    ds = ds.batch(batch_size)
    return ds
ds_train = df_to_dataset(dftrain)
ds_test = df_to_dataset(dftest)

#================================================================================
# 3. Defining the model
#================================================================================
printlog("step3: define model...")
  layers.DenseFeatures(feature_columns), # Placing the feature into tf.keras.layers.DenseFeatures
  layers.Dense(64, activation='relu'),
  layers.Dense(64, activation='relu'),
  layers.Dense(1, activation='sigmoid')
])

#================================================================================
# 5. Evaluating the model
#================================================================================
printlog("step5: eval model...")
model.summary()
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
def plot_metric(history, metric):
    train_metrics = history.history[metric]
    val_metrics = history.history['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)
    plt.plot(epochs, train_metrics, 'bo--')
    plt.plot(epochs, val_metrics, 'ro-')
    plt.title('Training and validation '+ metric)
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend(["train_"+metric, 'val_'+metric])
    plt.show()
plot_metric(history,"accuracy")

Please leave comments in the WeChat official account “Python与算法之美” (Elegance of Python and Algorithms) if you want to communicate with the author about the content. The author will try best to reply given the limited time available.

You are also welcomed to join the group chat with the other readers through replying 加群 (join group) in the WeChat official account.