Tabular data
Helper functions to get data in a DataLoaders
in the tabular application and higher class TabularDataLoaders
The main class to get your data ready for model training is TabularDataLoaders
and its factory methods. Checkout the for examples of use.
This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:
cat_names
: the names of the categorical variablescont_names
: the names of the continuous variablesy_names
: the names of the dependent variablesy_block
: theTransformBlock
to use for the targetbs
: the batch sizeval_bs
: the batch size for the validation (defaults tobs
)shuffle_train
: if we shuffle the trainingDataLoader
or notn
: overrides the numbers of elements in the datasetdevice
: the PyTorch device to use (defaults todefault_device()
)
TabularDataLoaders.from_df
[source]
Create from df
in path
using procs
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv', skipinitialspace=True)
dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names="salary", valid_idx=list(range(800,1000)), bs=64)
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Private | 11th | Separated | Adm-clerical | Unmarried | Black | False | 55.0 | 213894.000562 | 7.0 | <50k |
1 | Private | HS-grad | Married-civ-spouse | Machine-op-inspct | Husband | White | False | 53.0 | 228500.001385 | 9.0 | >=50k |
2 | Private | HS-grad | Married-civ-spouse | Tech-support | Husband | White | False | 38.0 | 256864.000909 | 9.0 | >=50k |
3 | Private | Bachelors | Married-civ-spouse | Tech-support | Husband | White | False | 40.0 | 247879.997190 | 13.0 | >=50k |
4 | Private | Some-college | Divorced | Craft-repair | Not-in-family | White | False | 41.0 | 40151.001925 | 10.0 | >=50k |
5 | Private | HS-grad | Married-civ-spouse | Sales | Husband | White | False | 37.0 | 110713.001599 | 9.0 | >=50k |
6 | Private | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 38.0 | 278924.000902 | 13.0 | >=50k |
7 | Self-emp-not-inc | 11th | Married-civ-spouse | Farming-fishing | Husband | White | False | 60.0 | 220341.999356 | 7.0 | <50k |
8 | ? | 9th | Never-married | ? | Not-in-family | White | False | 30.0 | 104965.001013 | 5.0 | <50k |
9 | ? | HS-grad | Never-married | ? | Not-in-family | White | False | 21.0 | 105311.997415 | 9.0 | <50k |
TabularDataLoaders.from_csv
[source]
Create from csv
file in path
using procs
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, procs=procs, cat_names=cat_names, cont_names=cont_names,
y_names="salary", valid_idx=list(range(800,1000)), bs=64)
External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ..."
. Often trimming is needed. Pandas has a convenient parameter skipinitialspace
that is exposed by ). Otherwise category labels use for inference later such as workclass
:Private
will be categorized wrongly to 0 or "#na#"
if training label was read as " Private"
. Let’s test this feature.