Tabular data

Helper functions to get data in a DataLoaders in the tabular application and higher class TabularDataLoaders

The main class to get your data ready for model training is TabularDataLoaders and its factory methods. Checkout the for examples of use.

This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:

cat_names: the names of the categorical variables
cont_names: the names of the continuous variables
y_names: the names of the dependent variables
y_block: the TransformBlock to use for the target
bs: the batch size
val_bs: the batch size for the validation (defaults to bs)
shuffle_train: if we shuffle the training DataLoader or not
n: overrides the numbers of elements in the dataset
device: the PyTorch device to use (defaults to default_device())

`TabularDataLoaders.from_df`[source]

Create from df in path using procs

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv', skipinitialspace=True)

dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                 y_names="salary", valid_idx=list(range(800,1000)), bs=64)

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	11th	Separated	Adm-clerical	Unmarried	Black	False	55.0	213894.000562	7.0	<50k
1	Private	HS-grad	Married-civ-spouse	Machine-op-inspct	Husband	White	False	53.0	228500.001385	9.0	>=50k
2	Private	HS-grad	Married-civ-spouse	Tech-support	Husband	White	False	38.0	256864.000909	9.0	>=50k
3	Private	Bachelors	Married-civ-spouse	Tech-support	Husband	White	False	40.0	247879.997190	13.0	>=50k
4	Private	Some-college	Divorced	Craft-repair	Not-in-family	White	False	41.0	40151.001925	10.0	>=50k
5	Private	HS-grad	Married-civ-spouse	Sales	Husband	White	False	37.0	110713.001599	9.0	>=50k
6	Private	Bachelors	Married-civ-spouse	Exec-managerial	Husband	White	False	38.0	278924.000902	13.0	>=50k
7	Self-emp-not-inc	11th	Married-civ-spouse	Farming-fishing	Husband	White	False	60.0	220341.999356	7.0	<50k
8	?	9th	Never-married	?	Not-in-family	White	False	30.0	104965.001013	5.0	<50k
9	?	HS-grad	Never-married	?	Not-in-family	White	False	21.0	105311.997415	9.0	<50k

`TabularDataLoaders.from_csv`[source]

Create from csv file in path using procs

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, procs=procs, cat_names=cat_names, cont_names=cont_names, 
                                  y_names="salary", valid_idx=list(range(800,1000)), bs=64)

External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ...". Often trimming is needed. Pandas has a convenient parameter skipinitialspace that is exposed by ). Otherwise category labels use for inference later such as workclass:Private will be categorized wrongly to 0 or "#na#" if training label was read as " Private". Let’s test this feature.

Data

Tabular data

TabularDataLoaders.from_df[source]

TabularDataLoaders.from_csv[source]

`TabularDataLoaders.from_df`[source]

`TabularDataLoaders.from_csv`[source]