Tabular data

    Helper functions to get data in a DataLoaders in the tabular application and higher class TabularDataLoaders

    The main class to get your data ready for model training is TabularDataLoaders and its factory methods. Checkout the for examples of use.

    This class should not be used directly, one of the factory methods should be preferred instead. All those factory methods accept as arguments:

    • cat_names: the names of the categorical variables
    • cont_names: the names of the continuous variables
    • y_names: the names of the dependent variables
    • y_block: the TransformBlock to use for the target
    • bs: the batch size
    • val_bs: the batch size for the validation (defaults to bs)
    • shuffle_train: if we shuffle the training DataLoader or not
    • n: overrides the numbers of elements in the dataset
    • device: the PyTorch device to use (defaults to default_device())

    TabularDataLoaders.from_df[source]

    Create from df in path using procs

    1. path = untar_data(URLs.ADULT_SAMPLE)
    2. df = pd.read_csv(path/'adult.csv', skipinitialspace=True)
    1. dls = TabularDataLoaders.from_df(df, path, procs=procs, cat_names=cat_names, cont_names=cont_names,
    2. y_names="salary", valid_idx=list(range(800,1000)), bs=64)
    workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numsalary
    0Private11thSeparatedAdm-clericalUnmarriedBlackFalse55.0213894.0005627.0<50k
    1PrivateHS-gradMarried-civ-spouseMachine-op-inspctHusbandWhiteFalse53.0228500.0013859.0>=50k
    2PrivateHS-gradMarried-civ-spouseTech-supportHusbandWhiteFalse38.0256864.0009099.0>=50k
    3PrivateBachelorsMarried-civ-spouseTech-supportHusbandWhiteFalse40.0247879.99719013.0>=50k
    4PrivateSome-collegeDivorcedCraft-repairNot-in-familyWhiteFalse41.040151.00192510.0>=50k
    5PrivateHS-gradMarried-civ-spouseSalesHusbandWhiteFalse37.0110713.0015999.0>=50k
    6PrivateBachelorsMarried-civ-spouseExec-managerialHusbandWhiteFalse38.0278924.00090213.0>=50k
    7Self-emp-not-inc11thMarried-civ-spouseFarming-fishingHusbandWhiteFalse60.0220341.9993567.0<50k
    8?9thNever-married?Not-in-familyWhiteFalse30.0104965.0010135.0<50k
    9?HS-gradNever-married?Not-in-familyWhiteFalse21.0105311.9974159.0<50k

    TabularDataLoaders.from_csv[source]

    Create from csv file in path using procs

    1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
    2. cont_names = ['age', 'fnlwgt', 'education-num']
    3. procs = [Categorify, FillMissing, Normalize]
    4. dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, procs=procs, cat_names=cat_names, cont_names=cont_names,
    5. y_names="salary", valid_idx=list(range(800,1000)), bs=64)

    External structured data files can contain unexpected spaces, e.g. after a comma. We can see that in the first row of adult.csv "49, Private,101320, ...". Often trimming is needed. Pandas has a convenient parameter skipinitialspace that is exposed by ). Otherwise category labels use for inference later such as workclass:Private will be categorized wrongly to 0 or "#na#" if training label was read as " Private". Let’s test this feature.