Tabular core

Basic function to preprocess tabular data before assembling it in a DataLoaders.

make_date[source]

Make sure df[date_field] is of the right date type.

  1. df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
  2. make_date(df, 'date')
  3. test_eq(df['date'].dtype, np.dtype('datetime64[ns]'))

add_datepart[source]

add_datepart(df, field_name, prefix=None, drop=True, time=False)

Helper function that adds columns relevant to a date in the column field_name of df.

For example if we have a series of dates we can then generate features such as Year, Month, Day, Dayofweek, Is_month_start, etc as shown below:

  1. df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
  2. df = add_datepart(df, 'date')
  3. df.head()

add_elapsed_times[source]

add_elapsed_times(df, field_names, date_field, base_field)

Add in df for each event in field_names the elapsed time according to date_field grouped by base_field

  1. df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
  2. 'event': [False, True, False, True], 'base': [1,1,2,2]})
  3. df = add_elapsed_times(df, ['event'], 'date', 'base')
  4. df.head()
dateeventbaseAftereventBeforeeventevent_bwevent_fw
02019-12-04False1501.00.0
12019-11-29True1001.01.0
22019-11-15False22201.00.0
32019-10-24True2001.01.0

cont_cat_split[source]

cont_cat_split(df, max_card=20, dep_var=None)

Helper function that returns column names of cont and cat variables from given df.

This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card parameter (or a float datatype) then it will be added to the cont_names else cat_names. An example is below:

  1. df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
  2. 'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
  3. 'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
  4. 'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
  5. 'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
  6. cont_names, cat_names = cont_cat_split(df)
  1. cont_names: ['cont1', 'f16']
  2. cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`
  1. df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
  2. 'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
  3. 'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
  4. 'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
  5. 'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
  6. })
  7. df = add_datepart(df, 'd1_date', drop=False)
  8. df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True, inplace=True)
  9. cont_names, cat_names = cont_cat_split(df, max_card=0)
  1. cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
  2. cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']

df_shrink_dtypes(df, skip=[], obj2cat=True, int2uint=False)

Return any possible smaller data types for DataFrame columns. Allows object->category, int->uint, and exclusion.

For example we will make a sample DataFrame with int, float, bool, and object datatypes:

  1. df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
  2. 'date':['2019-12-04','2019-11-29','2019-11-15',]})
  3. df.dtypes
  1. i int64
  2. f float64
  3. e bool
  4. date object
  5. dtype: object

We can then call to find the smallest possible datatype that can support the data:

  1. dt = df_shrink_dtypes(df)
  2. dt
  1. {'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}

df_shrink

df_shrink(df) attempts to make a DataFrame uses less memory, by fit numeric columns into smallest datatypes. In addition:

  • boolean, category, datetime64[ns] dtype columns are ignored.
  • ‘object’ type columns are categorified, which can save a lot of memory in large dataset. It can be turned off by obj2cat=False.
  • int2uint=True, to fit int types to uint types, if all data in the column is >= 0.
  • columns can be excluded by name using excl_cols=['col1','col2'].

To get only new column data types without actually casting a DataFrame, use df_shrink_dtypes() with all the same parameters for df_shrink().

  1. df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
  2. 'date':['2019-12-04','2019-11-29','2019-11-15']})
  3. df2 = df_shrink(df, skip=['date'])

Let’s compare the two:

  1. df.dtypes
  1. i int64
  2. f float64
  3. u int64
  4. date object
  5. dtype: object
  1. df2.dtypes
  1. i int8
  2. f float32
  3. u int16
  4. date object
  5. dtype: object

We can see that the datatypes changed, and even further we can look at their relative memory usages:

  1. Initial Dataframe: 224 bytes
  2. Reduced Dataframe: 173 bytes

Here’s another example using the ADULT_SAMPLE dataset:

  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. new_df = df_shrink(df, int2uint=True)

We reduced the overall memory used by 79%!

class Tabular

Tabular(df, procs=None, =None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True) :: CollBase

A DataFrame wrapper that knows which cols are cont/cat/y, and returns rows in __getitem__

  • df: A DataFrame of your data
  • cat_names: Your categorical x variables
  • cont_names: Your continuous x variables
  • y_names: Your dependent y variables
    • Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
  • y_block: How to sub-categorize the type of y_names ( or RegressionBlock)
  • splits: How to split your data
  • do_setup: A parameter for if will run the data through the procs upon initialization
  • device: cuda or cpu
  • inplace: If True, Tabular will not keep a separate copy of your original DataFrame in memory. You should ensure pd.options.mode.chained_assignment is None before setting this
  • reduce_memory: fastai will attempt to reduce the overall memory usage by the inputted DataFrame with

class TabularPandas

TabularPandas(df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True) :: Tabular

A object with transforms

TabularProc(enc=None, dec=None, split_idx=None, order=None) :: InplaceTransform

Base class to write a non-lazy tabular processor for dataframes

These transforms are applied as soon as the data is available rather than as data is called from the

Categorify(enc=None, dec=None, split_idx=None, order=None) :: TabularProc

Transform the categorical variables to something similar to pd.Categorical

While visually in the DataFrame you will not see a change, the classes are stored in to.procs.categorify as we can see below on a dummy DataFrame:

  1. df = pd.DataFrame({'a':[0,1,2,0,2]})
  2. to = TabularPandas(df, Categorify, 'a')
  3. to.show()
a
00
11
22
30
42
  1. cat = to.procs.categorify
  2. cat.classes
  1. {'a': ['#na#', 0, 1, 2]}

class FillStrategy[source]

Namespace containing the various filling strategies.

Currently, filling with the median, a constant, and the mode are supported.

class FillMissing[source]

FillMissing(fill_strategy=median, add_col=True, fill_vals=None) ::

Fill the missing values in continuous columns.

class ReadTabBatch

ReadTabBatch(to) :: ItemTransform

Transform values into a Tensor with the ability to decode

TabDataLoader(dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, wif=None, before_iter=None, after_item=None, before_batch=, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

A transformed for Tabular data

Integration example

For a more in-depth explanation, see the

  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. df = pd.read_csv(path/'adult.csv')
  3. df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
  4. df_test.drop('salary', axis=1, inplace=True)
  5. df_main.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k
  1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  2. cont_names = ['age', 'fnlwgt', 'education-num']
  3. procs = [Categorify, FillMissing, Normalize]
  4. splits = RandomSplitter()(range_of(df_main))
  1. to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)
  1. dls = to.dataloaders()
  2. dls.valid.show_batch()
  1. to.show()
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numsalary
279PrivateHS-gradNever-married#na#Own-childWhiteTrue20.0155775.010.0<50k
6459PrivateHS-gradDivorcedCraft-repairNot-in-familyWhiteFalse55.035551.09.0<50k
5544PrivateAssoc-vocDivorcedTech-supportNot-in-familyBlackFalse53.0479621.011.0<50k
3500?10thNever-married?Not-in-familyWhiteFalse19.0182590.06.0<50k
3788Self-emp-not-incBachelorsMarried-civ-spouseSalesHusbandWhiteFalse31.0340880.013.0<50k
4002Self-emp-not-incSome-collegeNever-marriedSalesOwn-childWhiteFalse30.0196342.010.0<50k
204?HS-gradMarried-civ-spouse#na#HusbandWhiteTrue60.0174073.010.0<50k
9097PrivateHS-gradMarried-civ-spouseAdm-clericalHusbandWhiteFalse39.083893.09.0>=50k
5972PrivateBachelorsMarried-civ-spouseExec-managerialHusbandWhiteFalse48.0105838.013.0>=50k
5661PrivateHS-gradNever-marriedAdm-clericalOwn-childWhiteFalse26.0262656.09.0<50k

We can decode any set of transformed data by calling to.decode_row with our raw data:

  1. row = to.items.iloc[0]
  2. to.decode_row(row)
  1. age 20.0
  2. workclass Private
  3. fnlwgt 155775.0
  4. education HS-grad
  5. education-num 10.0
  6. marital-status Never-married
  7. occupation #na#
  8. relationship Own-child
  9. race White
  10. sex Male
  11. capital-gain 0
  12. capital-loss 0
  13. hours-per-week 30
  14. native-country United-States
  15. salary <50k
  16. education-num_na True
  17. Name: 279, dtype: object

We can make new test datasets based on the training data with the to.new()

Note: Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

  1. to_tst = to.new(df_test)
  2. to_tst.process()
  3. to_tst.items.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryeducation-num_na
100000.45547651.326789101.1782003212Male0040Philippines1
10001-0.93629751.24048412-0.42071431514Male0040United-States1
100021.04148650.1468952-1.2201711925Female0037United-States1
100030.5287275-0.28263912-0.4207147255Female0043United-States1
100040.74848161.42847890.3787433515Male0060United-States1

We can then convert it to a DataLoader:

  1. tst_dl = dls.valid.new(to_tst)
  2. tst_dl.show_batch()
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-num
0PrivateBachelorsMarried-civ-spouseAdm-clericalHusbandAsian-Pac-IslanderFalse45.000000338105.00196713.0
1PrivateHS-gradMarried-civ-spouseTransport-movingHusbandOtherFalse26.000000328663.0056019.0
2Private11thDivorcedOther-serviceNot-in-familyWhiteFalse53.000000209021.9997957.0
3PrivateHS-gradWidowedAdm-clericalUnmarriedWhiteFalse46.000000162029.9994979.0
4Self-emp-incAssoc-vocMarried-civ-spouseExec-managerialHusbandWhiteFalse49.000000349229.99778011.0
5Local-govSome-collegeMarried-civ-spouseExec-managerialHusbandWhiteFalse34.000000124827.00245010.0
6Self-emp-incSome-collegeMarried-civ-spouseSalesHusbandWhiteFalse53.000000290640.00164410.0
7PrivateSome-collegeNever-marriedSalesOwn-childWhiteFalse19.000000106272.99874010.0
8PrivateSome-collegeMarried-civ-spouseProtective-servHusbandBlackFalse72.00000153684.00346210.0
9PrivateSome-collegeNever-marriedSalesOwn-childWhiteFalse20.000000505980.00706910.0

Other target types

one-hot encoded label

  1. def _mock_multi_label(df):
  2. sal,sex,white = [],[],[]
  3. for row in df.itertuples():
  4. sal.append(row.salary == '>=50k')
  5. sex.append(row.sex == ' Male')
  6. white.append(row.race == ' White')
  7. df['salary'] = np.array(sal)
  8. df['male'] = np.array(sex)
  9. df['white'] = np.array(white)
  10. return df
  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. df = pd.read_csv(path/'adult.csv')
  3. df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
  4. df_main = _mock_multi_label(df_main)
  1. df_main.head()
  1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  2. cont_names = ['age', 'fnlwgt', 'education-num']
  3. procs = [Categorify, FillMissing, Normalize]
  4. splits = RandomSplitter()(range_of(df_main))
  5. y_names=["salary", "male", "white"]
  1. %time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names=y_names, y_block=MultiCategoryBlock(encoded=True, vocab=y_names), splits=splits)
  1. CPU times: user 60 ms, sys: 0 ns, total: 60 ms
  2. Wall time: 59.4 ms
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numsalarymalewhite
0PrivateHS-gradMarried-civ-spouseSalesHusbandWhiteFalse47.000000186533.9998489.0TrueTrueTrue
1PrivateSome-collegeNever-marriedAdm-clericalNot-in-familyWhiteFalse32.000000115631.00121610.0FalseFalseTrue
2Federal-govSome-collegeWidowedExec-managerialNot-in-familyWhiteFalse60.00000127466.00387310.0FalseFalseTrue
3PrivateHS-gradNever-marriedOther-serviceNot-in-familyWhiteFalse49.000000129639.9976029.0FalseFalseTrue
4Local-govProf-schoolMarried-civ-spouseProf-specialtyHusbandWhiteFalse37.000000265038.00158215.0TrueTrueTrue
5PrivateBachelorsNever-marriedHandlers-cleanersOther-relativeWhiteFalse23.000001256755.00292913.0FalseFalseTrue
6PrivateHS-gradNever-marriedMachine-op-inspctNot-in-familyWhiteFalse39.000000185052.9999589.0FalseFalseTrue
7PrivateHS-gradNever-marriedHandlers-cleanersOwn-childWhiteFalse28.000000189346.0001399.0FalseTrueTrue
8Private10thMarried-civ-spouseOther-serviceHusbandAsian-Pac-IslanderFalse35.000000176122.9994946.0FalseTrueFalse
9Private5th-6thNever-marriedMachine-op-inspctOther-relativeWhiteFalse25.000000521399.9968823.0FalseTrueTrue

Not one-hot encoded

  1. def _mock_multi_label(df):
  2. targ = []
  3. for row in df.itertuples():
  4. labels = []
  5. if row.salary == '>=50k': labels.append('>50k')
  6. if row.sex == ' Male': labels.append('male')
  7. targ.append(' '.join(labels))
  8. df['target'] = np.array(targ)
  9. return df
  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. df = pd.read_csv(path/'adult.csv')
  3. df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
  4. df_main = _mock_multi_label(df_main)
  1. df_main.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalarytarget
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k>50k white
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k>50k male white
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k>50k male
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k
  1. @MultiCategorize
  2. def encodes(self, to:Tabular):
  3. #to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
  4. return to
  5. @MultiCategorize
  6. def decodes(self, to:Tabular):
  7. #to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
  8. return to
  1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  2. cont_names = ['age', 'fnlwgt', 'education-num']
  3. procs = [Categorify, FillMissing, Normalize]
  4. splits = RandomSplitter()(range_of(df_main))
  1. %time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="target", y_block=MultiCategoryBlock(), splits=splits)
  1. CPU times: user 68 ms, sys: 0 ns, total: 68 ms
  2. Wall time: 65 ms
  1. to.procs[2].vocab
  1. ['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']

Regression

  1. path = untar_data(URLs.ADULT_SAMPLE)
  2. df = pd.read_csv(path/'adult.csv')
  3. df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
  4. df_main = _mock_multi_label(df_main)
  1. cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
  2. cont_names = ['fnlwgt', 'education-num']
  3. procs = [Categorify, FillMissing, Normalize]
  4. splits = RandomSplitter()(range_of(df_main))
  1. %time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names='age', splits=splits)
  1. CPU times: user 60 ms, sys: 4 ms, total: 64 ms
  2. Wall time: 63.3 ms
  1. to.procs[-1].means
  1. {'fnlwgt': 192492.332875, 'education-num': 10.075499534606934}
  1. dls = to.dataloaders()
  2. dls.valid.show_batch()
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_nafnlwgteducation-numage
0Private9thMarried-civ-spouseMachine-op-inspctHusbandWhiteFalse288185.0023015.025.0
1Self-emp-incHS-gradMarried-civ-spouseCraft-repairHusbandWhiteFalse383492.9977539.044.0
2PrivateHS-gradMarried-civ-spouseCraft-repairHusbandWhiteFalse84136.0019209.040.0
3PrivateBachelorsNever-marriedHandlers-cleanersOwn-childWhiteTrue31778.00265610.028.0
4PrivateSome-collegeMarried-civ-spouseAdm-clericalHusbandBlackFalse193036.00000110.034.0
5Private10thDivorcedMachine-op-inspctNot-in-familyBlackFalse131713.9988196.029.0
6PrivateHS-gradMarried-civ-spouseMachine-op-inspctHusbandWhiteFalse275632.0020749.030.0
7PrivateHS-gradMarried-civ-spouseOther-serviceHusbandWhiteFalse107236.0030159.027.0
8PrivateHS-gradMarried-civ-spouseMachine-op-inspctHusbandBlackFalse83878.9978169.028.0
9Private7th-8thNever-marriedHandlers-cleanersOwn-childWhiteFalse255476.0000254.029.0
  1. class TensorTabular(fastuple):
  2. def get_ctxs(self, max_n=10, **kwargs):
  3. n_samples = min(self[0].shape[0], max_n)
  4. df = pd.DataFrame(index = range(n_samples))
  5. return [df.iloc[i] for i in range(n_samples)]
  6. def display(self, ctxs): display_df(pd.DataFrame(ctxs))
  7. class TabularLine(pd.Series):
  8. "A line of a dataframe that knows how to show itself"
  9. def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)
  10. class ReadTabLine(ItemTransform):
  11. def __init__(self, proc): self.proc = proc
  12. def encodes(self, row):
  13. cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
  14. return TensorTabular(tensor(cats).long(),tensor(conts).float())
  15. def decodes(self, o):
  16. to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
  17. to = self.proc.decode(to)
  18. return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))
  19. class ReadTabTarget(ItemTransform):
  20. def __init__(self, proc): self.proc = proc
  21. def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
  22. def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])
  1. # enc = tds[1]
  2. # test_eq(enc[0][0], tensor([2,1]))
  3. # test_close(enc[0][1], tensor([-0.628828]))
  4. # test_eq(enc[1], 1)
  5. # dec = tds.decode(enc)
  6. # assert isinstance(dec[0], TabularLine)
  7. # test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
  8. # test_eq(dec[1], 'a')
  9. # test_stdout(lambda: print(show_at(tds, 1)), """a 1
  10. # b_na False
  11. # b 1
  12. # category a

©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021