class TabularPandas [source]
- class Categorify [source]
- class FillMissing [source]
class TabDataLoader [source]
Other target types
- - one-hot encoded label
- Regression

Tabular core

Basic function to preprocess tabular data before assembling it in a DataLoaders.

`make_date`[source]

Make sure df[date_field] is of the right date type.

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
make_date(df, 'date')
test_eq(df['date'].dtype, np.dtype('datetime64[ns]'))

`add_datepart`[source]

add_datepart(df, field_name, prefix=None, drop=True, time=False)

Helper function that adds columns relevant to a date in the column field_name of df.

For example if we have a series of dates we can then generate features such as Year, Month, Day, Dayofweek, Is_month_start, etc as shown below:

df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
df.head()

`add_elapsed_times`[source]

add_elapsed_times(df, field_names, date_field, base_field)

Add in df for each event in field_names the elapsed time according to date_field grouped by base_field

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
                   'event': [False, True, False, True], 'base': [1,1,2,2]})
df = add_elapsed_times(df, ['event'], 'date', 'base')
df.head()

	date	event	base	Afterevent	event_bw	event_fw
0	2019-12-04	False	1	5	1.0	0.0
1	2019-11-29	True	1	0	1.0	1.0
2	2019-11-15	False	2	22	1.0	0.0
3	2019-10-24	True	2	0	1.0	1.0

`cont_cat_split`[source]

cont_cat_split(df, max_card=20, dep_var=None)

Helper function that returns column names of cont and cat variables from given df.

This function works by determining if a column is continuous or categorical based on the cardinality of its values. If it is above the max_card parameter (or a float datatype) then it will be added to the cont_names else cat_names. An example is below:

df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'], 
                   'i8': pd.Series([1, 2, 3, 4], dtype='int8'), 
                   'u8': pd.Series([1, 2, 3, 4], dtype='uint8'), 
                   'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
                   'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
cont_names, cat_names = cont_cat_split(df)

cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`

df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
                    'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
                    'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
                    'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
                    'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
                    })
df = add_datepart(df, 'd1_date', drop=False)
df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True, inplace=True)
cont_names, cat_names = cont_cat_split(df, max_card=0)

cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']

df_shrink_dtypes(df, skip=[], obj2cat=True, int2uint=False)

Return any possible smaller data types for DataFrame columns. Allows object->category, int->uint, and exclusion.

For example we will make a sample DataFrame with int, float, bool, and object datatypes:

df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
                   'date':['2019-12-04','2019-11-29','2019-11-15',]})
df.dtypes

i         int64
f       float64
e          bool
date     object
dtype: object

We can then call to find the smallest possible datatype that can support the data:

dt = df_shrink_dtypes(df)
dt

{'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}

`df_shrink`

df_shrink(df) attempts to make a DataFrame uses less memory, by fit numeric columns into smallest datatypes. In addition:

boolean, category, datetime64[ns] dtype columns are ignored.
‘object’ type columns are categorified, which can save a lot of memory in large dataset. It can be turned off by obj2cat=False.
int2uint=True, to fit int types to uint types, if all data in the column is >= 0.
columns can be excluded by name using excl_cols=['col1','col2'].

To get only new column data types without actually casting a DataFrame, use df_shrink_dtypes() with all the same parameters for df_shrink().

df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
                  'date':['2019-12-04','2019-11-29','2019-11-15']})
df2 = df_shrink(df, skip=['date'])

Let’s compare the two:

df.dtypes

i         int64
f       float64
u         int64
date     object
dtype: object

df2.dtypes

i          int8
f       float32
u         int16
date     object
dtype: object

We can see that the datatypes changed, and even further we can look at their relative memory usages:

Initial Dataframe: 224 bytes
Reduced Dataframe: 173 bytes

Here’s another example using the ADULT_SAMPLE dataset:

path = untar_data(URLs.ADULT_SAMPLE)
new_df = df_shrink(df, int2uint=True)

We reduced the overall memory used by 79%!

`class` `Tabular`

Tabular(df, procs=None, =None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True) :: CollBase

A DataFrame wrapper that knows which cols are cont/cat/y, and returns rows in __getitem__

df: A DataFrame of your data
cat_names: Your categorical x variables
cont_names: Your continuous x variables
y_names: Your dependent y variables
- Note: Mixed y’s such as Regression and Classification is not currently supported, however multiple regression or classification outputs is
y_block: How to sub-categorize the type of y_names ( or RegressionBlock)
splits: How to split your data
do_setup: A parameter for if will run the data through the procs upon initialization
device: cuda or cpu
inplace: If True, Tabular will not keep a separate copy of your original DataFrame in memory. You should ensure pd.options.mode.chained_assignment is None before setting this
reduce_memory: fastai will attempt to reduce the overall memory usage by the inputted DataFrame with

`class` `TabularPandas`

TabularPandas(df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True) :: Tabular

A object with transforms

TabularProc(enc=None, dec=None, split_idx=None, order=None) :: InplaceTransform

Base class to write a non-lazy tabular processor for dataframes

These transforms are applied as soon as the data is available rather than as data is called from the

Categorify(enc=None, dec=None, split_idx=None, order=None) :: TabularProc

Transform the categorical variables to something similar to pd.Categorical

While visually in the DataFrame you will not see a change, the classes are stored in to.procs.categorify as we can see below on a dummy DataFrame:

df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.show()

	a
0	0
1	1
2	2
3	0
4	2

cat = to.procs.categorify
cat.classes

{'a': ['#na#', 0, 1, 2]}

`class` `FillStrategy`[source]

Namespace containing the various filling strategies.

Currently, filling with the median, a constant, and the mode are supported.

`class` `FillMissing`[source]

FillMissing(fill_strategy=median, add_col=True, fill_vals=None) ::

Fill the missing values in continuous columns.

`class` `ReadTabBatch`

ReadTabBatch(to) :: ItemTransform

Transform values into a Tensor with the ability to decode

TabDataLoader(dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, wif=None, before_iter=None, after_item=None, before_batch=, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

A transformed for Tabular data

Integration example

For a more in-depth explanation, see the

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_test.drop('salary', axis=1, inplace=True)
df_main.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))

to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)

dls = to.dataloaders()
dls.valid.show_batch()

to.show()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
279	Private	HS-grad	Never-married	#na#	Own-child	White	True	20.0	155775.0	10.0	<50k
6459	Private	HS-grad	Divorced	Craft-repair	Not-in-family	White	False	55.0	35551.0	9.0	<50k
5544	Private	Assoc-voc	Divorced	Tech-support	Not-in-family	Black	False	53.0	479621.0	11.0	<50k
3500	?	10th	Never-married	?	Not-in-family	White	False	19.0	182590.0	6.0	<50k
3788	Self-emp-not-inc	Bachelors	Married-civ-spouse	Sales	Husband	White	False	31.0	340880.0	13.0	<50k
4002	Self-emp-not-inc	Some-college	Never-married	Sales	Own-child	White	False	30.0	196342.0	10.0	<50k
204	?	HS-grad	Married-civ-spouse	#na#	Husband	White	True	60.0	174073.0	10.0	<50k
9097	Private	HS-grad	Married-civ-spouse	Adm-clerical	Husband	White	False	39.0	83893.0	9.0	>=50k
5972	Private	Bachelors	Married-civ-spouse	Exec-managerial	Husband	White	False	48.0	105838.0	13.0	>=50k
5661	Private	HS-grad	Never-married	Adm-clerical	Own-child	White	False	26.0	262656.0	9.0	<50k

We can decode any set of transformed data by calling to.decode_row with our raw data:

row = to.items.iloc[0]
to.decode_row(row)

age                           20.0
workclass                  Private
fnlwgt                    155775.0
education                  HS-grad
education-num                 10.0
marital-status       Never-married
occupation                    #na#
relationship             Own-child
race                         White
sex                           Male
capital-gain                     0
capital-loss                     0
hours-per-week                  30
native-country       United-States
salary                        <50k
education-num_na              True
Name: 279, dtype: object

We can make new test datasets based on the training data with the to.new()

Note: Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training

to_tst = to.new(df_test)
to_tst.process()
to_tst.items.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	education-num_na
10000	0.455476	5	1.326789	10	1.178200	3	2	1	2	Male	40	Philippines	1
10001	-0.936297	5	1.240484	12	-0.420714	3	15	1	4	Male	40	United-States	1
10002	1.041486	5	0.146895	2	-1.220171	1	9	2	5	Female	37	United-States	1
10003	0.528727	5	-0.282639	12	-0.420714	7	2	5	5	Female	43	United-States	1
10004	0.748481	6	1.428478	9	0.378743	3	5	1	5	Male	60	United-States	1

We can then convert it to a DataLoader:

tst_dl = dls.valid.new(to_tst)
tst_dl.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
0	Private	Bachelors	Married-civ-spouse	Adm-clerical	Husband	Asian-Pac-Islander	False	45.000000	338105.001967	13.0
1	Private	HS-grad	Married-civ-spouse	Transport-moving	Husband	Other	False	26.000000	328663.005601	9.0
2	Private	11th	Divorced	Other-service	Not-in-family	White	False	53.000000	209021.999795	7.0
3	Private	HS-grad	Widowed	Adm-clerical	Unmarried	White	False	46.000000	162029.999497	9.0
4	Self-emp-inc	Assoc-voc	Married-civ-spouse	Exec-managerial	Husband	White	False	49.000000	349229.997780	11.0
5	Local-gov	Some-college	Married-civ-spouse	Exec-managerial	Husband	White	False	34.000000	124827.002450	10.0
6	Self-emp-inc	Some-college	Married-civ-spouse	Sales	Husband	White	False	53.000000	290640.001644	10.0
7	Private	Some-college	Never-married	Sales	Own-child	White	False	19.000000	106272.998740	10.0
8	Private	Some-college	Married-civ-spouse	Protective-serv	Husband	Black	False	72.000001	53684.003462	10.0
9	Private	Some-college	Never-married	Sales	Own-child	White	False	20.000000	505980.007069	10.0

Other target types

one-hot encoded label

def _mock_multi_label(df):
    sal,sex,white = [],[],[]
    for row in df.itertuples():
        sal.append(row.salary == '>=50k')
        sex.append(row.sex == ' Male')
        white.append(row.race == ' White')
    df['salary'] = np.array(sal)
    df['male']   = np.array(sex)
    df['white']  = np.array(white)
    return df

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)

df_main.head()

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
y_names=["salary", "male", "white"]

%time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names=y_names, y_block=MultiCategoryBlock(encoded=True, vocab=y_names), splits=splits)

CPU times: user 60 ms, sys: 0 ns, total: 60 ms
Wall time: 59.4 ms

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	male	white
0	Private	HS-grad	Married-civ-spouse	Sales	Husband	White	False	47.000000	186533.999848	9.0	True	True	True
1	Private	Some-college	Never-married	Adm-clerical	Not-in-family	White	False	32.000000	115631.001216	10.0	False	False	True
2	Federal-gov	Some-college	Widowed	Exec-managerial	Not-in-family	White	False	60.000001	27466.003873	10.0	False	False	True
3	Private	HS-grad	Never-married	Other-service	Not-in-family	White	False	49.000000	129639.997602	9.0	False	False	True
4	Local-gov	Prof-school	Married-civ-spouse	Prof-specialty	Husband	White	False	37.000000	265038.001582	15.0	True	True	True
5	Private	Bachelors	Never-married	Handlers-cleaners	Other-relative	White	False	23.000001	256755.002929	13.0	False	False	True
6	Private	HS-grad	Never-married	Machine-op-inspct	Not-in-family	White	False	39.000000	185052.999958	9.0	False	False	True
7	Private	HS-grad	Never-married	Handlers-cleaners	Own-child	White	False	28.000000	189346.000139	9.0	False	True	True
8	Private	10th	Married-civ-spouse	Other-service	Husband	Asian-Pac-Islander	False	35.000000	176122.999494	6.0	False	True	False
9	Private	5th-6th	Never-married	Machine-op-inspct	Other-relative	White	False	25.000000	521399.996882	3.0	False	True	True

Not one-hot encoded

def _mock_multi_label(df):
    targ = []
    for row in df.itertuples():
        labels = []
        if row.salary == '>=50k': labels.append('>50k')
        if row.sex == ' Male':   labels.append('male')
        targ.append(' '.join(labels))
    df['target'] = np.array(targ)
    return df

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)

df_main.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary	target
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k	>50k white
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k	>50k male white
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k	>50k male
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

@MultiCategorize
def encodes(self, to:Tabular): 
    #to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
    return to
@MultiCategorize
def decodes(self, to:Tabular): 
    #to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
    return to

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))

%time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="target", y_block=MultiCategoryBlock(), splits=splits)

CPU times: user 68 ms, sys: 0 ns, total: 68 ms
Wall time: 65 ms

to.procs[2].vocab

['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']

Regression

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))

%time to = TabularPandas(df_main, procs, cat_names, cont_names, y_names='age', splits=splits)

CPU times: user 60 ms, sys: 4 ms, total: 64 ms
Wall time: 63.3 ms

to.procs[-1].means

{'fnlwgt': 192492.332875, 'education-num': 10.075499534606934}

dls = to.dataloaders()
dls.valid.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	fnlwgt	education-num	age
0	Private	9th	Married-civ-spouse	Machine-op-inspct	Husband	White	False	288185.002301	5.0	25.0
1	Self-emp-inc	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	False	383492.997753	9.0	44.0
2	Private	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	False	84136.001920	9.0	40.0
3	Private	Bachelors	Never-married	Handlers-cleaners	Own-child	White	True	31778.002656	10.0	28.0
4	Private	Some-college	Married-civ-spouse	Adm-clerical	Husband	Black	False	193036.000001	10.0	34.0
5	Private	10th	Divorced	Machine-op-inspct	Not-in-family	Black	False	131713.998819	6.0	29.0
6	Private	HS-grad	Married-civ-spouse	Machine-op-inspct	Husband	White	False	275632.002074	9.0	30.0
7	Private	HS-grad	Married-civ-spouse	Other-service	Husband	White	False	107236.003015	9.0	27.0
8	Private	HS-grad	Married-civ-spouse	Machine-op-inspct	Husband	Black	False	83878.997816	9.0	28.0
9	Private	7th-8th	Never-married	Handlers-cleaners	Own-child	White	False	255476.000025	4.0	29.0

class TensorTabular(fastuple):
    def get_ctxs(self, max_n=10, **kwargs):
        n_samples = min(self[0].shape[0], max_n)
        df = pd.DataFrame(index = range(n_samples))
        return [df.iloc[i] for i in range(n_samples)]
    def display(self, ctxs): display_df(pd.DataFrame(ctxs))
class TabularLine(pd.Series):
    "A line of a dataframe that knows how to show itself"
    def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)
class ReadTabLine(ItemTransform):
    def __init__(self, proc): self.proc = proc
    def encodes(self, row):
        cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
        return TensorTabular(tensor(cats).long(),tensor(conts).float())
    def decodes(self, o):
        to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
        to = self.proc.decode(to)
        return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))
class ReadTabTarget(ItemTransform):
    def __init__(self, proc): self.proc = proc
    def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
    def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])

# enc = tds[1]
# test_eq(enc[0][0], tensor([2,1]))
# test_close(enc[0][1], tensor([-0.628828]))
# test_eq(enc[1], 1)
# dec = tds.decode(enc)
# assert isinstance(dec[0], TabularLine)
# test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
# test_eq(dec[1], 'a')
# test_stdout(lambda: print(show_at(tds, 1)), """a               1
# b_na        False
# b               1
# category        a

©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021

Core

Tabular core

make_date[source]

add_datepart[source]

add_elapsed_times[source]

cont_cat_split[source]

df_shrink

class Tabular

class TabularPandas

class FillStrategy[source]

class FillMissing[source]

class ReadTabBatch