text.data

This module contains the TextDataset class, which is the main dataset you should use for your NLP tasks. It automatically does the preprocessing steps described in . It also contains all the functions to quickly get a TextDataBunch ready.

Quickly assemble your data

You should get your data in one of the following formats to make the most of the fastai library and use one of the factory methods of one of the TextDataBunch classes:

raw text files in folders train, valid, test in an ImageNet style,
a csv where some column(s) gives the label(s) and the following one the associated text,
a dataframe structured the same way,
tokens and labels arrays,
ids, vocabulary (correspondence id to word) and labels.

If you are assembling the data for a language model, you should define your labels as always 0 to respect those formats. The first time you create a with one of those functions, your data will be preprocessed automatically. You can save it, so that the next time you call it is almost instantaneous.

Below are the classes that help assembling the raw data in a DataBunch suitable for NLP.

Some other tests where TextLMDataBunch is used:

pytest -sv tests/test_text_data.py::test_from_csv_and_from_df
pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_1 [source]
pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_2

To run tests please refer to this guide.

Create a suitable for training a language model.

All the texts in the datasets are concatenated and the labels are ignored. Instead, the target is the next word in the sentence.

`create`[source][test]

create(train_ds, valid_ds, test_ds=None, path:PathOrStr='.', no_check:bool=False, bs=64, val_bs:int=None, num_workers:int=0, device:=None, collate_fn:Callable='data_collate', dl_tfms:Optional[Collection[Callable]]=None, bptt:int=70, backwards:bool=False, **dl_kwargs) → DataBunch No tests found for create. To contribute a test please refer to and this discussion.

Create a in path from the datasets for language modelling. Passes **dl_kwargs on to DataLoader()

`class` `TextClasDataBunch`[test]

TextClasDataBunch(train_dl:DataLoader, valid_dl:, fix_dl:DataLoader=None, test_dl:Optional[]=None, device:device=None, dl_tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.', collate_fn:Callable='data_collate', no_check:bool=False) :: Tests found for TextClasDataBunch:

Some other tests where TextClasDataBunch is used:

pytest -sv tests/test_text_data.py::test_backwards_cls_databunch [source]
pytest -sv tests/test_text_data.py::test_from_csv_and_from_df
pytest -sv tests/test_text_data.py::test_from_ids_exports_classes [source]
pytest -sv tests/test_text_data.py::test_from_ids_works_for_equally_length_sentences
pytest -sv tests/test_text_data.py::test_from_ids_works_for_variable_length_sentences [source]
pytest -sv tests/test_text_data.py::test_load_and_save_test

To run tests please refer to this guide.

Create a suitable for training an RNN classifier.

`create`[test]

create(train_ds, valid_ds, test_ds=None, path:PathOrStr='.', bs:int=32, val_bs:int=None, pad_idx=1, pad_first=True, device:device=None, no_check:bool=False, backwards:bool=False, dl_tfms:Optional[Collection[Callable]]=None, **dl_kwargs) → No tests found for create. To contribute a test please refer to this guide and .

Function that transform the datasets in a DataBunch for classification. Passes **dl_kwargs on to DataLoader()

All the texts are grouped by length (with a bit of randomness for the training set) then padded so that the samples have the same length to get in a batch.

`class` `TextDataBunch`[source][test]

TextDataBunch(train_dl:, valid_dl:DataLoader, fix_dl:=None, test_dl:Optional[DataLoader]=None, device:=None, dl_tfms:Optional[Collection[Callable]]=None, path:PathOrStr='.', collate_fn:Callable='data_collate', no_check:bool=False) :: DataBunch No tests found for TextDataBunch. To contribute a test please refer to and this discussion.

General class to get a for NLP. Subclassed by TextLMDataBunch and .

Warning: This class can only work directly if all the texts have the same length.

Factory methods (TextDataBunch)

All those classes have the following factory methods.

`from_folder`[source][test]

from_folder(path:PathOrStr, train:str='train', valid:str='valid', test:Optional[str]=None, classes:ArgStar=None, tokenizer:=None, vocab:Vocab=None, chunksize:int=10000, max_vocab:int=60000, min_freq:int=2, mark_fields:bool=False, include_bos:bool=True, include_eos:bool=False, **kwargs) Tests found for from_folder:

Some other tests where from_folder is used:

pytest -sv tests/test_text_data.py::test_filter_classes
pytest -sv tests/test_text_data.py::test_from_folder [source]

To run tests please refer to this .

Create a TextDataBunch from text files in folders.

The folders are scanned in path with a train, valid and maybe test folders. Text files in the train and valid folders should be placed in subdirectories according to their classes (not applicable for a language model). tokenizer will be used to parse those texts into tokens.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

`from_csv`[source][test]

from_csv(path:PathOrStr, , valid_pct:float=0.2, test:Optional[str]=None, tokenizer:=None, vocab:Vocab=None, classes:StrList=None, delimiter:str=None, header='infer', text_cols:IntsOrStrs=1, label_cols:IntsOrStrs=0, label_delim:str=None, chunksize:int=10000, max_vocab:int=60000, min_freq:int=2, mark_fields:bool=False, include_bos:bool=True, include_eos:bool=False, **kwargs) → Tests found for from_csv:

pytest -sv tests/test_text_data.py::test_from_csv_and_from_df [source]

To run tests please refer to this .

Create a TextDataBunch from texts in csv files. kwargs are passed to the dataloader creation.

This method will look for csv_name, and optionally a test csv file, in path. These will be opened with , using delimiter. You can specify which are the text_cols and label_cols; by default a single label column is assumed to come before a single text column. If your csv has no header, you must specify these as indices. If you’re training a language model and don’t have labels, you must specify the text_cols. If there are several text_cols, the texts will be concatenated together with an optional field token. If there are several label_cols, the labels will be assumed to be one-hot encoded and classes will default to label_cols (you can ignore that argument for a language model). label_delim can be used to specify the separator between multiple labels in a column.

You can pass a tokenizer to be used to parse the texts into tokens and/or a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). Otherwise you can specify parameters such as max_vocab, min_freq, chunksize for the Tokenizer and Numericalizer (processors). Other parameters (e.g. bs, val_bs and num_workers, etc.) will be passed to LabelLists.databunch() documentation) (see the LM data and classifier data sections for more info).

`from_df`[source][test]

from_df(path:PathOrStr, train_df:DataFrame, valid_df:DataFrame, test_df:OptDataFrame=None, tokenizer:=None, vocab:Vocab=None, classes:StrList=None, text_cols:IntsOrStrs=1, label_cols:IntsOrStrs=0, label_delim:str=None, chunksize:int=10000, max_vocab:int=60000, min_freq:int=2, mark_fields:bool=False, include_bos:bool=True, include_eos:bool=False, **kwargs) → Tests found for from_df:

pytest -sv tests/test_text_data.py::test_from_csv_and_from_df [source]

Some other tests where from_df is used:

pytest -sv tests/test_text_data.py::test_backwards_cls_databunch
pytest -sv tests/test_text_data.py::test_load_and_save_test [source]
pytest -sv tests/test_text_data.py::test_regression
pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_1 [source]
pytest -sv tests/test_text_data.py::test_should_load_backwards_lm_2

To run tests please refer to this guide.

Create a from DataFrames. kwargs are passed to the dataloader creation.

This method will use train_df, valid_df and optionally test_df to build the TextDataBunch in path. You can specify text_cols and label_cols; by default a single label column comes before a single text column. If you’re training a language model and don’t have labels, you must specify the text_cols. If there are several text_cols, the texts will be concatenated together with an optional field token. If there are several label_cols, the labels will be assumed to be one-hot encoded and classes will default to label_cols (you can ignore that argument for a language model).

`from_tokens`[source][test]

from_tokens(path:PathOrStr, trn_tok:Tokens, trn_lbls:Collection[Union[int, float]], val_tok:Tokens, val_lbls:Collection[Union[int, float]], vocab:=None, tst_tok:Tokens=None, classes:ArgStar=None, max_vocab:int=60000, min_freq:int=3, **kwargs) → DataBunch No tests found for from_tokens. To contribute a test please refer to and this discussion.

Create a from tokens and labels. kwargs are passed to the dataloader creation.

This function will create a DataBunch from trn_tok, trn_lbls, val_tok, val_lbls and maybe tst_tok.

You can pass a specific vocab for the numericalization step (if you are building a classifier from a language model you fine-tuned for instance). kwargs will be split between the function and to the class initialization, you can precise there parameters such as max_vocab, chunksize, min_freq, n_labels, tok_suff and lbl_suff (see the TextDataset documentation) or bs, bptt and pad_idx (see the sections LM data and classifier data).

`from_ids`[source][test]

pytest -sv tests/test_text_data.py::test_from_ids_exports_classes
pytest -sv tests/test_text_data.py::test_from_ids_works_for_equally_length_sentences [source]
pytest -sv tests/test_text_data.py::test_from_ids_works_for_variable_length_sentences

To run tests please refer to this guide.

Create a from ids, labels and a vocab. kwargs are passed to the dataloader creation.

Texts are already preprocessed into train_ids, train_lbls, valid_ids, valid_lbls and maybe test_ids. You can specify the corresponding classes if applicable. You must specify a path and the vocab so that the RNNLearner class can later infer the corresponding sizes in the model it will create. kwargs will be passed to the class initialization.

To avoid losing time preprocessing the text data more than once, you should save and load your using DataBunch.save and .

`load`[test]

load(path:PathOrStr, cache_name:PathOrStr='tmp', processor:PreProcessor=None, **kwargs) No tests found for load. To contribute a test please refer to and this discussion.

Load a from path/cache_name. kwargs are passed to the dataloader creation.

Warning: This method should only be used to load back TextDataBunch saved in v1.0.43 or before, it is now deprecated.

Example

Untar the IMDB sample dataset if not already done:

PosixPath('/home/ubuntu/.fastai/data/imdb_sample')

Since it comes in the form of csv files, we will use the corresponding text_data method. Here is an overview of what your file should look like:

pd.read_csv(path/'texts.csv').head()

And here is a simple way of creating your for language modelling or classification.

Behind the scenes, the previous functions will create a training, validation and maybe test TextList that will be tokenized and numericalized (if needed) using .

`class` `Text`[test]

Text(ids, text) :: ItemBase No tests found for Text. To contribute a test please refer to and this discussion.

Basic item for text data in numericalized ids.

`class` `TextList`[source][test]

TextList(items:Iterator[T_co], vocab:=None, pad_idx:int=, sep=' ', **kwargs) :: ItemList Tests found for TextList:

Some other tests where TextList is used:

pytest -sv tests/test_text_data.py::test_filter_classes
pytest -sv tests/test_text_data.py::test_from_folder [source]
pytest -sv tests/test_text_data.py::test_regression

To run tests please refer to this guide.

Basic for text data.

vocab contains the correspondence between ids and tokens, pad_idx is the id used for padding. You can pass a custom processor in the kwargs to change the defaults for tokenization or numericalization. It should have the following form:

processor = [TokenizeProcessor(tokenizer=tokenizer), NumericalizeProcessor(max_vocab=30000)]

To use sentencepiece instead of spaCy (requires to install sentencepiece separately) you would pass

processor = SPProcessor()

See below for all the arguments those tokenizers can take.

`label_for_lm`[test]

label_for_lm(**kwargs) No tests found for label_for_lm. To contribute a test please refer to this guide and .

A special labelling method for language models.

`from_folder`[test]

from_folder(path:PathOrStr='.', extensions:StrList={'.txt'}, vocab:Vocab=None, processor:=None, **kwargs) → TextList Tests found for from_folder:

Some other tests where from_folder is used:

pytest -sv tests/test_text_data.py::test_filter_classes [source]
pytest -sv tests/test_text_data.py::test_from_folder

To run tests please refer to this guide.

Get the list of files in path that have a text suffix. determines if we search subfolders.

`show_xys`[test]

show_xys(xs, ys, max_len:int=70) No tests found for show_xys. To contribute a test please refer to this guide and .

Show the xs (inputs) and ys (targets). max_len is the maximum number of tokens displayed.

`show_xyzs`[test]

show_xyzs(xs, ys, zs, max_len:int=70) No tests found for show_xyzs. To contribute a test please refer to this guide and .

Show xs (inputs), ys (targets) and zs (predictions). max_len is the maximum number of tokens displayed.

OpenFileProcessor(ds:Collection[T_co]=None) :: PreProcessor No tests found for OpenFileProcessor. To contribute a test please refer to and this discussion.

that opens the filenames and read the texts.

`open_text`[test]

`class` `TokenizeProcessor`[test]

TokenizeProcessor(ds:ItemList=None, tokenizer:=None, chunksize:int=10000, mark_fields:bool=False, include_bos:bool=True, include_eos:bool=False) :: PreProcessor No tests found for TokenizeProcessor. To contribute a test please refer to and this discussion.

that tokenizes the texts in ds.

tokenizer is used on bits of chunksize. If mark_fields=True, add field tokens between each parts of the texts (given when the texts are read in several columns of a dataframe). Depending on include_bos and include_eos, BOS and EOS will be automatically added at the beginning or the end of each text. See more about tokenizers in the transform documentation.

`class` `NumericalizeProcessor`[source][test]

NumericalizeProcessor(ds:=None, vocab:Vocab=None, max_vocab:int=60000, min_freq:int=3) :: No tests found for NumericalizeProcessor. To contribute a test please refer to this guide and .

PreProcessor that numericalizes the tokens in ds.

Uses vocab for this (if not None), otherwise create one with max_vocab and min_freq from tokens.

`class` `SPProcessor`[source][test]

SPProcessor(ds:=None, pre_rules:ListRules=None, post_rules:ListRules=None, vocab_sz:int=None, max_vocab_sz:int=30000, model_type:str='unigram', max_sentence_len:int=20480, lang='en', char_coverage=None, tmp_dir='tmp', mark_fields:bool=False, include_bos:bool=True, include_eos:bool=False, sp_model=None, sp_vocab=None, n_cpus:int=None, enc='utf8') :: PreProcessor No tests found for SPProcessor. To contribute a test please refer to and this discussion.

that tokenizes and numericalizes with sentencepiece

pre_rules and post_rules default to defaults.text_pre_rules and defaults.text_post_rules respectively, vocab_sz defaults to the minimum between max_vocab_sz and one quarter of the number of words in the training texts (rounded to the nearest multiple of 8). model_type is passed to sentencepiece, so can be unigram (default), bpe, char, or word. Other sentencepiece parameters are langm max_sentence_len and char_coverage (default to 1. for European languages and 0.99 for others).

mark_fields=True will add fields tokens between each text columns (if they are in several columns of a dataframe) and depending on include_bos and include_eos, BOS and EOS will be automatically added at the beginning or the end of each text. The sentencepiece model used for tokenization will be saved in path/tmp_dir where path will be given by the data this processor is applied to.

If you already have a trained tokenizer, you can passa long the model and vocab files with sp_model and sp_vocab.

Language Model data

A language model is trained to guess what the next word is inside a flow of words. We don’t feed it the different texts separately but concatenate them all together in a big array. To create the batches, we split this array into bs chunks of continuous texts. Note that in all NLP tasks, we don’t use the usual convention of sequence length being the first dimension so batch size is the first dimension and sequence length is the second. Here you can read the chunks of texts in lines.

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
0	crew	that	he	can	trust	to	help	him	pull	it	off	and	get	his	xxunk	None	None
1	want	a	good	family	movie	,	this	might	do	.	xxmaj	it	is	clean	.	None	None
2	director	of	many	bad	xxunk	)	tries	to	cover	the	info	up	,	but	goo	None	None
3	film	,	and	the	xxunk	xxunk	of	the	villain	,	humorous	or	not	,	are	None	None
4	cole	in	the	beginning	are	meant	to	draw	comparisons	which	leave	the	audience	xxunk	.	None	None
5	witness	xxmaj	brian	dealing	with	his	situation	through	first	,	primitive	means	,	and	then	None	None
6	film	,	or	not	.	n	n		xxmaj	this	film	.	xxmaj	film	?	xxmaj	this
7	xxunk	sitting	through	this	bomb	.	xxmaj	the	crew	member	who	was	in	charge	of	None	None
8	this	film	is	viewed	as	non	xxup	xxunk	but	there	is	a	speech	by	xxmaj	None	None
9	mention	the	pace	of	the	movie	.	xxmaj	to	my	mind	,	this	new	version	None	None
10	of	yours	!	‘	n	n		xxmaj	director	xxmaj	xxunk	xxmaj	xxunk	,	who	is	xxunk
11	pair	,	xxmaj	harry	xxmaj	michell	as	xxmaj	harry	,	xxmaj	rosie	xxmaj	michell	as	None	None
12	cares	who	lives	and	who	dies	,	i	‘ll	be	shocked	.	xxmaj	the	same	None	None
13	is	incredibly	stupid	,	with	a	detective	trying	to	track	down	a	suspected	serial	killer	None	None
14	independent	film	was	one	of	the	best	films	at	the	tall	grass	film	festival	that	None	None

Warning: If you are used to another convention, beware! fastai always uses batch as a first dimension, even in NLP.

This is all done internally when we use TextLMDataBunch, by wrapping the dataset in the following pre-loader before calling a .

LanguageModelPreLoader(dataset:LabelList, lengths:Collection[int]=None, bs:int=32, bptt:int=70, backwards:bool=False, shuffle:bool=False) :: No tests found for LanguageModelPreLoader. To contribute a test please refer to this guide and .

Transforms the tokens in dataset to a stream of contiguous batches for language modelling.

LanguageModelPreLoader is an internal class used for training a language model. It takes the sentences passed as a jagged array of numericalised sentences in dataset and returns contiguous batches to the pytorch dataloader with batch size bs and a sequence length bptt.

lengths can be provided for the jagged training data else lengths is calculated internally
backwards=True will reverse the sentences.
shuffle=True, will shuffle the order of the sentences, at the start of each epoch - except the first

The following description is usefull for understanding the implementation of LanguageModelPreLoader:

idx: instance of CircularIndex that indexes items while taking the following into account 1) shuffle, 2) direction of indexing, 3) wraps around to head (reading forward) or tail (reading backwards) of the ragged array as needed in order to fill the last batch(s)
ro: index of the first rag of each row in the batch to be extract. Returns as index to the next rag to be extracted
ri: Reading forward: index to the first token to be extracted in the current rag (ro). Reading backwards: one position after the last token to be extracted in the rag
overlap: overlap between batches is 1, because we only predict the next token

Classifier data

When preparing the data for a classifier, we keep the different texts separate, which poses another challenge for the creation of batches: since they don’t all have the same length, we can’t easily collate them together in batches. To help with this we use two different techniques:

padding: each text is padded with the PAD token to get all the ones we picked to the same size
sorting the texts (ish): to avoid having together a very long text with a very short one (which would then have a lot of PAD tokens), we regroup the texts by order of length. For the training set, we still add some randomness to avoid showing the same batches at every step of the training.

Here is an example of batch with padding (the padding index is 1, and the padding is applied before the sentences start).

path = untar_data(URLs.IMDB_SAMPLE)
data = TextClasDataBunch.from_csv(path, 'texts.csv')
iter_dl = iter(data.train_dl)
_ = next(iter_dl)
x,y = next(iter_dl)
x[-10:,:20]

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0')

This is all done internally when we use TextClasDataBunch, by using the following classes:

SortSampler(data_source:NPArrayList, key:KeyFunc) :: Tests found for SortSampler:

pytest -sv tests/test_text_data.py::test_sampler [source]
pytest -sv tests/test_text_data.py::test_sort_sampler

To run tests please refer to this guide.

Go through the text data by order of length.

This pytorch is used for the validation and (if applicable) the test set.

`class` `SortishSampler`[test]

SortishSampler(data_source:NPArrayList, key:KeyFunc, bs:int) :: Sampler Tests found for SortishSampler:

To run tests please refer to this .

Go through the text data by order of length with a bit of randomness.

This pytorch Sampler is generally used for the training set.

`pad_collate`[source][test]

pad_collate(samples:BatchSamples, pad_idx:int=1, pad_first:bool=True, backwards:bool=False) → Tuple[LongTensor, LongTensor] No tests found for pad_collate. To contribute a test please refer to and this discussion.

Function that collect samples and adds padding. Flips token order if needed

This will collate the samples in batches while adding padding with pad_idx. If , padding is applied at the beginning (before the sentence starts) otherwise it’s applied at the end.

©2021 fast.ai. All rights reserved.
Site last generated: Jan 5, 2021

text.data

text.data

Quickly assemble your data

create[source][test]

class TextClasDataBunch[test]

create[test]

class TextDataBunch[source][test]

Factory methods (TextDataBunch)

from_folder[source][test]

from_csv[source][test]

from_df[source][test]

from_tokens[source][test]

from_ids[source][test]

load[test]

Example

class Text[test]

class TextList[source][test]

label_for_lm[test]

from_folder[test]

show_xys[test]

show_xyzs[test]

open_text[test]

class TokenizeProcessor[test]

class NumericalizeProcessor[source][test]

class SPProcessor[source][test]

Language Model data

Classifier data

class SortishSampler[test]

pad_collate[source][test]

`create`[source][test]

`class` `TextClasDataBunch`[test]

`create`[test]

`class` `TextDataBunch`[source][test]

`from_folder`[source][test]

`from_csv`[source][test]

`from_df`[source][test]

`from_tokens`[source][test]

`from_ids`[source][test]

`load`[test]

`class` `Text`[test]

`class` `TextList`[source][test]

`label_for_lm`[test]

`from_folder`[test]

`show_xys`[test]

`show_xyzs`[test]

`open_text`[test]

`class` `TokenizeProcessor`[test]

`class` `NumericalizeProcessor`[source][test]

`class` `SPProcessor`[source][test]

`class` `SortishSampler`[test]

`pad_collate`[source][test]