Transforms and DataBlocks.
TODOs:
- verify CLM works as well and mb rename
masking_func
as it would be not only for masking - add permutation LM
path = untar_data(URLs.IMDB_SAMPLE)
texts = pd.read_csv(path/'texts.csv')
model_name = 'distilbert-base-uncased'
max_len = 128
bs = 8
val_bs = 16
tokenizer = AutoTokenizer.from_pretrained(model_name)
dblock = DataBlock(blocks = [TransformersTextBlock(tokenizer=tokenizer),
CategoryBlock()],
get_x=ItemGetter('text'),
get_y=ItemGetter('label'),
splitter=ColSplitter())
dls = dblock.dataloaders(texts, bs=bs, val_bs=val_bs)
dls.show_batch(max_n=4)
HuggingFace models can compute loss, to use loss computed by model you should pass with_labels = True
to datablock constructor. The show_batch
result didn't change, but actually the labels are moved to dict
object, which is the first element of a batch.
dblock = DataBlock(blocks = [TransformersTextBlock(tokenizer=tokenizer, with_labels=True), CategoryBlock()],
get_x=ItemGetter('text'),
get_y=ItemGetter('label'),
splitter=ColSplitter())
dls = dblock.dataloaders(texts, bs=8)
dls.show_batch(max_n=4)
path = untar_data(URLs.IMDB_SAMPLE)
model_name = 'distilbert-base-uncased'
max_length = 128
bs = 8
val_bs = 16
tokenizer = AutoTokenizer.from_pretrained(model_name)
ds = datasets.Dataset.from_csv((path/'texts.csv').as_posix())
ds = ds.map(tokenize, remove_columns=ds.column_names)
block_size = max_length
lm_ds = ds.map(group_texts, batched=True, batch_size=1000)
dblock = DataBlock(blocks=[TransformersLMBlock(tokenizer=tokenizer)],
splitter=RandomSplitter())
dls = dblock.dataloaders(lm_ds, bs=bs, val_bs=val_bs)
dls.show_batch(max_n=4)