Masked Language Modeling

from transformers import AutoTokenizer, AutoModelForMaskedLM
from datasets import load_dataset, concatenate_datasets

from fastai.text.all import DataBlock, IndexSplitter, noop, perplexity
from fasthugs.learner import TransLearner
from fasthugs.data import TransformersLMBlock

Setup

model_name = 'distilroberta-base'
# data
max_length = 128
bs = 16
val_bs = bs*4
# training
lr = 3e-5

Data preprocessing

In this example notebook we use HuggingFace datasets for preprocessing (as show in example notebook here).

ds_name = 'imdb'

dataset = load_dataset(ds_name)

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/4ea52f2e58a08dbc12c2bd52d0d92b30b88c00230b4522801b3636782f625c5b)

dataset = dataset['train'].select(range(2000))

dataset.column_names

['label', 'text']

tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize(batch):
    return tokenizer(batch['text'], return_attention_mask=True, return_special_tokens_mask=True, verbose=False)

dataset.info.task_templates = []
dataset = dataset.map(tokenize, batched=True, batch_size=100, remove_columns=dataset.column_names, num_proc=4)

block_size = max_length

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = dataset.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Training

import random
N = len(lm_dataset)
idx = list(range(N))
random.shuffle(idx)

split = int(N*0.9)
train_idx = idx[:split]
valid_idx = idx[split:]

dblock = DataBlock(blocks=[TransformersLMBlock(tokenizer=tokenizer)],
                   splitter=IndexSplitter(valid_idx))

dls = dblock.dataloaders(lm_dataset, bs=bs, val_bs=val_bs, num_workers=4)
dls.show_batch()

	text
0	daughter. Richard Conte of "The Godfather"<mask> a Sicilian crime boss who wants to bury the hatchet with the Delon character, but<mask> rest of his hard-n<mask> associates want the hit-<mask> dead. Like most crime thrillers in the 1960s and<mask><mask>, "Big Guns" subscribes to the cinematic morality<mask> crime does not pay. Interestingly, the one man who has nothing to do with the murder of the wife and son<mask><mask> hero survives while another<mask>rays the hero<mask> extreme prejudice.<mask>ari does not waste a second in this 90-minute shoot<mask>em up. Apart from the
1	This is by far the best triple Madness match i have<mask> seen. It had close falls<mask> plenty of finishers, stolen fin<mask>, raw energy, intensity and fast pace. No one could<mask> who would<mask> out<mask> this one. If your going to buy this<mask><mask><mask>iphany it strictly for this match. (ending<mask> watch for yourself!)<br /><br stabilizedOverall<mask> was a solid<mask>V<mask> plenty<mask> extra goodies<mask> keep you watching again and again. Although this is hard to<mask> (<mask> had to pay a little more than usual for this<mask>) it<mask> definitely worth your money.</s><s>This<mask>
2	.<mask> Eye really shows his skills at storytelling.<br /><br />Red Eye also works<mask> because of its young<mask> talented cast. Rachel McAdams gives a very engaging performance and<mask> character<mask> hard to hate<mask> You may even end up<mask> for her out loud. Cillian<mask> gives a very creepy and<mask> performance as theottest. The way he acts charming at first but<mask> turns psycho is especially<mask>. The supporting actors are also pretty good<mask> include Brain Cox and Jay<mask> Mays.<br /><br />The<mask> is<mask> very<mask> and it has this overall creepy vibe to it. The setting works well since there
3	learns that Tony is actually part of two families<mask> in one family<mask> is a loving father yet not-so-perfect<mask>husband, and in the other family he is a ruthless wiseguy<mask> After analysis, Dr. Melfi concludes that Tony's problems actually derive from his mother L<mask>, who's<mask> to have borderline<mask>personality disorder. Gandolfini is rightfully praised<mask> the main character; yet Bracco and March<mask> aren't nearly as recognized for their equally and talented performances as the psychiatrist and mother, respectively. Falco, Imperioli and De<mask>te<mask> are acclaimed for their brilliant supporting roles<mask> Van Zand
4	past, we have always simply branded killers "psychopaths" and assumed that<mask> they were biologically wired for disaster or had media influence, but as Zero Day shows sometimes the motives are deeper than that, and we can never<mask> understand why tragedies such<mask><mask> shootings<mask> until we have seen it from<mask> perspective of the killers.</s><s>I rented Zero Day from the local video store last<mask>.<mask> had never heard of the film and I had my reservations about<mask>.<mask> from looking at the box I knew the film<mask> an Indie film<mask> therefore the quality was<mask> to be less than a mainstream film. <<mask><mask>><br
5	telling.<br /><br />The other direct<mask><mask> the Pulp Magazine. The inexpensive<mask> prose story publications that carried a great deal of<mask> of the same deposits characters<mask> on going, though not necessarily serialized, tales. The<mask> medium had<mask> around<mask><mask> decades and introduced us to Edgar Rice Borrough's TAR<mask>AN and Johnston McCulley's ZORRO<mask> The 1930's brought<mask> a bumper crop as feature characters<mask> THE SHADOW, THE AVENGER, G8's BATTLE<mask>ES and THE SPIDER,MASTER of MEN all found their way to the news stands<mask> among many
6	his obsessive compulsive cleaning sprees and<mask> phobias and sends her a suicide telegram.Sheipient Oscar and<mask> him know what happened.Felix turns up at Oscar's during his weekly poker game with<mask> friends Vinnie(John Fielder)<mask> the policeman(<mask>bert Edelman)Roy(David Sheiner)and Speed(Larry Haines).After some side splitting<mask>ics<mask>'s agreed Felix will<mask> gib Oscar.<br /><br />The rest of the film centres on how theseULT are such completely different characters.As well as looking at if Oscar<mask><mask><mask>'s truly weird and unique habits and
7	/>No one<mask> naturally disturbed but ultimately intrigued about the nightmarish<mask> of Pete being abducted and sexually abused for years until he was<mask> rescued<mask> a<mask> named Donna (Collette giving an excellent performance) who has adopted the boy but her correspondence with No<mask> reveals that<mask> is<mask> from AIDS.<mask> No<mask> wants<mask> meet the fans but is suddenly in doubt to their possibly devious ulterior motives when the seed is planted by<mask> estranged lover<mask> (Cannavale) whose sudden departure from their New York City apartment<mask> No one in an emotional tailspin that has only now grown into a temp<mask> in a<mask>ac
8	<mask> even though the script does leave a lot of room for<mask>. Most laughs<mask> from the difference between<mask>vira and the people of good morals, but there are a couple of good visual<mask>ags as well.<mask> all direction is<mask>, but it never<mask> to<mask> anything more than that. In all, a good, intentionally campy,<mask>. If you like this<mask><mask> thing, that is.</s><s>I found<mask> episode to be one of funniest<mask>'ve seen in a long time<mask><mask><mask> park creators have done the best spoof<mask><mask> Romero<mask> I have ever seen.They have truly touched on Romero

b = dls.one_batch()
b[0]['input_ids'], b[0]['labels']

(tensor([[ 3809, 48709,   100,  ...,  1250,     8, 50264],
         [   31, 50264, 27499,  ...,    10,  9209, 15259],
         [    4,   653,    38,  ...,     7,   224,    14],
         ...,
         [    5, 46646, 31794,  ...,     5,   527,     8],
         [50264, 25477,     7,  ...,    59,   657, 50264],
         [   85, 50264, 50264,  ...,    65,     9,    39]]),
 tensor([[-100, -100, -100,  ..., -100, -100, 5905],
         [-100,    5, -100,  ..., -100, -100, -100],
         [-100, -100, -100,  ..., -100,  224, -100],
         ...,
         [-100, -100, -100,  ..., -100, -100, -100],
         [1256, -100, -100,  ..., -100, -100,    6],
         [-100,   18,   99,  ..., -100, -100, -100]]))

The labels are constructed by DataCollatorForLanguageModeling and the loss computed by the model is used for training.

model = AutoModelForMaskedLM.from_pretrained(model_name)
learn = TransLearner(dls, model, loss_func=noop, metrics=perplexity).to_fp16()

As masking is done randomly on the fly, validation score may vary.

learn.validate()

(#2) [2.9136292934417725,18.423542022705078]

learn.fit_flat_cos(2, 3e-5)

epoch	train_loss	valid_loss	perplexity	time
0	2.392086	2.245003	9.440444	01:36
1	2.307463	2.124191	8.366127	01:39

learn.validate()

(#2) [2.1733407974243164,8.787592887878418]