from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, MarianConfig, AutoTokenizer, AutoConfig
from fastai.text.all import *
from fastai.callback.wandb import *

from fasthugs.learner import TransLearner
from fasthugs.data import TransformersTextBlock, TextGetter

Setup

Let's define main settings for the run in one place:

model_name = "Helsinki-NLP/opus-mt-fr-en"

max_len = 512
bs = 8
val_bs = bs*2

lr = 2e-5
df = pd.read_csv('./questions_easy.csv')
df.head()
en fr
0 What is light ? Qu’est-ce que la lumière?
1 Who are we? Où sommes-nous?
2 Where did we come from? D'où venons-nous?
3 What would we do without it? Que ferions-nous sans elle ?
4 What is the absolute location (latitude and longitude) of Badger, Newfoundland and Labrador? Quelle sont les coordonnées (latitude et longitude) de Badger, à Terre-Neuve-etLabrador?

Dataloaders

tokenizer = AutoTokenizer.from_pretrained(model_name)
@ItemTransform
def untuple1(x):
    return (*x[0], )
dblock = DataBlock(
    blocks = [TransformersTextBlock(tokenizer=tokenizer, do_targets=True, with_labels=True)],
    get_x=TextGetter('fr', 'en'),
    item_tfms=untuple1,
    splitter=RandomSplitter())
%%time
bs = 16
dls = dblock.dataloaders(df, bs=bs, val_bs=bs*2, shuffle=True)
CPU times: user 20.8 s, sys: 1.19 s, total: 21.9 s
Wall time: 29.4 s
dls.show_batch(max_n=4)
text text_
0 ▁Dans un tel▁cas,▁où il sagit d apprécier si un▁nom commercial a un▁fondement▁juridique▁antérieur à▁celui d une▁marque aux▁fins de larticle 16,▁paragraphe 1,▁troisième phrase, de laccord▁ADPIC,▁peut-on▁considérer▁comme▁décisif: i) le▁fait que, dans lÉtat▁où la▁marque est▁enregistrée et sa protection▁réclamée, le▁nom commercial▁ait▁été, du▁moins dans une▁certaine▁mesure,▁connu dans les▁milieux▁professionnels▁int éressés de lÉtat▁concerné▁avant la date à▁laquelle lenregistrement de la▁marque y a▁été▁demandé; ou que, dans les relations▁commerciales▁intéressant lÉtat▁où la▁marque est enregistr ée et sa protection▁réclamée, le▁nom commercial▁ait▁été▁utilisé▁avant la date à▁laquelle lenregistrement de la▁marque a▁été▁demandé dans▁cet▁État; ou▁tout▁autre▁facteur qui▁permette de▁déterminer si le▁nom commercial▁doit▁être▁considéré▁comme un droit▁antérieur▁existant au▁sens de larticle 16,▁paragraphe 1,▁troisième phrase, de laccord▁ADPIC? When assessing, in such a case, whether a trade name has a legal basis prior to a trade mark for the purposes of the third sentence of Article 16(1) of the TRIPs Agreement, may it thus be considered as decisive: (i) whether the trade name was well known at least to some extent among the relevant trade circles in the State in which the trade mark is registered and in which protection is sought for it, before the point in time at which registration of the trade mark was applied for in the State in question; or whether the trade name was used in commerce directed to the State in which the trade mark is registered and in which protection is sought for it, before the point in time at which registration of the trade mark was applied for in the State in question; or what other factor may decide whether the trade name is to be regarded as an existing prior right within the meaning of the third sentence of Article 16(1) of the TRIPs Agreement?
1 ▁Quelles▁sont les▁statistiques▁officielles▁disponibles sur les▁délits en▁matière de▁propriété▁intellectuelle (en particulier,▁quel est le volume de▁produits de▁contrefaçon▁détecté par les▁douanes,▁quels types de▁délits▁sont▁signalés à la police, comment▁ces▁derniers▁sont-ils▁classés,▁quelle est la proportion de▁délits qui font lobjet de▁poursuites par la police et▁sont▁signalés à dautres institutions,▁combien de▁procédures▁judiciaires▁sont▁lancées et▁quels▁sont les▁résultats▁obtenus)? What official statistics are available on IP crime (ie how much infringing product is detected by Customs, what types and level of IP crime are reported to police, how is this classified, what proportion is acted on by police and referred to other agencies, how many prosecutions are undertaken, and what are the results)?
2 Pour▁tirer une conclusion sur ce▁problème, il▁convient de▁répondre aux questions▁suivantes : -▁Alors▁qu▁théorie,▁chaque▁autorité▁nationale de concurrence/régulation▁serait en▁habilitée à▁traiter des▁problèmes de▁tarification transfrontalière à condition▁qu▁ils▁concernent les▁importations et,▁éventuellement, les▁exportations, est-il acceptable que 15▁processus▁décisionnels▁potentiellement▁conflictuels▁traitent au▁même moment▁cette question? Whilst, in theory, each national regulatory/competition authority might have jurisdiction to deal with cross-border tarification issues insofar as they concern imports and, possibly exports, is it acceptable to have 15 potentially conflicting decision-making processes contemporaneously treating this issue?
3 L'article 5 du▁règlement a-t-il▁été▁amendé d'une▁manière non▁conforme aux▁exigences▁procédurales▁visées par l'article 251 CE,▁lors de l'examen du▁projet de▁texte par le▁comité de consultation et, dans l'affirmative, l'article 5 du▁règlement est-il▁invalide et, dans l'affirmative,▁cette▁circonstance (combinée à▁tout▁autre▁élément pertinent)▁affecte-t-elle la▁validité du▁règlement dans son ensemble? Whether the amendment of Article 5 of the Regulation during consideration of the draft text by the Conciliation Committee was done in a manner that is inconsistent with the procedural requirements provided for in Article 251 EC and, if so, whether Article 5 of the Regulation is invalid and, if so, whether this (in conjunction with any other relevant factors) affects the validity of the Regulation as a whole?

Tracking

Here comes some details on w&b tracking and the leaderboard to be established...

import wandb

WANDB_NAME = f'{ds_name}-{model_name}'
GROUP = f'{ds_name}-{model_name}-simple-{lr:.0e}'
NOTES = f'finetuning {model_name} with RAdam lr={lr:.0e}'
CONFIG = {}
TAGS =[model_name, ds_name, 'radam']
wandb.init(reinit=True, project="fasthugs", entity="fastai_community",
           name=WANDB_NAME, group=GROUP, notes=NOTES, tags=TAGS, config=CONFIG);

Training

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
learn = TransLearner(dls, model, metrics=CorpusBLEUMetric(), loss_func=noop)
learn.validate()
epoch train_loss valid_loss corpus_bleu time
(#2) [2.0951087474823,0.31762597662887515]
learn.fit_one_cycle(2, 1e-4)
epoch train_loss valid_loss corpus_bleu time
0 1.231185 1.039507 0.655427 06:56
1 1.018177 0.993561 0.674918 06:57
learn.validate()
(#2) [0.993560791015625,0.6749184356623896]
df.iloc[10,1]
'Quelle est la province ayant la plus forte densité de population ?'
inp = tokenizer(df.iloc[10, 1], return_tensors='pt')
pred = learn.generate(inp['input_ids'].to(dls.device))
tokenizer.decode(pred[0].cpu(), skip_special_tokens=True)
'Which province has the highest population density?'