wip...

class LossCallback[source]

LossCallback(after_create=None, before_fit=None, before_epoch=None, before_train=None, before_batch=None, after_pred=None, after_loss=None, before_backward=None, before_step=None, after_cancel_step=None, after_step=None, after_cancel_batch=None, after_batch=None, after_cancel_train=None, after_train=None, before_validate=None, after_cancel_validate=None, after_validate=None, after_cancel_epoch=None, after_epoch=None, after_cancel_fit=None, after_fit=None) :: Callback

Base class for loss-computing callbacks

ALUM

Adversarial training for large neural language models as presented in https://arxiv.org/abs/2004.08994.

hook_out[source]

hook_out(m, inp, out)

KL[source]

KL(inp, targ, reduction='sum')

SymmetrizedKL[source]

SymmetrizedKL(inp, targ, reduction='sum')

adv_project[source]

adv_project(grad, norm_type='inf', eps=1e-06)

Algorithm

Input: $T$: total number of iterations, $\mathcal X$: the dataset, $f(x; \theta)$: model parameterized by $\theta$, $\sigma^2$: variance for random initialization of perturbation $\delta$, $\epsilon$: perturbation bound, $K$: number of iterations for updating $\delta$, $\eta$: lr for updating $\delta$, $\tau$: global learning rate, $\alpha$: adversarial loss weight, $\Pi$: projection operation.

01: for $t = 1,...,T$ do
02: $\quad$ for $(x,y) \in \mathcal X$ do
03: $\quad \quad$ $\delta \sim \mathcal{N} (0, \sigma^2 I)$
04: $\quad \quad$ for $m = 1,...,K$ do
05: $\quad \quad \quad$ $g_{adv} \leftarrow \Delta_\delta l(f(x;\theta), f(x+\delta; \theta)) $
06: $\quad \quad \quad$ $\delta \leftarrow \Pi_{\|\delta\|_\infty \le \epsilon}(\delta + \eta g_{adv})$
07: $\quad \quad$ end for
08: $\quad \quad$ $g_\theta \leftarrow \Delta_\theta l(f(x;\theta), y) + \alpha \Delta_\theta l(f(x;\theta), f(x+\delta;\theta))$
09: $\quad \quad$ $\theta \leftarrow \theta - \tau g_\theta$
10: $\quad$ end for
11: end for

Output: $\theta$

compute_adversarial_loss[source]

compute_adversarial_loss(model:Module, embed:Tensor, logits:Tensor, special_tokens_mask=None, token_type_mask=None, noise_var:float=1e-05, step_size:float=0.001, k:int=1, noise_gamma:float=1e-06, criterion=SymmetrizedKL)

Computes adversarial loss on iteratively refined perturbation

class ALUMCallback[source]

ALUMCallback(m:Module, alpha:float=1.0, start_epoch:int=0, criterion=None, mask_special_tokens:bool=False, one_token_type=False, special_tokens_mask=None, token_type_mask=None, noise_var:float=1e-05, step_size:float=0.001, k:int=1, noise_gamma:float=1e-06) :: LossCallback

ALUM callback for HuggingFace pretrained models

model = nn.Sequential(
    nn.Linear(1,10, bias=False),
    nn.Linear(10,1, bias=False)
)
learn = synth_learner(model=model, cbs=ALUMCallback(model[0]))
learn.fit(2, 1e-3)
epoch train_loss valid_loss time
0 11.263165 10.850156 00:00
1 10.396017 9.291542 00:00
Starting virtual adversarial training at epoch 0
Your model is probably not supported, make sure model interface is compatible with HF pretrained models

update_ema_model[source]

update_ema_model(ema_model:Module, model:Module, mom:float=0.99)

Updates ema_model parameters with online model parameters using momentum mom

Algorithm

Notation:
$ g_i(\tilde{x_i}, \bar{\theta_i}) = \frac{1}{|\mathcal{B}|}\sum_{x_i \in \mathcal{B}} {\{\nabla_x \ell_s (\mathcal{f}(x_i; \bar{\theta}_s), \mathcal{f}(\tilde{x_i}; \bar{\theta}_s))} $;
$AdamUpdate_{\mathcal B}$ - ADAM update for optimizing $\theta_{t+1} = argmin_\theta \mathcal F(\theta) + \mu \mathcal{D}_{Breg}(\theta, \tilde{\theta}_t)$;
$\Pi_{\mathcal A}$ - prjection to $\mathcal A$

Input: $T$: total number of iterations, $\mathcal X$: the dataset, $\theta_0$: pre-trained model parameters, $S$: total number of iterations for Bregman proximal point method, $\sigma^2$: variance for random initialization of perturbation, $T_{\bar{x}}$number of iterations for updating $\tilde{x_i}$, $\eta$: lr for updating $\tilde{x_i}$, $\beta$: momentum parameter.

01: $\tilde{\theta_1} \leftarrow \theta_0$
02: for $t = 1,...,T$ do
03: $\quad$ $\bar{\theta}_1 \leftarrow \theta_{t-1}$
04: $\quad$ for $s = 1,...,S$ do
05: $\quad \quad$ Sample $\mathcal{B}$ from $\mathcal X$
06: $\quad \quad$ $\tilde{x_i} \leftarrow x_i + \nu_i$ where $\nu_i ~ \mathcal{N} (0, \sigma^2)$
07: $\quad \quad$ for $m = 1,...,T_\bar{x}$ do
08: $\quad \quad \quad$ $\tilde{g_i} \leftarrow \frac{g_i(\tilde{x_i},\bar{\theta_s})}{\|g_i(\tilde{x_i},\bar{\theta_s})\|_\infty} $
09: $\quad \quad \quad$ $\tilde{x_i} \leftarrow \Pi_{\|\tilde{x_i}-x\|_\infty \le \epsilon}(\tilde{x_i} + \eta \tilde{g_i})$
10: $\quad \quad$ end for
11: $\quad \quad$ $\bar{\theta}_{s+1} \leftarrow AdamUpdate_\mathcal{B} (\bar{\theta}_s)$
12: $\quad$ end for
13: $\quad$ $\theta_t \leftarrow \bar{\theta}_{S}$
14: $\quad$ $\tilde{\theta}_{t+1} \leftarrow (1-\beta) \bar{\theta}_{S} + \beta \tilde{\theta}_t$
15: end for

Output: $\theta_T$

class SMARTCallback[source]

SMARTCallback(m:Module, alpha:float=1.0, mu:float=1.0, start_epoch:int=0, criterion=None, mask_special_tokens:bool=False, one_token_type=False, special_tokens_mask=None, token_type_mask=None, noise_var:float=1e-05, step_size:float=0.001, k:int=1, noise_gamma:float=1e-06) :: LossCallback

SMART callback for HuggingFace pretrained models.

Combines smoothness-inducing adversarial training and momentum accelerated Bregman proximal point optimization.

model = nn.Sequential(
    nn.Linear(1,10, bias=False),
    nn.Linear(10,1, bias=False)
)
learn = synth_learner(model=model, cbs=SMARTCallback(model[0]))
learn.fit(2)
epoch train_loss valid_loss time
0 12.511720 9.756591 00:00
1 11.275883 8.746477 00:00
Starting virtual adversarial training at epoch 0
Your model is probably not supported, make sure model interface is compatible with HF pretrained models

class VATCallback[source]

VATCallback(start_iter=None) :: LossCallback

VAT callback (draft)

Fin