Helpers

Bricks

Architecture specific layers, blocks and containers.

Encoder

bs = 4
sl = 128
d = 64
x = torch.randn(bs, sl, d)
m = XEncoderBlock(d)
out = m(x)
assert (out.size() == (bs, sl, d))
out.shape

torch.Size([4, 128, 64])

x = torch.randn(bs, sl, d)
m = XEncoder(d)
out = m(x)
assert (out.size() == (bs, sl, d))
out.shape

torch.Size([4, 128, 64])

m = XEncoder(d, residual_type='rezero')
out  = m(x)
assert (out.size() == (bs, sl, d))
assert (out == x).all()

Decoder

x = torch.randn(bs, sl, d)
context = torch.randn(bs, sl, d)
m = XDecoder(d)
out = m(x, context)
assert (out.size() == (bs, sl, d))
out.shape

torch.Size([4, 128, 64])

Models

Language model

* vocab_sz: int
* d_model: int - inner dimension of the model
* n_layers: int (default: 6)
* n_heads: int (default: 8)
* d_ff: int - inner dimension of the pointwise FeedForward net, if None defaults to 4*d_model
* attn_dropout: float - attention dropout
* ff_dropout: float - feed-forward dropout
* emb_dropout: float - embedding dropout
* causal: bool (default: True) - if True does causal masking automatically
* max_seq_len: int (default: 512)
* tie_weights: bool - if True target embedding weights are used for computation output projection
* residual_type: str - one of {'postnorm', 'prenorm', 'admin', 'rezero'}
* attn_bias: bool - wether to allow biases in attention projection layers
* pad_idx: int - padding token id, required for autogeneration of padding mask
* pos_enc: str from {'absolute', 'fixed', 'axial'} - type of positional encoding to use
* axial_shape: tuple - [optional] should be factors of max_seq_len
* axial_emb_dims: tuple - [optional] axial embedding components, should sum to d_model

* x - input ids, shape [bs, sl]
* mask - optional boolean mask, shape [bs, sl]

* logits - target token logits, shape [bs, sl, vocab_sz]

bs = 4
sl = 128
d = 64
vocab_sz = 256
x = torch.randint(vocab_sz, (bs, sl))
model = XTransformerLM(vocab_sz, d, n_layers=2, causal=False)
out = model(x)
assert (out.size() == (bs, sl, vocab_sz))
out.shape

torch.Size([4, 128, 256])

#add tests for various configs here

Encoder-Decoder model

Warning: Not implemented yet

* enc_vocab_sz: int - source vocab size
* dec_vocab_sz: int - target vocab size
* d_model: int - inner dimension of the model
* n_enc_layers: int (default: 6)
* n_dec_layers: int (default: 6)
* heads: int (default: 8)
* d_ff: int - inner dimension of the pointwise FeedForward net, if None defaults to 4*d_model
* attn_dropout: float - attention dropout
* ff_dropout: float - feed-forward dropout
* emb_dropout: float - embedding dropout
* max_seq_len: int (default: 512)
* prenorm: bool - whether to use PreNorm or PostNorm
* attn_bias: bool - whether to allow biases in attention projection layers
* pad_idx: int - padding token id, if pad_idx is provided, and no mask/context_mask are
        passed to forward method will be used to generate padding masks
* tie_weights: bool - if True target embedding weights are used for computation output projection
* shared_emb: bool - if True encoder and decoder will use shared embedding layer
* pos_enc: str from {'absolute', 'fixed', 'axial'} - type of positional encoding to use
* axial_shape: tuple - [optional] should be factors of max_seq_len
* axial_emb_dims: tuple - [optional] axial embedding components, should sum to d_model

* src - source input ids, shape [bs, src_sl]
* tgt - target input ids, shape [bs, tgt_sl]
* src_mask - optional boolean source mask, shape [bs, src_sl]
* tgt_mask - optional boolean target mask, shape [bs, tgt_sl]

* logits - target token logits, shape [bs, tgt_sl, tgt_vocab_sz]

Config for experiments

model = XTransformerLM.from_config(XConfig(n_layers=2, residual_type='rezero'))
model

XTransformerLM(
  (emb): TransformerEmbedding(
    (emb): Embedding(256, 512)
    (dropout): Dropout(p=0.1, inplace=False)
    (pos_enc): AbsolutePositionalEmbedding(
      (emb): Embedding(512, 512)
    )
  )
  (encoder): XEncoder(
    (layers): ModuleList(
      (0): XEncoderBlock(
        (attn): ReZero(
          (sublayer): Attention(
            (in_proj): AttnInProjV2(
              (to_q): Linear(in_features=512, out_features=512, bias=False)
              (to_kv): Linear(in_features=512, out_features=1024, bias=False)
            )
            (attn): ScaledDotProdAttention(
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (out_proj): Linear(in_features=512, out_features=512, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (scale): Scale()
        )
        (ff): ReZero(
          (sublayer): FeedForward(
            (net): Sequential(
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (act): GELU()
              (drop1): Dropout(p=0.1, inplace=False)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
              (drop2): Dropout(p=0.1, inplace=False)
            )
          )
          (scale): Scale()
        )
      )
      (1): XEncoderBlock(
        (attn): ReZero(
          (sublayer): Attention(
            (in_proj): AttnInProjV2(
              (to_q): Linear(in_features=512, out_features=512, bias=False)
              (to_kv): Linear(in_features=512, out_features=1024, bias=False)
            )
            (attn): ScaledDotProdAttention(
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (out_proj): Linear(in_features=512, out_features=512, bias=False)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (scale): Scale()
        )
        (ff): ReZero(
          (sublayer): FeedForward(
            (net): Sequential(
              (fc1): Linear(in_features=512, out_features=2048, bias=True)
              (act): GELU()
              (drop1): Dropout(p=0.1, inplace=False)
              (fc2): Linear(in_features=2048, out_features=512, bias=True)
              (drop2): Dropout(p=0.1, inplace=False)
            )
          )
          (scale): Scale()
        )
      )
    )
  )
  (proj): Linear(in_features=512, out_features=256, bias=True)
)

XTransformer

Helpers

`wrap_sublayer`[source]

Bricks

Encoder

`class` `XEncoderBlock`[source]

`class` `XEncoder`[source]

Decoder

`class` `XDecoderBlock`[source]

`class` `XDecoderBlockV2`[source]

`class` `XDecoder`[source]

Models

Language model

`class` `XTransformerLM`[source]

Encoder-Decoder model

`class` `XTransformer`[source]

Config for experiments

`class` `XConfig`[source]

XTransformer

Helpers

wrap_sublayer[source]

Bricks

Encoder

class XEncoderBlock[source]

class XEncoder[source]

Decoder

class XDecoderBlock[source]

class XDecoderBlockV2[source]

class XDecoder[source]

Models

Language model

class XTransformerLM[source]

Encoder-Decoder model

class XTransformer[source]

Config for experiments

class XConfig[source]

`wrap_sublayer`[source]

`class` `XEncoderBlock`[source]

`class` `XEncoder`[source]

`class` `XDecoderBlock`[source]

`class` `XDecoderBlockV2`[source]

`class` `XDecoder`[source]

`class` `XTransformerLM`[source]

`class` `XTransformer`[source]

`class` `XConfig`[source]