Transformer with code Part I - Positional…

Fan

Aug 19, 2023

Yet another tutorial for Transformer

Read →

6 Comments

Leo

Oct 1, 2023Edited

dropout is mentioned on Page 8(https://arxiv.org/pdf/1706.03762.pdf):

Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the

sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the

positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of

Pdrop = 0.1.

Expand full comment

Reply (1)

Fan

Oct 1, 2023

Thanks for the information. The extra dropout I mentioned here is taken directly from the TensorFlow library. This special dropout is used within the attention layer which can't be found in the original paper. https://github.com/keras-team/keras/blob/v2.13.1/keras/layers/attention/multi_head_attention.py#L535.

But we can see in the comment, it says `but is taken from the original Transformer paper.` This is weird.

This sentence you mentioned here, `we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks`. I think it's the dropout layer applied after the positional embedding not the one inside attention layer.

Expand full comment

Reply (1)

Leo

Oct 1, 2023

yes, based on the paper, there are 3 places where dropout is used: 2 in residual connections and 1 in sums of embeddings.

somehow, https://nlp.seas.harvard.edu/annotated-transformer/#full-model also implements dropout on tokens. see function `def attention(query, key, value, mask=None, dropout=None)`

Expand full comment

Leo

Oct 1, 2023

thanks for this great post! detailed explanation.

I have two questions for positional embedding:

1. why `assert length % 2 == 0` is required? why can't the sequence length be odd number?

2. `denom = tf.math.pow(n, -i / half_dim)` should be `denom = 1 / tf.math.pow(n, -i / half_dim)`, right?

Expand full comment

Reply (1)

Fan

Oct 1, 2023

Thanks for the good questions.

1. This is my fault. The assertion should work on the embedding dimension not the length. Because the positional encoding values are interleaved by the sin and cos values, so it required the dimension to be even. I fixed it in the latest commit.

2. Notice that there is a negative sign on the `i`. So the formula is equivalent to `1 / tf.math.pow(n, i / half_dim)`

Expand full comment

Reply (1)

Leo

Oct 1, 2023

oh, right, I missed the negative sign! thanks for your response.

Expand full comment

Be a happy and strong coder

Transformer with code Part I - Positional…