6 Comments
User's avatar
Leo's avatar

dropout is mentioned on Page 8(https://arxiv.org/pdf/1706.03762.pdf):

Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the

sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the

positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of

Pdrop = 0.1.

Expand full comment
Fan's avatar

Thanks for the information. The extra dropout I mentioned here is taken directly from the TensorFlow library. This special dropout is used within the attention layer which can't be found in the original paper. https://github.com/keras-team/keras/blob/v2.13.1/keras/layers/attention/multi_head_attention.py#L535.

But we can see in the comment, it says `but is taken from the original Transformer paper.` This is weird.

This sentence you mentioned here, `we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks`. I think it's the dropout layer applied after the positional embedding not the one inside attention layer.

Expand full comment
Leo's avatar

yes, based on the paper, there are 3 places where dropout is used: 2 in residual connections and 1 in sums of embeddings.

somehow, https://nlp.seas.harvard.edu/annotated-transformer/#full-model also implements dropout on tokens. see function `def attention(query, key, value, mask=None, dropout=None)`

Expand full comment
Leo's avatar

thanks for this great post! detailed explanation.

I have two questions for positional embedding:

1. why `assert length % 2 == 0` is required? why can't the sequence length be odd number?

2. `denom = tf.math.pow(n, -i / half_dim)` should be `denom = 1 / tf.math.pow(n, -i / half_dim)`, right?

Expand full comment
Fan's avatar

Thanks for the good questions.

1. This is my fault. The assertion should work on the embedding dimension not the length. Because the positional encoding values are interleaved by the sin and cos values, so it required the dimension to be even. I fixed it in the latest commit.

2. Notice that there is a negative sign on the `i`. So the formula is equivalent to `1 / tf.math.pow(n, i / half_dim)`

Expand full comment
Leo's avatar

oh, right, I missed the negative sign! thanks for your response.

Expand full comment