6 Comments
Oct 1, 2023·edited Oct 1, 2023

dropout is mentioned on Page 8(https://arxiv.org/pdf/1706.03762.pdf):

Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the

sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the

positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of

Pdrop = 0.1.

Expand full comment

thanks for this great post! detailed explanation.

I have two questions for positional embedding:

1. why `assert length % 2 == 0` is required? why can't the sequence length be odd number?

2. `denom = tf.math.pow(n, -i / half_dim)` should be `denom = 1 / tf.math.pow(n, -i / half_dim)`, right?

Expand full comment