Yet another tutorial for Transformer
dropout is mentioned on Page 8(https://arxiv.org/pdf/1706.03762.pdf):
Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the
sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the
positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of
Pdrop = 0.1.
thanks for this great post! detailed explanation.
I have two questions for positional embedding:
1. why `assert length % 2 == 0` is required? why can't the sequence length be odd number?
2. `denom = tf.math.pow(n, -i / half_dim)` should be `denom = 1 / tf.math.pow(n, -i / half_dim)`, right?
dropout is mentioned on Page 8(https://arxiv.org/pdf/1706.03762.pdf):
Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the
sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the
positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of
Pdrop = 0.1.
thanks for this great post! detailed explanation.
I have two questions for positional embedding:
1. why `assert length % 2 == 0` is required? why can't the sequence length be odd number?
2. `denom = tf.math.pow(n, -i / half_dim)` should be `denom = 1 / tf.math.pow(n, -i / half_dim)`, right?