Thanks for the comment. I checked the implementation from TensorFlow official blog https://www.tensorflow.org/text/tutorials/transformer#the_feed_forward_network. There is no dropout layer between the attention and residual connection layers. But I think we can add dropout layer whenever we want. This is flexible.
Thanks for this great post.
I think the dropout layer is missing in residual connection in both encoder and decoder in this implementation.
```
attention_output = self.norm(
self.add(
[input, dropout(self.attention(input, input, input, training=training)) ],
),
training=training,
)
```
Thanks for the comment. I checked the implementation from TensorFlow official blog https://www.tensorflow.org/text/tutorials/transformer#the_feed_forward_network. There is no dropout layer between the attention and residual connection layers. But I think we can add dropout layer whenever we want. This is flexible.
yep, the TF implementation doesn't have dropout.
https://nlp.seas.harvard.edu/annotated-transformer/#encoder-and-decoder-stacks has dropout and the original paper also mentions the dropout in residual connection too.
It's not big deal :)