3 Comments
User's avatar
Leo's avatar

Thanks for this great post.

I think the dropout layer is missing in residual connection in both encoder and decoder in this implementation.

```

attention_output = self.norm(

self.add(

[input, dropout(self.attention(input, input, input, training=training)) ],

),

training=training,

)

```

Expand full comment
Fan's avatar

Thanks for the comment. I checked the implementation from TensorFlow official blog https://www.tensorflow.org/text/tutorials/transformer#the_feed_forward_network. There is no dropout layer between the attention and residual connection layers. But I think we can add dropout layer whenever we want. This is flexible.

Expand full comment
Leo's avatar

yep, the TF implementation doesn't have dropout.

https://nlp.seas.harvard.edu/annotated-transformer/#encoder-and-decoder-stacks has dropout and the original paper also mentions the dropout in residual connection too.

It's not big deal :)

Expand full comment