Fan, can you share your experiment setup when you say "this 2 params do have a big impact on the final performance, around 5% improvement on the recall rate metrics."? The reason of this ask is that I failed to make them work in my experiment setup, which is: (1) each tower is just an embedding lookup table (2) feature is just user id and movie title (3) movielen_100k dataset. There are two variants: (1) sampling bias correction as the baseline (2) baseline + l2 + temperature. I used kerastuner to tune the learning rate and temperature. However, I can't make the variant 2 (baseline + l2 + temperature) outperform variant 1 even with extensive tuning. Here is the colab link in case you are interested: https://drive.google.com/file/d/1i3suC8hE0zK3p5slM5TQKlzSjvMLtsqK/view?usp=sharing. I wonder if it's due to the model is too simple and the dataset is too small.
Sorry. I just saw your comment. The result isn't from MovieLens data, it's based on a real industry model training on real dataset and features. The dataset and features are much more complex than MovieLens. Your assumption should be right. To get a more reliable result, you can try to tune it on Criteo dataset, which is much bigger than MovieLens.
thanks for the great explanation, it answers many questions about the streaming frequency estimation in my mind for years. There is still one more puzzle for me and can you help me confirm my understanding: it's definitely possible that a batch contains the same item multiple times and all of those duplicated items in the same batch have the same item frequency estimation?
Yes. Actually in the paper, they didn't consider this situation. In my implemented version, the same candidate will be calculated using the same step params and the estimation result will be the same.
Yep, the original paper uses sampling without replacement in the simulation session. :-(. Your code actually made me realizes that batch update handles the duplication.
Fan, can you share your experiment setup when you say "this 2 params do have a big impact on the final performance, around 5% improvement on the recall rate metrics."? The reason of this ask is that I failed to make them work in my experiment setup, which is: (1) each tower is just an embedding lookup table (2) feature is just user id and movie title (3) movielen_100k dataset. There are two variants: (1) sampling bias correction as the baseline (2) baseline + l2 + temperature. I used kerastuner to tune the learning rate and temperature. However, I can't make the variant 2 (baseline + l2 + temperature) outperform variant 1 even with extensive tuning. Here is the colab link in case you are interested: https://drive.google.com/file/d/1i3suC8hE0zK3p5slM5TQKlzSjvMLtsqK/view?usp=sharing. I wonder if it's due to the model is too simple and the dataset is too small.
Sorry. I just saw your comment. The result isn't from MovieLens data, it's based on a real industry model training on real dataset and features. The dataset and features are much more complex than MovieLens. Your assumption should be right. To get a more reliable result, you can try to tune it on Criteo dataset, which is much bigger than MovieLens.
no worries, thanks for the reply. Happy Chinese new year!
thanks for the great explanation, it answers many questions about the streaming frequency estimation in my mind for years. There is still one more puzzle for me and can you help me confirm my understanding: it's definitely possible that a batch contains the same item multiple times and all of those duplicated items in the same batch have the same item frequency estimation?
Yes. Actually in the paper, they didn't consider this situation. In my implemented version, the same candidate will be calculated using the same step params and the estimation result will be the same.
Yep, the original paper uses sampling without replacement in the simulation session. :-(. Your code actually made me realizes that batch update handles the duplication.