Dive into Twitter's recommendation system IV - SimClusters

Community-Based Representations for Heterogeneous Recommendations at Twitter

May 15, 2023

In this post, let’s go through the most advanced candidate retriever system in Twitter - SimClusters1. For the previous posts, please refer to

Dive into Twitter's recommendation system I - RealGraph

Fan

April 15, 2023

Read full story

Dive into Twitter's recommendation system II - GraphJet

Fan

April 29, 2023

Read full story

Dive into Twitter's recommendation system III - GraphJet

Fan

May 6, 2023

Read full story

It’s the backbone of many product features in Twitter, including home feed, trends, who to follow, topic tweet recommendations, and personalized trend
It currently serves 35% of Home Timeline tweets, and it’s the primary source of out-of-network tweets
It successfully solved the unique problem of Twitter
1. Various cardinality, which affects the scale of the problem
2. Different Shelf life, which constrains latency of recommendation generation
The core algorithm is a simulation of NMF (sparse non-negative matrix factorization). But traditional MF methods cannot scale in Twitter’s big data
It proposes a two stages solution
1. Stage 1: community discovery based on Metropolis-Hastings sampling
2. Stage 2: batch and streaming processes to ingest the result and apply the intermediate result to different products
There are too many heuristic hyperparameters and settings in this algorithm, and I believe there is still much room to improve in the future

The overall architecture is listed below.

Paper reading

Introduction

Here Twitter introduces the two primary dimensions of their recommendation system:

The cardinality of the items being recommended. As we can see below, the cardinality varies a lot
The shelf life of the computed recommendations. For example, tweet recommendations become stale quicker and must be generated in a real-time online system

As we shared in RealGraph and GraphJet, they are designed to solve specific questions but cannot generalize well. This brings us to the motivation for building SimClusters:

build a general system that helps us advance the accuracy of all or most of the Twitter products which require personalization and recommendations

The core idea of this paper is :

from the user–user graph, construct a general-purpose representation based on community structure, where each community is characterized by a set of influencers that many people in that community follow

SimClusters have the following features:

Universal representations: Users and all kinds of content can be represented in the same vector space
Computational scale: SimClusters work well with ∼10^9 users, ∼10^11 edges between them, and 10^8 new Tweets every day with ∼10^9 user engagements per day
Accuracy beyond the head: SimClusters are accurate beyond the head content. My comment🤔: it’s reasonable because all the results are based on community, which is also a kind of content-based recommendation that generalizes well. But they don’t show the evidence in this paper.
Item and graph churn: Simclusters can adapt to dynamic items which rapidly rise
and diminish in popularity
Interpretability: Every dimension in the vector is a community. This has much better interpretability than MF or graph embedding. My comment🤔: how about the accuracy? I guess it should be worse, but MF and graph embedding cannot run at Twitter’s scale
Efficient nearest neighbor search: It’s easy to maintain inverted indices for retrieving tasks. My comment🤔: ANN search can also be applied here, but they don’t mention

Overview of SimClusters

The overview is the same as I talked about before. Let’s give more details here:

Stage 1: Generate user community scores based on the user-user graph.
1. It is made available in both offline data warehouses as well as low-latency online stores, indexed by the user id.
2. This first stage is run in a batch-distributed setting
3. Compared to the user-tweet graph, the user-user graph evolves slowly and has higher coverage. So an offline batch job is suitable.
Stage 2: Several different jobs calculate the representations for specific recommendation targets. This job can be either batch or streaming, depending on the shelf life of the target. All the jobs are modularized and decoupled from each other, and the only dependency is on offline datasets or online key-value stores. My comment🤔: This is the first time I realized dependency on data is also a kind of decoupling

Community Discovery

Twitter still follows the idea from HITS. The users are decoupled into a bipartite graph:

Hubs: left side users who are interested in specific communities
Authority: right side users who are influential in specific communities
Naturally, the size of right-side users is much smaller than the size of left-side users because of the power-law distribution. To be specific

we find that we’re able to cover the majority of edges (numbering ∼10^11) in the full graph by including the top ∼10^7 most followed users in the right side, while the left side continues to include all users, which is ∼10^9.

Actually, from a matrix-factorization point of view, this 3-step approach closely mirrors one way of performing SVD of a matrix via the eigen-decomposition. But all the NMF(sparse non-negative matrix factorization) methods cannot scale to Twitter’s huge dataset.

Step 1: Similarity Graph of Right Nodes

In this step, a smaller unipartite undirected graph 𝐺 is constructed over the nodes of the right partition. The right partition is used because it’s much smaller and less noisy.

The algorithm is quite simple:

define the weight between two users based on the cosine similarity of their followers on the left side of the bipartite graph

Low similar score users are discarded, and only keep the topK similar users
The calculation scale is enormous. This is solved by WHIMP2, which uses a combination of wedge sampling and Locality Sensitive Hashing (LSH)
The output is an undirected graph with ∼10^7 nodes and ∼10^9 edges. The scale shrinks to a shared-memory multi-core scale

Step 2: Communities of Right Nodes

The next step is to discover the communities. The objective function is defined below:

Z(𝑢) denotes the set of communities to which the vertex 𝑢 has been assigned
N (𝑢) denotes the set of neighbors of vertex 𝑢
The contribution of the first term(neighbors shared the same community) is up-weighted using the parameter 𝛼
In simple words, the object is to find the best communities set that can maximize the overlap of communities among all the neighbors

The algorithm is listed below:

This is a greedy approach: every time, it randomly generates a new community set for every vertex u at most T epochs
If the newly generated community set is better, replace the previous one
If it’s worse, still accept the new set with a certain probability in case of getting stuck in local minima
The naive approach of proposing a new community set is based on random sampling, but it cannot scale

Neighborhood-aware MH:

It is extremely unlikely that a node should belong to a community that none of its neighbors currently belong to
It is unnecessary to assign a node to more than a small number of communities
So the candidate communities are picked from the communities of neighbors
And another threshold is set as the upper bound of the maximal communities that can be assigned to
Then finally sample the subset 𝑠 with probability proportional to 𝑒^𝑓 (𝑢,𝑠)
i.e. apply the softmax

Step 3: Communities of Left Nodes

The final step is quite simple:

assign a left-node to communities by looking at the communities that its neighbors (which will all be right-nodes and hence already have assignments) have been assigned to.

This is done by multiplying the adjacency matrix and setting U = 𝑡𝑟𝑢𝑛𝑐𝑎𝑡𝑒 (A · V), where the 𝑡𝑟𝑢𝑛𝑐𝑎𝑡𝑒 function keeps only up to a certain number of non-zeros per row to save on storage.

Item Representations

The next stage is to build the representations for all items. This is also straightforward:

compute an item’s representation by aggregating the representations of all the users who engaged with it

Exponentially time-decayed average as the 𝑎𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒 function, which exponentially decays the contribution of a user who interacted with the item based on how long ago that user engaged with the item
Shorter shelf life has a shorter half-life
Keep two summary statistics for each cell in W - the current average itself and the last timestamp when it was updated
R and C sets are also recorded for the top communities and vice versa
My comment🤔: this is actually a real-time time decay pipeline very commonly used in calculating real-time interests

Deployment Details

In production, there are ∼10^5 communities in the representations
The communities are extracted from the similarity graph of the top ∼10^7 users by follower count
The final graph contains nearly 70% of the edges in the input bipartite graph
Within Stage 1:
1. Step 1 (similarity calculation) is the most expensive step, taking about 2 days to run end-to-end on Hadoop MapReduce
2. For step 2, the community matrix is updated incrementally, which means it’s only built from scratch for the very first time
For stage 2:
1. Two batch jobs, one for user influence representations and one for Topic representations
  1. The user influence representations tell us what communities a user is influential in, and they have better coverage and are denser. No details about how to generate it. My comment🤔: should be another round of aggregation, like the most followed people in certain communities
  2. The topic representations are generated on the user interest representations and user-topic engagement graph. My comment🤔: the user-topic graph should also be a kind of aggregation on the user-tweet-topic behaviors
2. Two streaming jobs, one for Tweet representations and one for
  Trend representations
  1. They are all generated by consuming the streaming user-tweet engagements

Applications

As I pointed out in the overall architecture picture, different intermediate representations are further outputted to downstream products

Similar Tweets on Tweet Details page

Adding two candidate sources:

Similar tweets based on tweets representations from SimClusters using cosine similarity. The new candidates have a +25% higher engagement rate
Retrieve Tweets whose SimClusters representation has high cosine similarity with the user influence representation of the author of the main Tweet on the page. Overall engagement increased by +7%

Tweet Recommendations in Home Page

Also, adding two candidate sources:

Retrieve tweets whose real-time representation has the highest dot-product with the viewing user’s interest representation
Item-based collaborative filtering: uses the same underlying implementation as the “Similar Tweets” application. Identify Tweets similar to those Tweets which the user has recently liked
The engagement rate of new candidates is 33% higher than that for candidates generated by GraphJet
The overall engagement increased by 1% for Twitter

Add as a feature for ranking:

The model trained with these features was able to increase the engagement rate of recommended content by 4.7% relatively

Ranking of Personalized Trends

Scoring Trends for a given user by using the dot-product of the user’s interest representation along with the real-time representation for a Trend

8% increase in user engagement with the Trends themselves
12% increase in engagement on the landing page subsequent to a click

Topic Tweet Recommendations

Identifying tweets whose SimClusters representation has high cosine similarity with the representation of the query topic, and then applying the textual matching rules

Show positive result, no detailed number mentioned

Ranking Who To Follow Recommendations

Adding new features based on the SimClusters representations of the viewing user and the candidate user

Increase of 7% in the follow rate
My comment🤔: not mention what representations they are using, I think both interest and influence representations are useful here

Applications in progress

Notifications quality filter

Build SimClusters representations for users based on the user–user block graph and
use these representations as features to train a model for filtering out abusive and spammy replies. (👏 good idea)

Supervised embeddings from feature combinations

Input as features to two-tower models for the engagement prediction model. (👏 they are trying two-tower lol)

Real-time Event notifications

Identify candidate users for real-time events based on the event representation and user interest representation

Related Work

In this part, Twitter shares its thoughts on different models

Graph embedding, or VAE, fits separate parameters for each user or item, but they cannot scale
Hybrid models, such as Factorization Machine and Deep Neural Networks (DNNs) have fewer parameters
1. But they require either well-defined feature vectors or pre-trained embeddings from auxiliary tasks as the input representation of users and items
2. My comment🤔: this is not a strong reason. As we know, in RealGraph, they already developed tons of user features. For item features, they can use NLP models to generate embeddings for tweets
Graph Convolutional Networks (GCNs) perform well in domains where the
items have a good set of pre-existing features, e.g., where the items
are images
1. But GCN performs worse in the absence of useful content features and cannot deal with the short half-life of items either
2. My comment🤔: this is neither a strong reason. NLP embeddings are strong tweet features. The main issue is the short half-life of tweets, but this still can be solved by applying a two-stage algorithm:
  1. Build GCN on old tweets data on a daily or weekly basis
  2. Aggregate or group new tweets to the old tweets, and use the average of old tweets embedding as the representations of new tweets. NLP embeddings can calculate the similarity
  3. I believe GCN should have better results in the community representations

My Final Words

We can see that SimCluster has pros in interpretability, generalization, and real-time scaling. But it’s still just a mimic of the MF algorithm. From a modeling perspective, many improvements can be made in the future.

It’s focused on Graph learning and representation, and we know that graph relationship is the most important feature of Twitter. And all the previous work on Twitter also focuses on Graph. This makes sense, but I still want to ask, how about the tweet? Is the tweet more or equally important as the user? If so, how about trying two-tower like YouTube?
Two many heuristic rules in this algorithm, which basically means the end-to-end approach can work much better without subtle tuning
GCN is promising. I hope Twitter can put more resources into at least an offline evaluation of it

Weekly Digest

The Illustrated Machine Learning website. A very cool website illustrates the core machine-learning concepts
AI's Hardware Problem. Deep learning's memory wall problem
Towards Complex Reasoning: the Polaris of Large Language Models. A summary of methods toward models of strong complex reasoning capabilities
Google "We Have No Moat, And Neither Does OpenAI". Leaked Internal Google Document Claims Open Source AI Will Outcompete Google and OpenAI
Science is a strong-link problem. Some problems are strong-link problems: overall quality depends on how good the best stuff is, and the bad stuff barely matters

What’s Next

Let’s go to the final paper for Twitter, MaskNet: Introducing Feature-Wise Multiplication to CTR Ranking Models by Instance-Guided Mask.3

Thank you for reading Be a happy and strong coder. This post is public so feel free to share it.

https://dl.acm.org/doi/pdf/10.1145/3394486.3403370

https://arxiv.org/abs/1703.01054

https://arxiv.org/pdf/2102.07619.pdf

Be a happy and strong coder

Dive into Twitter's recommendation system IV - SimClusters

Community-Based Representations for Heterogeneous Recommendations at Twitter

Dive into Twitter's recommendation system I - RealGraph

Dive into Twitter's recommendation system II - GraphJet

Dive into Twitter's recommendation system III - GraphJet

Paper reading

Introduction

Overview of SimClusters

Community Discovery

Step 1: Similarity Graph of Right Nodes

Step 2: Communities of Right Nodes

Step 3: Communities of Left Nodes

Item Representations

Deployment Details

Applications

Similar Tweets on Tweet Details page

Tweet Recommendations in Home Page

Ranking of Personalized Trends

Topic Tweet Recommendations

Ranking Who To Follow Recommendations

Applications in progress

Notifications quality filter

Supervised embeddings from feature combinations

Real-time Event notifications

Related Work

My Final Words

Weekly Digest

What’s Next

Discussion about this post