Graph Embedding with GraphSAGE

Graph embedding refers to learning low-dimensional representations of nodes in a network; the representation encodes structural information and node features. Such a representation can then be used for various downstream machine learning tasks like node classification, link prediction, visualization, etc. The downstream tasks often work with dynamic or evolving networks such as social networks, recommendation networks etc. The majority of the graph embedding methods assume fixed graph structure and thus are unsuitable for dynamic networks. The GraphSAGE embedding method overcomes this limitation by incorporating two changes, sampling and aggregating features, in the graph convolutional networks (GCNs) that are used for fixed structure networks. These changes, explained below, make GraphSAGE not only computationally efficient to scale to very large graphs but also permit embeddings to be generated for those nodes of a graph not seen before. This inductive capability to work with unseen data makes GraphSAGE a highly valuable node embedding tool suitable for a large number of applications. In this blog post, I will highlight the aggregation and sampling used in GraphSAGE and provide an example of its usage. Before reading on, you may want to go over my post on graph convolutional networks

How Does GraphSAGE Perform Aggregation?

We will assume each node has a feature vector associated with it. The associated feature vector with a node, for example, may be the profile of the person represented by the node. In absence of explicit features, structural features of nodes such as the node degree are used in GraphSAGE. 

The key component of GraphSAGE algorithm is that embedding of a node is obtained by aggregating embeddings in a layer-wise manner by slowly expanding the local neighborhood of the node. At layer-0, the embedding of a node is simply its feature vector. Thus, the embedding of the target node A at layer-0 in the graph shown below is $\bf X_A$.

Figure 1. Input Graph for Embedding

The layer-1 embedding of a node is the aggregate embedding of all of its neighboring nodes away at one link. Similarly, the layer-2 embedding of a node is obtained by aggregating embeddings of its 2-neighborhood nodes. Figure 2 below shows this layer-wise embedding process for the target node A of the input graph of Figure 1.

Figure 2. Layer-wise Embedding of Target Node A

The actual aggregation is done by learning aggregation functions for different layers. The square boxes in Figure 2 denote these functions. The configuration of Figure 2 defines a graph convolutional neural network where the hidden layer outputs can be expressed as follows for layers 1-K.

For k = 2, we get the embedding vector $\bold{Z}_A for the target node in the above graph. The matrices W and B are the trainable matrices. These trainable matrices are learned by defining a suitable loss function. Both supervised and unsupervised learnings are possible.

While the above description for aggregation appears similar to that used in graph convolutional networks, the difference is that the aggregation function is learned during training in GraphSAGE and it is predefined and fixed in GCNs. It is this difference that makes GraphSAGE an inductive learner as opposed to GCNs being transductive learners.

Neighborhood Sampling in GraphSAGE

In graph convolution networks, every neighboring node of the target node at the specified neighborhood size participates in message passing and contributes towards the computation of the embedded vector of the target node. When the neighborhood size is enlarged, the number of nodes contributing to the embedded vector computation can grow exponentially for certain graphs. This problem gets exacerbated  when the neighborhood of a target node includes a hub or celebrity node having millions of connections. To avoid exponential computation growth, GraphSAGE algorithm randomly selects only a sample of neighboring nodes. This allows GraphSAGE to be used for extremely large graphs with billions of nodes. You can see an illustration of neighborhood sampling in Figure 3 below where the No icon shows the nodes not being sampled.

Figure 3. Illustration of Neighborhood Sampling

Now, let's use GraphSage to generate node embeddings for link prediction.

Applying GraphSAGE to Perform Link Prediction

In link prediction, the objective is to predict whether a link exists between a pair of nodes. Link prediction has many applications; for example, it is used for recommending friends on social media networks. 

We will use the Cora dataset, widely used in graph machine learning similar to the usage of MNIST dataset in deep learning.  The Cora dataset consists of 2708 scientific publications classified into one of seven classes. Each publication is a node in the network, and is described by a 0/1-valued word vector of 1433 dimensions. The binary vector values represent the absence/presence of the corresponding word from the dictionary of 1433 unique words. The citation network has 10566 edges.

The implementation will be done using the  Deep Graph Learning (DGL) library, built for easy implementation of graph neural network models on top of existing deep learning frameworks such as PyTorch, MXNet and TensorFlow. Each node of a network is represented by a unique integer, called its node ID, in DGL. An edge is represented by a pair of integers corresponding to the IDs of its end nodes. You will need to install DGL before proceeding any further.

We begin by importing all necessary libraries. We also specify PyTorch as the backend for DGL.

import itertools
import os
os.environ["DGLBACKEND"] = "pytorch"
import dgl
import numpy as np
import scipy.sparse as sp
import torch
import torch.nn as nn
import torch.nn.functional as F
%matplotlib inline

We are going to treat the link prediction problem as a binary classification problem. When a pair of nodes is connected, there is an edge linking them. We treat this as an example of positive class; the absence of a link between a node pair is treated as negative class example. A sample of positive and negative class examples will be extracted from Cora dataset with the rest of the data being treated as test data. We begin by loading the data. We pick 10% of the edges as positive class examples and the same number of edges as negative examples.

dataset =
g = dataset[0]
# Split edge set for training and testing
u, v = g.edges()

eids = np.arange(g.num_edges())
eids = np.random.permutation(eids)
test_size = int(len(eids) * 0.1)
train_size = g.num_edges() - test_size
test_pos_u, test_pos_v = u[eids[:test_size]], v[eids[:test_size]]
train_pos_u, train_pos_v = u[eids[test_size:]], v[eids[test_size:]]

# Find all negative edges and split them for training and testing
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
adj_neg = 1 - adj.todense() - np.eye(g.num_nodes())
neg_u, neg_v = np.where(adj_neg != 0)

neg_eids = np.random.choice(len(neg_u), g.num_edges())
test_neg_u, test_neg_v = (
train_neg_u, train_neg_v = (
Next, we define our GraphSAGE model. We will use two convolution layers.

# build a two-layer GraphSAGE model
class GraphSAGE(nn.Module):
def __init__(self, in_feats, h_feats):
super(GraphSAGE, self).__init__()
self.conv1 = SAGEConv(in_feats, h_feats, "mean")
self.conv2 = SAGEConv(h_feats, h_feats, "mean")

def forward(self, g, in_feat):
h = self.conv1(g, in_feat)
h = F.relu(h)
h = self.conv2(g, h)
return h

Before training, we need to convert our original graph to a subgraph without test nodes. 

# Create a subgraph removing test nodes
train_g = dgl.remove_edges(g, eids[:test_size])

Now, the question is how do we perform link prediction using the embeddings that will be generated by our GraphSAGE model. Remember that link prediction operates on node pairs. Since we can describe a pair of nodes with an edge connecting them, we can set up our training and test data based on edges. To do so, we create four graphs: positive train graph, positive test graph, negative train graph, and negative test graph. All these graphs have same number of nodes as in the original data; however, the edges in each graph correspond to node pairs of positive and negative training and test examples. In DGL, we can do this by the following code.

# Create positive and negative train and test graphs
train_pos_g = dgl.graph((train_pos_u, train_pos_v), num_nodes=g.num_nodes())
train_neg_g = dgl.graph((train_neg_u, train_neg_v), num_nodes=g.num_nodes())

test_pos_g = dgl.graph((test_pos_u, test_pos_v), num_nodes=g.num_nodes())
test_neg_g = dgl.graph((test_neg_u, test_neg_v), num_nodes=g.num_nodes())

One way to predict the likelihood of a link between a pair of nodes is to compute the similarity of the respective node vectors. We will use the dot product to compute node similarities. The corresponding DGL code for this is as follows.

import dgl.function as fn

class DotPredictor(nn.Module):
def forward(self, g, h):
with g.local_scope():
g.ndata["h"] = h
# Compute a new edge feature named 'score' by a dot-product between the
# source node feature 'h' and destination node feature 'h'.
g.apply_edges(fn.u_dot_v("h", "h", "score"))
# u_dot_v returns a 1-element vector for each edge so you need to squeeze it.
return g.edata["score"][:, 0]

We are now ready to perform training. For training, we need to specify a loss function. We will use the binary cross-entropy loss. The performance evaluation metric is chosen as AUC (area under the curve). The following code snippets take care of the loss function and AUC.

model = GraphSAGE(train_g.ndata["feat"].shape[1], 16)
pred = DotPredictor()

def compute_loss(pos_score, neg_score):
scores =[pos_score, neg_score])
labels =
[torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]
return F.binary_cross_entropy_with_logits(scores, labels)

def compute_auc(pos_score, neg_score):
scores =[pos_score, neg_score]).numpy()
labels =
[torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]
return roc_auc_score(labels,scores)

We can now set up the training loop and begin training.

# We will use Adam optimizer
optimizer = torch.optim.Adam(
itertools.chain(model.parameters(), pred.parameters()), lr=0.01)

# training loop
all_logits = []
for e in range(100):
# forward
h = model(train_g, train_g.ndata["feat"])
pos_score = pred(train_pos_g, h)
neg_score = pred(train_neg_g, h)
loss = compute_loss(pos_score, neg_score)

# backward

if e % 5 == 0:
print("In epoch {}, loss: {}".format(e, loss))

# Compute AUC
from sklearn.metrics import roc_auc_score

with torch.no_grad():
pos_score = pred(test_pos_g, h)
neg_score = pred(test_neg_g, h)
print("AUC", compute_auc(pos_score, neg_score))

The result of training is shown below.
In epoch 0, loss: 0.7079574465751648
In epoch 5, loss: 0.6902216076850891
In epoch 10, loss: 0.6735929250717163
In epoch 15, loss: 0.6262624263763428
In epoch 20, loss: 0.5796554684638977
In epoch 25, loss: 0.555548369884491
In epoch 30, loss: 0.5190547108650208
In epoch 35, loss: 0.5012964606285095
In epoch 40, loss: 0.48360729217529297
In epoch 45, loss: 0.46244215965270996
In epoch 50, loss: 0.4450105130672455
In epoch 55, loss: 0.4255025088787079
In epoch 60, loss: 0.41155683994293213
In epoch 65, loss: 0.396867573261261
In epoch 70, loss: 0.3816315531730652
In epoch 75, loss: 0.36709561944007874
In epoch 80, loss: 0.3521302044391632
In epoch 85, loss: 0.33679577708244324
In epoch 90, loss: 0.32077470421791077
In epoch 95, loss: 0.3040713667869568
AUC 0.8397133936793872
The AUC value of 0.839 indicates that the model is performing well.

That's all for GraphSAGE. Let me know how you liked it.

CCA for Finding Latent Relationships and Dimensionality Reduction

Canonical Correlation Analysis (CCA) is a powerful statistical technique. In machine learning and multimedia information retrieval, CCA plays a vital role in uncovering intricate relationships between different sets of variables. In this blog post, we will look into this powerful technique and show how it can be used for finding hidden correlations as well as for dimensionality reduction.

To understand CCA's capabilities, let’s take a look at two sets of observations, X and Y, shown below. These two sets of observations are made on the same set of objects and each observation represents a different variable.

Two Sets of Observations

When computing the pairwise correlation between the column vectors of X and Y, we obtain the following set of values, where the entry at (i,j) represents the correlation between the i-th column of X and the j-th column of Y.

The resulting correlation values give us some insight between the two sets of measurements. The correlation values show moderate to almost no correlation between the columns of the two datasets except a relatively higher correlation between the second column of X and the third column of Y.

Hidden Relationship

It looks like there is not much of a relationship between X and Y. Is that so? Let's wait before concluding that X and Y do not have much of a relationship.

Lets transform X and Y into one-dimensional arrays, a and b, using the vectors [-0.427 -0.576 0.696] and [0 0 -1].

a = X[-0.427 -0.576 0.696]T

b = Y[0 0 -1]T

Now, let's calculate the correlation between a and b. Wow! we get a correlation value of 0.999, meaning that the two projections of X and Y are very strongly correlated. In other words, there is a very strong hidden relationship present in our two sets of observations. Wow! How did we end up getting a and b?” The answer to this is the canonical correlation analysis. 

What is Canonical Correlation Analysis?

Canonical correlation analysis is a technique that looks for pairs of basis vectors for two sets of variables X and Y such that the correlation between the projections of X and Y  onto these basis vectors are mutually maximized. In other words, the transformed arrays show much higher correlation to bring out any hidden relationship. The number of pairs of such basis vectors is limited to the smallest dimensionality of X and Y. For example, if X is an array of size nxp and Y of size nxq, then the number of basis vectors cannot exceed min{p,q}.

Assume wx and wy be the pair of basis vectors projecting X and Y into a and b given by a=Xwx, and b=Ywy. The projections a and b are called the scores or the canonical variates. The correlation between the projections, after some algebraic manipulation, can be expressed as:

$\Large \rho = \frac{\bf{w}_{x}^T \bf{C}_{xy}\bf{w}_{y}}{\sqrt{\bf{w}_{x}^T \bf{C}_{xx}\bf{w}_{x}\bf{w}_{y}^T \bf{C}_{yy}\bf{w}_{y}}}$,

where CxxCxy and Cyy  are three covariance matrices. The canonical correlations between X and Y are found by solving the eigenvalue equations

$ \bf{C}_{xx}^{-1}\bf{C}_{xy}\bf{C}_{yy}^{-1}\bf{C}_{yx}\bf{w}_x = \rho^2 \bf{w}_x$

$ \bf{C}_{yy}^{-1}\bf{C}_{yx}\bf{C}_{xx}^{-1}\bf{C}_{xy}\bf{w}_y = \rho^2 \bf{w}_y$

The eigenvalues in the above solution correspond to the squared canonical correlations and the corresponding eigenvectors yield the needed basis vectors. The number of non-zero solutions to these equations are limited to the smallest dimensionality of X and Y.

CCA Example

Let’s take a look at an example using the wine dataset from the sklearn library. We will divide the13 features of the dataset into X and Y sets of observations. The class labels in our example will act as hidden or latent feature. First, we will load the data, split it into X and Y and perform feature normalization.

from sklearn.datasets import load_wine
import numpy as np
wine = load_wine()
X =[:, :6]# Form X using first six features
Y =[:, 6:]# Form Y using the remaining seven features
# Perform feature normalization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
Y = scaler.fit_transform(Y)

Next, we import CCA object and fit the data. After that we obtain the canonical variates. In the code below, we are calculating 3 projections, X_c and Y_c, each for X and Y.

from sklearn.cross_decomposition import CCA
cca = CCA(n_components=3), Y)
X_c, Y_c = cca.transform(X, Y)
We can now calculate the canonical correlation coefficients to see what correlation values are obtained.
cca_corr = np.corrcoef(X_c.T, Y_c.T).diagonal(offset=3)

[0.90293514 0.73015495 0.51667522]

The highest canonical correlation value is 0.9029, indicating a strong hidden relationship between the two sets of vectors. Let us now try to visualize whether these correlations have captured any hidden relationship or not. In the present example, the underlying latent information not available to CCA is the class-membership of different measurements in X and Y. To check this, I have plotted the scatter plots of the three sets of x-y canonical variates where each variate pair is colored using the class label not accessible to CCA. These plots are shown below. It is clear that the canonical variates associated with the highest correlation coefficient show the existence of three groups in the scatter plot. This means that CCA is able to discern the presence of a hidden variable that reflects the class membership of the different observations.


Canonical Correlation Analysis (CCA) is a valuable statistical technique that enables us to uncover hidden relationships between two sets of variables. By identifying the most significant patterns and correlations, CCA helps gain valuable insights with numerous potential applications. CCA can be also used for dimensionality reduction. In machine and deep learning, CCA has been used for cross-modal learning and cross-modal retrieval.

An Intro to Graph Convolutional Networks

Graph neural networks (GNNs) are deep learning networks that operate on graph data. These networks are increasingly getting popular as numerous real-world applications are easily modeled as graphs. Graphs are unlike images, text, time-series that are used in deep learning models. Graphs are of arbitrary size and complex topological structure. We represent graphs as a set of nodes and edges. In many instances, each node is associated with a feature vector. The adjacency matrix of a graph defines the presence of edges between the nodes. The ordering of nodes in a graph is arbitrary. These factors make it hard to use the existing deep learning architectures and call for an architecture suited to graphs as inputs.

Permutation Invariance Architecture

Since the nodes in a graph are arbitrarily ordered, it is possible that two adjacency matrices might be representing the same graph. So whatever architecture we plan for graph computation, it should be invariant to the ordering of nodes. This requirement is termed permutation invariance. Given a graph with adjacency matrix A1 and the corresponding nodes feature matrix X1 and another matrix pair of A2 and X2 representing the same graph but with different ordering of nodes, the permutation invariance implies that any function f that maps the graph to an embedding vector R of dimension d must satisfy the relationship f(A1,X1) = f(A2,X2), i.e. the function should yield the same embedding irrespective of how the nodes are numbered.

Message Passing Architecture

The deep learning architecture that works with graphs is based on message passing. The resulting deep learning model is also known as Graph Convolutional Networks (GCNs). The underlying graph for this architecture consists of an adjacency matrix and a matrix of node vectors. Each node in the graph starts by sharing its message, the associated node vector, with all its neighbors. Thus, every node receives one message per adjacent node as illustrated below.

Alt text

As each node receives messages from its neighbors, the node feature vectors for every node are updated. The GCN uses the following updating rule where $\bf{h}_u$ refers to the node vector for node u.

$\bf{h}_u^{(k+1)} = \sigma\left(\hat{\bf{D}}^{-1/2}\hat{\bf{A}}\hat{\bf{D}}^{-1/2}\bf{h}_u^{(k)}\bf{W}^{(k)}\right)$,

where $\bf{W}^{(k)}$ is the weight parameters that transforms the node features into messages $(\bf{H}^{(k)}\bf{W}^{(k)})$. We add the identity matrix to the adjacency matrix $\bf{A}$ so that each node sends its own message to itself also: $\hat{\bf{A}}=\bf{A}+\bf{I}$. In place of summing, we perform messages averaging via the degree matrix $\hat{\bf{D}}$. We can also represent the above updating rule through the following expression:

$\bf{h}_u^k = \sigma\Bigl(\bf{W}_k \sum_{v\in{N(u)\cup N(v))}} \frac{\bf{h}_v^{k-1}}{(|N(u)||N(v)|)^{-1/2}} \Bigl)$,  

where N(.) represents the number of neighbors of a node including itself. The denominator here is viewed as performing the symmetric normalization of the sum accumulated in the numerator. The normalization is important to avoid large aggregated values for nodes with many neighbors. A nonlinearity, denoted by σ, is used in the updating relationship. The ReLU is generally the nonlinearity used in GCNs.

A Simple Example to Illustrate Message Passing Computation

We will use the graph shown above for our illustration. The adjacency matrix of the graph is

The $\hat {\bf{A}}$ and $\hat {\bf{D}}$ matrices are then:

Let the matrix of node features be the following

The $\hat \bf{D}^{-1/2}$ matrix results in $\text{diag}(1/2^{1/2}, 1/4^{1/2}, 1/3^{1/2}, 1/3^{1/2})$. Assuming $\bf{W}$ to be an identity matrix of size 2, we carry out the multiplications to obtain the updated matrix of node feature vectors, before passing through the nonlinearity, as

[[0.707 1.560], [3.387 4.567], [3.910 4.866], [3.910 4.866]].

The graph convolution layer just described above limits the message passing only to immediate neighbors. With multiple such layers we have a GCN with  message passing between nodes beyond the local neighborhood. The figure below presents a visualization of a GCN with two layers of message passing. 

Training GCNs

How do we find the weights used in a graph neural network? This is done by defining a loss function and using SGD to find the optimal weight values. The choice of the loss function depends upon the task that the GCN is to be used for. We will consider node classification task for our discussion. I plan to cover the training for other tasks such as link prediction and graph classification in future posts.

Node classification implies assigning a label or category to a node based on a training set of nodes with known labels. Given $\bf{y}_u$ as the one-hot encoding of true label of node u and its node embedding output by the GNN as $\bf{z}_u$, the negative log-likelihood loss, defined below, is the most common choice to learn the weights:

$L =\sum_{u\in{\bf {V}_{\text{train}}}} - log(softmax(\bf{z}_u,\bf{y}_u)$

The above expression implies the presence of a set of training nodes. These training nodes are a subset of all nodes participating in message passing. The remaining nodes are used as test nodes to determine how well the learning has taken. It is important to remember that all nodes, whether training nodes or not , are involved in message passing to generate embeddings; only a subset of nodes marked as training nodes are used in the loss function to optimize the weight parameters. The nodes not designated as training nodes are called transductive test nodes. Since the transductive nodes do participate in embeddings generation, this setup is different from supervised learning and is termed as semi-supervised learning

Applying GNN for Node Classification

Let's now train a GCN for a node classification task. I am going to use the scikit-network library. You can find a brief introduction to this library at this blog post. The library is developed along the lines of the popular scikit-learning library for machine learning and thus provides a familiar environment to work with for graph machine learning.

Let's get started by importing all the necessary libraries.

# import necessary libraries
from IPython.display import SVG
import numpy as np
from scipy import sparse
from sknetwork.classification import get_accuracy_score
from sknetwork.gnn import GNNClassifier
from sknetwork.visualization import svg_graph

We will use the art_philo_science toy dataset. It consists of a selection of 30 Wikipedia articles with links between them. Each article is described by some words used in their summary, among a list of 11 words. Each article belongs to one of the following 3 categories: arts, philosophy or science. The goal is to retrieve the category of some articles (the test set) from the category of the other articles (the train set).

from import art_philo_science
graph = art_philo_science(metadata=True)
adjacency = graph.adjacency
features = graph.biadjacency
names = graph.names
names_features = graph.names_col
names_labels = graph.names_labels
labels_true = graph.labels
position = graph.position

Just for illustration, let's print the names of all the 30 nodes.

print(names)# These are nodes     
['Isaac Newton' 'Albert Einstein' 'Carl Linnaeus' 'Charles Darwin'
'Ptolemy' 'Gottfried Wilhelm Leibniz' 'Carl Friedrich Gauss'
'Galileo Galilei' 'Leonhard Euler' 'John von Neumann' 'Leonardo da Vinci'
'Richard Wagner' 'Ludwig van Beethoven' 'Bob Dylan' 'Igor Stravinsky'
'The Beatles' 'Wolfgang Amadeus Mozart' 'Richard Strauss' 'Raphael'
'Pablo Picasso' 'Aristotle' 'Plato' 'Augustine of Hippo' 'Thomas Aquinas'
'Immanuel Kant' 'Bertrand Russell' 'David Hume' 'René Descartes'
'John Stuart Mill' 'Socrates']

Let's also print the names of the features.

print(names_features)# These are the features.     
['contribution' 'theory' 'invention' 'time' 'modern' 'century' 'study'
 'logic' 'school' 'author' 'compose']

We will use a single hidden layer.

hidden_dim = 5
n_labels = 3
gnn = GNNClassifier(dims=[hidden_dim, n_labels],
print(gnn)# Details of the gnn   
    Convolution(layer_type: conv, out_channels: 5, activation: ReLu,
    use_bias: True, normalization: both, self_embeddings: True)
    Convolution(layer_type: conv, out_channels: 3, activation: Cross entropy,
    use_bias: True, normalization: both, self_embeddings: True))

We now select nodes for use as training nodes and start training.

# Training set
labels = labels_true.copy()
train_mask = np.random.random(size=len(labels)) < 0.5
labels[train_mask] = -1
# Training
labels_pred = gnn.fit_predict(adjacency, features, labels,
                    n_epochs=200, random_state=42, history=True)
In epoch   0, loss: 1.053, train accuracy: 0.462
In epoch  20, loss: 0.834, train accuracy: 0.692
In epoch  40, loss: 0.819, train accuracy: 0.692
In epoch  60, loss: 0.831, train accuracy: 0.692
In epoch  80, loss: 0.839, train accuracy: 0.692
In epoch 100, loss: 0.839, train accuracy: 0.692
In epoch 120, loss: 0.825, train accuracy: 0.692
In epoch 140, loss: 0.771, train accuracy: 0.769
In epoch 160, loss: 0.557, train accuracy: 1.000
In epoch 180, loss: 0.552, train accuracy: 1.000

We now test for accuracy on test nodes.

# Accuracy on test set
test_mask = ~train_mask
get_accuracy_score(labels_true[test_mask], labels_pred[test_mask]) 

The accuracy is 100%, not a surprise because it is a toy problem.

Issues with GCNs

One major limitation of GCNs discussed in this post is that these networks can generate embeddings only for fixed or static networks. Thus, the node classification task shown above cannot be used for evolving networks as exemplified by social networks. Another issue relates to computational load for large graphs with nodes having very large neighborhoods. In subsequent posts, we are going to look at some other GNN models. So be on a look out.