Showing posts with label Node Embeddings. Show all posts
Showing posts with label Node Embeddings. Show all posts

Graph Classification using Graph Neural Networks

Graph classification focuses on assigning labels or categories to entire graphs or networks. Unlike traditional classification tasks that deal with individual data instances, graph classification considers the entire graph structure, including nodes, edges and their properties. A graph classifier uses a mapping function that can accurately predict the class or label of an unseen graph based on its structural properties. The mapping function is learned during training using supervised learning.

Why Do We Need Graph Classification?

The importance of graph classification lies in graph data being ubiquitous in today's interconnected world. Graph based methods including graph classification have emerged as methodology of  choice in numerous applications across various domains, including:

1. Bioinformatics: Classifying protein-protein interaction networks or gene regulatory networks can provide insights into disease mechanisms and aid in drug discovery. In fact, the most well-known success story of graph neural networks is the discovery of antibiotics to treat drug-resistant diseases, widely reported in early 2020.

2. Social Network Analysis: Categorizing social networks can help identify communities, detect anomalies (e.g., fake accounts), and understand information diffusion patterns.

3. Cybersecurity: Classifying computer networks can assist in detecting malicious activities, identifying vulnerabilities, and preventing cyber attacks.

4. Chemistry: Classifying molecular graphs can aid in predicting chemical properties, synthesizing new compounds, and understanding chemical reactions.

How Do We Build a Graph Classifier? 

There are two main approaches that we can use to build graph classifiers: kernel-based methods and neural network-based methods.

1. Kernel-based Methods:

These methods rely on defining similarity measures (kernels) between pairs of graphs, which capture their structural and topological properties. Popular kernel-based methods include the random walk kernel, the shortest-path kernel, and the Weisfeiler-Lehman kernel. Once the kernel is defined, traditional kernel-based machine learning algorithms, such as Support Vector Machines (SVMs), can be applied for classification.

2. Neural Network- based Methods:

These methods typically involve learning low-dimensional representations (embeddings) of the graphs through specialized neural network architectures, such as Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs). The learned embeddings capture the structural information of the graphs and can be used as input to standard classifiers, like feed-forward neural networks or logistic regression models. For details on GCNs and node embeddings, please visit my earlier post on graph convolutional networks

Both kernel-based and neural network-based methods have their strengths and weaknesses, and the choice depends on factors such as the size and complexity of the graphs, the availability of labeled data, and computational resources. Given that graph neural networks are getting more mileage, we will complete this blog post by going over steps needed for building a GNN classifier.

Steps for Building a GNN Classifier


We are going to use the MUTAG dataset which is part of the TUDatasets, an extensive collection of graph datasets and easily accessible in PyTorch Geometric library for building graph applications. The MUTAG dataset is a small dataset of 188 graphs representing two classes of graphs. Each graph node is characterized by seven features. Two of the example graphs from this dataset are shown below.

Two example graphs from MUTAG dataset

We will use 150 graphs for training and the remaining 38 for testing. The division into the training and test sets is done using the available utilities in the PyTorch Geometric. 


Due to the smaller graph sizes in the dataset, mini-batching of graphs is a desirable step in graph classification for better utilization of GPU resources. The mini-batching is done by diagonally stacking adjacency matrices of the graphs in a batch to create a giant graph that acts as an input to the GNN for learning. The node features of the graphs are concatenated to form the corresponding giant node feature vector. The idea of mini-batching is illustrated below.

Illustration of mini-batching

Graph Classifier

We are going to use a three-stage classifier for this task. The first stage will consists of generating node embeddings using a message-passing graph convolutional network (GCN). The second stage is an embeddings aggregation stage. This stage is also known as the readout layer. The function of this stage is to aggregate node embeddings into a single vector. We will simply take the average of node embeddings to create readout vectors. PyTorch Geometric has a built-in function for this purpose operating at the mini-batch level that we will use. The final stage is the actual classifier that looks at mapped/readout vectors to learn classification rule. In the present case, we will simply use a linear thresholding classifier to perform binary classification. We need to specify a suitable loss function so that the network can be trained to learn the proper weights.

A complete implementation of all of the steps described above is available at this Colab notebook. The results show that training a GCN with three convolutional layers results in test accuracy of 76% in just a few epochs.

Takeaways Before You Leave

Graph Neural Networks (GNNs) offer ways to perform various graph-related prediction/classification tasks. We can use GNNs make predictions about nodes, for example designate a node as a friend or fraud. We can use them to predict whether a link should exist between a pair of nodes or not in social networks to make friends' suggestions. And of course, we can use GNNs to perform classification of entire graphs with applications mentioned earlier.

Another useful library for deep learning with graphs is Deep Graph Library (DGL). You can find a graph classifier implementation using DGL at the following link, if you like. 

An Intro to Graph Convolutional Networks

Graph neural networks (GNNs) are deep learning networks that operate on graph data. These networks are increasingly getting popular as numerous real-world applications are easily modeled as graphs. Graphs are unlike images, text, time-series that are used in deep learning models. Graphs are of arbitrary size and complex topological structure. We represent graphs as a set of nodes and edges. In many instances, each node is associated with a feature vector. The adjacency matrix of a graph defines the presence of edges between the nodes. The ordering of nodes in a graph is arbitrary. These factors make it hard to use the existing deep learning architectures and call for an architecture suited to graphs as inputs.

Permutation Invariance Architecture

Since the nodes in a graph are arbitrarily ordered, it is possible that two adjacency matrices might be representing the same graph. So whatever architecture we plan for graph computation, it should be invariant to the ordering of nodes. This requirement is termed permutation invariance. Given a graph with adjacency matrix A1 and the corresponding nodes feature matrix X1 and another matrix pair of A2 and X2 representing the same graph but with different ordering of nodes, the permutation invariance implies that any function f that maps the graph to an embedding vector R of dimension d must satisfy the relationship f(A1,X1) = f(A2,X2), i.e. the function should yield the same embedding irrespective of how the nodes are numbered.

Message Passing Architecture

The deep learning architecture that works with graphs is based on message passing. The resulting deep learning model is also known as Graph Convolutional Networks (GCNs). The underlying graph for this architecture consists of an adjacency matrix and a matrix of node vectors. Each node in the graph starts by sharing its message, the associated node vector, with all its neighbors. Thus, every node receives one message per adjacent node as illustrated below.

Alt text

As each node receives messages from its neighbors, the node feature vectors for every node are updated. The GCN uses the following updating rule where $\bf{h}_u$ refers to the node vector for node u.

$\bf{h}_u^{(k+1)} = \sigma\left(\hat{\bf{D}}^{-1/2}\hat{\bf{A}}\hat{\bf{D}}^{-1/2}\bf{h}_u^{(k)}\bf{W}^{(k)}\right)$,

where $\bf{W}^{(k)}$ is the weight parameters that transforms the node features into messages $(\bf{H}^{(k)}\bf{W}^{(k)})$. We add the identity matrix to the adjacency matrix $\bf{A}$ so that each node sends its own message to itself also: $\hat{\bf{A}}=\bf{A}+\bf{I}$. In place of summing, we perform messages averaging via the degree matrix $\hat{\bf{D}}$. We can also represent the above updating rule through the following expression:

$\bf{h}_u^k = \sigma\Bigl(\bf{W}_k \sum_{v\in{N(u)\cup N(v))}} \frac{\bf{h}_v^{k-1}}{(|N(u)||N(v)|)^{-1/2}} \Bigl)$,  

where N(.) represents the number of neighbors of a node including itself. The denominator here is viewed as performing the symmetric normalization of the sum accumulated in the numerator. The normalization is important to avoid large aggregated values for nodes with many neighbors. A nonlinearity, denoted by σ, is used in the updating relationship. The ReLU is generally the nonlinearity used in GCNs.

A Simple Example to Illustrate Message Passing Computation

We will use the graph shown above for our illustration. The adjacency matrix of the graph is

The $\hat {\bf{A}}$ and $\hat {\bf{D}}$ matrices are then:

Let the matrix of node features be the following

The $\hat \bf{D}^{-1/2}$ matrix results in $\text{diag}(1/2^{1/2}, 1/4^{1/2}, 1/3^{1/2}, 1/3^{1/2})$. Assuming $\bf{W}$ to be an identity matrix of size 2, we carry out the multiplications to obtain the updated matrix of node feature vectors, before passing through the nonlinearity, as

[[0.707 1.560], [3.387 4.567], [3.910 4.866], [3.910 4.866]].

The graph convolution layer just described above limits the message passing only to immediate neighbors. With multiple such layers we have a GCN with  message passing between nodes beyond the local neighborhood. The figure below presents a visualization of a GCN with two layers of message passing. 

Training GCNs

How do we find the weights used in a graph neural network? This is done by defining a loss function and using SGD to find the optimal weight values. The choice of the loss function depends upon the task that the GCN is to be used for. We will consider node classification task for our discussion. I plan to cover the training for other tasks such as link prediction and graph classification in future posts.

Node classification implies assigning a label or category to a node based on a training set of nodes with known labels. Given $\bf{y}_u$ as the one-hot encoding of true label of node u and its node embedding output by the GNN as $\bf{z}_u$, the negative log-likelihood loss, defined below, is the most common choice to learn the weights:

$L =\sum_{u\in{\bf {V}_{\text{train}}}} - log(softmax(\bf{z}_u,\bf{y}_u)$

The above expression implies the presence of a set of training nodes. These training nodes are a subset of all nodes participating in message passing. The remaining nodes are used as test nodes to determine how well the learning has taken. It is important to remember that all nodes, whether training nodes or not , are involved in message passing to generate embeddings; only a subset of nodes marked as training nodes are used in the loss function to optimize the weight parameters. The nodes not designated as training nodes are called transductive test nodes. Since the transductive nodes do participate in embeddings generation, this setup is different from supervised learning and is termed as semi-supervised learning

Applying GNN for Node Classification

Let's now train a GCN for a node classification task. I am going to use the scikit-network library. You can find a brief introduction to this library at this blog post. The library is developed along the lines of the popular scikit-learning library for machine learning and thus provides a familiar environment to work with for graph machine learning.

Let's get started by importing all the necessary libraries.

# import necessary libraries
from IPython.display import SVG
import numpy as np
from scipy import sparse
from sknetwork.classification import get_accuracy_score
from sknetwork.gnn import GNNClassifier
from sknetwork.visualization import svg_graph

We will use the art_philo_science toy dataset. It consists of a selection of 30 Wikipedia articles with links between them. Each article is described by some words used in their summary, among a list of 11 words. Each article belongs to one of the following 3 categories: arts, philosophy or science. The goal is to retrieve the category of some articles (the test set) from the category of the other articles (the train set).

from import art_philo_science
graph = art_philo_science(metadata=True)
adjacency = graph.adjacency
features = graph.biadjacency
names = graph.names
names_features = graph.names_col
names_labels = graph.names_labels
labels_true = graph.labels
position = graph.position

Just for illustration, let's print the names of all the 30 nodes.

print(names)# These are nodes     
['Isaac Newton' 'Albert Einstein' 'Carl Linnaeus' 'Charles Darwin'
'Ptolemy' 'Gottfried Wilhelm Leibniz' 'Carl Friedrich Gauss'
'Galileo Galilei' 'Leonhard Euler' 'John von Neumann' 'Leonardo da Vinci'
'Richard Wagner' 'Ludwig van Beethoven' 'Bob Dylan' 'Igor Stravinsky'
'The Beatles' 'Wolfgang Amadeus Mozart' 'Richard Strauss' 'Raphael'
'Pablo Picasso' 'Aristotle' 'Plato' 'Augustine of Hippo' 'Thomas Aquinas'
'Immanuel Kant' 'Bertrand Russell' 'David Hume' 'René Descartes'
'John Stuart Mill' 'Socrates']

Let's also print the names of the features.

print(names_features)# These are the features.     
['contribution' 'theory' 'invention' 'time' 'modern' 'century' 'study'
 'logic' 'school' 'author' 'compose']

We will use a single hidden layer.

hidden_dim = 5
n_labels = 3
gnn = GNNClassifier(dims=[hidden_dim, n_labels],
print(gnn)# Details of the gnn   
    Convolution(layer_type: conv, out_channels: 5, activation: ReLu,
    use_bias: True, normalization: both, self_embeddings: True)
    Convolution(layer_type: conv, out_channels: 3, activation: Cross entropy,
    use_bias: True, normalization: both, self_embeddings: True))

We now select nodes for use as training nodes and start training.

# Training set
labels = labels_true.copy()
train_mask = np.random.random(size=len(labels)) < 0.5
labels[train_mask] = -1
# Training
labels_pred = gnn.fit_predict(adjacency, features, labels,
                    n_epochs=200, random_state=42, history=True)
In epoch   0, loss: 1.053, train accuracy: 0.462
In epoch  20, loss: 0.834, train accuracy: 0.692
In epoch  40, loss: 0.819, train accuracy: 0.692
In epoch  60, loss: 0.831, train accuracy: 0.692
In epoch  80, loss: 0.839, train accuracy: 0.692
In epoch 100, loss: 0.839, train accuracy: 0.692
In epoch 120, loss: 0.825, train accuracy: 0.692
In epoch 140, loss: 0.771, train accuracy: 0.769
In epoch 160, loss: 0.557, train accuracy: 1.000
In epoch 180, loss: 0.552, train accuracy: 1.000

We now test for accuracy on test nodes.

# Accuracy on test set
test_mask = ~train_mask
get_accuracy_score(labels_true[test_mask], labels_pred[test_mask]) 

The accuracy is 100%, not a surprise because it is a toy problem.

Issues with GCNs

One major limitation of GCNs discussed in this post is that these networks can generate embeddings only for fixed or static networks. Thus, the node classification task shown above cannot be used for evolving networks as exemplified by social networks. Another issue relates to computational load for large graphs with nodes having very large neighborhoods. In subsequent posts, we are going to look at some other GNN models. So be on a look out. 

Mapping Nodes to Vectors: An Intro to Node Embedding

In an earlier post, I had stated that the recent advances in Natural Language Processing (NLP) technology can be, to a large extent, attributed to the use of very high-dimensional vectors for language representation. These high-dimensional, 764 dimensions is common, vector representations are called embeddings and are aimed at capturing semantic meaning and relationships between linguistic items. Given that graphs are everywhere, it is not surprising to see the ideas of word and sentence embeddings being extended to graphs in the form of node embeddings. 

What are Node Embedding?

Node embeddings are encodings of the properties and relationships of nodes in a low-dimensional vector space. This enables nodes with similar properties or connectivity patterns to have similar vector representations. Using node embeddings can improve performance on various graph analytics tasks such as node classification, link prediction, and clustering.  

Methods for Node Embeddings

There are several common techniques for learning node embeddings:

- DeepWalk and Node2Vec use random walks on the graph to generate node sequences, which are then fed into a word2vec skip-gram model to get embeddings. The random walk strategies allow capturing both local and broader network structure.

- Graph convolutional networks like GCN, GAT learn embeddings by propagating and transforming node features across graph edges. The neural network architecture captures both feature information and graph structure.

- Matrix factorization methods like GraRep and HOPE factorize some matrix representation of the graph to produce embeddings. For example, factorizing the adjacency matrix or a higher-order matrix encoding paths.

We are going to look at Node2Vec embeddings.

Node2Vec Embedding Steps

Before going over the Node2Vec embedding steps, let's first try to understand random walks on graphs. The random walks are used to convert the relationships present in the graph to a representation similar to sentences. 

A random walk is a series of steps beginning from some node in the graph and moving to its neighboring nodes selected in a random manner. A random walk of length k means k random steps from an initially selected node. An example of a random walk of length 3 is shown below in Figure 1 where edges in green show an instance of the random walk from node marked X. Each random walk results in a sequence of nodes, analogous to sentences in a corpus.

Fig.1. An example of a random walk of length 3.

Using random walks, an overall pipeline of steps for generating node embeddings is shown below in Figure 2. The random walks method first generates a set of node sequence of specified length for every node. Next, the node sequences are passed on to a language embedding model that generates a low-dimensional, continuous vector representation using SkipGram model. These vectors can be then used to for any down-stream task using node similarities. It is also possible to map the embeddings to two-dimensional space for visualization using popular dimensionality reduction techniques.

Fig.2. Pipeline of steps for generating node embeddings

The DeepWalk node embedding method used random walks to gather node context information. Node2VEC instead uses biased random walks that provide better node context information. Both breadth-first and depth-first strategies are used in biased random walk to explore the neighborhood of a node. The biased random walk is carried out using two parameters, p and q. The parameter p models transition probabilities to return back to the previous node while the parameter  defines the “ratio” of BFS and DFS. Figure 3 below illustrates biased random walk. Starting with node v, it shows a walk of length 4. Blue arrows indicate the breadth-first search (BFS) directions and red arrows show the depth-first search (DFS) directions. The transition probabilities are governed by the search parameters as shown.


Example of Generating Embeddings

I will use Zachary's Karate Club dataset to illustrate the generation of node embeddings. It is a small dataset that has been widely used in studies of graph based methods. The embeddings will be generated using the publicly available implementation of   node2vec. 

# Import necessary libraries
import networkx as nx
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
from node2vec import Node2Vec
# Load Karate Club data
G = nx.karate_club_graph()
cols = ["blue" if G.nodes[n]["club"]=='Officer' else "red" for n in G.nodes()]
nx.draw_networkx(G, node_color=cols)
df_edges=nx.to_pandas_edgelist(G)  #creating an edge list data frame
node2vec = Node2Vec(G, dimensions=16, walk_length=30, num_walks=100, workers=4,p=1.0,q=0.5)
model.wv.save_word2vec_format('./karate.emb')#Save embeddings
#Create a dataframe of embeddings for visualization
df= pd.read_csv('karate.emb',sep=' ',skiprows=[0],header=None) # creating a dataframe from the embedded graph 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
33 0.092881 -0.116476 0.109677 0.216758 -0.216537 -0.273324 0.897161 0.099800 -0.199117 -0.012935 0.095815 -0.022268 -0.246529 0.166690 -0.315830 -0.080442
0 -0.103973 0.217445 -0.321607 -0.188096 0.191852 0.158175 0.668216 0.027499 0.438449 -0.021251 0.072351 0.126618 -0.257281 -0.223262 -0.590806 -0.194275
32 0.072306 -0.285375 0.167020 0.190507 -0.130808 -0.225752 0.810336 0.078936 -0.430512 0.188738 0.072464 -0.023813 -0.209342 -0.106820 -0.501649 0.008207
2 0.189961 0.159125 -0.195857 0.293849 -0.102961 -0.130784 0.730172 -0.042656 0.085806 -0.365449 -0.013965 0.187779 -0.158719 0.019433 -0.280343 -0.301981
1 0.099712 0.355814 -0.091135 0.015275 0.107660 0.123524 0.606924 0.004724 0.201826 -0.244784 0.389210 0.387045 -0.031511 -0.156609 -0.425399 -0.469062
Now, let's project embeddings to visualize.
df.columns = df.columns.astype(str)
transform = TSNE
trans = transform(n_components=2)
node_embeddings_2d = trans.fit_transform(df)
plt.figure(figsize=(7, 7))
    node_embeddings_2d[:, 0],
    node_embeddings_2d[:, 1], c = labels )
plt.title("{} visualization of node embeddings".format(transform.__name__))


The visualization shows three clusters of node embeddings. This implies possibly three communities in the data. This is different from the known structure of the Karate club data having two communities with one member with an ambiguous assignment. However, we need to keep in mind that no node properties were used while creating embeddings. Another point to note is that node2vec algorithm does have several parameters and their settings do impact the results.

The following paper provides an excellent review of different node embedding methods mentioned in this blog. You may want to check it out.