Integrated Knowledge Solutions: DGL

Showing posts with label DGL. Show all posts

Graph Classification using Graph Neural Networks

Graph classification focuses on assigning labels or categories to entire graphs or networks. Unlike traditional classification tasks that deal with individual data instances, graph classification considers the entire graph structure, including nodes, edges and their properties. A graph classifier uses a mapping function that can accurately predict the class or label of an unseen graph based on its structural properties. The mapping function is learned during training using supervised learning.

Why Do We Need Graph Classification?

The importance of graph classification lies in graph data being ubiquitous in today's interconnected world. Graph based methods including graph classification have emerged as methodology of choice in numerous applications across various domains, including:

1. Bioinformatics: Classifying protein-protein interaction networks or gene regulatory networks can provide insights into disease mechanisms and aid in drug discovery. In fact, the most well-known success story of graph neural networks is the discovery of antibiotics to treat drug-resistant diseases, widely reported in early 2020.

2. Social Network Analysis: Categorizing social networks can help identify communities, detect anomalies (e.g., fake accounts), and understand information diffusion patterns.

3. Cybersecurity: Classifying computer networks can assist in detecting malicious activities, identifying vulnerabilities, and preventing cyber attacks.

4. Chemistry: Classifying molecular graphs can aid in predicting chemical properties, synthesizing new compounds, and understanding chemical reactions.

How Do We Build a Graph Classifier?

There are two main approaches that we can use to build graph classifiers: kernel-based methods and neural network-based methods.

1. Kernel-based Methods:

These methods rely on defining similarity measures (kernels) between pairs of graphs, which capture their structural and topological properties. Popular kernel-based methods include the random walk kernel, the shortest-path kernel, and the Weisfeiler-Lehman kernel. Once the kernel is defined, traditional kernel-based machine learning algorithms, such as Support Vector Machines (SVMs), can be applied for classification.

2. Neural Network- based Methods:

These methods typically involve learning low-dimensional representations (embeddings) of the graphs through specialized neural network architectures, such as Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs). The learned embeddings capture the structural information of the graphs and can be used as input to standard classifiers, like feed-forward neural networks or logistic regression models. For details on GCNs and node embeddings, please visit my earlier post on graph convolutional networks.

Both kernel-based and neural network-based methods have their strengths and weaknesses, and the choice depends on factors such as the size and complexity of the graphs, the availability of labeled data, and computational resources. Given that graph neural networks are getting more mileage, we will complete this blog post by going over steps needed for building a GNN classifier.

Steps for Building a GNN Classifier

Dataset

We are going to use the MUTAG dataset which is part of the TUDatasets, an extensive collection of graph datasets and easily accessible in PyTorch Geometric library for building graph applications. The MUTAG dataset is a small dataset of 188 graphs representing two classes of graphs. Each graph node is characterized by seven features. Two of the example graphs from this dataset are shown below.

Two example graphs from MUTAG dataset

We will use 150 graphs for training and the remaining 38 for testing. The division into the training and test sets is done using the available utilities in the PyTorch Geometric.

Mini-Batching

Due to the smaller graph sizes in the dataset, mini-batching of graphs is a desirable step in graph classification for better utilization of GPU resources. The mini-batching is done by diagonally stacking adjacency matrices of the graphs in a batch to create a giant graph that acts as an input to the GNN for learning. The node features of the graphs are concatenated to form the corresponding giant node feature vector. The idea of mini-batching is illustrated below.

Illustration of mini-batching

Graph Classifier

We are going to use a three-stage classifier for this task. The first stage will consists of generating node embeddings using a message-passing graph convolutional network (GCN). The second stage is an embeddings aggregation stage. This stage is also known as the readout layer. The function of this stage is to aggregate node embeddings into a single vector. We will simply take the average of node embeddings to create readout vectors. PyTorch Geometric has a built-in function for this purpose operating at the mini-batch level that we will use. The final stage is the actual classifier that looks at mapped/readout vectors to learn classification rule. In the present case, we will simply use a linear thresholding classifier to perform binary classification. We need to specify a suitable loss function so that the network can be trained to learn the proper weights.

A complete implementation of all of the steps described above is available at this Colab notebook. The results show that training a GCN with three convolutional layers results in test accuracy of 76% in just a few epochs.

Takeaways Before You Leave

Graph Neural Networks (GNNs) offer ways to perform various graph-related prediction/classification tasks. We can use GNNs make predictions about nodes, for example designate a node as a friend or fraud. We can use them to predict whether a link should exist between a pair of nodes or not in social networks to make friends' suggestions. And of course, we can use GNNs to perform classification of entire graphs with applications mentioned earlier.

Another useful library for deep learning with graphs is Deep Graph Library (DGL). You can find a graph classifier implementation using DGL at the following link, if you like.

Graph Embedding with GraphSAGE

Graph embedding refers to learning low-dimensional representations of nodes in a network; the representation encodes structural information and node features. Such a representation can then be used for various downstream machine learning tasks like node classification, link prediction, visualization, etc. The downstream tasks often work with dynamic or evolving networks such as social networks, recommendation networks etc. The majority of the graph embedding methods assume fixed graph structure and thus are unsuitable for dynamic networks. The GraphSAGE embedding method overcomes this limitation by incorporating two changes, sampling and aggregating features, in the graph convolutional networks (GCNs) that are used for fixed structure networks. These changes, explained below, make GraphSAGE not only computationally efficient to scale to very large graphs but also permit embeddings to be generated for those nodes of a graph not seen before. This inductive capability to work with unseen data makes GraphSAGE a highly valuable node embedding tool suitable for a large number of applications. In this blog post, I will highlight the aggregation and sampling used in GraphSAGE and provide an example of its usage. Before reading on, you may want to go over my post on graph convolutional networks.

How Does GraphSAGE Perform Aggregation?

We will assume each node has a feature vector associated with it. The associated feature vector with a node, for example, may be the profile of the person represented by the node. In absence of explicit features, structural features of nodes such as the node degree are used in GraphSAGE.

The key component of GraphSAGE algorithm is that embedding of a node is obtained by aggregating embeddings in a layer-wise manner by slowly expanding the local neighborhood of the node. At layer-0, the embedding of a node is simply its feature vector. Thus, the embedding of the target node A at layer-0 in the graph shown below is $\bf X_A$.

Figure 1. Input Graph for Embedding

The layer-1 embedding of a node is the aggregate embedding of all of its neighboring nodes away at one link. Similarly, the layer-2 embedding of a node is obtained by aggregating embeddings of its 2-neighborhood nodes. Figure 2 below shows this layer-wise embedding process for the target node A of the input graph of Figure 1.

Figure 2. Layer-wise Embedding of Target Node A

The actual aggregation is done by learning aggregation functions for different layers. The square boxes in Figure 2 denote these functions. The configuration of Figure 2 defines a graph convolutional neural network where the hidden layer outputs can be expressed as follows for layers 1-K.

For k = 2, we get the embedding vector $\bold{Z}_A for the target node in the above graph. The matrices W and B are the trainable matrices. These trainable matrices are learned by defining a suitable loss function. Both supervised and unsupervised learnings are possible.

While the above description for aggregation appears similar to that used in graph convolutional networks, the difference is that the aggregation function is learned during training in GraphSAGE and it is predefined and fixed in GCNs. It is this difference that makes GraphSAGE an inductive learner as opposed to GCNs being transductive learners.

Neighborhood Sampling in GraphSAGE

In graph convolution networks, every neighboring node of the target node at the specified neighborhood size participates in message passing and contributes towards the computation of the embedded vector of the target node. When the neighborhood size is enlarged, the number of nodes contributing to the embedded vector computation can grow exponentially for certain graphs. This problem gets exacerbated when the neighborhood of a target node includes a hub or celebrity node having millions of connections. To avoid exponential computation growth, GraphSAGE algorithm randomly selects only a sample of neighboring nodes. This allows GraphSAGE to be used for extremely large graphs with billions of nodes. You can see an illustration of neighborhood sampling in Figure 3 below where the No icon shows the nodes not being sampled.

Figure 3. Illustration of Neighborhood Sampling

Now, let's use GraphSage to generate node embeddings for link prediction.

Applying GraphSAGE to Perform Link Prediction

In link prediction, the objective is to predict whether a link exists between a pair of nodes. Link prediction has many applications; for example, it is used for recommending friends on social media networks.

We will use the Cora dataset, widely used in graph machine learning similar to the usage of MNIST dataset in deep learning. The Cora dataset consists of 2708 scientific publications classified into one of seven classes. Each publication is a node in the network, and is described by a 0/1-valued word vector of 1433 dimensions. The binary vector values represent the absence/presence of the corresponding word from the dictionary of 1433 unique words. The citation network has 10566 edges.

The implementation will be done using the Deep Graph Learning (DGL) library, built for easy implementation of graph neural network models on top of existing deep learning frameworks such as PyTorch, MXNet and TensorFlow. Each node of a network is represented by a unique integer, called its node ID, in DGL. An edge is represented by a pair of integers corresponding to the IDs of its end nodes. You will need to install DGL before proceeding any further.

We begin by importing all necessary libraries. We also specify PyTorch as the backend for DGL.

       
import itertools
import os
os.environ["DGLBACKEND"] = "pytorch"
import dgl
import dgl.data
import numpy as np
import scipy.sparse as sp
import torch
import torch.nn as nn
import torch.nn.functional as F%matplotlib inline

We are going to treat the link prediction problem as a binary classification problem. When a pair of nodes is connected, there is an edge linking them. We treat this as an example of positive class; the absence of a link between a node pair is treated as negative class example. A sample of positive and negative class examples will be extracted from Cora dataset with the rest of the data being treated as test data. We begin by loading the data. We pick 10% of the edges as positive class examples and the same number of edges as negative examples.

       
dataset = dgl.data.CoraGraphDataset()
g = dataset[0]
# Split edge set for training and testing
u, v = g.edges()

eids = np.arange(g.num_edges())
eids = np.random.permutation(eids)
test_size = int(len(eids) * 0.1)
train_size = g.num_edges() - test_size
test_pos_u, test_pos_v = u[eids[:test_size]], v[eids[:test_size]]
train_pos_u, train_pos_v = u[eids[test_size:]], v[eids[test_size:]]

# Find all negative edges and split them for training and testing
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
adj_neg = 1 - adj.todense() - np.eye(g.num_nodes())
neg_u, neg_v = np.where(adj_neg != 0)

neg_eids = np.random.choice(len(neg_u), g.num_edges())
test_neg_u, test_neg_v = (
    neg_u[neg_eids[:test_size]],
    neg_v[neg_eids[:test_size]],
)
train_neg_u, train_neg_v = (
    neg_u[neg_eids[test_size:]],
    neg_v[neg_eids[test_size:]],
)

Next, we define our GraphSAGE model. We will use two convolution layers.

# build a two-layer GraphSAGE model
class GraphSAGE(nn.Module):
    def __init__(self, in_feats, h_feats):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, "mean")
        self.conv2 = SAGEConv(h_feats, h_feats, "mean")

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

Before training, we need to convert our original graph to a subgraph without test nodes.

# Create a subgraph removing test nodes

train_g = dgl.remove_edges(g, eids[:test_size])

Now, the question is how do we perform link prediction using the embeddings that will be generated by our GraphSAGE model. Remember that link prediction operates on node pairs. Since we can describe a pair of nodes with an edge connecting them, we can set up our training and test data based on edges. To do so, we create four graphs: positive train graph, positive test graph, negative train graph, and negative test graph. All these graphs have same number of nodes as in the original data; however, the edges in each graph correspond to node pairs of positive and negative training and test examples. In DGL, we can do this by the following code.

# Create positive and negative train and test graphs

train_pos_g = dgl.graph((train_pos_u, train_pos_v), num_nodes=g.num_nodes())
train_neg_g = dgl.graph((train_neg_u, train_neg_v), num_nodes=g.num_nodes())

test_pos_g = dgl.graph((test_pos_u, test_pos_v), num_nodes=g.num_nodes())
test_neg_g = dgl.graph((test_neg_u, test_neg_v), num_nodes=g.num_nodes())

One way to predict the likelihood of a link between a pair of nodes is to compute the similarity of the respective node vectors. We will use the dot product to compute node similarities. The corresponding DGL code for this is as follows.

import dgl.function as fn

class DotPredictor(nn.Module):
    def forward(self, g, h):
        with g.local_scope():
            g.ndata["h"] = h
            # Compute a new edge feature named 'score' by a dot-product between the
            # source node feature 'h' and destination node feature 'h'.
            g.apply_edges(fn.u_dot_v("h", "h", "score"))
            # u_dot_v returns a 1-element vector for each edge so you need to squeeze it.
            return g.edata["score"][:, 0]

We are now ready to perform training. For training, we need to specify a loss function. We will use the binary cross-entropy loss. The performance evaluation metric is chosen as AUC (area under the curve). The following code snippets take care of the loss function and AUC.

model = GraphSAGE(train_g.ndata["feat"].shape[1], 16)
pred = DotPredictor()


def compute_loss(pos_score, neg_score):
    scores = torch.cat([pos_score, neg_score])
    labels = torch.cat(
        [torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]
    )
    return F.binary_cross_entropy_with_logits(scores, labels)


def compute_auc(pos_score, neg_score):
    scores = torch.cat([pos_score, neg_score]).numpy()
    labels = torch.cat(
        [torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]
    ).numpy()
    return roc_auc_score(labels,scores)

We can now set up the training loop and begin training.

# We will use Adam optimizer
optimizer = torch.optim.Adam(
    itertools.chain(model.parameters(), pred.parameters()), lr=0.01)

# training loop
all_logits = []
for e in range(100):
    # forward
    h = model(train_g, train_g.ndata["feat"])
    pos_score = pred(train_pos_g, h)
    neg_score = pred(train_neg_g, h)
    loss = compute_loss(pos_score, neg_score)

    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if e % 5 == 0:
        print("In epoch {}, loss: {}".format(e, loss))

# Compute AUC
from sklearn.metrics import roc_auc_score

with torch.no_grad():
    pos_score = pred(test_pos_g, h)
    neg_score = pred(test_neg_g, h)
    print("AUC", compute_auc(pos_score, neg_score))

The result of training is shown below.

In epoch 0, loss: 0.7079574465751648
In epoch 5, loss: 0.6902216076850891
In epoch 10, loss: 0.6735929250717163
In epoch 15, loss: 0.6262624263763428
In epoch 20, loss: 0.5796554684638977
In epoch 25, loss: 0.555548369884491
In epoch 30, loss: 0.5190547108650208
In epoch 35, loss: 0.5012964606285095
In epoch 40, loss: 0.48360729217529297
In epoch 45, loss: 0.46244215965270996
In epoch 50, loss: 0.4450105130672455
In epoch 55, loss: 0.4255025088787079
In epoch 60, loss: 0.41155683994293213
In epoch 65, loss: 0.396867573261261
In epoch 70, loss: 0.3816315531730652
In epoch 75, loss: 0.36709561944007874
In epoch 80, loss: 0.3521302044391632
In epoch 85, loss: 0.33679577708244324
In epoch 90, loss: 0.32077470421791077
In epoch 95, loss: 0.3040713667869568
AUC 0.8397133936793872

The AUC value of 0.839 indicates that the model is performing well.

That's all for GraphSAGE. Let me know how you liked it.

Integrated Knowledge Solutions

Pages