Graph neural networks (GNNs) are deep learning networks that operate on graph data. These networks are increasingly getting popular as numerous real-world applications are easily modeled as graphs. Graphs are unlike images, text, time-series that are used in deep learning models. Graphs are of arbitrary size and complex topological structure. We represent graphs as a set of nodes and edges. In many instances, each node is associated with a feature vector. The adjacency matrix of a graph defines the presence of edges between the nodes. The ordering of nodes in a graph is arbitrary. These factors make it hard to use the existing deep learning architectures and call for an architecture suited to graphs as inputs.

### Permutation Invariance Architecture

Since the nodes in a graph are arbitrarily ordered, it is possible that two adjacency matrices might be representing the same graph. So whatever architecture we plan for graph computation, it should be invariant to the ordering of nodes. This requirement is termed *permutation invariance*. Given a graph with adjacency matrix * A1* and the corresponding nodes feature matrix

*and another matrix pair of*

**X**1*and*

**A**2*representing the same graph but with different ordering of nodes, the permutation invariance implies that any function*

**X**2*f*that maps the graph to an embedding vector

**R**of dimension

*d*must satisfy the relationship

*f(*, i.e. the function should yield the same embedding irrespective of how the nodes are numbered.

**A**1,**X**1) = f(**A**2,**X**2)### Message Passing Architecture

The deep learning architecture that works with graphs is based on message passing. The resulting deep learning model is also known as Graph Convolutional Networks (GCNs). The underlying graph for this architecture consists of an adjacency matrix and a matrix of node vectors. Each node in the graph starts by sharing its message, the associated node vector, with all its neighbors. Thus, every node receives one message per adjacent node as illustrated below.

As each node receives messages from its neighbors, the node feature vectors for every node are updated. The GCN uses the following updating rule where $\bf{h}_u$ refers to the node vector for node *u*.

$\bf{h}_u^{(k+1)} = \sigma\left(\hat{\bf{D}}^{-1/2}\hat{\bf{A}}\hat{\bf{D}}^{-1/2}\bf{h}_u^{(k)}\bf{W}^{(k)}\right)$,

where $\bf{W}^{(k)}$ is the weight parameters that transforms the node features into messages $(\bf{H}^{(k)}\bf{W}^{(k)})$. We add the identity matrix to the adjacency matrix $\bf{A}$ so that each node sends its own message to itself also: $\hat{\bf{A}}=\bf{A}+\bf{I}$. In place of summing, we perform messages averaging via the degree matrix $\hat{\bf{D}}$. We can also represent the above updating rule through the following expression:

$\bf{h}_u^k = \sigma\Bigl(\bf{W}_k \sum_{v\in{N(u)\cup N(v))}} \frac{\bf{h}_v^{k-1}}{(|N(u)||N(v)|)^{-1/2}} \Bigl)$,

where N(.) represents the number of neighbors of a node including itself. The denominator here is viewed as performing the *symmetric normalization* of the sum accumulated in the numerator. The normalization is important to avoid large aggregated values for nodes with many neighbors. A nonlinearity, denoted by σ, is used in the updating relationship. The ReLU is generally the nonlinearity used in GCNs.

### A Simple Example to Illustrate Message Passing Computation

We will use the graph shown above for our illustration. The adjacency matrix of the graph is

Let the matrix of node features be the following

The $\hat \bf{D}^{-1/2}$ matrix results in $\text{diag}(1/2^{1/2}, 1/4^{1/2}, 1/3^{1/2}, 1/3^{1/2})$. Assuming $\bf{W}$ to be an identity matrix of size 2, we carry out the multiplications to obtain the updated matrix of node feature vectors, before passing through the nonlinearity, as

[[0.707 1.560], [3.387 4.567], [3.910 4.866], [3.910 4.866]].

The graph convolution layer just described above limits the message passing only to immediate neighbors. With multiple such layers we have a GCN with message passing between nodes beyond the local neighborhood. The figure below presents a visualization of a GCN with two layers of message passing.

### Training GCNs

How do we find the weights used in a graph neural network? This is done by defining a loss function and using SGD to find the optimal weight values. The choice of the loss function depends upon the task that the GCN is to be used for. We will consider *node classification* task for our discussion. I plan to cover the training for other tasks such as link prediction and graph classification in future posts.

Node classification implies assigning a label or category to a node based on a training set of nodes with known labels. Given $\bf{y}_u$ as the one-hot encoding of true label of node *u* and its node embedding output by the GNN as $\bf{z}_u$, the negative log-likelihood loss, defined below, is the most common choice to learn the weights:

$L =\sum_{u\in{\bf {V}_{\text{train}}}} - log(softmax(\bf{z}_u,\bf{y}_u)$

The above expression implies the presence of a set of training nodes. These training nodes are a subset of all nodes participating in message passing. The remaining nodes are used as test nodes to determine how well the learning has taken. It is important to remember that all nodes, whether training nodes or not , are involved in message passing to generate embeddings; only a subset of nodes marked as training nodes are used in the loss function to optimize the weight parameters. The nodes not designated as training nodes are called *transductive test nodes. *Since the transductive nodes do participate in embeddings generation, this setup is different from supervised learning and is termed as *semi-supervised learning*.

### Applying GNN for Node Classification

Let's now train a GCN for a node classification task. I am going to use the scikit-network library. You can find a brief introduction to this library at this blog post. The library is developed along the lines of the popular scikit-learning library for machine learning and thus provides a familiar environment to work with for graph machine learning.

Let's get started by importing all the necessary libraries.

```
# import necessary libraries
from IPython.display import SVG
import numpy as np
from scipy import sparse
from sknetwork.classification import get_accuracy_score
from sknetwork.gnn import GNNClassifier
from sknetwork.visualization import svg_graph
```

We will use the art_philo_science toy dataset. It consists of a selection of 30 Wikipedia articles with links between them. Each article is described by some words used in their summary, among a list of 11 words. Each article belongs to one of the following 3 categories: arts, philosophy or science. The goal is to retrieve the category of some articles (the test set) from the category of the other articles (the train set).

```
from sknetwork.data import art_philo_science
graph = art_philo_science(metadata=True)
adjacency = graph.adjacency
features = graph.biadjacency
names = graph.names
names_features = graph.names_col
names_labels = graph.names_labels
labels_true = graph.labels
position = graph.position
```

Just for illustration, let's print the names of all the 30 nodes.

`print(names)# These are nodes`

'Ptolemy' 'Gottfried Wilhelm Leibniz' 'Carl Friedrich Gauss'

'Galileo Galilei' 'Leonhard Euler' 'John von Neumann' 'Leonardo da Vinci'

'Richard Wagner' 'Ludwig van Beethoven' 'Bob Dylan' 'Igor Stravinsky'

'The Beatles' 'Wolfgang Amadeus Mozart' 'Richard Strauss' 'Raphael'

'Pablo Picasso' 'Aristotle' 'Plato' 'Augustine of Hippo' 'Thomas Aquinas'

'Immanuel Kant' 'Bertrand Russell' 'David Hume' 'René Descartes'

'John Stuart Mill' 'Socrates']

Let's also print the names of the features.

`print(names_features)# These are the features.`

`['contribution' 'theory' 'invention' 'time' 'modern' 'century' 'study' 'logic' 'school' 'author' 'compose']`

We will use a single hidden layer.

`hidden_dim = 5 n_labels = 3 gnn = GNNClassifier(dims=[hidden_dim, n_labels], layer_types='Conv', activations='ReLu', verbose=True)`

print(gnn)# Details of the gnn`GNNClassifier( Convolution(layer_type: conv, out_channels: 5, activation: ReLu, use_bias: True, normalization: both, self_embeddings: True) Convolution(layer_type: conv, out_channels: 3, activation: Cross entropy, use_bias: True, normalization: both, self_embeddings: True))`

We now select nodes for use as training nodes and start training.

`# Training set labels = labels_true.copy() np.random.seed(42) train_mask = np.random.random(size=len(labels)) < 0.5 labels[train_mask] = -1 # Training labels_pred = gnn.fit_predict(adjacency, features, labels, n_epochs=200, random_state=42, history=True)`

`In epoch 0, loss: 1.053, train accuracy: 0.462 In epoch 20, loss: 0.834, train accuracy: 0.692 In epoch 40, loss: 0.819, train accuracy: 0.692 In epoch 60, loss: 0.831, train accuracy: 0.692 In epoch 80, loss: 0.839, train accuracy: 0.692 In epoch 100, loss: 0.839, train accuracy: 0.692 In epoch 120, loss: 0.825, train accuracy: 0.692 In epoch 140, loss: 0.771, train accuracy: 0.769 In epoch 160, loss: 0.557, train accuracy: 1.000 In epoch 180, loss: 0.552, train accuracy: 1.000`

We now test for accuracy on test nodes.

`# Accuracy on test set test_mask = ~train_mask get_accuracy_score(labels_true[test_mask], labels_pred[test_mask])`

`1.0`

The accuracy is 100%, not a surprise because it is a toy problem.

## Issues with GCNs

One major limitation of GCNs discussed in this post is that these networks can generate embeddings only for fixed or static networks. Thus, the node classification task shown above cannot be used for evolving networks as exemplified by social networks. Another issue relates to computational load for large graphs with nodes having very large neighborhoods. In subsequent posts, we are going to look at some other GNN models. So be on a look out.