Mapping Nodes to Vectors: An Intro to Node Embedding

In an earlier post, I had stated that the recent advances in Natural Language Processing (NLP) technology can be, to a large extent, attributed to the use of very high-dimensional vectors for language representation. These high-dimensional, 764 dimensions is common, vector representations are called embeddings and are aimed at capturing semantic meaning and relationships between linguistic items. Given that graphs are everywhere, it is not surprising to see the ideas of word and sentence embeddings being extended to graphs in the form of node embeddings. 

What are Node Embedding?

Node embeddings are encodings of the properties and relationships of nodes in a low-dimensional vector space. This enables nodes with similar properties or connectivity patterns to have similar vector representations. Using node embeddings can improve performance on various graph analytics tasks such as node classification, link prediction, and clustering.  

Methods for Node Embeddings

There are several common techniques for learning node embeddings:

- DeepWalk and Node2Vec use random walks on the graph to generate node sequences, which are then fed into a word2vec skip-gram model to get embeddings. The random walk strategies allow capturing both local and broader network structure.

- Graph convolutional networks like GCN, GAT learn embeddings by propagating and transforming node features across graph edges. The neural network architecture captures both feature information and graph structure.

- Matrix factorization methods like GraRep and HOPE factorize some matrix representation of the graph to produce embeddings. For example, factorizing the adjacency matrix or a higher-order matrix encoding paths.

We are going to look at Node2Vec embeddings.

Node2Vec Embedding Steps

Before going over the Node2Vec embedding steps, let's first try to understand random walks on graphs. The random walks are used to convert the relationships present in the graph to a representation similar to sentences. 

A random walk is a series of steps beginning from some node in the graph and moving to its neighboring nodes selected in a random manner. A random walk of length k means k random steps from an initially selected node. An example of a random walk of length 3 is shown below in Figure 1 where edges in green show an instance of the random walk from node marked X. Each random walk results in a sequence of nodes, analogous to sentences in a corpus.

Fig.1. An example of a random walk of length 3.

Using random walks, an overall pipeline of steps for generating node embeddings is shown below in Figure 2. The random walks method first generates a set of node sequence of specified length for every node. Next, the node sequences are passed on to a language embedding model that generates a low-dimensional, continuous vector representation using SkipGram model. These vectors can be then used to for any down-stream task using node similarities. It is also possible to map the embeddings to two-dimensional space for visualization using popular dimensionality reduction techniques.


Fig.2. Pipeline of steps for generating node embeddings

The DeepWalk node embedding method used random walks to gather node context information. Node2VEC instead uses biased random walks that provide better node context information. Both breadth-first and depth-first strategies are used in biased random walk to explore the neighborhood of a node. The biased random walk is carried out using two parameters, p and q. The parameter p models transition probabilities to return back to the previous node while the parameter  defines the “ratio” of BFS and DFS. Figure 3 below illustrates biased random walk. Starting with node v, it shows a walk of length 4. Blue arrows indicate the breadth-first search (BFS) directions and red arrows show the depth-first search (DFS) directions. The transition probabilities are governed by the search parameters as shown.

Fig.3.

Example of Generating Embeddings

I will use Zachary's Karate Club dataset to illustrate the generation of node embeddings. It is a small dataset that has been widely used in studies of graph based methods. The embeddings will be generated using the publicly available implementation of   node2vec. 

       
# Import necessary libraries
import networkx as nx
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
from node2vec import Node2Vec
# Load Karate Club data
G = nx.karate_club_graph()
cols = ["blue" if G.nodes[n]["club"]=='Officer' else "red" for n in G.nodes()]
nx.draw_networkx(G, node_color=cols)
       
df_edges=nx.to_pandas_edgelist(G)  #creating an edge list data frame
node2vec = Node2Vec(G, dimensions=16, walk_length=30, num_walks=100, workers=4,p=1.0,q=0.5)
model.wv.save_word2vec_format('./karate.emb')#Save embeddings
#Create a dataframe of embeddings for visualization
df= pd.read_csv('karate.emb',sep=' ',skiprows=[0],header=None) # creating a dataframe from the embedded graph 
df.set_index(0,inplace=True)
df.index.name='node_id'
df.head()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
node_id
33 0.092881 -0.116476 0.109677 0.216758 -0.216537 -0.273324 0.897161 0.099800 -0.199117 -0.012935 0.095815 -0.022268 -0.246529 0.166690 -0.315830 -0.080442
0 -0.103973 0.217445 -0.321607 -0.188096 0.191852 0.158175 0.668216 0.027499 0.438449 -0.021251 0.072351 0.126618 -0.257281 -0.223262 -0.590806 -0.194275
32 0.072306 -0.285375 0.167020 0.190507 -0.130808 -0.225752 0.810336 0.078936 -0.430512 0.188738 0.072464 -0.023813 -0.209342 -0.106820 -0.501649 0.008207
2 0.189961 0.159125 -0.195857 0.293849 -0.102961 -0.130784 0.730172 -0.042656 0.085806 -0.365449 -0.013965 0.187779 -0.158719 0.019433 -0.280343 -0.301981
1 0.099712 0.355814 -0.091135 0.015275 0.107660 0.123524 0.606924 0.004724 0.201826 -0.244784 0.389210 0.387045 -0.031511 -0.156609 -0.425399 -0.469062
Now, let's project embeddings to visualize.
       
df.columns = df.columns.astype(str)
transform = TSNE
trans = transform(n_components=2)
node_embeddings_2d = trans.fit_transform(df)
plt.figure(figsize=(7, 7))
plt.axes().set(aspect="equal")
plt.scatter(
    node_embeddings_2d[:, 0],
    node_embeddings_2d[:, 1], c = labels )
plt.title("{} visualization of node embeddings".format(transform.__name__))
plt.show()

 

The visualization shows three clusters of node embeddings. This implies possibly three communities in the data. This is different from the known structure of the Karate club data having two communities with one member with an ambiguous assignment. However, we need to keep in mind that no node properties were used while creating embeddings. Another point to note is that node2vec algorithm does have several parameters and their settings do impact the results.

The following paper provides an excellent review of different node embedding methods mentioned in this blog. You may want to check it out.

Google's Bard Can Code and Compute for You

Large language models (LLMs) continue to fascinate us with their capabilities to answer our questions, generate presentations and essays for us and many other assorted tasks. These models are also good at generating code for user specified tasks. However, almost all of them do not run the code for us; they simply give us the code that we can copy and execute.  Recently, Google has given its large language model, Bard, the computational capabilities as well. Bard thus not only provides the code but also executes it while answering user's questions.




I wanted to check this feature of Bard. Below is what happened when I asked Bard a question that involved some computation.




























Not only generating the code for entropy calculation and running it, Bard went on to explain entropy and its answer.



Google characterizes computing by Bard in response to user questions as "writing code on the fly" method. The company says, "So far, we've seen this method improve the accuracy of Bard’s responses to computation-based word and math problems in our internal challenge datasets by approximately 30%." 

The "writing code on the fly by Bard" is not limited to word or math problems; you can ask Bard to do coding and execution for larger problems too. As an example, I asked Bard to generate and execute code for building a regression model with my data being specified via a URL. I was pleasantly surprised to see Bard execute the code, get the results including the plots. You can read about my this interaction with Bard at this post.






















Exploring Canonical Correlation Analysis (CCA): Uncovering Hidden Relationships

Canonical Correlation Analysis (CCA) is a statistical technique that enables us to uncover hidden associations between two sets of variables. Whether it's in the fields of psychology, economics, genetics, marketing or machine learning, CCA proves to be a powerful tool for gaining valuable insights. In this blog post, we will try to understand CCA. But first let’s take a look at two sets of observations, X and Y, shown below. These two sets of observations are made on the same set of objects and each observation represents a different variable.


Let’s calculate pairwise correlation between the column vectors of X and Y. The resulting correlation values should give us some insight between the two sets of measurements. These values are shown below where the entry at (i,j) represents the correlation between the i-th column of X and the j-th column of Y.

The correlation values show moderate to almost no correlation between the columns of the two datasets except a relatively higher correlation between the second column of X and the third column of Y.

Is There a Hidden Relationship?

It looks like there is not much of a relationship between X and Y. But wait! Lets transform X and Y into one-dimensional arrays, a and b, using the vectors [-0.427 -0.576 0.696] and [0 0 1].

a = X[-0.427 -0.576 0.696]T

b = Y[0 0 1]T

Now, let's calculate the correlation between a and b. Wow! we get a correlation value of -0.999, meaning that the two projections of X and Y are very strongly correlated. In other words, there is a very strong hidden relationship present in our two sets of observations. An obvious question on your part at this stage is “how did you get the two vectors above used for getting a and b?” The answer to this is the canonical correlation analysis. 

What is Canonical Correlation Analysis?

Canonical correlation analysis is the problem of finding pairs of basis vectors for two sets of variables X and Y such that the correlation between the projections of the variables onto these basis vectors are mutually maximized. The number of pairs of such basis vectors is limited to the smallest dimensionality of X and Y. Assume $\bf{w}_x$ and $\bf{w}_y$ be the pair of basis vectors that projects X and Y into a and b given by $\bf a =  X w{_x}$, and $\bf b =  Y w{_y}$. The projections a and b are called the scores or the canonical variates. The correlation between the projections, after some algebraic manipulation, can be expressed as:


$\Large \rho = \frac{\bf{w}_{x}^T \bf{C}_{xy}\bf{w}_{y}}{\sqrt{\bf{w}_{x}^T \bf{C}_{xx}\bf{w}_{x}\bf{w}_{y}^T \bf{C}_{yy}\bf{w}_{y}}}$,

where Cxx, Cxy and Cyy  are three covariance matrices. The canonical correlations between X and Y are found by solving the eigenvalue equations

$ \bf{C}_{xx}^{-1}\bf{C}_{xy}\bf{C}_{yy}^{-1}\bf{C}_{yx}\bf{w}_x = \rho^2 \bf{w}_x$

$ \bf{C}_{yy}^{-1}\bf{C}_{yx}\bf{C}_{xx}^{-1}\bf{C}_{xy}\bf{w}_y = \rho^2 \bf{w}_y$

The eigenvalues in the above solution correspond to the squared canonical correlations and the corresponding eigenvectors yield the needed basis vectors. The number of non-zero solutions to these equations are limited to the smallest dimensionality of X and Y.

CCA Example

Let’s take a look at an example using the wine dataset from the sklearn library. We will divide the13 features of the dataset into X and Y sets of observations. The class labels in our example will act as hidden or latent feature. First, we will load the data, split it into X and Y and perform feature normalization.

from sklearn.datasets import load_wine
import numpy as np
wine = load_wine()
X = wine.data[:, :6]# Form X using first six features
Y = wine.data[:, 6:]# Form Y using the remaining seven features
# Perform feature normalization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
Y = scaler.fit_transform(Y)

Next, we import CCA object and fit the data. After that we obtain the canonical variates. In the code below, we are calculating 3 projections, X_c and Y_c, each for X and Y.

from sklearn.cross_decomposition import CCA
cca = CCA(n_components=3)
cca.fit(X, Y)
X_c, Y_c = cca.transform(X, Y)
We can now calculate the canonical correlation coefficients to see what correlation values are obtained.
cca_corr = np.corrcoef(X_c.T, Y_c.T).diagonal(offset=3)
print(cca_corr)

[0.90293514 0.73015495 0.51667522]

The highest canonical correlation value is 0.9029, indicating a strong hidden relationship between the two sets of vectors. Let us now try to visualize whether these correlations have captured any hidden relationship or not. In the present example, the underlying latent information not available to CCA is the class-membership of different measurements in X and Y. To check this, I have plotted the scatter plots of the three sets of x-y canonical variates where each variate pair is colored using the class label not accessible to CCA. These plots are shown below. It is clear that the canonical variates associated with the highest correlation coefficient show the existence of three groups in the scatter plot. This means that CCA is able to discern the presence of a hidden variable that reflects the class membership of the different observations.


Summary

Canonical Correlation Analysis (CCA) is a valuable statistical technique that enables us to uncover hidden relationships between two sets of variables. By identifying the most significant patterns and correlations, CCA helps gain valuable insights with numerous potential applications. CCA can be also used for dimensionality reduction. In machine and deep learning, CCA has been used for cross-modal learning and cross-modal retrieval.