Integrated Knowledge Solutions: sentence embeddings

Showing posts with label sentence embeddings. Show all posts

Mapping Nodes to Vectors: An Intro to Node Embedding

In an earlier post, I had stated that the recent advances in Natural Language Processing (NLP) technology can be, to a large extent, attributed to the use of very high-dimensional vectors for language representation. These high-dimensional, 764 dimensions is common, vector representations are called embeddings and are aimed at capturing semantic meaning and relationships between linguistic items. Given that graphs are everywhere, it is not surprising to see the ideas of word and sentence embeddings being extended to graphs in the form of node embeddings.

What are Node Embedding?

Node embeddings are encodings of the properties and relationships of nodes in a low-dimensional vector space. This enables nodes with similar properties or connectivity patterns to have similar vector representations. Using node embeddings can improve performance on various graph analytics tasks such as node classification, link prediction, and clustering.

Methods for Node Embeddings

There are several common techniques for learning node embeddings:

- DeepWalk and Node2Vec use random walks on the graph to generate node sequences, which are then fed into a word2vec skip-gram model to get embeddings. The random walk strategies allow capturing both local and broader network structure.

- Graph convolutional networks like GCN, GAT learn embeddings by propagating and transforming node features across graph edges. The neural network architecture captures both feature information and graph structure.

- Matrix factorization methods like GraRep and HOPE factorize some matrix representation of the graph to produce embeddings. For example, factorizing the adjacency matrix or a higher-order matrix encoding paths.

We are going to look at Node2Vec embeddings.

Node2Vec Embedding Steps

Before going over the Node2Vec embedding steps, let's first try to understand random walks on graphs. The random walks are used to convert the relationships present in the graph to a representation similar to sentences.

A random walk is a series of steps beginning from some node in the graph and moving to its neighboring nodes selected in a random manner. A random walk of length k means k random steps from an initially selected node. An example of a random walk of length 3 is shown below in Figure 1 where edges in green show an instance of the random walk from node marked X. Each random walk results in a sequence of nodes, analogous to sentences in a corpus.

Fig.1. An example of a random walk of length 3.

Using random walks, an overall pipeline of steps for generating node embeddings is shown below in Figure 2. The random walks method first generates a set of node sequence of specified length for every node. Next, the node sequences are passed on to a language embedding model that generates a low-dimensional, continuous vector representation using SkipGram model. These vectors can be then used to for any down-stream task using node similarities. It is also possible to map the embeddings to two-dimensional space for visualization using popular dimensionality reduction techniques.

Fig.2. Pipeline of steps for generating node embeddings

The DeepWalk node embedding method used random walks to gather node context information. Node2VEC instead uses biased random walks that provide better node context information. Both breadth-first and depth-first strategies are used in biased random walk to explore the neighborhood of a node. The biased random walk is carried out using two parameters, p and q. The parameter p models transition probabilities to return back to the previous node while the parameter $q$ defines the “ratio” of BFS and DFS. Figure 3 below illustrates biased random walk. Starting with node v, it shows a walk of length 4. Blue arrows indicate the breadth-first search (BFS) directions and red arrows show the depth-first search (DFS) directions. The transition probabilities are governed by the search parameters as shown.

Fig.3.

Example of Generating Embeddings

I will use Zachary's Karate Club dataset to illustrate the generation of node embeddings. It is a small dataset that has been widely used in studies of graph based methods. The embeddings will be generated using the publicly available implementation of node2vec.

       
# Import necessary libraries
import networkx as nx
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
from node2vec import Node2Vec
# Load Karate Club data
G = nx.karate_club_graph()
cols = ["blue" if G.nodes[n]["club"]=='Officer' else "red" for n in G.nodes()]
nx.draw_networkx(G, node_color=cols)

       
df_edges=nx.to_pandas_edgelist(G)  #creating an edge list data frame
node2vec = Node2Vec(G, dimensions=16, walk_length=30, num_walks=100, workers=4,p=1.0,q=0.5)
model.wv.save_word2vec_format('./karate.emb')#Save embeddings
#Create a dataframe of embeddings for visualization
df= pd.read_csv('karate.emb',sep=' ',skiprows=[0],header=None) # creating a dataframe from the embedded graph 
df.set_index(0,inplace=True)
df.index.name='node_id'
df.head()

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
33	0.092881	-0.116476	0.109677	0.216758	-0.216537	-0.273324	0.897161	0.099800	-0.199117	-0.012935	0.095815	-0.022268	-0.246529	0.166690	-0.315830	-0.080442
0	-0.103973	0.217445	-0.321607	-0.188096	0.191852	0.158175	0.668216	0.027499	0.438449	-0.021251	0.072351	0.126618	-0.257281	-0.223262	-0.590806	-0.194275
32	0.072306	-0.285375	0.167020	0.190507	-0.130808	-0.225752	0.810336	0.078936	-0.430512	0.188738	0.072464	-0.023813	-0.209342	-0.106820	-0.501649	0.008207
2	0.189961	0.159125	-0.195857	0.293849	-0.102961	-0.130784	0.730172	-0.042656	0.085806	-0.365449	-0.013965	0.187779	-0.158719	0.019433	-0.280343	-0.301981
1	0.099712	0.355814	-0.091135	0.015275	0.107660	0.123524	0.606924	0.004724	0.201826	-0.244784	0.389210	0.387045	-0.031511	-0.156609	-0.425399	-0.469062

Now, let's project embeddings to visualize.

       
df.columns = df.columns.astype(str)
transform = TSNE
trans = transform(n_components=2)
node_embeddings_2d = trans.fit_transform(df)
plt.figure(figsize=(7, 7))
plt.axes().set(aspect="equal")
plt.scatter(
    node_embeddings_2d[:, 0],
    node_embeddings_2d[:, 1], c = labels )
plt.title("{} visualization of node embeddings".format(transform.__name__))
plt.show()

The visualization shows three clusters of node embeddings. This implies possibly three communities in the data. This is different from the known structure of the Karate club data having two communities with one member with an ambiguous assignment. However, we need to keep in mind that no node properties were used while creating embeddings. Another point to note is that node2vec algorithm does have several parameters and their settings do impact the results.

The following paper provides an excellent review of different node embedding methods mentioned in this blog. You may want to check it out.

Understanding Graph Embedding Methods and Their Applications, Mengjia Xu

Embeddings Beyond Words: Intro to Sentence Embeddings

It wouldn't be an exaggeration to say that the recent advances in Natural Language Processing (NLP) technology can be, to a large extent, attributed to the use of very high-dimensional vectors for language representation. These high-dimensional, 764 dimensions is common, vector representations are called embeddings and are aimed at capturing semantic meaning and relationships between linguistic items.

Although the idea of using vector representation for words has been around for many years, the interest in word embedding took a quantum jump with Tomáš Mikolov’s Word2vec algorithm in 2013. Since then, many methods for generating word embeddings, for example GloVe and BERT, have been developed. Before moving on further, let's see briefly how word embedding methods work.

Word Embedding: How is it Performed?

I am going to explain how word embedding is done using the Word2vec method. This method uses a linear encoder-decoder network with a single hidden layer. The input layer of the encoder is set to have as many neurons as there are words in the vocabulary for training. The hidden layer size is set to the dimensionality of the resulting word vectors. The size of the output layer is same as the input layer. The input words to the encoder are encoded using one-hot vector encoding where the size of the vector corresponds to the vocabulary. The figure below shows the arrangement for learning embeddings.

The embeddings are learned by adjusting weights so that for a target word, say fox in a piece of text "The quick brown fox jumped over the fence", the probability for the designated context word, say jumped is high. There are two major variations to this basic technique. In a variation known as the continuous bag of words (CBW), multiple context words are used. Thus, the system may use brown, jumped and fence as the context words. In another scheme, known as the skip-gram model, the use of target and context words is reversed. Thus, the target word is fed on the input side and the weights are modified to increase probabilities for the prediction of context words. In both of these cases, the above architecture needs modification. You can read details about the architecture changes as well as look at a simple example in the blog post that I did a while ago.

Sentence Embeddings

While word embeddings are useful, we are often working with text to perform tasks such as text classification, sentiment analysis, and topic detection etc. Thus, it would be logical to extend the idea of word embeddings to sentences. One simple way to accomplish this is to take the average of embeddings of different words in a sentence. However, such an approach doesn't take into account the word order and thus results in vectors that aren't very good at capturing the sentence meaning. Instead, the sentence embeddings are obtained by using transformer models such BERT (Bidirectional Encoder Representations from Transformers) which make use of attention mechanism to gauge the importance of different words in a sentence. BERT outputs for each token in the given input text its contextualized embedding. In order to create a fixed-sized sentence embedding out of this, the model applies mean pooling, i.e., the output embeddings for all tokens are pooled to yield a fixed-sized vector. The Sentence-BERT or simply SBERT is a package that you can use to create sentence embeddings without worrying about pooling.

One issue facing BERT/SBERT is that of encountering an out of vocabulary word, that is a word that wasn't part of the text corpus used to train BERT. In such a case, an embedding for such a word doesn't exist. BERT/SBERT solve this by using a WordPiece tokenizer which breaks every word into one or more tokens. As an example, the word snowboarding will be tokenized through three tokens: snow, board, ing. This ensures embedding being created for any new word. SBERT permits creating a single vector embedding for sequences containing no more than 128 tokens. Sequence tokens beyond 128 are simply discarded.

Sentence Embedding Libraries

Other than SBERT, there are many libraries that one can use. Some of these are:

TensorFlow Hub - Provides pre-trained encoders like BERT and other transformer models. Makes it easy to generate sentence embeddings.

InferSent - Facebook AI research model for sentence embeddings trained on natural language inference data.

Universal Sentence Encoder (USE) - Google model trained on a variety of data sources to generate general purpose sentence embeddings.

Flair - NLP library with models like Flair embeddings trained on unlabeled data which can provide sentence representations.

Doc2Vec - Extension of Word2Vec that can learn embeddings for sentences and documents.

Stanford SkipThoughts - Unupervised model trained to predict surrounding sentences based on context.

GenSim - Includes implementations of models like Doc2Vec for generating sentence and paragraph embeddings.

SentenceTransformers - Library for state-of-the-art sentence embeddings based on transformers. Includes pretrained models like BERT and RoBERTa.

The choice of model depends on your use case. For general purposes, pretrained universal encoders like USE and SBERT provide robust sentence vectors. For domain-specific tasks, fine-tuning transformer models like BERT often produces the best performance.

One word of caution while using embeddings. Never mix embeddings generated by two different libraries. Embeddings produced via each method/framework are unique to that method and the training corpus.

An Example of Sentence Embedding for Measuring Similarity

Let's take a look at using sentence embedding to capture semantic similarity between pairs of sentences. We will use SBERT for this purpose. First, we install and import the necessary libraries and decide upon the sentence transformer model to be used.

! pip install sentence-transformers

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')

Next, we specify the sentences that we are using.

sentences = [
    "The sky is blue and beautiful",
    "Love this blue and beautiful sky!",
    "The brown fox is quick and the blue dog is lazy!",
    "The dog is lazy but the brown fox is quick!",
    "the bees decided to have a mutiny against their queen",
    "the sign said there was road work ahead so she decided to speed up",
    "on a scale of one to ten, what's your favorite flavor of color?",
    "flying stinging insects rebelled in opposition to the matriarch"
]

embeddings = model.encode(sentences)
embeddings.shape

(8, 768)

So, the embedding results in eight vectors of 768 dimensions. Next, we import a utility from sentence transformer library and compute cosine similarities between different pairs. Remember, the cosine similarity value close to one indicates very high degree of similarity and low values are indicative of almost no similarity.

from sentence_transformers import util
#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)
print(cos_sim)

tensor([[ 1.0000, 0.7390, 0.2219, 0.1689, 0.1008, 0.1191, 0.2174, 0.0628], [ 0.7390, 1.0000, 0.1614, 0.1152, 0.0218, 0.0713, 0.2854, -0.0181], [ 0.2219, 0.1614, 1.0000, 0.9254, 0.1245, 0.2171, 0.1068, 0.0962], [ 0.1689, 0.1152, 0.9254, 1.0000, 0.1018, 0.2463, 0.0463, 0.0706], [ 0.1008, 0.0218, 0.1245, 0.1018, 1.0000, 0.2005, 0.0153, 0.6084], [ 0.1191, 0.0713, 0.2171, 0.2463, 0.2005, 1.0000, 0.0116, 0.1011], [ 0.2174, 0.2854, 0.1068, 0.0463, 0.0153, 0.0116, 1.0000, -0.0492], [ 0.0628, -0.0181, 0.0962, 0.0706, 0.6084, 0.1011, -0.0492, 1.0000]])

Looking at the resulting similarity values, we see that the sentence#1 and sentence#2 pair has a high degree of similarity. Sentence#3 and sentence#4 also generate a very high value of cosine similarity. Interestingly, sentence#5 and sentence#8 are also deemed to have a good semantic similarity, although they do not share any descriptive words. Thus, the sentence embedding is doing a pretty good job of capturing sentence semantics.

Comparison with TF-IDF Vectorization

Information Retrieval (IR) community for a long time has been representing text as vectors for matching documents. The approach, known as the bag-of-words model, uses a set of words or terms to characterize text. Each word or term is assigned a weight following the TF-IDF weighting scheme. In this scheme, the weight assigned to a word is based upon: (i) how often it appears in the document being vectorized, the term frequency (TF) component of the weighting scheme, and (ii) how rare is the word in the entire document collection, the inverse document frequency (IDF) component of the weighting scheme. The vector size is governed by the number of terms used from the entire document collection, i.e. the vocabulary size. You can read details about TF-IDF vectorization in this blog post.

Let's see how well the TF-IDF vectorization captures similarities between document in comparison with the sentence embedding. We will use the same set of sentences to perform vectorization and similarity calculations as shown below.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer(ngram_range = (1,2),stop_words='english')
tfidf = vectorizer.fit_transform(sentences)
similarity =cosine_similarity(tfidf,tfidf)
np.set_printoptions(precision=4)
print(similarity)

[[1. 0.5818 0.0962 0. 0. 0. 0. 0. ] [0.5818 1. 0.0772 0. 0. 0. 0. 0. ] [0.0962 0.0772 1. 0.7654 0. 0. 0. 0. ] [0. 0. 0.7654 1. 0. 0. 0. 0. ] [0. 0. 0. 0. 1. 0.0761 0. 0. ] [0. 0. 0. 0. 0.0761 1. 0. 0. ] [0. 0. 0. 0. 0. 0. 1. 0. ] [0. 0. 0. 0. 0. 0. 0. 1. ]]

Looking at the above results, we see that TF-IDF vectorization is unable to determine similarity between
sentence#5 and sentence#8 which the sentence embedding was able to pick up despite of the absence of the common descriptive words in the sentence pair.

Thus, TF-IDF vectorizer is good as long as there are shared descriptive words. But the sentence embedding is able to capture semantic similarities without even shared descriptive words. This is possible because the high-dimensional embedded vectors learn relationships between different words and their context during training and utilize those relationships during similarity computation as well as for other NLP tasks.

Now you might be wondering whether the embedding concept can be applied to images and graphs. The answer is yes and I hope to dwell on these in my future posts.

Integrated Knowledge Solutions

Pages