Embeddings Beyond Words: Intro to Sentence Embeddings

It wouldn't be an exaggeration to say that the recent advances in Natural Language Processing (NLP) technology can be, to a large extent, attributed to the use of very high-dimensional vectors for language representation. These high-dimensional, 764 dimensions is common, vector representations are called embeddings and are aimed at capturing semantic meaning and relationships between linguistic items.

Although the idea of using vector representation for words has been around for many years, the interest in word embedding took a quantum jump with Tomáš Mikolov’s Word2vec algorithm in 2013. Since then, many methods for generating word embeddings, for example GloVe and BERT, have been developed. Before moving on further, let's see briefly how word embedding methods work.

Word Embedding: How is it Performed?

I am going to explain how word embedding is done using the Word2vec method. This method uses a linear encoder-decoder network with a single hidden layer. The input layer of the encoder is set to have as many neurons as there are words in the vocabulary for training. The hidden layer size is set to the dimensionality of the resulting word vectors. The size of the output layer is same as the input layer. The input words to the encoder are encoded using one-hot vector encoding where the size of the vector corresponds to the vocabulary. The figure below shows the arrangement for learning embeddings.
















The embeddings are learned by adjusting weights so that for a target word, say fox in a piece of text "The quick brown fox jumped over the fence", the probability for the designated context word, say jumped is high. There are two major variations to this basic technique. In a variation known as the continuous bag of words (CBW), multiple context words are used. Thus, the system may use brown, jumped and fence as the context words. In another scheme, known as the skip-gram model, the use of target and context words is reversed. Thus, the target word is fed on the input side and the weights are modified to increase probabilities for the prediction of context words. In both of these cases, the above architecture needs modification. You can read details about the architecture changes as well as look at a simple example in the blog post that I did a while ago.

Sentence Embeddings

While word embeddings are useful, we are often working with text to perform tasks such as text classification, sentiment analysis, and topic detection etc. Thus, it would be logical to extend the idea of word embeddings to sentences.  One simple way to accomplish this is to take the average of embeddings of different words in a sentence. However, such an approach doesn't take into account the word order and thus results in vectors that aren't very good at capturing the sentence meaning. Instead, the sentence embeddings are obtained by using transformer models such BERT (Bidirectional Encoder Representations from Transformers) which make use of attention mechanism to gauge the importance of different words in a sentence. BERT outputs for each token in the given input text its contextualized embedding. In order to create a fixed-sized sentence embedding out of this, the model applies mean pooling, i.e., the output embeddings for all tokens are pooled to yield a fixed-sized vector. The Sentence-BERT or simply SBERT is a package that you can use to create sentence embeddings without worrying about pooling. 

One issue facing BERT/SBERT is that of encountering an out of vocabulary word, that is a word that wasn't part of the text corpus used to train BERT. In such a case, an embedding for such a word doesn't exist. BERT/SBERT solve this by using a WordPiece tokenizer which breaks every word into one or more tokens. As an example, the word snowboarding will be tokenized through three tokens: snow, board, ing. This ensures embedding being created for any new word. SBERT permits creating a single vector embedding for sequences containing no more than 128 tokens. Sequence tokens beyond 128 are simply discarded.

Sentence Embedding Libraries

Other than SBERT, there are many libraries that one can use. Some of these are:

  • TensorFlow Hub - Provides pre-trained encoders like BERT and other transformer models. Makes it easy to generate sentence embeddings.
  • InferSent - Facebook AI research model for sentence embeddings trained on natural language inference data.
  • Universal Sentence Encoder (USE) - Google model trained on a variety of data sources to generate general purpose sentence embeddings.
  • Flair - NLP library with models like Flair embeddings trained on unlabeled data which can provide sentence representations.
  • Doc2Vec - Extension of Word2Vec that can learn embeddings for sentences and documents.
  • Stanford SkipThoughts - Unupervised model trained to predict surrounding sentences based on context.
  • GenSim - Includes implementations of models like Doc2Vec for generating sentence and paragraph embeddings.
  • SentenceTransformers - Library for state-of-the-art sentence embeddings based on transformers. Includes pretrained models like BERT and RoBERTa.

The choice of model depends on your use case. For general purposes, pretrained universal encoders like USE and SBERT provide robust sentence vectors. For domain-specific tasks, fine-tuning transformer models like BERT often produces the best performance.

One word of caution while using embeddings. Never mix embeddings generated by two different libraries. Embeddings produced via each method/framework are unique to that method and the training corpus.

An Example of Sentence Embedding for Measuring Similarity

Let's take a look at using sentence embedding to capture semantic similarity between pairs of sentences. We will use SBERT for this purpose. First, we install and import the necessary libraries and decide upon the sentence transformer model to be used.

! pip install sentence-transformers

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')

Next, we specify the sentences that we are using.

sentences = [
"The sky is blue and beautiful",
"Love this blue and beautiful sky!",
"The brown fox is quick and the blue dog is lazy!",
"The dog is lazy but the brown fox is quick!",
"the bees decided to have a mutiny against their queen",
"the sign said there was road work ahead so she decided to speed up",
"on a scale of one to ten, what's your favorite flavor of color?",
"flying stinging insects rebelled in opposition to the matriarch"
]

embeddings = model.encode(sentences)
embeddings.shape

(8, 768)

So, the embedding results in eight vectors of 768 dimensions. Next, we import a utility from sentence transformer library and compute cosine similarities between different pairs. Remember, the cosine similarity value close to one indicates very high degree of similarity and low values are indicative of almost no similarity.


from sentence_transformers import util
#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)
print(cos_sim)
tensor([[ 1.0000, 0.7390, 0.2219, 0.1689, 0.1008, 0.1191, 0.2174, 0.0628], [ 0.7390, 1.0000, 0.1614, 0.1152, 0.0218, 0.0713, 0.2854, -0.0181], [ 0.2219, 0.1614, 1.0000, 0.9254, 0.1245, 0.2171, 0.1068, 0.0962], [ 0.1689, 0.1152, 0.9254, 1.0000, 0.1018, 0.2463, 0.0463, 0.0706], [ 0.1008, 0.0218, 0.1245, 0.1018, 1.0000, 0.2005, 0.0153, 0.6084], [ 0.1191, 0.0713, 0.2171, 0.2463, 0.2005, 1.0000, 0.0116, 0.1011], [ 0.2174, 0.2854, 0.1068, 0.0463, 0.0153, 0.0116, 1.0000, -0.0492], [ 0.0628, -0.0181, 0.0962, 0.0706, 0.6084, 0.1011, -0.0492, 1.0000]])

Looking at the resulting similarity values, we see that the sentence#1 and sentence#2 pair has a high degree of similarity. Sentence#3 and sentence#4 also generate a very high value of cosine similarity. Interestingly, sentence#5 and sentence#8 are also deemed to have a good semantic similarity, although they do not share any descriptive words. Thus, the sentence embedding is doing a pretty good job of capturing sentence semantics.


Comparison with TF-IDF Vectorization

Information Retrieval (IR) community for a long time has been representing text as vectors for matching documents. The approach, known as the bag-of-words model, uses a set of words or terms to characterize text.  Each word or term is assigned a weight following the  TF-IDF weighting scheme. In this scheme, the weight assigned to a word is based upon: (i) how often it appears in the document being vectorized, the term frequency (TF) component of the weighting scheme, and (ii) how rare is the word in the entire document collection, the inverse document frequency (IDF) component of the weighting scheme. The vector size is governed by the number of terms used from the entire document collection, i.e. the vocabulary size. You can read details about TF-IDF vectorization in this blog post.

Let's see how well the TF-IDF vectorization captures similarities between document in comparison with the sentence embedding. We will use the same set of sentences to perform vectorization and similarity calculations as shown below.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer(ngram_range = (1,2),stop_words='english')
tfidf = vectorizer.fit_transform(sentences)
similarity =cosine_similarity(tfidf,tfidf)
np.set_printoptions(precision=4)
print(similarity)

[[1. 0.5818 0.0962 0. 0. 0. 0. 0. ] [0.5818 1. 0.0772 0. 0. 0. 0. 0. ] [0.0962 0.0772 1. 0.7654 0. 0. 0. 0. ] [0. 0. 0.7654 1. 0. 0. 0. 0. ] [0. 0. 0. 0. 1. 0.0761 0. 0. ] [0. 0. 0. 0. 0.0761 1. 0. 0. ] [0. 0. 0. 0. 0. 0. 1. 0. ] [0. 0. 0. 0. 0. 0. 0. 1. ]]


Looking at the above results, we see that TF-IDF vectorization is unable to determine similarity between 
sentence#5 and sentence#8 which the sentence embedding was able to pick up despite of the absence of the common descriptive words in the sentence pair.

Thus, TF-IDF vectorizer is good as long as there are shared descriptive words. But the sentence embedding is able to capture semantic similarities without even shared descriptive words. This is possible because the high-dimensional embedded vectors learn relationships between different words and their context during training and utilize those relationships during similarity computation as well as for other NLP tasks.

Now you might be wondering whether the embedding concept can be applied to images and graphs. The answer is yes and I hope to dwell on these in my future posts.


Claude 2: A New Member of the Growing Family of Large Language Models

AI has advanced rapidly in recent years, with large language models (LLMs) like ChatGPT creating enormous excitement. These models can generate remarkably human-like text albeit with certain limitations. In this post, we'll look at a new member of the family of large language models, Anthropic's Claude 2, and highlight some of its features.

Claude 2 Overview

Claude2 was released in February 2023. Claude 2 utilizes a context window of approximately 4,000 tokens during conversations. This allows it to actively reference the last 1,000-2,000 words spoken in order to strengthen contextual awareness and continuity. The context window is dynamically managed, expanding or contracting slightly based on factors like conversation complexity. This context capacity exceeds ChatGPT's approximately 1,000 token window, enabling Claude 2 to sustain longer, more intricate dialogues while retaining appropriate context. In addition to conversational context, Claude 2 can take in multiple documents to incorporate information from different sources. 

Claude2's distinguishing features are Constitutional AI and Constitutional Instructive Reward techniques. The incorporation of these two techniques is claimed to improve safety and reliability. As a result, Claude 2 is seen to provide helpful, harmless, and honest responses compared to other models; its performance on a wide range of conversational queries is over 99% accuracy. In benchmarks, ChatGPT produces inconsistent or incorrect responses approximately 5-10% of the time. 

 What is Constitutional AI?

The Constitutional AI technique constrains Claude 2 to behave according to a "constitution" defined by its designers at Anthropic. The "constitution" takes the form of a modular library of neural network modules that encodes rules guiding allowed model outputs. The constitutional rule modules are designed using a combination of techniques like supervised learning from human feedback, adversarial training to surface edge cases, and reinforcement learning optimized for consistency and oversight. The modules operate on Claude 2's internal representations, blocking or altering potential model outputs that violate defined constitution policies. These policies prohibit overtly harmful responses and mitigate risks identified during Claude 2's training. This technique constrains Claude 2 to behave according to a "constitution" defined by its designers at Anthropic. The constitution sets guidelines for providing helpful, honest, harmless information. Concrete rules prohibit harmful responses, while allowing Claude 2 to politely decline inappropriate requests. This establishes ethical boundaries unmatched by other LLMs. 

What is Constitutional Instructive Reward Technique?

Constitutional Instructive Reward technique builds on Constitutional AI by further optimizing Claude 2's training process. Anthropic generates a large dataset of hypothetical conversational scenarios that might challenge model integrity. The Constitutional AI modules provide feedback on which responses are acceptable versus violations. This dataset then trains an auxiliary Constitutional AI Advisor model through self-supervised learning.

The Constitutional AI Advisor produces reward signals that feed back into Claude 2's reinforcement learning loop. This focuses the overarching optimization toward mitigating identified risks and providing helpful instructions to users. The Advisor guides Claude 2 toward more nuanced integrity not encapsulated by the core Constitutional AI modules. It also provides explainability, since the Advisor outputs can be inspected to identify why specific responses qualified as unwise or unethical.

Useful Claude2 Metrics

The information below is gleaned from Anthropic's publications.

- Claude 2 can generate approximately 300 tokens (2,000 words) per second on a modern GPU. This enables rapid response times for conversational queries.

- Its average query response latency is under 500 milliseconds, allowing for smooth and natural dialogue flow.

- The model is optimized to run efficiently on commercially available GPUs like the Nvidia A100. On this hardware, it can process over 10 queries per second concurrently.

- Claude 2 requires only 50 GPU hours to train, this improves sustainability.

- In a benchmark test on the SuperGLUE natural language toolkit, Claude 2 achieved a 94% score while running up to 24x faster than GPT-3.

- Cloud-deployed versions of Claude 2 scale to handle over 100,000 users simultaneously.

In summary, Claude2 is a welcome addition to the growing family of LLMs with some distinct performance superiority over other models. The best thing about Claude2 is that it is totally free.



Difference Between Semi-Supervised Learning and Self-Supervised Learning

There are many styles of training machine learning models including the familiar supervised and unsupervised learning to active learning, semi-supervised learning and self-supervised learning. In this post, I will explain the difference between semi-supervised and self-supervised styles of learning. To get started, let us first recap what is supervised learning, the most popular machine learning methodology to build predictive models. Supervised learning uses annotated or labeled data to train predictive models. A label attached to a data vector is nothing but the response that the predictive model should generate  for that data vector as input during the model training. For example, we will label pictures of cats and dogs with labels cat and dog to train a Cat versus Dog classifier. We assume a large enough training data set with labels is available when building a classifier.

When there are no labels attached to the training data, then the learning style is known as unsupervised learning. In unsupervised learning the aim is to partition the data into different groups based upon similarities of the training vectors. The k-means clustering is the most well-known unsupervised learning technique. Often, the number of the data groups to be formed is specified by the user.

Semi-Supervised Learning

In a real world setting, training examples with labels need to be acquired for a predictive modeling task. Labeling or annotating examples is expensive and time-consuming; many application domains require expert annotators. Thus, we often need ways to work with a small labeled training data set. In certain situations, we may be able to acquire, in addition to a small labeled training data set, additional training examples without labels with labeling being too expensive to perform. In such cases, it is possible to label the unlabeled examples using the small available set of labeled examples. This type of learning is referred as semi-supervised learning and it falls somewhere between supervised and unsupervised learning. 

The term semi-supervised classification is often used to describe the process of labeling training examples using a small set of labeled examples for classification modeling. A similar idea is also used in clustering in an approach known as the semi-supervised clustering. In semi-supervised clustering, the goal is to group a given set of examples into different clusters with the condition that certain examples must be clustered together and certain examples must be put in different clusters. In other words, some kind of constraints are imposed on resulting clusters in terms of cluster memberships of certain specified examples. For an example of semi-supervised classification, you can check this blog post. In another blog post, you can read about constrained k-means clustering as a technique for semi-supervised clustering.

Transfer Learning

In certain situations we have a small set of labeled examples but cannot acquire more training examples even without the labels. One possible solution in such situations is transfer learning. In transfer learning, we take a trained predictive model that was trained on a related task and re-train it with our available labeled data. The re-training fine-tunes the parameters of the trained model to make it perform well for our predictive task. Transfer learning is popular in deep learning where many trained predictive models are publicly available. While performing transfer learning, we often employ data augmentation to the available labeled examples to create additional examples with labels. The common data augmentation operations include translation, rotation, cropping and resizing, and blurring.

Self-Supervised Learning

The Self-supervised learning is essentially unsupervised learning wherein the labels, the desired predictions, are provided by the data itself and hence the name self-supervised learning. The objective of the self-supervised learning is to learn the latent characteristics of the data that could be useful in many ways. Although the self-supervised learning has been around for a long time, for example as in autoencoders, its current popularity is primarily due its use in training the large language models. 

The example below shows how the desired output is defined via self-learning. In the example, the words in green are masked and the model is trained to predict the masked words using the surrounding words. Thus, the masked words function as labels. The masking of the words is done in a random fashion for the given corpus and thus no manual labeling is needed.




The idea of random masking is not the only way to self-generate labels; several variations at the word level as well as the sentence level are possible and have been successfully used in different language modeling efforts. For example, self-learning can be employed to predict the neighboring sentences that come before and after a selected sentence in a given document. 

The tasks defined to perform self-supervised learning are called pretext tasks because these tasks are not the end-goal and the results of these tasks are used for building the final systems. 

Self-generation of labels for prediction is easily extended to images to define a variety of pretext tasks for self-supervised learning. As an example, images can be subjected to rotations of (90 degrees, 180 degrees etc.) and the pretext task is defined to predict the rotation applied to the images. Such a pretext task can make the model learn the canonical orientation of image objects. Data augmentation is also commonly used in self-supervised learning to create image variations. 

All in all, self-supervised learning is a valuable concept that eliminates the need for external annotation. The success of large language models can be majorly attributed to this style of machine learning.