Showing posts with label NLP. Show all posts
Showing posts with label NLP. Show all posts

Embeddings Beyond Words: Intro to Sentence Embeddings

It wouldn't be an exaggeration to say that the recent advances in Natural Language Processing (NLP) technology can be, to a large extent, attributed to the use of very high-dimensional vectors for language representation. These high-dimensional, 764 dimensions is common, vector representations are called embeddings and are aimed at capturing semantic meaning and relationships between linguistic items.

Although the idea of using vector representation for words has been around for many years, the interest in word embedding took a quantum jump with Tomáš Mikolov’s Word2vec algorithm in 2013. Since then, many methods for generating word embeddings, for example GloVe and BERT, have been developed. Before moving on further, let's see briefly how word embedding methods work.

Word Embedding: How is it Performed?

I am going to explain how word embedding is done using the Word2vec method. This method uses a linear encoder-decoder network with a single hidden layer. The input layer of the encoder is set to have as many neurons as there are words in the vocabulary for training. The hidden layer size is set to the dimensionality of the resulting word vectors. The size of the output layer is same as the input layer. The input words to the encoder are encoded using one-hot vector encoding where the size of the vector corresponds to the vocabulary. The figure below shows the arrangement for learning embeddings.

The embeddings are learned by adjusting weights so that for a target word, say fox in a piece of text "The quick brown fox jumped over the fence", the probability for the designated context word, say jumped is high. There are two major variations to this basic technique. In a variation known as the continuous bag of words (CBW), multiple context words are used. Thus, the system may use brown, jumped and fence as the context words. In another scheme, known as the skip-gram model, the use of target and context words is reversed. Thus, the target word is fed on the input side and the weights are modified to increase probabilities for the prediction of context words. In both of these cases, the above architecture needs modification. You can read details about the architecture changes as well as look at a simple example in the blog post that I did a while ago.

Sentence Embeddings

While word embeddings are useful, we are often working with text to perform tasks such as text classification, sentiment analysis, and topic detection etc. Thus, it would be logical to extend the idea of word embeddings to sentences.  One simple way to accomplish this is to take the average of embeddings of different words in a sentence. However, such an approach doesn't take into account the word order and thus results in vectors that aren't very good at capturing the sentence meaning. Instead, the sentence embeddings are obtained by using transformer models such BERT (Bidirectional Encoder Representations from Transformers) which make use of attention mechanism to gauge the importance of different words in a sentence. BERT outputs for each token in the given input text its contextualized embedding. In order to create a fixed-sized sentence embedding out of this, the model applies mean pooling, i.e., the output embeddings for all tokens are pooled to yield a fixed-sized vector. The Sentence-BERT or simply SBERT is a package that you can use to create sentence embeddings without worrying about pooling. 

One issue facing BERT/SBERT is that of encountering an out of vocabulary word, that is a word that wasn't part of the text corpus used to train BERT. In such a case, an embedding for such a word doesn't exist. BERT/SBERT solve this by using a WordPiece tokenizer which breaks every word into one or more tokens. As an example, the word snowboarding will be tokenized through three tokens: snow, board, ing. This ensures embedding being created for any new word. SBERT permits creating a single vector embedding for sequences containing no more than 128 tokens. Sequence tokens beyond 128 are simply discarded.

Sentence Embedding Libraries

Other than SBERT, there are many libraries that one can use. Some of these are:

  • TensorFlow Hub - Provides pre-trained encoders like BERT and other transformer models. Makes it easy to generate sentence embeddings.
  • InferSent - Facebook AI research model for sentence embeddings trained on natural language inference data.
  • Universal Sentence Encoder (USE) - Google model trained on a variety of data sources to generate general purpose sentence embeddings.
  • Flair - NLP library with models like Flair embeddings trained on unlabeled data which can provide sentence representations.
  • Doc2Vec - Extension of Word2Vec that can learn embeddings for sentences and documents.
  • Stanford SkipThoughts - Unupervised model trained to predict surrounding sentences based on context.
  • GenSim - Includes implementations of models like Doc2Vec for generating sentence and paragraph embeddings.
  • SentenceTransformers - Library for state-of-the-art sentence embeddings based on transformers. Includes pretrained models like BERT and RoBERTa.

The choice of model depends on your use case. For general purposes, pretrained universal encoders like USE and SBERT provide robust sentence vectors. For domain-specific tasks, fine-tuning transformer models like BERT often produces the best performance.

One word of caution while using embeddings. Never mix embeddings generated by two different libraries. Embeddings produced via each method/framework are unique to that method and the training corpus.

An Example of Sentence Embedding for Measuring Similarity

Let's take a look at using sentence embedding to capture semantic similarity between pairs of sentences. We will use SBERT for this purpose. First, we install and import the necessary libraries and decide upon the sentence transformer model to be used.

! pip install sentence-transformers

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')

Next, we specify the sentences that we are using.

sentences = [
"The sky is blue and beautiful",
"Love this blue and beautiful sky!",
"The brown fox is quick and the blue dog is lazy!",
"The dog is lazy but the brown fox is quick!",
"the bees decided to have a mutiny against their queen",
"the sign said there was road work ahead so she decided to speed up",
"on a scale of one to ten, what's your favorite flavor of color?",
"flying stinging insects rebelled in opposition to the matriarch"

embeddings = model.encode(sentences)

(8, 768)

So, the embedding results in eight vectors of 768 dimensions. Next, we import a utility from sentence transformer library and compute cosine similarities between different pairs. Remember, the cosine similarity value close to one indicates very high degree of similarity and low values are indicative of almost no similarity.

from sentence_transformers import util
#Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)
tensor([[ 1.0000, 0.7390, 0.2219, 0.1689, 0.1008, 0.1191, 0.2174, 0.0628], [ 0.7390, 1.0000, 0.1614, 0.1152, 0.0218, 0.0713, 0.2854, -0.0181], [ 0.2219, 0.1614, 1.0000, 0.9254, 0.1245, 0.2171, 0.1068, 0.0962], [ 0.1689, 0.1152, 0.9254, 1.0000, 0.1018, 0.2463, 0.0463, 0.0706], [ 0.1008, 0.0218, 0.1245, 0.1018, 1.0000, 0.2005, 0.0153, 0.6084], [ 0.1191, 0.0713, 0.2171, 0.2463, 0.2005, 1.0000, 0.0116, 0.1011], [ 0.2174, 0.2854, 0.1068, 0.0463, 0.0153, 0.0116, 1.0000, -0.0492], [ 0.0628, -0.0181, 0.0962, 0.0706, 0.6084, 0.1011, -0.0492, 1.0000]])

Looking at the resulting similarity values, we see that the sentence#1 and sentence#2 pair has a high degree of similarity. Sentence#3 and sentence#4 also generate a very high value of cosine similarity. Interestingly, sentence#5 and sentence#8 are also deemed to have a good semantic similarity, although they do not share any descriptive words. Thus, the sentence embedding is doing a pretty good job of capturing sentence semantics.

Comparison with TF-IDF Vectorization

Information Retrieval (IR) community for a long time has been representing text as vectors for matching documents. The approach, known as the bag-of-words model, uses a set of words or terms to characterize text.  Each word or term is assigned a weight following the  TF-IDF weighting scheme. In this scheme, the weight assigned to a word is based upon: (i) how often it appears in the document being vectorized, the term frequency (TF) component of the weighting scheme, and (ii) how rare is the word in the entire document collection, the inverse document frequency (IDF) component of the weighting scheme. The vector size is governed by the number of terms used from the entire document collection, i.e. the vocabulary size. You can read details about TF-IDF vectorization in this blog post.

Let's see how well the TF-IDF vectorization captures similarities between document in comparison with the sentence embedding. We will use the same set of sentences to perform vectorization and similarity calculations as shown below.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer(ngram_range = (1,2),stop_words='english')
tfidf = vectorizer.fit_transform(sentences)
similarity =cosine_similarity(tfidf,tfidf)

[[1. 0.5818 0.0962 0. 0. 0. 0. 0. ] [0.5818 1. 0.0772 0. 0. 0. 0. 0. ] [0.0962 0.0772 1. 0.7654 0. 0. 0. 0. ] [0. 0. 0.7654 1. 0. 0. 0. 0. ] [0. 0. 0. 0. 1. 0.0761 0. 0. ] [0. 0. 0. 0. 0.0761 1. 0. 0. ] [0. 0. 0. 0. 0. 0. 1. 0. ] [0. 0. 0. 0. 0. 0. 0. 1. ]]

Looking at the above results, we see that TF-IDF vectorization is unable to determine similarity between 
sentence#5 and sentence#8 which the sentence embedding was able to pick up despite of the absence of the common descriptive words in the sentence pair.

Thus, TF-IDF vectorizer is good as long as there are shared descriptive words. But the sentence embedding is able to capture semantic similarities without even shared descriptive words. This is possible because the high-dimensional embedded vectors learn relationships between different words and their context during training and utilize those relationships during similarity computation as well as for other NLP tasks.

Now you might be wondering whether the embedding concept can be applied to images and graphs. The answer is yes and I hope to dwell on these in my future posts.

Exploring Large Language Models: Types and Applications

Large language models (LLMs) are currently the craze. Who hasn't heard of ChatGPT that can deliver all kinds of responses to user prompts, be a recipe or suggestions for vacation or an essay on a topic for a term paper. It is all possible because of the underlying large language models.

So what are large language models? How do these models work? What can we do with these models? Let's try to answer these questions without going into much technical details.

What are Large Language Models?

We will begin by first trying to understand what is a language model. Think about using your cell phone for messaging. As you enter text, your cell phone tries to guess the word you are typing, see the figure below. Under the hood, a language model is computing probabilities for the next character/word and is displaying the top three or five most probable characters/words. 

There are a few types of language models such as rule-based models, statistical language models, and the recurrent neural networks (RNNs). The rule-based models rely on predefined linguistic rules and heuristics to perform their calculations. These models require experts to manually create and fine-tune rules, making them inflexible and limited in handling complex language patterns. 

Statistical language models use probabilistic methods to estimate the likelihood of a sequence of words. These models utilize n-grams, which are sequences of n words, to predict the probability of the next word based on the previous ones. While statistical models offer improved language processing capabilities, they still struggle with understanding context and long-range dependencies.

RNNs are neural networks with memory; these are designed to process sequential data, making them ideal for modeling language. The internal memory enables them to consider context from previous words while predicting the next word. However, standard RNNs are unable to capture long-term dependencies due to a training bottleneck, the "vanishing gradient" problem. 

The large language models are deep learning models that use the transformer architecture to learn the dependencies among words. These models have 100+ billion parameters that are set by training. There are a number of features of the transformer architecture that have made them the architecture of choice for sequential data. Even, images can be used with the transformer architecture by considering them as sequences of small blocks of pixels. The foremost feature of the transformer architecture is the self-attention mechanism which weighs importance of different words in a given context. It, thus, allows the  transformer architecture to capture dependencies across the entire input sequence making them highly effective in language modeling tasks. Another important feature is that the architecture looks at all the input words of a sentence at the same time which is a key to the use of the attention mechanism. 

The transformer architecture consists of two main components: the encoder and the decoder. Both the encoder and decoder are composed of multiple layers of self-attention and feedforward neural networks. The encoder receives an input sequence and produces a sequence of hidden states. The decoder accepts a target sequence and uses the encoder’s output to generate a sequence of predictions. Exceedingly large amounts of text data, sourced from books, websites, wikipedia, and multitude of other sources, are used to train the transformer model. The training is done by following the self-supervised learning modality. Typical approaches to self-supervised learning is to mask certain amount of text and train the transformer to predict the text. Instead of masking, the next sentence prediction is also used for training. It is the self-supervised learning approach that has made the training of large language models removing the need for expert annotators.

Pre-trained LLMs

There are a multitude of pre-trained large language models that have been released for use. Before listing some of the popular pre-trained models, let's categorize them in terms of their architecture and usage.

  • Encoder-only Models
  • Decoder-only Models
  • Encoder-Decoder Models

The encoder-only models are the models that are trained to predict masked or missing words. The pre-trained models produce a high-dimensional vector representation of the input text, known as embeddings. [You can read about embeddings at the post "Words as Vectors".] These models can be fine-tuned for a variety of NLP tasks, such as sentiment analysis, named entity recognition, and question answering. These models are also called auto-encoding models.

The decoder-only models as one would expect use only the decoder part of the transformer architecture. These models are generally trained by having the model to predict the next word of the input text. These models are best suited for text generation. These models are also called autoregressive models.

The encoder-decoder models use both the encoder and the decoder components of the transformer architecture. The pre-training of these models replaces a chunk of the input text by a single mask and the model is trained to predict the entire chunk of the masked input text. These models are also known as sequence-to-sequence models. These models are suitable for text summarization, translation, or generative question answering tasks.

In many cases, you want to adapt a pre-trained model for your specific task in a particular domain, for example finance. This is done by applying transfer learning to the pre-trained model with a dataset specific to the application domain. Such models are called fine-tuned models.

Examples of Large Language Models (LLMs)

Below is a non-exhaustive list of LLMs.

1. GPTs

The GPT (Generative Pre-trained Transformer) series of models from OpenAi is perhaps the most well-known LLMs. The release of ChatGPT based on GPT3.5 in November 2022 kind of created an artificial intelligence storm. This series of models are decoder-only models and are being used for text generation, summarization and question-answers. GPT-4, the most recent model in the series, is being used in Microsoft's Bing Chat.

2. LaMDA

LaMDA which stands for Language Model for Dialogue Applications is a LLM from Google. It was trained on dialogue and thus exhibits superior conversational performance. It is mainly being used internally at Google and an earlier version of Google Bard was based on this model.

3. PaLM-2

This model was released by Google in May of this year. It is a state-of-the-art language model with improved multilingual, reasoning and coding capabilities. It was trained with text from over 100 languages, scientific papers, and code from numerous public sources. As a result, PaLM-2 is claimed to offer multilingual, reasoning, and programming capabilities. The current version of Google Bard is based on PaLM-2.

4. LLaMA

This model was released by Meta in February earlier this year. It is an auto-regressive language model and comes in different sizes: 7B, 13B, 33B and 65B parameters. It is good for question answering, and reading comprehension tasks.


BERT from Google stands for Bidirectional Encoder Representations from Transformers. It is an encoder-only type LLM. BERT uses bidirectional context to generate representation for words. What this means is that in the sentence "I bought an apple phone", the unidirectional context for encoding the word "apple" is "I bought an" while the bidirectional context brings in the next word "phone" also. Clearly, the bidirectional context provides a more targeted representation. BERT has been used for question-answer, sentiment analysis, and text classification. DistilBERT is a compressed version of BERT with fewer parameters but with equally good performance.

6. T5

This is an encoder-decoder transformer model from Google. It is suitable for tasks including machine translation, question answering, abstractive summarization, and text classification.

An Example of LLM Usage

Here, we are going to take a look at using LLMs for our daily tasks. The example that we are going to look at about using ChatGPT to get code for building an app to perform next day stock price prediction. We will give a prompt to ChatGPT specifying what we want. The prompt and the response from ChatGPT are shown below. If we want to do this on your own end, you will need to get an account with OpenAI.

In the subsequent prompts, I ask ChatGPT to give me code for downloading stock data. Then I prompt ChatGPT to make a python app out of it. All of this is performed satisfactorily and the app works fine. You can read about the responses from ChatGPT and get the complete code at "Create a Simple Stock Price Prediction App using ChatGPT" blogpost. 

Issues with LLMs Usage

Many organizations including Microsoft have been quick to deploy LLMs in their products. At the same time, a large group of researchers have been concerned with potential harms that can result with LLMs becoming more and more powerful. Some issues that have emerged from the current LLMs are:

1. Incorrect and Made-up Answers

Instances of incorrect and fabricated yet convincing responses have been reported by many. Fabricated and inaccurate answers. Thus, the responses from LLMs shouldn't be taken at face value and must be reviewed before usage.

2. Data Privacy and Confidentiality 

One needs to observe caution as any sensitive, confidential, and proprietary information used in prompts may end up being included in responses to other users. 

3. Model Bias

The LLMs have been found to exhibit bias which arises from the data from the wild that is used for training them. Bias exhibited by a model in use can create legal issues. 

4. Intellectual and Copyright Issue

Since models like ChatGPT have been trained using data from the web, the training data includes copyrighted material available of the web. This can result in copyright violations.

5. Fraud and Scamming Risk

Given that it is easy to generate fake data and misinformation, the scams using LLMs are definitely going to increase. As a consumer, we need to be on alert for such possibilities.

Going Forward with LLMs

LLMs are here to greatly impact the society on almost all of its facets. There will be large benefits from the use of LLMs and at the same time certain challenges are emerging in terms of dealing with fake yet convincing looking information being spread. While the thrust of LLMs development so far has been on producing bigger and bigger models, it appears that the focus is shifting to making LLMs more efficient and more accurate in their responses. We are also going to see the models being made domain specific.

I hope you enjoyed reading this exploration of LLMs. If you want to learn more, I suggest you visit huggingface transformer library where you will find information on many transformer models as well as demos to show their usage.