Claude 2: A New Member of the Growing Family of Large Language Models

AI has advanced rapidly in recent years, with large language models (LLMs) like ChatGPT creating enormous excitement. These models can generate remarkably human-like text albeit with certain limitations. In this post, we'll look at a new member of the family of large language models, Anthropic's Claude 2, and highlight some of its features.

Claude 2 Overview

Claude2 was released in February 2023. Claude 2 utilizes a context window of approximately 4,000 tokens during conversations. This allows it to actively reference the last 1,000-2,000 words spoken in order to strengthen contextual awareness and continuity. The context window is dynamically managed, expanding or contracting slightly based on factors like conversation complexity. This context capacity exceeds ChatGPT's approximately 1,000 token window, enabling Claude 2 to sustain longer, more intricate dialogues while retaining appropriate context. In addition to conversational context, Claude 2 can take in multiple documents to incorporate information from different sources. 

Claude2's distinguishing features are Constitutional AI and Constitutional Instructive Reward techniques. The incorporation of these two techniques is claimed to improve safety and reliability. As a result, Claude 2 is seen to provide helpful, harmless, and honest responses compared to other models; its performance on a wide range of conversational queries is over 99% accuracy. In benchmarks, ChatGPT produces inconsistent or incorrect responses approximately 5-10% of the time. 

 What is Constitutional AI?

The Constitutional AI technique constrains Claude 2 to behave according to a "constitution" defined by its designers at Anthropic. The "constitution" takes the form of a modular library of neural network modules that encodes rules guiding allowed model outputs. The constitutional rule modules are designed using a combination of techniques like supervised learning from human feedback, adversarial training to surface edge cases, and reinforcement learning optimized for consistency and oversight. The modules operate on Claude 2's internal representations, blocking or altering potential model outputs that violate defined constitution policies. These policies prohibit overtly harmful responses and mitigate risks identified during Claude 2's training. This technique constrains Claude 2 to behave according to a "constitution" defined by its designers at Anthropic. The constitution sets guidelines for providing helpful, honest, harmless information. Concrete rules prohibit harmful responses, while allowing Claude 2 to politely decline inappropriate requests. This establishes ethical boundaries unmatched by other LLMs. 

What is Constitutional Instructive Reward Technique?

Constitutional Instructive Reward technique builds on Constitutional AI by further optimizing Claude 2's training process. Anthropic generates a large dataset of hypothetical conversational scenarios that might challenge model integrity. The Constitutional AI modules provide feedback on which responses are acceptable versus violations. This dataset then trains an auxiliary Constitutional AI Advisor model through self-supervised learning.

The Constitutional AI Advisor produces reward signals that feed back into Claude 2's reinforcement learning loop. This focuses the overarching optimization toward mitigating identified risks and providing helpful instructions to users. The Advisor guides Claude 2 toward more nuanced integrity not encapsulated by the core Constitutional AI modules. It also provides explainability, since the Advisor outputs can be inspected to identify why specific responses qualified as unwise or unethical.

Useful Claude2 Metrics

The information below is gleaned from Anthropic's publications.

- Claude 2 can generate approximately 300 tokens (2,000 words) per second on a modern GPU. This enables rapid response times for conversational queries.

- Its average query response latency is under 500 milliseconds, allowing for smooth and natural dialogue flow.

- The model is optimized to run efficiently on commercially available GPUs like the Nvidia A100. On this hardware, it can process over 10 queries per second concurrently.

- Claude 2 requires only 50 GPU hours to train, this improves sustainability.

- In a benchmark test on the SuperGLUE natural language toolkit, Claude 2 achieved a 94% score while running up to 24x faster than GPT-3.

- Cloud-deployed versions of Claude 2 scale to handle over 100,000 users simultaneously.

In summary, Claude2 is a welcome addition to the growing family of LLMs with some distinct performance superiority over other models. The best thing about Claude2 is that it is totally free.



Difference Between Semi-Supervised Learning and Self-Supervised Learning

There are many styles of training machine learning models including the familiar supervised and unsupervised learning to active learning, semi-supervised learning and self-supervised learning. In this post, I will explain the difference between semi-supervised and self-supervised styles of learning. To get started, let us first recap what is supervised learning, the most popular machine learning methodology to build predictive models. Supervised learning uses annotated or labeled data to train predictive models. A label attached to a data vector is nothing but the response that the predictive model should generate  for that data vector as input during the model training. For example, we will label pictures of cats and dogs with labels cat and dog to train a Cat versus Dog classifier. We assume a large enough training data set with labels is available when building a classifier.

When there are no labels attached to the training data, then the learning style is known as unsupervised learning. In unsupervised learning the aim is to partition the data into different groups based upon similarities of the training vectors. The k-means clustering is the most well-known unsupervised learning technique. Often, the number of the data groups to be formed is specified by the user.

Semi-Supervised Learning

In a real world setting, training examples with labels need to be acquired for a predictive modeling task. Labeling or annotating examples is expensive and time-consuming; many application domains require expert annotators. Thus, we often need ways to work with a small labeled training data set. In certain situations, we may be able to acquire, in addition to a small labeled training data set, additional training examples without labels with labeling being too expensive to perform. In such cases, it is possible to label the unlabeled examples using the small available set of labeled examples. This type of learning is referred as semi-supervised learning and it falls somewhere between supervised and unsupervised learning. 

The term semi-supervised classification is often used to describe the process of labeling training examples using a small set of labeled examples for classification modeling. A similar idea is also used in clustering in an approach known as the semi-supervised clustering. In semi-supervised clustering, the goal is to group a given set of examples into different clusters with the condition that certain examples must be clustered together and certain examples must be put in different clusters. In other words, some kind of constraints are imposed on resulting clusters in terms of cluster memberships of certain specified examples. For an example of semi-supervised classification, you can check this blog post. In another blog post, you can read about constrained k-means clustering as a technique for semi-supervised clustering.

Transfer Learning

In certain situations we have a small set of labeled examples but cannot acquire more training examples even without the labels. One possible solution in such situations is transfer learning. In transfer learning, we take a trained predictive model that was trained on a related task and re-train it with our available labeled data. The re-training fine-tunes the parameters of the trained model to make it perform well for our predictive task. Transfer learning is popular in deep learning where many trained predictive models are publicly available. While performing transfer learning, we often employ data augmentation to the available labeled examples to create additional examples with labels. The common data augmentation operations include translation, rotation, cropping and resizing, and blurring.

Self-Supervised Learning

The Self-supervised learning is essentially unsupervised learning wherein the labels, the desired predictions, are provided by the data itself and hence the name self-supervised learning. The objective of the self-supervised learning is to learn the latent characteristics of the data that could be useful in many ways. Although the self-supervised learning has been around for a long time, for example as in autoencoders, its current popularity is primarily due its use in training the large language models. 

The example below shows how the desired output is defined via self-learning. In the example, the words in green are masked and the model is trained to predict the masked words using the surrounding words. Thus, the masked words function as labels. The masking of the words is done in a random fashion for the given corpus and thus no manual labeling is needed.




The idea of random masking is not the only way to self-generate labels; several variations at the word level as well as the sentence level are possible and have been successfully used in different language modeling efforts. For example, self-learning can be employed to predict the neighboring sentences that come before and after a selected sentence in a given document. 

The tasks defined to perform self-supervised learning are called pretext tasks because these tasks are not the end-goal and the results of these tasks are used for building the final systems. 

Self-generation of labels for prediction is easily extended to images to define a variety of pretext tasks for self-supervised learning. As an example, images can be subjected to rotations of (90 degrees, 180 degrees etc.) and the pretext task is defined to predict the rotation applied to the images. Such a pretext task can make the model learn the canonical orientation of image objects. Data augmentation is also commonly used in self-supervised learning to create image variations. 

All in all, self-supervised learning is a valuable concept that eliminates the need for external annotation. The success of large language models can be majorly attributed to this style of machine learning.

Retrieval Augmented Generation: What is it and Why do we need it?

What is Retrieval Augmented Generation?

Generative AI is currently garnering lots of attention. While the responses provided by the large language models (LLMs) are satisfactory in most situations, sometimes we want to get better focused responses when employing LLMs in specific domains. Retrieval-augmented generation (RAG) offers one such way to improve the output of generative AI systems. RAG enhances the LLMs capabilities by providing them with additional knowledge context through information retrieval. Thus, RAG aims to combine the strengths of both retrieval-based methods, which focus on selecting relevant information, and generation-based methods, which produce coherent and fluent text. 

RAG works in the following way:

  1. Retrieval: The process starts with retrieving relevant documents, passages, or pieces of information from a pre-defined corpus or database. These retrieved sources contain content that is related to the topic or context for which you want to generate text.
  2. Generation: After retrieving the relevant content, the generation step takes over. It involves using the retrieved information as input or context to guide the generation of coherent and contextually relevant text. This can involve techniques such as fine-tuning large language models like GPT-3 on the retrieved content or using it as a prompt.
  3. Combination: The generated text is produced while taking into consideration both the retrieved information and the language model's inherent creative abilities. This allows the generated text to be more informative, accurate, and contextually appropriate.

How is RAG Useful?

Retrieval-augmented generation is useful for several reasons:

  1. Content Quality: By incorporating information from retrieved sources, the generated text can be more accurate, relevant, and factually sound. This is particularly important for applications where accuracy and credibility are crucial.
  2. Data Augmentation: Retrieval-augmented generation can be used to expand the dataset for fine-tuning language models. By combining the model's generative capabilities with real-world information, it can learn to produce more contextually relevant and diverse text.
  3. Expertise Integration: In domains that require domain-specific knowledge or expertise, retrieval-augmented generation can ensure that the generated content aligns with expert knowledge.
  4. Abstractive Summarization: When generating summaries, retrieval-augmented approaches can help ensure that the generated summary captures the most important and relevant information from the source documents.
  5. Question Answering: In question answering tasks, retrieval-augmented generation can improve the accuracy of generated answers by incorporating relevant information from a corpus of documents.
  6. Content Personalization: For chatbots and content generation systems, retrieval-augmented generation can enable more personalized and contextually relevant responses by incorporating information retrieved from a user's history or relevant documents.

The success of the RAG approach greatly depends upon how semantically close are the retrieved documents to help the generative AI system when it is responding to a user request. Retrieving meaningful chunks of text is done by nearest neighbor search implemented in a vector database with text being represented by word embeddings. Look for my next post to learn about this aspect of RAG implementation.

It's important to note that retrieval-augmented generation is a research-intensive area and involves challenges such as selecting the right retrieval sources, managing biases in retrieved content, and effectively integrating retrieved information with the language model's creative capabilities. However, it holds promise for improving the quality and utility of generated text across various NLP applications.