Showing posts with label AI. Show all posts
Showing posts with label AI. Show all posts

Claude 2: A New Member of the Growing Family of Large Language Models

AI has advanced rapidly in recent years, with large language models (LLMs) like ChatGPT creating enormous excitement. These models can generate remarkably human-like text albeit with certain limitations. In this post, we'll look at a new member of the family of large language models, Anthropic's Claude 2, and highlight some of its features.

Claude 2 Overview

Claude2 was released in February 2023. Claude 2 utilizes a context window of approximately 4,000 tokens during conversations. This allows it to actively reference the last 1,000-2,000 words spoken in order to strengthen contextual awareness and continuity. The context window is dynamically managed, expanding or contracting slightly based on factors like conversation complexity. This context capacity exceeds ChatGPT's approximately 1,000 token window, enabling Claude 2 to sustain longer, more intricate dialogues while retaining appropriate context. In addition to conversational context, Claude 2 can take in multiple documents to incorporate information from different sources. 

Claude2's distinguishing features are Constitutional AI and Constitutional Instructive Reward techniques. The incorporation of these two techniques is claimed to improve safety and reliability. As a result, Claude 2 is seen to provide helpful, harmless, and honest responses compared to other models; its performance on a wide range of conversational queries is over 99% accuracy. In benchmarks, ChatGPT produces inconsistent or incorrect responses approximately 5-10% of the time. 

 What is Constitutional AI?

The Constitutional AI technique constrains Claude 2 to behave according to a "constitution" defined by its designers at Anthropic. The "constitution" takes the form of a modular library of neural network modules that encodes rules guiding allowed model outputs. The constitutional rule modules are designed using a combination of techniques like supervised learning from human feedback, adversarial training to surface edge cases, and reinforcement learning optimized for consistency and oversight. The modules operate on Claude 2's internal representations, blocking or altering potential model outputs that violate defined constitution policies. These policies prohibit overtly harmful responses and mitigate risks identified during Claude 2's training. This technique constrains Claude 2 to behave according to a "constitution" defined by its designers at Anthropic. The constitution sets guidelines for providing helpful, honest, harmless information. Concrete rules prohibit harmful responses, while allowing Claude 2 to politely decline inappropriate requests. This establishes ethical boundaries unmatched by other LLMs. 

What is Constitutional Instructive Reward Technique?

Constitutional Instructive Reward technique builds on Constitutional AI by further optimizing Claude 2's training process. Anthropic generates a large dataset of hypothetical conversational scenarios that might challenge model integrity. The Constitutional AI modules provide feedback on which responses are acceptable versus violations. This dataset then trains an auxiliary Constitutional AI Advisor model through self-supervised learning.

The Constitutional AI Advisor produces reward signals that feed back into Claude 2's reinforcement learning loop. This focuses the overarching optimization toward mitigating identified risks and providing helpful instructions to users. The Advisor guides Claude 2 toward more nuanced integrity not encapsulated by the core Constitutional AI modules. It also provides explainability, since the Advisor outputs can be inspected to identify why specific responses qualified as unwise or unethical.

Useful Claude2 Metrics

The information below is gleaned from Anthropic's publications.

- Claude 2 can generate approximately 300 tokens (2,000 words) per second on a modern GPU. This enables rapid response times for conversational queries.

- Its average query response latency is under 500 milliseconds, allowing for smooth and natural dialogue flow.

- The model is optimized to run efficiently on commercially available GPUs like the Nvidia A100. On this hardware, it can process over 10 queries per second concurrently.

- Claude 2 requires only 50 GPU hours to train, this improves sustainability.

- In a benchmark test on the SuperGLUE natural language toolkit, Claude 2 achieved a 94% score while running up to 24x faster than GPT-3.

- Cloud-deployed versions of Claude 2 scale to handle over 100,000 users simultaneously.

In summary, Claude2 is a welcome addition to the growing family of LLMs with some distinct performance superiority over other models. The best thing about Claude2 is that it is totally free.



Reinforcement Learning with Human Feedback: A Powerful Approach to AI Training

The unprecedented capabilities exhibited by the large language models (LLMs) such as ChatGPT and GPT-4 have created enormous excitement as well as concerns about the impact of AI on the society in near and far future. Behind the success of LLMs and AI in general lies among other techniques a learning approach called Reinforcement Learning with Human Feedback (RLHF). In this blog post, we will try to understand what RLHF is and why it offers a powerful approach to training AI models. However, before we do that, let's try to understand the concept of reinforcement learning (RL).

What is Reinforcement Learning (RL)?

RL, inspired by the principles of behavioral psychology, is a machine learning technique wherein the learner, called an agent, learns decision making by exploring an environment through a trial-and-error process to achieve its goal. Each action by the agent results in feedback in the form of a reward or punishment. While performing actions and receiving feedback, the agent tries to maximize the expected cumulative reward over time. The figure below shows the basic working of the RL algorithm. 




The agent has a repertoire of actions to choose from at any given instant. Depending upon the environment, the action space is discrete or continuous. For example, the action space is discrete for an agent learning to play a board game. On the other hand, the action space is continuous for an agent, autonomous robot, learning to stay in a driving lane. The choice of the agent's action at a given time is governed by policy. The policy could be deterministic or stochastic and its implementation is done either by a table lookup, or a simple function or via search.

The environment refers to the world in which the agent interacts. The term state is used to describe the observation of the environment at any time which the agent uses as an input to decide its next action. As a result of the agent's action, the state of the environment changes leading to a new input to the agent. For example, the positions of chess pieces on a chess board at any time defines the state of the environment for an agent learning to play chess. Once, the agent makes a move, the state of the environment changes; the state of new environment is then used by the agent for its next action. 

The agent's actions result in reward, positive, neutral, or negative. To ensure that the agent is not focussed on short-term rewards, a value function of the state-action pair is specified to estimates the expected long-term. To train the agent to achieve its goal, a policy-based or a value-based implementation is used. The policy-based implementation involves coming up with a policy or deterministic/stochastic strategy to maximize the cumulative reward. The value-based implementation tries to optimize the chosen value function. 

Applications needing sequential decision making are excellent candidates for RL. It has been successfully used in autonomous robotics, finance, recommendation systems, and gaming. The AlphaGo from DeepMind is the most well-known example of RL. AlphaGo was the first computer program able to defeat a professional human Go player, a significant achievement given that Go is known as the most challenging classical game for artificial intelligence because of its complexity.

While the traditional RL algorithms have been successful in solving many complex problems, their adoption in the real world has been slow. One limiting factor is the task of designing reward functions that accurately capture the desired behavior can be a daunting and time-consuming task. Moreover, in complex real-world scenarios, defining appropriate reward functions can be highly challenging or even impractical. Reinforcement learning with human feedback (RLHF) addresses this challenge by leveraging human feedback to provide a more effective and efficient learning signal.

Reinforcement Learning with Human Feedback (RLHF)

RLHF was originally developed for training simple robots in simulated environments and Atari games. The key idea behind RLHF is to involve human trainers who interact with the AI agent and provide evaluative feedback on its actions. As an example, imagine a robotic arm being trained to grasp objects. Instead of relying solely on the predefined rewards from the environment (such as success or failure of the grasp), a human trainer provides explicit reward signals. The trainer observes the robot's actions and assigns positive or negative rewards based on the quality of the grasp. This feedback helps the robot learn more quickly and accurately. The feedback can take various forms, such as binary signals indicating whether an action is correct or incorrect, preference rankings among different actions, or even more nuanced feedback like explanations or demonstrations. By incorporating this feedback into the learning process, RLHF algorithms can learn from human expertise and accelerate the training process.

There are several approaches to implementing RLHF. One common technique is known as reward modeling, where human trainers provide explicit reward signals instead of relying solely on the environment's predefined rewards. The RL agent then learns from these human-generated rewards to optimize its behavior. Another approach involves interactive learning, where the agent actively seeks feedback from human trainers during its exploration phase. The trainers can guide the agent by providing corrective feedback or demonstrations, helping it learn more efficiently. The process of collecting human feedback and refining the model through reinforcement learning is repeated iteratively, resulting in continuous improvement in the model's performance.

The benefits of RLHF are numerous. Firstly, it reduces the sample complexity of RL algorithms, enabling faster and more efficient learning. By incorporating human expertise, RLHF algorithms can leverage existing knowledge and generalize to new situations more effectively. Secondly, RLHF allows for more precise control over the agent's behavior. Human trainers can steer the learning process towards desired outcomes, ensuring that AI systems adhere to specific ethical guidelines or safety constraints. This control and transparency are crucial when deploying AI in critical domains such as healthcare, finance, or autonomous vehicles.

RLHF also bridges the gap between AI and human collaboration. By involving human trainers in the learning loop, RLHF fosters a symbiotic relationship between humans and machines. Trainers can learn from the agent's behavior and iteratively refine their guidance, resulting in a continuous learning feedback loop that benefits both parties. Furthermore, RLHF enables the development of AI systems that can adapt to changing environments or user preferences more effectively, as trainers can update the feedback and influence the agent's behavior in real-time.

RLHF in ChatGPT and GPT-4

OpenAI has used RLHF to train the ChatGPT and GPT-4 models. The full details are available in a paper titled Training Language Models to Follow Instructions with Human Feedback. Here, I will briefly outline the three steps for applying RLHF to a pre-trained language model.

  1. The first step is to collect a dataset of human-generated prompts and responses, and fine-tune the pre-trained language model. 
  1. The next step is to have humans rank the model responses to prompts and use these rankings to train a reward model. 
  1. The final step is to use the reward model as a reward function, and fine-tune the model to maximize this reward. 

The above steps may be repeated to ensure that the model responses are aligned with human responses. The RLHF paper from OpenAI indicates using 40 human labelers, selected through a screening process, to generate responses. The prompts for the task consisted primarily of diverse text prompts submitted to a commercial language model plus a small number of labeler-written prompts. 

Limitations and Challenges of RLHF

Despite its promises, RLHF also poses its own set of challenges. Incorporating human feedback introduces biases and subjectivity that need to be carefully addressed. The trade-off between the trainer's guidance and the agent's exploration is a delicate task. making language models aligned with user intentions makes them more useful, it also makes them more prone to misuse by making it  easier to use these models to generate convincing misinformation, or hateful or abusive content. Scalability also becomes a concern when deploying RLHF algorithms in large-scale applications with numerous trainers or in scenarios where continuous feedback is required. 

Nevertheless, RLHF represents a compelling avenue for advancing AI capabilities. By merging the strengths of human intelligence with the computational power of AI, RLHF holds the potential to accelerate the development of intelligent systems capable of solving complex real-world problems. We can anticipate exciting breakthroughs and applications that harness the collaborative power of humans and machines as researchers and engineers continue to explore this field.