How CHATGPT Works?


A brief explanation of the thought process and intuition that lie behind the chatbot that you keep hearing about.

This brief introduction to the machine learning models that underpin ChatGPT will begin with a discussion of Large Language Models, move on to the ground-breaking self-attention mechanism that made it possible to train GPT-3, and finally move on to Reinforcement Learning From Human Feedback, the innovative method that set ChatGPT apart.

Large Language Models ChatGPT is an extrapolation of the Large Language Model (LLM) class of machine learning and Natural Language Processing models. LLMs decipher relationships between words in a text by analyzing massive amounts of text data. As computing power has increased, these models have grown over the past few years. As the size of their input datasets and parameter space grows, LLMs become more capable.

Predicting a word from a list of words is the most fundamental training for language models. This is typically observed as either masked language modeling or next token prediction.

An arbitrary illustration of the author's next token prediction and masked language modeling.

Using a Long-Short-Term-Memory (LSTM) model, this basic sequencing technique fills in the blank with the word with the highest statistical probability given the surrounding context. This sequential modeling structure has two main drawbacks.

The model is unable to differentiate between the surrounding words in any meaningful way. While "reading" may most frequently be associated with "hates," "Jacob" may be such an avid reader in the database that the model should prioritize "Jacob" over "reading" and select "love" rather than "hates" in the example above.

Instead of being processed as a whole corpus, the input data are processed one at a time. This indicates that the window of context for an LSTM during training is fixed and only extends beyond an individual input for several steps in the sequence. The relationships between words and the meanings that can be derived are limited as a result of this limitation.

Transformers were developed by a group at Google Brain in 2017 as a response to this problem. Transformers can process all input data simultaneously, in contrast to LSTMs. The model can give different parts of the input data different weights in relation to any position in the language sequence by using a self-attention mechanism. This feature made it possible to process significantly larger datasets and make significant advancements in the process of incorporating meaning into LLMs.

OpenAI first introduced the GPT and Self-Attention Generative Pre-training Transformer (GPT) models in 2018 under the name GPT-1. With GPT-2 in 2019, GPT-3 in 2020, and InstructGPT and ChatGPT in 2022, the models continued to advance. GPT-3 was able to be trained on significantly more data than GPT-2, giving it a more diverse knowledge base and the ability to perform a wider range of tasks before human feedback was incorporated into the system. This was the greatest advancement in the evolution of the GPT model.

GPT-2 and GPT-3 are compared on the left and right. the author's creation.

An encoder is used to process the input sequence and a decoder is used to generate the output sequence in each GPT model, which makes use of the transformer architecture. The multi-head self-attention mechanism of the encoder and decoder enables the model to differentially weight segments of the sequence in order to infer meaning and context. Additionally, the encoder makes use of masked language modeling to comprehend the connection between words and produce responses that are easier to understand.

Tokens—words, sentences, or other text groups—are transformed into vectors that represent the token's significance in the input sequence by the self-attention mechanism that powers GPT. The model creates a query, key, and value vector for each token in the input sequence in order to accomplish this.

Using the dot product of the two vectors, determines how similar the query vector from step one is to the key vector for each subsequent token.

by feeding the output of step 2 into a softmax function, it produces normalized weights.

by multiplying the weights that were generated in step 3 by the value vectors of each token, a final vector that represents the significance of the token in the sequence is produced.

GPT's "multi-head" attention mechanism is an improvement on self-attention. The model iterates this mechanism several times rather than once, resulting in a new linear projection of the query, key, and value vectors for each iteration. The model is able to comprehend sub-meanings and more complex relationships in the input data by expanding self-attention in this manner.

The author produced a screenshot of ChatGPT.

Despite the significant advancements in natural language processing made by GPT-3, the system's capacity to match user intentions is constrained. GPT-3, for instance, may produce outputs that lack helpfulness—that is, they do not explicitly follow the user's instructions.

contain hallucinations that reflect facts that do not exist or are incorrect.

lack interpretability, making it difficult for humans to comprehend the model's decision-making or prediction-making process.

Include content that is harmful, offensive, toxic, or biased and spreads false information.

To address some of these inherent issues of standard LLMs, ChatGPT introduced novel training methods.

ChatGPT is a spinoff of InstructGPT, which pioneered a novel method for incorporating human feedback into the training process to better align model outputs with user intent. ChatGPT is a spinoff of InstructGPT. In the openAI paper Training language models to follow instructions with human feedback from 2022, Reinforcement Learning from Human Feedback (RLHF) is explained in detail. Below, a simplified version of this concept is provided.


Step 1: Supervised Fine Tuning (SFT) Model The first step was to fine-tune the GPT-3 model by hiring forty contractors to create a supervised training dataset with a known output for the input from which the model can learn. The Open API received prompts, or inputs, from actual user entries. A suitable response to the prompt was then written by the labelers, resulting in a known output for each input. The GPT-3.5 model, also known as the SFT model, was then created by fine-tuning the GPT-3 model with this brand-new, supervised dataset.


Only 200 prompts could come from any given user ID, and any prompts that shared long common prefixes were removed in order to maximize diversity in the prompts dataset. Last but not least, all prompts that requested personally identifiable information (PII) were taken out.


Labelers were also asked to create sample prompts to fill out categories for which there was only a small amount of real sample data after aggregating prompts from the OpenAI API. The following were the categories of interest: any unrelated request.

Short-cut prompts: instructions with multiple queries and responses.

Prompts made by the user: correspond to a specific use case for the OpenAI API that was requested.

Labelers were instructed to make every effort to deduce the user's instruction when generating responses. The three primary methods by which prompts request information are described in the paper.


Be direct: Please tell me about..." Write another story about the same subject as these two examples.

Continuation: Finish a story given its beginning.

13,000 input and output samples were compiled by labelers using prompts from the OpenAI API. These samples were then used by the supervised model.

The image on the left was taken from OpenAI et al.'s "Training language models to follow instructions with human feedback," 2022: http://arxiv.org/pdf/2203.02155.pdf The author added additional context in red (right).

Step 2: Reward Model The SFT model generates more aligned responses to user prompts after it has been trained in step 1. The next improvement is training a reward model, where the model's input is a collection of prompts and responses and the reward's output is a scaler value. Reinforcement learning, in which a model learns to produce outputs that maximize its reward (see step 3), cannot be used without the reward model.

Labelers receive anywhere from four to nine SFT model outputs for a single input prompt in order to train the reward model. They are instructed to create combinations of output ranking in the following order, from best to worst.

An illustration of a response ranking combination. the author's creation.

Overfitting (failure to extrapolate beyond observed data) occurred when each combination was included in the model as a separate datapoint. The model was developed using each ranking group as a single batch datapoint to solve the problem.

The image on the left was taken from OpenAI et al.'s "Training language models to follow instructions with human feedback," 2022: http://arxiv.org/pdf/2203.02155.pdf The author added additional context in red (right).

Step 3: Model for Reinforcement Learning In the final stage, the model receives a random prompt and responds. The "policy" the model learned in step 2 is used to generate the response. The policy is a method that the machine has learned to use to get where it wants to go; maximizing its reward in this instance. A scaler reward value for the prompt and response pair is then determined using the reward model created in step 2. The reward is then incorporated into the model in order to modify the policy.

Schulman et al. in 2017 introduced Proximal Policy Optimization (PPO), a method for adjusting the model's policy in response to each response. A Kullback–Leibler (KL) penalty per token from the SFT model is incorporated into PPO. Extreme distances are penalized by the KL divergence, which measures how similar two distribution functions are to one another. To avoid over-optimizing the reward model and departing too significantly from the human intention dataset, a KL penalty is used to reduce the distance that the responses can be from the SFT model outputs trained in step 1.

The image on the left was taken from OpenAI et al.'s "Training language models to follow instructions with human feedback," 2022: http://arxiv.org/pdf/2203.02155.pdf The author added additional context in red (right).

Although this has not been done extensively in practice, the process's steps 2 and 3 can be repeated.

The author produced a screenshot of ChatGPT.

Evaluation of the Model During training, a test set that the model has not seen is set aside for evaluation. A number of tests are done on the test set to see if the model is better aligned than its predecessor, GPT-3.

Helpfulness: the capacity of the model to infer and comply with user instructions. 85 to 3% of the time, labelers preferred GPT-3 outputs over those from InstructGPT.

Truthfulness: the model's propensity to have visions. When evaluated with the TruthfulQA dataset, the PPO model produced outputs with slight increases in informativeness and truthfulness.

Harmlessness: the model's capacity to steer clear of content that is offensive, demeaning, or inappropriate. The RealToxicityPrompts dataset was utilized in the testing of harmlessness. Three conditions were used for the test.

instructed to respond with respect: caused a significant drop in the number of toxic responses.

instructed to respond without any environment for respect: toxicity did not significantly alter.

instructed to respond toxically: In point of fact, responses were significantly more harmful than the GPT-3 model.

Read the original OpenAI Training language models to follow instructions with human feedback, 2022,



 for more information on the methods used to create InstructGPT and ChatGPT.

Comments

Popular posts from this blog