How large language models work, a visual intro to transformers

Introduction

The initials GPT stand for Generative Pre-trained Transformer. The first word, "Generative," indicates that these models are designed to generate new text. The term "Pre-trained" signifies that the models undergo a learning process using vast amounts of data, and the prefix suggests there is room for further fine-tuning on specific tasks with additional training. However, the most significant part is the word Transformer, which refers to a specific type of neural network, a core invention behind the current surge in AI.

In this article, we will explore the inner workings of the Transformer model, visually illustrating the data flow and the processes at each step. Many models utilize Transformers, from speech recognition to text-to-speech and even image generation, the latter being exemplified by tools like DALL-E and Midjourney. The original Transformer model, introduced by Google in 2017, was designed for translating text between languages. However, the variant we will focus on—the model underlying applications like ChatGPT—predicts subsequent text based on an initial sequence. This prediction is expressed as a probability distribution over various potential text segments.

Initially, predicting the next word may seem like a different task from generating long passages of text. However, with a predictive model, you can supply an initial snippet, sample from the generated distribution, append that to the text, and repeat the process to generate cohesive output. For example, while experimenting with GPT-2, the results may lack coherence. In contrast, using GPT-3—a much larger model—produces coherent stories even when encountering unusual prompts.

Data flows through a Transformer in a structured manner. The input is divided into smaller pieces known as tokens, which can represent words, punctuation, or even parts of words. Each token is associated with a vector, a list of numbers that encodes its meaning. These vectors pass through an attention block, allowing them to communicate and update their meanings based on context. For instance, the word "model" demonstrates differing meanings based on its context in "machine learning model" versus "fashion model."

After the attention mechanism processes the vectors, they flow through a multi-layer perceptron, where they undergo parallel transformations. This continues through various layers until the final vector, representing the predominant meaning, emerges. The model's ultimate goal is to create a probability distribution over potential next tokens.

The initial step is to transform text into tokens and convert these into vectors through an embedding matrix. Words close in meaning tend to cluster together in high-dimensional space, with dimensions typically reaching up to 12,288 in models like GPT-3. The embedding captures semantic meanings, allowing vectors to adjust based on surrounding context.

At the end of the process is another matrix that maps the last vector to a probability distribution over all potential next tokens, typically accomplished using the softmax function. The softmax function transforms a list of arbitrary numbers into a normalized distribution, emphasizing the highest values.

Lastly, the article touches on the temperature parameter in sampling probabilities, which controls the randomness of predictions. A higher temperature leads to more diverse outcomes, while a lower temperature tends to favor the most predictable word.

With this foundational understanding, readers are encouraged to delve into more complex processes, like the attention mechanism, in subsequent sections.

Keywords

Transformers
Generative Pre-trained Transformer
Neural Networks
Attention Mechanism
Tokens
Word Embeddings
Probability Distribution
Softmax

FAQ

Q1: What is a Transformer model?
A: A Transformer model is a type of neural network architecture that uses mechanisms like attention to process and generate sequences of data, commonly used in natural language processing.

Q2: What does the term "pre-trained" mean in the context of GPT?
A: "Pre-trained" indicates that the model has already been trained on a large dataset before being fine-tuned for specific tasks.

Q3: How does the attention mechanism work?
A: The attention mechanism allows different parts of the input data to focus on other relevant parts, updating their meanings based on context.

Q4: What role do embeddings play in language models?
A: Embeddings convert each token (word or part of a word) into a high-dimensional vector that captures semantic meanings and relationships with other words.

Q5: What is the softmax function used for?
A: The softmax function normalizes a list of arbitrary numbers into a probability distribution, ensuring all values are between 0 and 1 and sum to 1.

How large language models work, a visual intro to transformers | Chapter 5, Deep Learning

Introduction

Keywords

FAQ