High Level Overview Of The Transformer Blocks In Large Language Models

What if I told you that the same mechanism powering ChatGPT, Claude, and virtually every cutting-edge AI system today traces back to a single 2017 research paper? Eight researchers, one groundbreaking idea and the world of AI was never the same.

In our last article, Understanding Transformer LLMs From Scratch | Tokenizers, we explored how LLMs convert raw text into a numerical language that machines can process. Think of that as teaching the model to read. Now comes the more fascinating question: how does it actually “think”?

This is where the Transformer enters the picture.

In this article, we’re pulling back the curtain on the architecture that started a revolution. We’ll take a high-level tour through Transformer blocks and unpack the self-attention mechanism — the secret ingredient that gives modern LLMs their almost uncanny ability to understand context, nuance, and meaning across long stretches of text.

If you’ve ever wondered what’s really happening under the hood when an LLM generates a response, you’re about to find out. Let’s dive in.

Transformer Blocks

The transformer blocks are used to conduct computations to calculate the embeddings based on the attention mechanism. This embedding is then used in the LM head. This is a very high level overview at the moment, we’ll dive into this much much closer later on.

Let’s start by going over a quick recap of what we have covered to far.

Language Model Head (LM Head)

This LM head is a neural network that takes the computed embedding from the transformer blocks and does a final prediction of the next likely word. This final prediction is a probability distribution, basically answering the question, “Of all the words (tokens) in the tokenizer vocabulary, what is the probability of each token to be the next word?”. In simpler words, the LM head is a neural network that assigns probabilities to each of the words (tokens) in the tokenizer vocabulary space, the word or token with the highest probability will get outputted as the next word.

An over view of this whole process of a text generation model can be illustrated in the diagram below:

NOTE: We can attach different types of “heads” to the transformers to do different things. For the LLMs, we’ll focus on “heads” that allow us to predict the next likely token(“word”).

Decoding Strategy

From the diagram above, we have seen how the LM head produces probability distributions for each token in the tokenizer vocabulary. The process of choosing or selecting one of these tokens from the probability distribution is what we call decoding strategy.

One of the commonly used strategies in the decoding strategy is the greedy decoding, this would simply select the token with the highest score or probability. A better approach to this is to introduce some randomness such that we do not always select the token with the highest probability, but maybe the second or third highest probabilities. With this strategy, other tokens have a change of being selected, through the simple sampling from probability distribution. This is what happens when you set the temperature of LLMs to zero when making API calls. Higher temperatures introduce some level of creativity into the model.

Processing Streams And Parallel Token Processing

You might be thinking, if the LLM has to process all the previous “words”(tokens) before it can predict a new one, this should take a significant amount of time right? Yes, it would, but the Transformer LLM has a trick of its own. The transformer LLM utilizes parallel computing in the sense that all the tokens are being processed at the same time, not in a sequential manner.

You can think of it as each token having to be passed through its own processing pipeline. We have multiple of these pipelines all running in parallel hence, tokens are being processed in parallel to one another. Models have a limited number of these pipelines. In fact, we have a name for this “pipelines” or “processing streams”, the total number of this processing stream/pipelines is what we refer to as context window size or context length of a model. If a model can process 100 tokens at the same time, then we say the model has a context length of 100 tokens. In the real world, models have way more than this, models can have up to a 1,000,000 context length. This means the model can process up to 1M tokens at a single time.

NOTE:

The above diagram is not really what happens, all the output embeddings are not passed into the LM head only the last one, we’ll go over this more in future articles, for the mean time and for simplicity, you can think of it like this. But why calculate all the other output vectors only to discard them? Hold on to this question, we’ll look into it in more detail later on when we discuss the Attention mechanism.

What will happen if the text you pass into the model is more than the context length, well you’ll first check if the token length is more or less and depending on this result process the tokens through the transformers or raise an error.

Don’t worry so much alot all these, just dumping ideas here for the mean time, it will all come together in a perfect harmony. You’ll come back to this specific article, I can promise you that :)…

Key and Values KV-Caching

So for each iteration to predict the next token, we’ll have to calculate all the previous tokens as well, then only pick the last one to perform forward pass through the LM head. This is computationally intensive, don’t you think so? Yes, luckily, a caching mechanism was introduced in the transformers to store previous token computations so in future iterations, they do not have to be re-computed each time in each iteration to predict the next token, this speeds up processing and saves compute, reducing inference time. This is done through an optimization technique called the key-value cache a.k.a KV-caching. Key and Values are essential components of the self attention mechanism, we’ll dive into this later on.

Anatomy Of The Transformer Blocks

The Transformer block is the most critical part of a Transformer LLM in my opinion. Most of the computation happens here and the results we get from the LLM largely depends on this section of the architecture.

The stack of transformers section for the Transformer LLM has many Transformers stacked one after the other in a sequential layout. The output of one Transformer is the input to the next Transformer. This stack can consist of as many as hundreds of Transformers. In the original “Attention Is All You Need” paper, they had only 6 Transformers stacks.

Why is this section so important that we need to stack so many Transformers? Let’s take a deeper look into it.

This section of the architecture consist of mainly two sections:

Self-Attention layer
Feedforward Neural Network

Conclusion

Congratulations for making it to the end of this blog post! Hope you now have a clearer high level overview of how the transformer blocks work in Large Language Models (LLM). In the next article, we’ll make a close look into the different components of the Transformer blocks, specifically the self attention mechanism.

I know this article was more of a brain dump of ideas, trust me we’ll go back to all these concepts. Stay tuned!

Other platforms where you can reach out to me:

Happy coding! And see you next time, the world keeps spinning.

Transformer Blocks

Language Model Head (LM Head)

Decoding Strategy

Processing Streams And Parallel Token Processing

NOTE:

Key and Values KV-Caching

Anatomy Of The Transformer Blocks

Conclusion

Leave a Reply Cancel reply

Quick Links

Services

Legal