How Do Large Language Models Actually Work?
A clear, visual explanation of how AI systems like ChatGPT and Claude generate text. No PhD required - just analogies, diagrams, and honest explanations of what these systems can and cannot do.
What is a Large Language Model?
A Large Language Model (LLM) is a type of artificial intelligence trained to understand and generate human language. Think of it as a very sophisticated autocomplete system.
When you type on your phone and it suggests the next word, that is a tiny language model. LLMs like GPT-4 and Claude are the same idea, but trained on billions of text examples and with billions of adjustable parameters.
The key insight: LLMs do not truly understand language. They are extremely good at predicting what text should come next based on patterns they learned during training.
The Prediction Machine Analogy
Imagine you read millions of books and conversations. After a while, you would get very good at predicting what words typically come next in any sentence.
Example:
"The cat sat on the..."
You would probably guess "mat" because that pattern appears frequently in English text.
LLMs do this at massive scale, considering all the context to make remarkably good predictions about what text should come next.
How Text Generation Works
When you ask an LLM a question, here is what actually happens behind the scenes. Click each step to learn more.
User Input
You type a question or prompt
Details
The LLM receives your text as a string of characters. This could be a question, instruction, or any text you want it to respond to.
Example
Tokenization
Text is broken into tokens
Details
The model cannot read letters directly. It splits text into "tokens" - pieces that might be words, parts of words, or single characters.
Example
Neural Processing
Tokens flow through the neural network
Details
Each token becomes a number (embedding) that travels through billions of mathematical operations across many layers of the network.
Example
Next Token Prediction
Model predicts the most likely next token
Details
Based on everything it learned during training, the model calculates probability scores for every possible next token.
Example
Token Selection
A token is chosen and added to the response
Details
The model selects a token (influenced by temperature settings) and this becomes part of the response. Process repeats until complete.
Example
Understanding Tokens
Tokens are the building blocks of LLM processing. Here is how the sentence "Hello, how are you?" might be tokenized:
1// Example tokenization2Input: "Hello, how are you?"34Tokens: ["Hello", ",", " how", " are", " you", "?"]56Token IDs: [9906, 11, 703, 527, 499, 30]78// Each token maps to a number the model can process9// Notice: spaces often attach to the following wordDifferent models use different tokenization schemes. Some split more aggressively (more tokens), others keep larger chunks together. This affects both cost and capability.
Temperature: Controlling Creativity
When the LLM predicts the next token, it does not always pick the most likely option. The temperature setting controls how much randomness to allow.
Low temperature means picking the most probable tokens almost every time (safe, predictable). High temperature means sometimes choosing less likely tokens (creative, unpredictable).
Real world impact: If you ask the same question twice with high temperature, you will get different answers. With low temperature, answers will be nearly identical.
The model almost always picks the highest probability token. Responses are consistent and predictable.
A balance between creativity and consistency. Some variety while staying mostly on topic.
Lower probability tokens have a better chance of being selected. More creative but less predictable.
The Context Window
LLMs have a limited "memory" for each conversation. This is the context window - how many tokens the model can consider at once.
Small Context
~6,000 words. Good for simple Q&A but struggles with long documents or conversations.
Standard (GPT-4, Llama 3)
~100,000 words. Can handle long documents, extended conversations, and complex context.
Large (Claude, Gemini)
~150,000+ words. Can process entire books, large codebases, or very long research papers.
Why Context Window Matters
- Longer context = more information the model can reference
- When exceeded, oldest messages are "forgotten"
- Larger context windows typically cost more per query
- Your input + the response both count toward the limit
Why LLMs Make Mistakes
LLMs are powerful but not perfect. Understanding their limitations helps you use them effectively.
Training Data Gaps
The model may have incomplete or outdated information. It was trained on data up to a certain date and may not know recent events.
Pattern Matching Gone Wrong
LLMs work by recognising patterns. Sometimes they generate plausible-sounding text that follows learned patterns but is factually incorrect.
No Real Understanding
LLMs do not truly "understand" information like humans do. They predict likely text sequences without verifying factual accuracy.
What is a "Hallucination"?
When an LLM generates confident-sounding but factually incorrect information, we call it a "hallucination". This is not the AI lying - it is generating text that follows learned patterns without verifying facts.
Example hallucination:
"The Sydney Opera House was designed by Frank Lloyd Wright and completed in 1959."
(Actually designed by Jorn Utzon, completed 1973)
How to Reduce Hallucinations
- Use RAG to ground responses in verified data
- Ask for sources and verify them independently
- Use lower temperature for factual queries
- Provide context within your prompt when possible
Different LLMs Compared
Not all LLMs are the same. Here is a neutral comparison of leading models as of 2025.
| Feature | GPT-4 | Claude | Llama 3 | Gemini |
|---|---|---|---|---|
Open Source | ||||
Context Window | 128K tokens | 200K tokens | 128K tokens | 1M+ tokens |
Best For | General tasks | Analysis, writing | Self-hosting | Multimodal |
Provider | OpenAI | Anthropic | Meta | |
API Available |
Open Source Models
Models like Llama can be downloaded and run on your own hardware. You control the data, there are no API costs per query, but you need technical expertise and hardware.
Closed Source Models
Models like GPT-4 and Claude are accessed through APIs. Easy to use, always up-to-date, but you pay per query and data leaves your systems.
Key Takeaways
LLMs predict text one token at a time based on learned patterns
They do not truly "understand" - they recognise statistical patterns in language
Temperature controls how creative vs deterministic the output is
Context window limits how much text the model can consider at once
Hallucinations happen because LLMs generate plausible text, not verified facts
Different LLMs have different strengths - there is no single "best" model
Frequently Asked Questions
What does LLM stand for?
LLM stands for Large Language Model. "Large" refers to the billions of parameters (adjustable values) in the neural network. "Language Model" describes its function: predicting and generating human language.
How do LLMs learn?
LLMs learn through a process called training, where they read billions of text examples from the internet, books, and other sources. During training, they adjust billions of internal parameters to get better at predicting what text comes next. This process requires massive computing power and can take weeks or months.
What is the difference between GPT and an LLM?
GPT (Generative Pre-trained Transformer) is a specific type of LLM created by OpenAI. LLM is the general category that includes GPT, Claude, Llama, Gemini, and many others. It is like how "car" is the category and "Tesla" is a specific brand.
Why do LLMs sometimes make things up?
LLMs do not actually "know" facts - they predict likely text based on patterns learned during training. When they encounter questions about topics not well-covered in training data, or when patterns are ambiguous, they generate plausible-sounding but incorrect text. This is called "hallucination".
What is a context window?
The context window is how much text an LLM can "see" at once. It includes your input plus the response being generated. A 128K token context window means the model can process roughly 100,000 words at once. Larger context windows allow for longer conversations and documents.
Can LLMs learn from conversations?
Standard LLMs do not learn from individual conversations - they only use the training data they were initially trained on. However, they do remember the current conversation within the context window. Some systems use techniques like fine-tuning or RAG to give LLMs access to updated information.
What is the difference between open source and closed source LLMs?
Open source LLMs (like Llama) make their model weights publicly available, allowing anyone to download, modify, and run them. Closed source LLMs (like GPT-4 and Claude) only offer access through APIs - you cannot see or modify the underlying model.
Continue Learning
Now that you understand how LLMs work, explore these related topics.
What is RAG?
Learn how Retrieval-Augmented Generation gives LLMs access to your business data.
Read GuideChatbot Architecture
Understand how modern AI chatbots are built and how to choose the right approach.
ExploreReady to Put LLMs to Work for Your Business?
Now that you understand how LLMs work, let us show you how they can transform your customer service. Free consultation with zero jargon.