LLM Essentials
Master the fundamentals of Large Language Models to power your AI agents
Understanding Large Language Models
Large Language Models (LLMs) are the core technology powering modern AI agents. Understanding how they work, their capabilities, and their limitations is essential for effective AI agent development.
Key Insight
LLMs are not just text generators—they're reasoning engines that can process information, make decisions, and solve problems when properly prompted.
How LLMs Work
At a conceptual level, LLMs function through these key mechanisms:
- Pattern Recognition: LLMs identify statistical patterns in language from massive training datasets
- Contextual Understanding: They interpret input text based on surrounding context
- Prediction: They predict the most likely next tokens based on learned patterns
- Generation: They assemble these predictions into coherent responses
The Transformer Architecture
Modern LLMs are based on the transformer architecture, which uses:
- Self-Attention: Allows the model to weigh the importance of different words in relation to each other
- Parallel Processing: Enables efficient training on massive datasets
- Positional Encoding: Helps the model understand word order
- Multi-Head Attention: Allows the model to focus on different aspects of the input simultaneously
While understanding the technical details isn't necessary for most applications, knowing these concepts helps explain LLM behaviour.
Key LLM Capabilities
Modern LLMs can perform a wide range of tasks that make them valuable for AI agents:
| Capability | Description | Agent Applications |
|---|---|---|
| Text Generation | Creating coherent, contextually relevant text | Content creation, responses, summaries |
| Reasoning | Working through problems step-by-step | Decision making, planning, troubleshooting |
| Instruction Following | Executing specific directions | Task automation, workflow execution |
| Information Extraction | Identifying key data points from text | Data processing, form filling, analysis |
| Transformation | Converting between formats and styles | Translation, summarisation, simplification |
| Tool Use | Generating structured outputs for external tools | API calls, database queries, function execution |
The Token Economy
Understanding tokens is crucial for both effective and cost-efficient use of LLMs. Tokens are the fundamental units that LLMs process, and they directly impact both performance and cost.
What Are Tokens?
Tokens are the basic units of text that LLMs process. They're not exactly words or characters, but pieces of text that the model treats as single units.
Token Examples
The sentence "I want to build an AI agent" might be tokenised as:
["I", "want", "to", "build", "an", "AI", "agent"]
But more complex words might be split into multiple tokens:
"Tokenisation" → ["Token", "isation"]
And common phrases might be single tokens:
"New York" → ["New_York"]
Token Counting Rules of Thumb
- In English, 1 token ≈ 4 characters
- 1 token ≈ ¾ of a word
- 100 tokens ≈ 75 words
- 1,500 tokens ≈ 1 page of text
Token Counting Tools
Use these tools to accurately count tokens for your prompts and responses:
- OpenAI Tokenizer - Official OpenAI tool
- Hugging Face Tokenizer Playground - Works with multiple models
Context Windows
The context window is the maximum number of tokens an LLM can process at once, including both the input prompt and the generated output.
| Model | Context Window (tokens) | Approximate Text Equivalent |
|---|---|---|
| GPT-3.5 Turbo | 16,385 | ~12 pages |
| GPT-4 Turbo | 128,000 | ~85 pages |
| Claude 3 Opus | 200,000 | ~133 pages |
| Mistral Large | 32,768 | ~22 pages |
| Llama 3 70B | 8,192 | ~5 pages |
Context Window Limitations
Problem: LLMs can only "see" text within their context window. Anything beyond this limit is invisible to the model.
Impact: This creates a "goldfish memory" problem where the model forgets information that falls outside the window.
Solution: Implement memory systems in your agents to store and retrieve information beyond the context window.
Token Economy Optimisation Strategies
Cost Optimisation Techniques:
- Prompt Compression: Remove unnecessary details and examples
- Two-Stage Processing: Use cheaper models for initial processing, premium models for final output
- Chunking: Break large documents into manageable pieces
- Summarisation: Condense previous conversation turns
- Caching: Store responses for common queries
Implementing a Token-Efficient Agent
import tiktoken
class TokenEfficientAgent:
def __init__(self, model_name="gpt-3.5-turbo", max_history_tokens=4000):
self.conversation_history = []
self.max_history_tokens = max_history_tokens
self.token_count = 0
try:
self.encoding = tiktoken.encoding_for_model(model_name)
except KeyError:
print(f"Warning: Model {model_name} not found. Using cl100k_base encoding.")
self.encoding = tiktoken.get_encoding("cl100k_base")
def _count_tokens(self, text):
"""Count tokens using the specified model's encoder."""
return len(self.encoding.encode(text))
def add_message(self, role, content):
"""Add a message to the conversation history with accurate token counting."""
# Accurate token count for the message content
message_tokens = self._count_tokens(content)
# Add overhead tokens per message (approximation, varies by model)
overhead_tokens_per_message = 4
total_tokens = message_tokens + overhead_tokens_per_message
message = {"role": role, "content": content, "tokens": total_tokens}
self.conversation_history.append(message)
self.token_count += total_tokens
# Prune history if needed
self._prune_history()
def _prune_history(self):
"""Ensure conversation history stays within token limits."""
# Account for tokens needed for the response
buffer_tokens = 1000 # Reserve tokens for the next model response
target_tokens = self.max_history_tokens - buffer_tokens
if self.token_count <= target_tokens:
return
# Always keep the system message if present
system_message = None
if self.conversation_history and self.conversation_history[0]["role"] == "system":
system_message = self.conversation_history.pop(0)
self.token_count -= system_message["tokens"]
# Remove oldest messages (excluding system message) until under the limit
while self.token_count > target_tokens and len(self.conversation_history) > 0:
# Remove from the beginning (oldest conversation turns)
removed = self.conversation_history.pop(0)
self.token_count -= removed["tokens"]
# Re-add system message if it existed
if system_message:
self.conversation_history.insert(0, system_message)
self.token_count += system_message["tokens"]
# If adding system message back puts us over, remove another message
if self.token_count > target_tokens and len(self.conversation_history) > 1:
removed = self.conversation_history.pop(1) # Remove message after system prompt
self.token_count -= removed["tokens"]
def get_messages_for_api(self):
"""Get messages formatted for API call."""
return [{"role": m["role"], "content": m["content"]} for m in self.conversation_history]
Temperature and Sampling Controls
Temperature and related parameters control the randomness and creativity of LLM outputs. Understanding these controls is essential for tailoring responses to different tasks.
Temperature Explained
Temperature controls the randomness of the model's token selection. It's typically set between 0.0 and 1.0 (or sometimes higher), with higher values producing more diverse and creative outputs.
Temperature Settings Guide:
| Setting | Effect | Best For |
|---|---|---|
| 0.0 - 0.3 | Highly deterministic, focused on most likely tokens | Factual responses, code generation, structured data extraction |
| 0.4 - 0.7 | Balanced between determinism and creativity | General conversation, explanations, summaries |
| 0.8 - 1.0+ | Highly creative, diverse, potentially less coherent | Brainstorming, creative writing, role-playing |
Other Sampling Parameters
- Top-p (Nucleus Sampling): Considers only the most probable tokens that make up a certain cumulative probability mass (e.g., top_p=0.9 considers tokens covering 90% probability). Often used instead of or alongside temperature.
- Top-k: Considers only the top k most probable tokens. Less common now than top-p.
- Frequency Penalty: Decreases the likelihood of repeating tokens already generated.
- Presence Penalty: Decreases the likelihood of repeating tokens already present in the prompt.
Best Practice: Adjust One Parameter at a Time
Typically, adjust either temperature OR top-p, not both. Start with temperature and only explore top-p if temperature alone doesn't give the desired control.
Major LLM Providers and Models
Several major players offer powerful LLMs via APIs. Here's an overview of the landscape in 2025:
1. OpenAI
- Key Models: GPT-4 Turbo, GPT-3.5 Turbo
- Strengths: Strong general reasoning, instruction following, code generation
- Considerations: Can be expensive, closed-source
2. Anthropic
- Key Models: Claude 3 (Opus, Sonnet, Haiku)
- Strengths: Strong performance on benchmarks, large context window, focus on safety
- Considerations: Newer player, API access might be less widespread
3. Google
- Key Models: Gemini (Pro, Ultra)
- Strengths: Multimodal capabilities, integration with Google ecosystem
- Considerations: API offerings still evolving, performance varies
4. Mistral AI
- Key Models: Mistral Large, Mistral Small, Mixtral (Open Weights)
- Strengths: High performance open-weight models, competitive API offerings
- Considerations: European focus, rapidly evolving
5. Meta (via Hugging Face, etc.)
- Key Models: Llama 3 (70B, 8B)
- Strengths: Leading open-weight models, widely available for self-hosting
- Considerations: Typically require self-hosting or use via third-party APIs
Choosing the Right Model
Model selection depends on:
- Task Complexity: Use more capable models (like GPT-4, Claude Opus) for complex reasoning.
- Cost Sensitivity: Use cheaper models (like GPT-3.5, Claude Haiku, Mistral Small) for high-volume or simpler tasks.
- Speed Requirements: Smaller models are generally faster.
- Context Length Needs: Choose models with large context windows if handling long documents.
- Open vs. Closed Source: Consider open-weight models for customization or self-hosting needs.
Limitations and Challenges of LLMs
While powerful, LLMs have inherent limitations that agent developers must understand and mitigate:
1. Hallucinations
Problem: LLMs can generate plausible but factually incorrect or nonsensical information.
Mitigation: Use Retrieval-Augmented Generation (RAG) to ground responses in factual data, implement fact-checking steps, lower temperature settings for factual tasks.
2. Knowledge Cutoff
Problem: An LLM's knowledge is frozen at the time its training data was collected.
Mitigation: Use RAG to provide up-to-date information, integrate with real-time data sources (like web search APIs).
3. Bias
Problem: LLMs can reflect biases present in their training data.
Mitigation: Carefully craft prompts to avoid leading questions, implement content filtering and moderation, use bias detection tools, ensure human oversight for sensitive applications.
4. Prompt Sensitivity
Problem: Small changes in prompts can lead to large variations in output quality.
Mitigation: Develop robust prompt templates, use few-shot examples, systematically test and iterate on prompts.
5. Cost and Latency
Problem: Powerful LLMs can be expensive and slow to respond.
Mitigation: Optimize token usage, use smaller/cheaper models where appropriate, implement caching, explore fine-tuning or distilled models for specific tasks.
Next Steps: Mastering Prompt Engineering
Understanding LLM fundamentals is the first step. The next crucial skill is prompt engineering—the art and science of crafting effective inputs to guide LLMs towards desired outputs.
Key Takeaways from This Section:
- LLMs work by predicting tokens based on patterns learned from vast datasets
- Tokens are the basic units LLMs process, impacting cost and context limits
- Context windows define how much information an LLM can handle at once
- Temperature and sampling controls influence the creativity vs. determinism of outputs
- Major providers like OpenAI, Anthropic, Google, and Mistral offer diverse models
- LLMs have limitations like hallucinations and knowledge cutoffs that require mitigation
In the next section, we delve into Prompt Engineering Essentials, providing techniques to unlock the full potential of LLMs within your AI agents.
Continue to Prompt Engineering →