LLM Essentials

Master the fundamentals of Large Language Models to power your AI agents

Understanding Large Language Models

Large Language Models (LLMs) are the core technology powering modern AI agents. Understanding how they work, their capabilities, and their limitations is essential for effective AI agent development.

Key Insight

LLMs are not just text generators—they're reasoning engines that can process information, make decisions, and solve problems when properly prompted.

How LLMs Work

At a conceptual level, LLMs function through these key mechanisms:

  1. Pattern Recognition: LLMs identify statistical patterns in language from massive training datasets
  2. Contextual Understanding: They interpret input text based on surrounding context
  3. Prediction: They predict the most likely next tokens based on learned patterns
  4. Generation: They assemble these predictions into coherent responses

The Transformer Architecture

Modern LLMs are based on the transformer architecture, which uses:

  • Self-Attention: Allows the model to weigh the importance of different words in relation to each other
  • Parallel Processing: Enables efficient training on massive datasets
  • Positional Encoding: Helps the model understand word order
  • Multi-Head Attention: Allows the model to focus on different aspects of the input simultaneously

While understanding the technical details isn't necessary for most applications, knowing these concepts helps explain LLM behaviour.

Key LLM Capabilities

Modern LLMs can perform a wide range of tasks that make them valuable for AI agents:

Capability Description Agent Applications
Text Generation Creating coherent, contextually relevant text Content creation, responses, summaries
Reasoning Working through problems step-by-step Decision making, planning, troubleshooting
Instruction Following Executing specific directions Task automation, workflow execution
Information Extraction Identifying key data points from text Data processing, form filling, analysis
Transformation Converting between formats and styles Translation, summarisation, simplification
Tool Use Generating structured outputs for external tools API calls, database queries, function execution

The Token Economy

Understanding tokens is crucial for both effective and cost-efficient use of LLMs. Tokens are the fundamental units that LLMs process, and they directly impact both performance and cost.

What Are Tokens?

Tokens are the basic units of text that LLMs process. They're not exactly words or characters, but pieces of text that the model treats as single units.

Token Examples

The sentence "I want to build an AI agent" might be tokenised as:

["I", "want", "to", "build", "an", "AI", "agent"]

But more complex words might be split into multiple tokens:

"Tokenisation" → ["Token", "isation"]

And common phrases might be single tokens:

"New York" → ["New_York"]

Token Counting Rules of Thumb

Token Counting Tools

Use these tools to accurately count tokens for your prompts and responses:

Context Windows

The context window is the maximum number of tokens an LLM can process at once, including both the input prompt and the generated output.

Model Context Window (tokens) Approximate Text Equivalent
GPT-3.5 Turbo 16,385 ~12 pages
GPT-4 Turbo 128,000 ~85 pages
Claude 3 Opus 200,000 ~133 pages
Mistral Large 32,768 ~22 pages
Llama 3 70B 8,192 ~5 pages

Context Window Limitations

Problem: LLMs can only "see" text within their context window. Anything beyond this limit is invisible to the model.

Impact: This creates a "goldfish memory" problem where the model forgets information that falls outside the window.

Solution: Implement memory systems in your agents to store and retrieve information beyond the context window.

Token Economy Optimisation Strategies

Cost Optimisation Techniques:

  1. Prompt Compression: Remove unnecessary details and examples
  2. Two-Stage Processing: Use cheaper models for initial processing, premium models for final output
  3. Chunking: Break large documents into manageable pieces
  4. Summarisation: Condense previous conversation turns
  5. Caching: Store responses for common queries

Implementing a Token-Efficient Agent

import tiktoken

class TokenEfficientAgent:
    def __init__(self, model_name="gpt-3.5-turbo", max_history_tokens=4000):
        self.conversation_history = []
        self.max_history_tokens = max_history_tokens
        self.token_count = 0
        try:
            self.encoding = tiktoken.encoding_for_model(model_name)
        except KeyError:
            print(f"Warning: Model {model_name} not found. Using cl100k_base encoding.")
            self.encoding = tiktoken.get_encoding("cl100k_base")
    
    def _count_tokens(self, text):
        """Count tokens using the specified model's encoder."""
        return len(self.encoding.encode(text))

    def add_message(self, role, content):
        """Add a message to the conversation history with accurate token counting."""
        # Accurate token count for the message content
        message_tokens = self._count_tokens(content)
        # Add overhead tokens per message (approximation, varies by model)
        overhead_tokens_per_message = 4 
        total_tokens = message_tokens + overhead_tokens_per_message
        
        message = {"role": role, "content": content, "tokens": total_tokens}
        self.conversation_history.append(message)
        self.token_count += total_tokens
        
        # Prune history if needed
        self._prune_history()
    
    def _prune_history(self):
        """Ensure conversation history stays within token limits."""
        # Account for tokens needed for the response
        buffer_tokens = 1000 # Reserve tokens for the next model response
        target_tokens = self.max_history_tokens - buffer_tokens

        if self.token_count <= target_tokens:
            return
            
        # Always keep the system message if present
        system_message = None
        if self.conversation_history and self.conversation_history[0]["role"] == "system":
            system_message = self.conversation_history.pop(0)
            self.token_count -= system_message["tokens"]
        
        # Remove oldest messages (excluding system message) until under the limit
        while self.token_count > target_tokens and len(self.conversation_history) > 0:
            # Remove from the beginning (oldest conversation turns)
            removed = self.conversation_history.pop(0)
            self.token_count -= removed["tokens"]
        
        # Re-add system message if it existed
        if system_message:
            self.conversation_history.insert(0, system_message)
            self.token_count += system_message["tokens"]
            # If adding system message back puts us over, remove another message
            if self.token_count > target_tokens and len(self.conversation_history) > 1:
                 removed = self.conversation_history.pop(1) # Remove message after system prompt
                 self.token_count -= removed["tokens"]
    
    def get_messages_for_api(self):
        """Get messages formatted for API call."""
        return [{"role": m["role"], "content": m["content"]} for m in self.conversation_history]

Temperature and Sampling Controls

Temperature and related parameters control the randomness and creativity of LLM outputs. Understanding these controls is essential for tailoring responses to different tasks.

Temperature Explained

Temperature controls the randomness of the model's token selection. It's typically set between 0.0 and 1.0 (or sometimes higher), with higher values producing more diverse and creative outputs.

Temperature Settings Guide:

Setting Effect Best For
0.0 - 0.3 Highly deterministic, focused on most likely tokens Factual responses, code generation, structured data extraction
0.4 - 0.7 Balanced between determinism and creativity General conversation, explanations, summaries
0.8 - 1.0+ Highly creative, diverse, potentially less coherent Brainstorming, creative writing, role-playing

Other Sampling Parameters

Best Practice: Adjust One Parameter at a Time

Typically, adjust either temperature OR top-p, not both. Start with temperature and only explore top-p if temperature alone doesn't give the desired control.

Major LLM Providers and Models

Several major players offer powerful LLMs via APIs. Here's an overview of the landscape in 2025:

1. OpenAI

2. Anthropic

3. Google

4. Mistral AI

5. Meta (via Hugging Face, etc.)

Choosing the Right Model

Model selection depends on:

  • Task Complexity: Use more capable models (like GPT-4, Claude Opus) for complex reasoning.
  • Cost Sensitivity: Use cheaper models (like GPT-3.5, Claude Haiku, Mistral Small) for high-volume or simpler tasks.
  • Speed Requirements: Smaller models are generally faster.
  • Context Length Needs: Choose models with large context windows if handling long documents.
  • Open vs. Closed Source: Consider open-weight models for customization or self-hosting needs.

Limitations and Challenges of LLMs

While powerful, LLMs have inherent limitations that agent developers must understand and mitigate:

1. Hallucinations

Problem: LLMs can generate plausible but factually incorrect or nonsensical information.

Mitigation: Use Retrieval-Augmented Generation (RAG) to ground responses in factual data, implement fact-checking steps, lower temperature settings for factual tasks.

2. Knowledge Cutoff

Problem: An LLM's knowledge is frozen at the time its training data was collected.

Mitigation: Use RAG to provide up-to-date information, integrate with real-time data sources (like web search APIs).

3. Bias

Problem: LLMs can reflect biases present in their training data.

Mitigation: Carefully craft prompts to avoid leading questions, implement content filtering and moderation, use bias detection tools, ensure human oversight for sensitive applications.

4. Prompt Sensitivity

Problem: Small changes in prompts can lead to large variations in output quality.

Mitigation: Develop robust prompt templates, use few-shot examples, systematically test and iterate on prompts.

5. Cost and Latency

Problem: Powerful LLMs can be expensive and slow to respond.

Mitigation: Optimize token usage, use smaller/cheaper models where appropriate, implement caching, explore fine-tuning or distilled models for specific tasks.

Next Steps: Mastering Prompt Engineering

Understanding LLM fundamentals is the first step. The next crucial skill is prompt engineering—the art and science of crafting effective inputs to guide LLMs towards desired outputs.

Key Takeaways from This Section:

  • LLMs work by predicting tokens based on patterns learned from vast datasets
  • Tokens are the basic units LLMs process, impacting cost and context limits
  • Context windows define how much information an LLM can handle at once
  • Temperature and sampling controls influence the creativity vs. determinism of outputs
  • Major providers like OpenAI, Anthropic, Google, and Mistral offer diverse models
  • LLMs have limitations like hallucinations and knowledge cutoffs that require mitigation

In the next section, we delve into Prompt Engineering Essentials, providing techniques to unlock the full potential of LLMs within your AI agents.

Continue to Prompt Engineering →