LLM Essentials - AI Agent Expert Guide

Understanding Large Language Models

Large Language Models (LLMs) are the core technology powering modern AI agents. Understanding how they work, their capabilities, and their limitations is essential for effective AI agent development.

Key Insight

LLMs are not just text generators—they're reasoning engines that can process information, make decisions, and solve problems when properly prompted.

How LLMs Work

At a conceptual level, LLMs function through these key mechanisms:

Pattern Recognition: LLMs identify statistical patterns in language from massive training datasets
Contextual Understanding: They interpret input text based on surrounding context
Prediction: They predict the most likely next tokens based on learned patterns
Generation: They assemble these predictions into coherent responses

The Transformer Architecture

Modern LLMs are based on the transformer architecture, which uses:

Self-Attention: Allows the model to weigh the importance of different words in relation to each other
Parallel Processing: Enables efficient training on massive datasets
Positional Encoding: Helps the model understand word order
Multi-Head Attention: Allows the model to focus on different aspects of the input simultaneously

While understanding the technical details isn't necessary for most applications, knowing these concepts helps explain LLM behaviour.

Key LLM Capabilities

Modern LLMs can perform a wide range of tasks that make them valuable for AI agents:

Capability	Description	Agent Applications
Text Generation	Creating coherent, contextually relevant text	Content creation, responses, summaries
Reasoning	Working through problems step-by-step	Decision making, planning, troubleshooting
Instruction Following	Executing specific directions	Task automation, workflow execution
Information Extraction	Identifying key data points from text	Data processing, form filling, analysis
Transformation	Converting between formats and styles	Translation, summarisation, simplification
Tool Use	Generating structured outputs for external tools	API calls, database queries, function execution

The Token Economy

Understanding tokens is crucial for both effective and cost-efficient use of LLMs. Tokens are the fundamental units that LLMs process, and they directly impact both performance and cost.

What Are Tokens?

Tokens are the basic units of text that LLMs process. They're not exactly words or characters, but pieces of text that the model treats as single units.

Token Examples

The sentence "I want to build an AI agent" might be tokenised as:

["I", "want", "to", "build", "an", "AI", "agent"]

But more complex words might be split into multiple tokens:

"Tokenisation" → ["Token", "isation"]

And common phrases might be single tokens:

"New York" → ["New_York"]

Token Counting Rules of Thumb

In English, 1 token ≈ 4 characters
1 token ≈ ¾ of a word
100 tokens ≈ 75 words
1,500 tokens ≈ 1 page of text

Token Counting Tools

Use these tools to accurately count tokens for your prompts and responses:

OpenAI Tokenizer - Official OpenAI tool
Hugging Face Tokenizer Playground - Works with multiple models

Context Windows

The context window is the maximum number of tokens an LLM can process at once, including both the input prompt and the generated output.

Model	Context Window (tokens)	Approximate Text Equivalent
GPT-3.5 Turbo	16,385	~12 pages
GPT-4 Turbo	128,000	~85 pages
Claude 3 Opus	200,000	~133 pages
Mistral Large	32,768	~22 pages
Llama 3 70B	8,192	~5 pages

Context Window Limitations

Problem: LLMs can only "see" text within their context window. Anything beyond this limit is invisible to the model.

Impact: This creates a "goldfish memory" problem where the model forgets information that falls outside the window.

Solution: Implement memory systems in your agents to store and retrieve information beyond the context window.

Token Economy Optimisation Strategies

                Cost Optimisation Techniques:
                Prompt Compression: Remove unnecessary details and examples
Two-Stage Processing: Use cheaper models for initial processing, premium models for final output
Chunking: Break large documents into manageable pieces
Summarisation: Condense previous conversation turns
Caching: Store responses for common queries

            

Implementing a Token-Efficient Agent

import tiktoken

class TokenEfficientAgent:
    def __init__(self, model_name="gpt-3.5-turbo", max_history_tokens=4000):
        self.conversation_history = []
        self.max_history_tokens = max_history_tokens
        self.token_count = 0
        try:
            self.encoding = tiktoken.encoding_for_model(model_name)
        except KeyError:
            print(f"Warning: Model {model_name} not found. Using cl100k_base encoding.")
            self.encoding = tiktoken.get_encoding("cl100k_base")
    
    def _count_tokens(self, text):
        """Count tokens using the specified model's encoder."""
        return len(self.encoding.encode(text))

    def add_message(self, role, content):
        """Add a message to the conversation history with accurate token counting."""
        # Accurate token count for the message content
        message_tokens = self._count_tokens(content)
        # Add overhead tokens per message (approximation, varies by model)
        overhead_tokens_per_message = 4 
        total_tokens = message_tokens + overhead_tokens_per_message
        
        message = {"role": role, "content": content, "tokens": total_tokens}
        self.conversation_history.append(message)
        self.token_count += total_tokens
        
        # Prune history if needed
        self._prune_history()
    
    def _prune_history(self):
        """Ensure conversation history stays within token limits."""
        # Account for tokens needed for the response
        buffer_tokens = 1000 # Reserve tokens for the next model response
        target_tokens = self.max_history_tokens - buffer_tokens

        if self.token_count <= target_tokens:
            return
            
        # Always keep the system message if present
        system_message = None
        if self.conversation_history and self.conversation_history[0]["role"] == "system":
            system_message = self.conversation_history.pop(0)
            self.token_count -= system_message["tokens"]
        
        # Remove oldest messages (excluding system message) until under the limit
        while self.token_count > target_tokens and len(self.conversation_history) > 0:
            # Remove from the beginning (oldest conversation turns)
            removed = self.conversation_history.pop(0)
            self.token_count -= removed["tokens"]
        
        # Re-add system message if it existed
        if system_message:
            self.conversation_history.insert(0, system_message)
            self.token_count += system_message["tokens"]
            # If adding system message back puts us over, remove another message
            if self.token_count > target_tokens and len(self.conversation_history) > 1:
                 removed = self.conversation_history.pop(1) # Remove message after system prompt
                 self.token_count -= removed["tokens"]
    
    def get_messages_for_api(self):
        """Get messages formatted for API call."""
        return [{"role": m["role"], "content": m["content"]} for m in self.conversation_history]

Temperature and Sampling Controls

Temperature and related parameters control the randomness and creativity of LLM outputs. Understanding these controls is essential for tailoring responses to different tasks.

Temperature Explained

Temperature controls the randomness of the model's token selection. It's typically set between 0.0 and 1.0 (or sometimes higher), with higher values producing more diverse and creative outputs.

                Temperature Settings Guide:
                
                        Setting
                        Effect
                        Best For
                    
                        0.0 - 0.3
                        Highly deterministic, focused on most likely tokens
                        Factual responses, code generation, structured data extraction
                    
                        0.4 - 0.7
                        Balanced between determinism and creativity
                        General conversation, explanations, summaries
                    
                        0.8 - 1.0+
                        Highly creative, diverse, potentially less coherent
                        Brainstorming, creative writing, role-playing

Setting	Effect	Best For
0.0 - 0.3	Highly deterministic, focused on most likely tokens	Factual responses, code generation, structured data extraction
0.4 - 0.7	Balanced between determinism and creativity	General conversation, explanations, summaries
0.8 - 1.0+	Highly creative, diverse, potentially less coherent	Brainstorming, creative writing, role-playing

Other Sampling Parameters

Top-p (Nucleus Sampling): Considers only the most probable tokens that make up a certain cumulative probability mass (e.g., top_p=0.9 considers tokens covering 90% probability). Often used instead of or alongside temperature.
Top-k: Considers only the top k most probable tokens. Less common now than top-p.
Frequency Penalty: Decreases the likelihood of repeating tokens already generated.
Presence Penalty: Decreases the likelihood of repeating tokens already present in the prompt.

Best Practice: Adjust One Parameter at a Time

Typically, adjust either temperature OR top-p, not both. Start with temperature and only explore top-p if temperature alone doesn't give the desired control.

Major LLM Providers and Models

Several major players offer powerful LLMs via APIs. Here's an overview of the landscape in 2025:

1. OpenAI

Key Models: GPT-4 Turbo, GPT-3.5 Turbo
Strengths: Strong general reasoning, instruction following, code generation
Considerations: Can be expensive, closed-source

2. Anthropic

Key Models: Claude 3 (Opus, Sonnet, Haiku)
Strengths: Strong performance on benchmarks, large context window, focus on safety
Considerations: Newer player, API access might be less widespread

3. Google

Key Models: Gemini (Pro, Ultra)
Strengths: Multimodal capabilities, integration with Google ecosystem
Considerations: API offerings still evolving, performance varies

4. Mistral AI

Key Models: Mistral Large, Mistral Small, Mixtral (Open Weights)
Strengths: High performance open-weight models, competitive API offerings
Considerations: European focus, rapidly evolving

5. Meta (via Hugging Face, etc.)

Key Models: Llama 3 (70B, 8B)
Strengths: Leading open-weight models, widely available for self-hosting
Considerations: Typically require self-hosting or use via third-party APIs

Choosing the Right Model

Model selection depends on:

Task Complexity: Use more capable models (like GPT-4, Claude Opus) for complex reasoning.
Cost Sensitivity: Use cheaper models (like GPT-3.5, Claude Haiku, Mistral Small) for high-volume or simpler tasks.
Speed Requirements: Smaller models are generally faster.
Context Length Needs: Choose models with large context windows if handling long documents.
Open vs. Closed Source: Consider open-weight models for customization or self-hosting needs.

Limitations and Challenges of LLMs

While powerful, LLMs have inherent limitations that agent developers must understand and mitigate:

1. Hallucinations

Problem: LLMs can generate plausible but factually incorrect or nonsensical information.

Mitigation: Use Retrieval-Augmented Generation (RAG) to ground responses in factual data, implement fact-checking steps, lower temperature settings for factual tasks.

2. Knowledge Cutoff

Problem: An LLM's knowledge is frozen at the time its training data was collected.

Mitigation: Use RAG to provide up-to-date information, integrate with real-time data sources (like web search APIs).

3. Bias

Problem: LLMs can reflect biases present in their training data.

Mitigation: Carefully craft prompts to avoid leading questions, implement content filtering and moderation, use bias detection tools, ensure human oversight for sensitive applications.

4. Prompt Sensitivity

Problem: Small changes in prompts can lead to large variations in output quality.

Mitigation: Develop robust prompt templates, use few-shot examples, systematically test and iterate on prompts.

5. Cost and Latency

Problem: Powerful LLMs can be expensive and slow to respond.

Mitigation: Optimize token usage, use smaller/cheaper models where appropriate, implement caching, explore fine-tuning or distilled models for specific tasks.

Next Steps: Mastering Prompt Engineering

Understanding LLM fundamentals is the first step. The next crucial skill is prompt engineering—the art and science of crafting effective inputs to guide LLMs towards desired outputs.

                Key Takeaways from This Section:
                LLMs work by predicting tokens based on patterns learned from vast datasets
Tokens are the basic units LLMs process, impacting cost and context limits
Context windows define how much information an LLM can handle at once
Temperature and sampling controls influence the creativity vs. determinism of outputs
Major providers like OpenAI, Anthropic, Google, and Mistral offer diverse models
LLMs have limitations like hallucinations and knowledge cutoffs that require mitigation

            

In the next section, we delve into Prompt Engineering Essentials, providing techniques to unlock the full potential of LLMs within your AI agents.

Continue to Prompt Engineering →