Anatomy of an Agent: How HeartBeatAgents Thinks, Remembers, and Acts
A technical deep-dive into the architecture behind HeartBeatAgents: LLM routing, three-layer memory, tool orchestration, and the lifecycle of a request.
Building an AI agent that works reliably in production requires solving problems that most chatbot architectures never encounter. A chatbot processes a message and returns a response. An agent processes a message, decides whether to respond immediately or gather more information first, invokes tools across multiple external systems, stores new knowledge for future use, and then responds, sometimes across a different channel than the one the message arrived on. The architectural distance between these two systems is enormous.
This post walks through the internal architecture of a Heart Beat Agent: how it receives a request, how it reasons about that request, how it accesses memory and tools, and how it produces a final response. Every layer described here is running in production today.
The Request Lifecycle
When a message arrives, whether from Slack, Discord, email, WhatsApp, or any other connected channel, it enters the system through our Channel Adapter layer. Each adapter normalizes the incoming message into a standard internal format: sender identity, channel context, message content, any attachments, and metadata like thread ID or reply context. This normalization is critical because the agent's reasoning layer should never need to know whether a message came from Slack or Telegram. It processes intent, not protocol.
The normalized message is then enriched with context. The system retrieves the sender's identity from our user graph, pulls the active conversation thread if one exists, and loads the agent's standing orders, the persistent instructions that define its role, personality, boundaries, and objectives. Standing orders are the closest analog to a job description. They tell the agent what it is, what it should do, and what it should never do.
LLM Routing: The Right Model for the Right Task
HeartBeatAgents does not use a single language model. We maintain active integrations with 10+ providers: OpenAI (GPT-4o, GPT-4 Turbo), Anthropic (Claude 4, Sonnet, Haiku), Google (Gemini 2.5 Pro, Gemini Flash), Mistral (Large, Medium), Cohere (Command R+), Meta (Llama 3 via hosted endpoints), DeepSeek (DeepSeek V3), OpenRouter (100+ models), Ollama (local inference with any open-weight model), and custom OpenAI-compatible endpoints.
Our routing layer selects the appropriate model for each sub-task within a request. The selection considers four factors:
- Task complexity. A simple factual lookup does not require the same model as a multi-step reasoning chain with tool calls. We classify tasks on a complexity spectrum and route accordingly.
- Required capabilities. Some tasks require code generation (DeepSeek Coder excels here). Some require long-context processing (Claude 3 and Gemini handle 100K+ tokens). Some require structured output (GPT-4o with function calling is highly reliable). The router matches task requirements to model strengths.
- Latency budget. If a user is waiting in a real-time chat, we optimize for speed. If the task is an asynchronous workflow step, we optimize for quality. Smaller, faster models handle the former; larger, more capable models handle the latter.
- Cost efficiency. Running every task through the most expensive model is wasteful. Our router ensures that 70-80% of routine tasks are handled by cost-efficient models, while complex tasks get the full power of frontier models. This typically reduces LLM costs by 40-60% compared to single-model architectures.
Agents can also be configured with a primary provider preference. If your organization has an enterprise agreement with Anthropic, for example, you can set Claude as the default with fallback to other providers. The routing layer respects these preferences while maintaining the ability to fail over if a provider experiences downtime.
The Three-Layer Memory System
Memory is what separates an agent from a chatbot. Our memory architecture has three distinct layers, each serving a different purpose and using a different access pattern.
Episodic Memory
Episodic memory stores conversation history: the raw record of what was said, by whom, and when. Every message exchanged between an agent and a user is stored with full metadata: timestamps, channel context, sentiment signals, and tool invocations that occurred during the conversation. When a user returns after days or weeks, the agent can retrieve relevant prior conversations and resume with full context.
Episodic memories are indexed by both recency and relevance. A conversation from yesterday is weighted higher than one from three months ago, but if the three-month-old conversation is semantically similar to the current query, it surfaces. We use a hybrid retrieval approach: BM25 for keyword matching combined with dense vector similarity for semantic matching, re-ranked by a cross-encoder for final relevance scoring.
Semantic Memory
Semantic memory stores extracted facts and knowledge. When an agent processes a conversation, our extraction pipeline identifies factual statements: "The customer's contract renews in March," "The API uses OAuth 2.0 with PKCE," "This user prefers email communication over Slack." It stores them as discrete, retrievable knowledge entries. Each entry is vector-embedded and tagged with a confidence score, a source conversation reference, and an expiration policy.
Semantic memories are automatically deduplicated and updated. If a customer changes their preferred contact method, the new fact supersedes the old one rather than coexisting with it. This prevents the common failure mode where agents contradict themselves because they retrieved conflicting historical statements.
Procedural Memory
Procedural memory is the most novel layer. It stores learned behavioral patterns: sequences of actions that an agent has found effective. When an agent successfully resolves a particular type of request, say processing a refund by checking the order in Stripe, verifying the refund policy, and then initiating the refund, the system extracts the action sequence as a procedure. The next time a similar request arrives, the agent can retrieve and follow the learned procedure rather than reasoning from scratch.
Procedural memories improve over time. If an agent discovers a more efficient approach to a task, the new procedure replaces the old one. If a procedure fails, its confidence score decreases. This creates a genuine learning loop: agents get measurably better at their jobs the longer they run.
Tool Orchestration
Tools are how agents interact with external systems. Each integration (Google Workspace, GitHub, HubSpot, Stripe, Jira) exposes a set of tools that the agent can invoke. A Google Workspace integration provides tools like google_calendar_create_event, google_docs_read_document, and gmail_send_email. A Stripe integration provides stripe_get_customer, stripe_list_charges, and stripe_create_refund.
When the LLM decides that a tool invocation is needed, our orchestration layer handles execution. This is not a simple function call. The orchestrator manages authentication (refreshing OAuth tokens as needed), rate limiting (respecting each provider's API limits), error handling (retrying transient failures, surfacing permanent failures to the agent), and result formatting (converting raw API responses into context the LLM can reason about).
Critically, tool calls can be chained. A single user request might require the agent to read a Google Sheet, look up a customer in HubSpot based on data from that sheet, check their subscription in Stripe, and then send them an email via Gmail. The orchestrator manages this entire chain, passing results from each step to the next and maintaining a transaction-like context so the agent can reason about the full picture.
Standing Orders and Guardrails
Every agent operates under standing orders, persistent instructions that survive across conversations. Standing orders define:
- Role and persona. "You are a senior customer support agent for Acme Corp. You are professional, empathetic, and concise."
- Operational boundaries. "You may issue refunds up to $100 without approval. Refunds above $100 require human escalation."
- Knowledge boundaries. "You are an expert on our product catalog and pricing. If asked about competitor products, politely redirect."
- Behavioral rules. "Always confirm the customer's identity before accessing account information. Never share internal metrics."
Standing orders are injected into every LLM call as system-level context. They are not suggestions. They are constraints enforced at the architectural level. Our guardrail system validates every agent response against the standing orders before it is sent, catching potential violations and either correcting them automatically or escalating to a human reviewer.
Putting It Together
The full processing pipeline, from message receipt to response delivery, typically completes in 1-4 seconds for straightforward requests and 5-15 seconds for complex multi-tool chains. Here is the sequence:
- Channel adapter receives and normalizes the incoming message.
- Context enrichment loads user identity, conversation history, and standing orders.
- Memory retrieval pulls relevant episodic, semantic, and procedural memories.
- The LLM router selects the appropriate model and constructs the prompt with full context.
- The LLM reasons about the request and produces a plan, which may include tool calls.
- The tool orchestrator executes any required tool calls, handling auth, retries, and chaining.
- Tool results are fed back to the LLM for final response generation.
- The guardrail system validates the response against standing orders.
- The response is delivered through the appropriate channel adapter.
- New memories are extracted and stored for future retrieval.
Every step is logged, observable, and auditable. Every tool call, every memory retrieval, every LLM invocation is recorded with full metadata for debugging, compliance, and continuous improvement.
This is the architecture that powers every Heart Beat Agent in production. It is the result of two years of engineering focused on a single question: what does it take to build an AI system that you can trust with real work?