How We Built Agentic Memory in Case.dev
Back to News
Engineering

How We Built Agentic Memory in Case.dev

We built AI memory with PostgreSQL, pgvector, and two LLM calls—one to extract facts, one to deduplicate them semantically.

·Max Sonderby

How We Built Memory for Case.dev

Every AI application eventually hits the same wall: conversations are stateless. Your carefully crafted assistant forgets everything the moment a session ends. Users repeat themselves. Context disappears. The experience feels hollow.

We needed memory for Case.dev — a way for AI agents to remember facts across conversations. After evaluating third-party services like Mem0, we decided to build our own. Here's how we did it with nothing but PostgreSQL, pgvector, and two well-placed LLM calls.

The Architecture

Memory systems need to do three things well:

  1. Extract meaningful facts from noisy conversations
  2. Deduplicate intelligently (semantic, not just exact match)
  3. Retrieve relevant memories at query time

We landed on a design we call the two-LLM-call pattern. The first call extracts facts. The second decides what to do with each fact against existing memories. Everything else is just PostgreSQL doing what PostgreSQL does best.

The Schema

CREATE TABLE memories (
  id              TEXT PRIMARY KEY,
  
  -- Generic indexed tag fields for flexible filtering
  tag_1           TEXT,
  tag_2           TEXT,
  -- ... through tag_12
  
  content         TEXT NOT NULL,           -- The actual memory
  content_hash    TEXT NOT NULL,           -- SHA-256 for exact dedup
  embedding       VECTOR(1536),            -- pgvector embedding
  
  category        TEXT DEFAULT 'fact',     -- User-defined classification
  source          TEXT DEFAULT 'chat',     -- chat | manual | import | api
  confidence      REAL DEFAULT 1.0,        -- Quality signal 0.0-1.0
  expires_at      TIMESTAMPTZ,             -- Optional TTL
  metadata        JSONB DEFAULT '{}',      -- Flexible non-indexed data
  
  created_at      TIMESTAMPTZ DEFAULT NOW(),
  updated_at      TIMESTAMPTZ DEFAULT NOW()
);

-- The magic: HNSW index for fast approximate nearest neighbor search
CREATE INDEX memories_embedding_idx 
  ON memories USING hnsw (embedding vector_cosine_ops);

-- Exact dedup fast path
CREATE UNIQUE INDEX memories_hash_idx 
  ON memories (content_hash);

A few design decisions worth noting:

Generic tag fields, not domain columns. We ship 12 indexed tag_N fields instead of opinionated columns like client_id or patient_id. A legal app might use tag_1=client_id, tag_2=matter_id. Healthcare might use tag_1=patient_id, tag_2=encounter_id. We don't dictate; we provide primitives.

SHA-256 content hash. Before any expensive vector operations, we check if an exact match already exists. This catches true duplicates instantly.

HNSW over IVFFlat. pgvector supports both index types. HNSW gives better query performance at the cost of slightly more memory and slower index builds. For a memory system where reads vastly outnumber writes, HNSW wins.

1536 dimensions. We use OpenAI's text-embedding-3-small. Good quality, reasonable cost, and the dimension count matters less than you'd think once you have proper indexing.

Step 1: Fact Extraction

When a conversation comes in, we don't store raw messages. We extract discrete, atomic facts:

const FACT_EXTRACTION_PROMPT = `You are a memory extraction system.

Extract discrete, atomic facts from conversation messages. Focus on:
- User preferences and requirements
- Important entities (people, companies, deadlines)
- Decisions made
- Relationships between entities

Rules:
1. Each fact should be a single, standalone piece of information
2. Use third person ("The user prefers..." not "You prefer...")
3. Include confidence based on how clearly stated the fact is

Output JSON: [{ "content": "...", "category": "...", "confidence": 0.9 }]`;

A conversation like:

User: I'm working with Acme Corp on the Johnson merger. Sarah Chen is my main contact there—she prefers email over calls. We need everything done by March 15th.

Becomes:

[
  { "content": "The user is working with Acme Corp on the Johnson merger", "category": "fact", "confidence": 0.95 },
  { "content": "Sarah Chen is the user's main contact at Acme Corp", "category": "contact", "confidence": 0.95 },
  { "content": "Sarah Chen prefers email communication over phone calls", "category": "preference", "confidence": 0.90 },
  { "content": "The Johnson merger has a deadline of March 15th", "category": "deadline", "confidence": 0.95 }
]

This decomposition is crucial. Atomic facts are easier to deduplicate, update, and retrieve than tangled conversation logs.

Step 2: Intelligent Deduplication

Here's where it gets interesting. For each extracted fact, we need to decide: is this new information, an update to something we know, a contradiction, or a duplicate?

The Fast Path: Exact Hash Match

function hashContent(content: string): string {
  return createHash('sha256')
    .update(content.toLowerCase().trim())
    .digest('hex');
}

// Check exact duplicate before anything expensive
const existing = await db.select()
  .from(memories)
  .where(eq(memories.contentHash, hashContent(fact.content)))
  .limit(1);

if (existing.length > 0) {
  // Exact match found, skip
  return { event: 'NONE', id: existing[0].id };
}

The Semantic Path: Vector Similarity + LLM Decision

If no exact match, we generate an embedding and find semantically similar memories:

// Generate embedding for the new fact
const embedding = await generateEmbedding(fact.content);

// Find similar memories (cosine similarity >= 0.85)
const similar = await db
  .select({
    memory: memories,
    similarity: sql`1 - (embedding <=> ${vectorLiteral(embedding)})`
  })
  .from(memories)
  .orderBy(sql`embedding <=> ${vectorLiteral(embedding)}`)
  .limit(20);

const candidates = similar.filter(r => r.similarity >= 0.85);

Note the pgvector syntax: <=> is cosine distance, so 1 - distance = similarity.

The LLM Decision

This is the clever bit. We ask an LLM to decide what to do, but we don't give it memory IDs. LLMs hallucinate IDs. Instead, we use array indices:

const MEMORY_UPDATE_PROMPT = `Given a NEW fact and EXISTING memories, decide:
- ADD: New information not in any existing memory
- UPDATE: Refines an existing memory (return index + merged content)
- DELETE: Contradicts an existing memory
- NONE: Already captured (duplicate)

For UPDATE, merge intelligently:
- Existing: "User prefers email"
- New: "User prefers email for non-urgent, phone for urgent"
- Merged: "User prefers email for non-urgent matters and phone for urgent issues"

Output: { "action": "...", "memory_index": null|0|1|..., "merged_content": "..." }`;

// Present memories by index, not ID
const indexedMemories = candidates.map((m, idx) => ({
  index: idx,
  content: m.memory.content,
  similarity: m.similarity.toFixed(2)
}));

const decision = await callLLM(MEMORY_UPDATE_PROMPT, `
  NEW FACT: "${fact.content}"
  EXISTING MEMORIES: ${JSON.stringify(indexedMemories)}
`);

The response might be:

{ "action": "UPDATE", "memory_index": 2, "merged_content": "Sarah Chen prefers email for non-urgent matters and phone calls for urgent issues" }

We map memory_index back to the actual memory ID ourselves. The LLM never sees UUIDs.

Executing the Decision

switch (decision.action) {
  case 'ADD':
    await db.insert(memories).values({
      content: fact.content,
      contentHash: hashContent(fact.content),
      embedding,
      category: fact.category,
      confidence: fact.confidence,
    });
    break;

  case 'UPDATE':
    const target = candidates[decision.memory_index];
    const mergedEmbedding = await generateEmbedding(decision.merged_content);
    await db.update(memories)
      .set({
        content: decision.merged_content,
        contentHash: hashContent(decision.merged_content),
        embedding: mergedEmbedding,
      })
      .where(eq(memories.id, target.memory.id));
    break;

  case 'DELETE':
    await db.delete(memories)
      .where(eq(memories.id, candidates[decision.memory_index].memory.id));
    break;

  case 'NONE':
    // Duplicate, do nothing
    break;
}

Retrieval: Semantic Search

Retrieval is straightforward. Generate an embedding for the query, find the nearest neighbors:

async function searchMemory(query: string, options: SearchOptions) {
  const queryEmbedding = await generateEmbedding(query);
  
  const results = await db
    .select({
      memory: memories,
      score: sql`1 - (embedding <=> ${vectorLiteral(queryEmbedding)})`
    })
    .from(memories)
    .where(and(
      // Optional tag filters
      options.tag1 ? eq(memories.tag1, options.tag1) : undefined,
      // Exclude expired
      sql`(expires_at IS NULL OR expires_at > NOW())`
    ))
    .orderBy(sql`embedding <=> ${vectorLiteral(queryEmbedding)}`)
    .limit(options.topK || 10);

  return results;
}

The combination of vector similarity and tag filtering is powerful. You can ask "What do I know about deadlines for client X?" by searching with query="deadlines" and tag_1="client_x_id".

Why This Filtering Method Matters

The legal world organizes information differently. Solo practitioners thinks in terms of clients and deadlines, while an AmLaw 100 firm structures everything around matters, billing codes, and practice groups. Every corporate counsel has their own way of thinking about this.

Our memory API lets you define the taxonomy that matches how your customers and users actually work, rather than forcing them into a rigid schema designed for someone else.

The History Table

Every mutation gets logged:

CREATE TABLE memory_history (
  id               TEXT PRIMARY KEY,
  memory_id        TEXT NOT NULL,        -- No FK cascade - survives deletion
  event            TEXT NOT NULL,        -- ADD | UPDATE | DELETE | NONE
  previous_content TEXT,
  new_content      TEXT,
  metadata         JSONB,
  created_at       TIMESTAMPTZ DEFAULT NOW()
);

The key insight: memory_id has no foreign key constraint. When a memory is deleted, its history remains. This gives you a complete audit trail and the ability to answer "what did we used to know about X?"

Performance Considerations

Index everything you filter on. Each tag field should have an index:

CREATE INDEX memories_tag1_idx ON memories (tag_1);
CREATE INDEX memories_tag2_idx ON memories (tag_2);
-- ... and so on

Set HNSW parameters appropriately. The defaults work, but for larger datasets:

CREATE INDEX memories_embedding_idx ON memories 
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

Higher m means more connections per node (better recall, more memory). Higher ef_construction means more work at index time (better quality).

Batch when possible. If you're processing many facts, batch the embedding calls. OpenAI's embedding endpoint accepts arrays.

What We Learned

The two-LLM pattern works. Extraction + decision is more robust than trying to do everything in one call. Each call has a focused job.

Index-based references prevent hallucination. Never let an LLM generate IDs. Give it indices, map them yourself.

Generic fields beat domain models. We tried designing memory schemas for "legal apps" with client_id and matter_id columns. It was wrong. Tags are better. Let users decide what their dimensions mean.

SHA-256 is a cheap first filter. Most duplicates are exact. Check the hash before burning tokens on embeddings.

History is non-negotiable. You will need to debug. You will need to audit. Keep everything.

The API

The end result is a simple REST API:

# Add memories from conversation
POST /memory/v1
{ "messages": [...], "tag_1": "client_123" }

# Semantic search
POST /memory/v1/search
{ "query": "deadlines", "tag_1": "client_123", "top_k": 5 }

# List with filters
GET /memory/v1?tag_1=client_123&category=deadline

# Delete
DELETE /memory/v1/:id

Giving Agents Memory: Tool Definitions

An API is useless if agents can't call it. Here's how we expose memory to AI agents using Vercel AI SDK's tool() helper.

The Tool Definitions

We define three tools: memory_search, memory_add, and memory_forget.

import { tool } from 'ai';
import { z } from 'zod';

export const memorySearch = tool({
  description:
    'Search your memory for relevant facts, preferences, or context. Use this when you need to recall previously stored information.',
  inputSchema: z.object({
    query: z.string().describe('Natural language query to search memories'),
    limit: z.number().default(5),
  }),
  execute: async ({ query, limit }) => {
    const response = await fetch(`${API_BASE}/memory/v1/search`, {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${apiKey}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({ query, top_k: limit }),
    });

    const data = await response.json();
    return {
      memories: data.results.map((m: any) => ({
        id: m.id,
        content: m.content,
        category: m.category,
        score: m.score,
      })),
    };
  },
});

export const memoryAdd = tool({
  description:
    'Store an important fact, preference, or piece of context for future reference.',
  inputSchema: z.object({
    content: z.string().describe('The fact to remember'),
    category: z.enum(['preference', 'fact', 'deadline', 'decision', 'context']).default('fact'),
  }),
  execute: async ({ content, category }) => {
    const response = await fetch(`${API_BASE}/memory/v1`, {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${apiKey}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        messages: [{ role: 'user', content }],
        category,
      }),
    });

    const data = await response.json();
    return {
      success: true,
      stored: data.results.map((m: any) => ({
        id: m.id,
        content: m.content,
        event: m.event,
      })),
    };
  },
});

export const memoryForget = tool({
  description:
    'Remove a stored memory that is no longer accurate or relevant.',
  inputSchema: z.object({
    query: z.string().describe('Describe what memory to forget'),
  }),
  execute: async ({ query }) => {
    // First, search for the memory
    const searchResponse = await fetch(`${API_BASE}/memory/v1/search`, {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${apiKey}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({ query, top_k: 1 }),
    });

    const searchData = await searchResponse.json();
    const memoryToDelete = searchData.results?.[0];

    if (!memoryToDelete) {
      return { success: false, error: 'No matching memory found' };
    }

    // Delete it
    await fetch(`${API_BASE}/memory/v1/${memoryToDelete.id}`, {
      method: 'DELETE',
      headers: { 'Authorization': `Bearer ${apiKey}` },
    });

    return {
      success: true,
      deleted: { id: memoryToDelete.id, content: memoryToDelete.content },
    };
  },
});

The Memory Loop

We use a three-phase pattern in our chat routes:

Phase 1: Pre-fetch relevant memories before generating a response.

// Before calling streamText, search for relevant context
let memoryContext = '';
const lastUserMessage = messages.filter(m => m.role === 'user').pop();

if (lastUserMessage) {
  const response = await fetch(`${API_BASE}/memory/v1/search`, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      query: lastUserMessage.content,
      top_k: 10,
    }),
  });

  const data = await response.json();
  if (data.results?.length > 0) {
    memoryContext = `\n\n[Relevant Context]
You remember the following facts:
${data.results.map((m: any) => `- ${m.content}`).join('\n')}

Use this context to provide personalized responses.`;
  }
}

Phase 2: Include memory tools in the agent's toolset.

const stream = streamText({
  model: modelInstance,
  system: systemPrompt + memoryContext,  // Inject pre-fetched memories
  messages,
  tools: {
    memory_search: memorySearch,
    memory_add: memoryAdd,
    memory_forget: memoryForget,
    // ... other tools
  },
});

Phase 3: Auto-store the conversation after completion.

// In onFinish callback
if (userText && assistantText) {
  await fetch(`${API_BASE}/memory/v1`, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      messages: [
        { role: 'user', content: userText },
        { role: 'assistant', content: assistantText },
      ],
      category: 'fact',
    }),
  });
}

This gives you passive memory (automatically learning from conversations) and active memory (agent explicitly storing and retrieving).

Teaching the Agent to Use Memory Naturally

Tools are only useful if the agent knows when to use them. Here's guidance from our system prompt:

**Memory**: You remember things across conversations—like a colleague would.

Before doing any real work, recall what you know. Preferences? 
Context that should shape your approach? Let that inform your work silently.

**How to behave:**
- Just remember things. Don't say "I'll remember that"—it sounds robotic.
- Use what you remember naturally: "Since you mentioned you prefer..." 
  not "According to my memory..."
- When the user corrects something, quietly update your memory.

**What to remember:**
- Preferences, key dates, important facts, decisions made

**What NOT to remember:**
- Trivial or obvious things
- Anything the user asks you to forget

**Tone:**
- Never say: "I'll store this in memory", "Let me check my memory"
- The goal: feel like a colleague who naturally remembers your history together

The key insight: memory should be invisible. A good assistant doesn't announce "I'm saving this to my database." It just remembers.

Conclusion

Building memory doesn't require exotic infrastructure. PostgreSQL with pgvector handles the vector math. Two focused LLM calls handle the intelligence. Generic tag fields handle the flexibility. A history table handles the auditing.

The whole thing is ~1000 lines of TypeScript and a single database table. No Redis. No Pinecone. No managed vector service. Just Postgres doing what it does best, with a little help from embeddings.

Sometimes the best architecture is the boring one.

Max Sonderby

Max Sonderby

Head of Product & Engineering

Building case.dev in San Francisco