← Back to Field Notes
May 07, 2026 · 6 min read

Direct-Context RAG: When You Don't Need a Vector Database

Building the AutomaQue CRM chatbot, I skipped embeddings entirely. Here's the math, the implementation, and when to upgrade.

ragclaude-apibigqueryarchitecture

I built a CRM. I wanted users to ask it questions in plain English — which deals are most likely to close this month? who haven’t I followed up with in 30+ days?— and get real answers grounded in their actual data.

The default 2026 answer is RAG with embeddings: chunk the data, embed the chunks, store them in a vector database, embed the query, retrieve the top-K matches, stuff them into the prompt. Pinecone or pgvector or Weaviate or BigQuery vector search.

I didn’t do any of that.

For AutomaQue CRM, I built a chatbot that snapshots the entire database, passes it as a single chunk of context to Claude, and lets the model do the “retrieval” itself. No embeddings. No vector store. No retrieval pipeline. Here’s why — and when you should make the same call.

AutomaQue CRM answering against the live CRM snapshot

Why Everyone Reaches for Vectors First

The vector-database playbook came from a real constraint: in 2022, GPT had a 4K context window. If your knowledge base was bigger than four thousand tokens (it always was), you had no choice. Embed, chunk, retrieve.

That constraint is gone. Claude Sonnet 4.5 has a 200K context window. Claude Opus 4.7 with the 1M extension has a million. The question stopped being “how do I fit my data into the prompt?” and became “do I need to fit everything, or just the relevant slice?”

For most personal-scale use cases — a CRM, a personal knowledge base, a small docs site — the entire dataset fits comfortably in context. And when it does, the embedding pipeline is dead weight.

The Math: How Big Is Your Data, Really?

Run the numbers before you reach for a vector store. Rough conversion: 1 token ≈ 4 characters of English text. So 200K tokens is roughly 800K characters — about 130,000 English words.

For a CRM, that’s comfortably:

  • 5,000 contacts at ~30 words each (name, company, tags, notes)
  • 1,000 deals at ~20 words each
  • 10,000 activities at ~15 words each
  • A few hundred notes at a paragraph each

That’s a fully-loaded personal CRM, dumped flat as JSON, sitting well under 100K tokens. With Claude Sonnet 4.5’s 200K window you have headroom to spare.

Don’t guess at this — measure it. The Anthropic SDK exposes a token-counting endpoint. Run it against a snapshot of your real data before deciding you need a retrieval pipeline.

The Implementation

The chat endpoint is a single API route. On every turn, it queries BigQuery for a fresh snapshot of contacts, deals, activities, and notes, formats them as JSON, and sends them to Claude as a system prompt with prompt caching enabled. The user’s message becomes the user prompt.

The same CRM state that feeds the direct-context assistant
// /api/chat - pseudocode
export async function POST(req: Request) {
  const { messages } = await req.json();
  const snapshot = await getCachedSnapshot(); // 60s server-side cache

  const stream = await anthropic.messages.stream({
    model: "claude-sonnet-4-5",
    max_tokens: 2048,
    system: [{
      type: "text",
      text: `You are a CRM assistant. Here is the full CRM state as JSON:
${JSON.stringify(snapshot)}
Answer questions using only this data. Cite specific records.`,
      cache_control: { type: "ephemeral" },
    }],
    messages,
  });

  return new Response(stream.toReadableStream());
}

The 60-second server-side cache on the snapshot is doing a lot of work. Within a single chat session, the user typically sends 3-5 questions in quick succession; we hit BigQuery once and serve the rest from memory. Anthropic’s prompt cache then handles the downstream cost — the same JSON blob gets cached for repeat calls so subsequent turns pay roughly 10% of the input cost of the first.

What You Give Up

Two things, both real:

Cost per call.Even with caching, you’re paying to ship the whole dataset through the model on every cache miss. For a CRM with maybe 50 chat sessions a day, this is rounding-error money. For a system serving 50,000 users a day, it would dominate.

Hard latency floor. Sending 80K tokens of context takes longer than sending 2K. First token might land in 1-2 seconds instead of 200ms. For a chat UI with streaming, this is fine. For a sub-second autocomplete, it’s not.

Notice what you don’tgive up: retrieval quality. The model has the whole dataset; it can’t miss the relevant record because it doesn’t need to retrieve. This eliminates an entire class of RAG bug — queries that fail because the right chunk didn’t make it into the top-K.

When to Upgrade to Vectors

Three triggers, and only three:

  1. Data outgrows context. The day your snapshot exceeds ~150K tokens (leaving headroom for conversation), it’s time. Not before.
  2. Latency stops being acceptable. If first-token time is breaking the UX, retrieve a slice instead of shipping everything.
  3. Cost stops penciling. If per-conversation cost crosses your unit-economics line, retrieval gets you back under it.

Critically: in all three cases, the migration is a backend change. The chat endpoint signature stays the same. The frontend doesn’t know the difference. You’ve bought yourself a clean upgrade path without paying for it on day one.

The Generalization

This isn’t really a post about RAG. It’s a post about defaults aging out. Best practices that were forced by 2022’s constraints aren’t best practices in 2026. Vector databases were a brilliant workaround for tiny context windows. Tiny context windows are now history.

The lesson generalizes. Every time you reach for a complex pattern, ask the simpler question first: does the constraint that justified this pattern still exist?Sometimes it does. Often, it doesn’t, and you’ve been carrying complexity for a problem the platform already solved.

For AutomaQue CRM, skipping the vector store removed an entire infrastructure layer, simplified the chat endpoint to under 30 lines, and made the whole system easier to reason about. The day my data outgrows the context window, I’ll add embeddings. Until then, the simpler system wins.

Working on something like this?

Start a conversation →