Skip to content

Wiki + RAG, picked by the LLM

Most "chat with your docs" tools force a single retrieval mode — usually RAG. pocket llm ships both, and the model picks per question via tool calls.

The two modes

search_wiki — structured, deep

Maintained markdown articles in wiki/. Each article has a title, type, summary, and links to other articles (Karpathy LLM-wiki style). The LLM navigates the wiki the way a human would navigate a hand-written knowledge base.

Strengths:

  • Deterministic answers — every claim cites an article.
  • Compounds over time — articles link each other; coverage grows in the form of more articles, not bigger chunks.
  • Human-editable — frozen: true pins an article against regen.
  • git diff wiki/ is the safety net — review what the LLM produced before it ships.

Weaknesses:

  • The slow path. Generating the wiki uses the LLM. A 50-document set takes 5–30 minutes on a 4B CPU model.
  • Coverage gaps until the wiki catches up to data/.

search_rag — fast, fuzzy

Vector search over chunks of data/. Standard RAG: chunk every source, embed each chunk with bge-small, query at chat time.

Strengths:

  • Instant — no LLM step at index time.
  • Wide recall — fuzzy queries find chunks that nobody bothered to write a wiki article about.

Weaknesses:

  • Quality varies with chunk-window heuristics.
  • No citation discipline — chunks are bags of words, not arguments.
  • Plateau-prone — once the chunks exist, RAG can't improve without re-chunking.

Tool calling at chat time

The chat LLM is given both tools and decides:

json
{
  "name": "search_wiki",
  "input": { "query": "what's our return policy", "type": "policy" }
}

or

json
{
  "name": "search_rag",
  "input": { "query": "return policy 30 days", "top_k": 8 }
}

For a question like "what's our return policy", the model usually picks search_wiki (there's likely a return-policy.md article). For "who mentioned 30 days in the threads from March", it'll pick search_rag (fuzzy, free-text).

When both could work, the model is biased toward search_wiki first and falls back to search_rag on a miss.

Why both

Most "RAG + something" projects fail because the "something" is a synonym for "RAG with extra steps." Here the modes have genuinely different strengths:

  • Wiki = depth + citation + git review.
  • RAG = breadth + speed + zero setup.

They're co-deployed, not competing. The LLM picks per query.

Other tools the chat LLM has

Beyond retrieval, the model gets:

  • list_wiki — list all articles by type.
  • read_wiki — read one article in full.
  • Skill tools — every SKILL.md in skills/ registers itself as a tool the LLM can call. Industry-standard format, same as Claude Code / Codex / Microsoft Agent Framework.

Full chat-tool spec on GitHub →

The instructions file is portable

wiki-instructions.md is the same prompt template that:

  • Feeds the local model when you run wiki generate.
  • Can be fed to claude-code (or any other LLM) by hand for higher-quality output.

Whichever path produces the article, build doesn't care — the markdown file is the contract. This means you can use the local model as a baseline and selectively upgrade specific articles through a stronger LLM, without forking your workflow.

pocket llm — local-first, offline, no telemetry. MIT licensed.