Wiki + RAG, picked by the LLM
Most "chat with your docs" tools force a single retrieval mode — usually RAG. pocket llm ships both, and the model picks per question via tool calls.
The two modes
search_wiki — structured, deep
Maintained markdown articles in wiki/. Each article has a title, type, summary, and links to other articles (Karpathy LLM-wiki style). The LLM navigates the wiki the way a human would navigate a hand-written knowledge base.
Strengths:
- Deterministic answers — every claim cites an article.
- Compounds over time — articles link each other; coverage grows in the form of more articles, not bigger chunks.
- Human-editable —
frozen: truepins an article against regen. git diff wiki/is the safety net — review what the LLM produced before it ships.
Weaknesses:
- The slow path. Generating the wiki uses the LLM. A 50-document set takes 5–30 minutes on a 4B CPU model.
- Coverage gaps until the wiki catches up to
data/.
search_rag — fast, fuzzy
Vector search over chunks of data/. Standard RAG: chunk every source, embed each chunk with bge-small, query at chat time.
Strengths:
- Instant — no LLM step at index time.
- Wide recall — fuzzy queries find chunks that nobody bothered to write a wiki article about.
Weaknesses:
- Quality varies with chunk-window heuristics.
- No citation discipline — chunks are bags of words, not arguments.
- Plateau-prone — once the chunks exist, RAG can't improve without re-chunking.
Tool calling at chat time
The chat LLM is given both tools and decides:
{
"name": "search_wiki",
"input": { "query": "what's our return policy", "type": "policy" }
}or
{
"name": "search_rag",
"input": { "query": "return policy 30 days", "top_k": 8 }
}For a question like "what's our return policy", the model usually picks search_wiki (there's likely a return-policy.md article). For "who mentioned 30 days in the threads from March", it'll pick search_rag (fuzzy, free-text).
When both could work, the model is biased toward search_wiki first and falls back to search_rag on a miss.
Why both
Most "RAG + something" projects fail because the "something" is a synonym for "RAG with extra steps." Here the modes have genuinely different strengths:
- Wiki = depth + citation + git review.
- RAG = breadth + speed + zero setup.
They're co-deployed, not competing. The LLM picks per query.
Other tools the chat LLM has
Beyond retrieval, the model gets:
list_wiki— list all articles by type.read_wiki— read one article in full.- Skill tools — every
SKILL.mdinskills/registers itself as a tool the LLM can call. Industry-standard format, same as Claude Code / Codex / Microsoft Agent Framework.
Full chat-tool spec on GitHub →
The instructions file is portable
wiki-instructions.md is the same prompt template that:
- Feeds the local model when you run
wiki generate. - Can be fed to claude-code (or any other LLM) by hand for higher-quality output.
Whichever path produces the article, build doesn't care — the markdown file is the contract. This means you can use the local model as a baseline and selectively upgrade specific articles through a stronger LLM, without forking your workflow.