Skip to content

Architecture overview

   user shell                                       browser
        │                                              │
        ▼                                              ▼
  ┌───────────┐                               ┌─────────────┐
  │ cobra CLI │                               │  dashboard  │  (static, embedded via embed.FS)
  └─────┬─────┘                               └──────┬──────┘
        │                                              │  HTTP (localhost, no auth)
        ▼                                              ▼
  ┌───────────────────────────────────────────────────────────────┐
  │         apps/cli/internal/service/   ←  FUNCTION LAYER          │
  │  (CLI- and HTTP-agnostic Go funcs: Init, Doctor, Build, …)        │
  └───┬──────────────────┬───────────────┬──────────────────────┘
        ▼                  ▼                 ▼
  ┌─────────┐         ┌──────────┐      ┌─────────────┐
  │ wiki +  │         │  skills  │      │ apps/llama  │  (cgo, in-process)
  │ RAG     │         └──────────┘      │   │         │
  └─────────┘                          │   ▼         │
                                       │ libllama.a  │  (statically linked)
                                       └─────────────┘

One binary, one process. No subprocess, no port allocation, no HTTP round-trip per token. llama.cpp is linked statically into the Go binary via cgo. The dashboard assets are baked in via embed.FS.

Layered design

  1. apps/cli/internal/cli/ (cobra commands) — thin presentation layer. Parses flags, builds option structs, calls into service/.
  2. apps/cli/internal/tui/ (Bubble Tea) — interactive prompts (model picker so far). Also a thin layer above service/.
  3. apps/cli/internal/http/ (HTTP handlers) — used by serve. Pure presentation; routes to service/.
  4. apps/cli/internal/service/the function layer. All business logic. CLI-agnostic, HTTP-agnostic. Returns errors and values. Never calls os.Exit, never prints to stdout.
  5. Foundation packagesmanifest/, project/, wiki/, wikigen/, rag/, skills/, packzip/, manifest/. Each owns one slice of pure-Go logic.

Why function-layer matters →

Three apps, one binary

  • apps/cli — the Go binary local-agents. Hosts the embedded HTTP server when serve is run.
  • apps/llama — cgo bindings to llama.cpp (git submodule, statically linked).
  • apps/dashboard — Vite + React 18 + TS SPA. Built to dist/ and embedded via embed.FS into the CLI binary.

Deep dive →

Two retrieval modes

At chat time the LLM has two tool calls:

  • search_wiki — structured navigation over articles in wiki/. Deep, deterministic, citation-friendly.
  • search_rag — vector search over chunks of data/. Fast, fuzzy, fall-through for queries the wiki doesn't cover yet.

The model decides per question. Both modes co-exist instead of competing. Why hybrid →

Where the model lives

PlatformDefault pathEnv override
linux$XDG_CONFIG_HOME/local-agents/ (default ~/.config/local-agents/)$LOCAL_AGENTS_HOME
darwin~/.config/local-agents/$LOCAL_AGENTS_HOME
windows%USERPROFILE%\.config\local-agents\%LOCAL_AGENTS_HOME%

Layout:

.config/local-agents/
├── model/<id>.gguf       # the chat model
└── embed/<id>.onnx       # the embedding model (bge-small by default)

Machine-wide cache. Same blob serves every project on the host.

Why cgo, not subprocess

We considered running prebuilt llama-server as a child process. The trade-offs flipped because:

  • Subprocess management is non-trivial: ports, health checks, crash recovery, version compatibility.
  • HTTP round-trip per token isn't free at our volumes.
  • share --self-contained would have to bundle a llama-server binary per OS/arch — multiplying zip size.
  • One static binary is a strictly better UX: "download one file, run it."

The cost: a cgo build step + a C toolchain on dev machines + a CI build per OS/arch. All accepted.

Read the specs

The full design lives under docs/specs/ in the repo — browse on GitHub →

pocket llm — local-first, offline, no telemetry. MIT licensed.