Architecture overview
user shell browser
│ │
▼ ▼
┌───────────┐ ┌─────────────┐
│ cobra CLI │ │ dashboard │ (static, embedded via embed.FS)
└─────┬─────┘ └──────┬──────┘
│ │ HTTP (localhost, no auth)
▼ ▼
┌───────────────────────────────────────────────────────────────┐
│ apps/cli/internal/service/ ← FUNCTION LAYER │
│ (CLI- and HTTP-agnostic Go funcs: Init, Doctor, Build, …) │
└───┬──────────────────┬───────────────┬──────────────────────┘
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌─────────────┐
│ wiki + │ │ skills │ │ apps/llama │ (cgo, in-process)
│ RAG │ └──────────┘ │ │ │
└─────────┘ │ ▼ │
│ libllama.a │ (statically linked)
└─────────────┘One binary, one process. No subprocess, no port allocation, no HTTP round-trip per token. llama.cpp is linked statically into the Go binary via cgo. The dashboard assets are baked in via embed.FS.
Layered design
apps/cli/internal/cli/(cobra commands) — thin presentation layer. Parses flags, builds option structs, calls into service/.apps/cli/internal/tui/(Bubble Tea) — interactive prompts (model picker so far). Also a thin layer above service/.apps/cli/internal/http/(HTTP handlers) — used byserve. Pure presentation; routes to service/.apps/cli/internal/service/— the function layer. All business logic. CLI-agnostic, HTTP-agnostic. Returns errors and values. Never callsos.Exit, never prints to stdout.- Foundation packages —
manifest/,project/,wiki/,wikigen/,rag/,skills/,packzip/,manifest/. Each owns one slice of pure-Go logic.
Three apps, one binary
apps/cli— the Go binarylocal-agents. Hosts the embedded HTTP server whenserveis run.apps/llama— cgo bindings to llama.cpp (git submodule, statically linked).apps/dashboard— Vite + React 18 + TS SPA. Built todist/and embedded viaembed.FSinto the CLI binary.
Two retrieval modes
At chat time the LLM has two tool calls:
search_wiki— structured navigation over articles inwiki/. Deep, deterministic, citation-friendly.search_rag— vector search over chunks ofdata/. Fast, fuzzy, fall-through for queries the wiki doesn't cover yet.
The model decides per question. Both modes co-exist instead of competing. Why hybrid →
Where the model lives
| Platform | Default path | Env override |
|---|---|---|
| linux | $XDG_CONFIG_HOME/local-agents/ (default ~/.config/local-agents/) | $LOCAL_AGENTS_HOME |
| darwin | ~/.config/local-agents/ | $LOCAL_AGENTS_HOME |
| windows | %USERPROFILE%\.config\local-agents\ | %LOCAL_AGENTS_HOME% |
Layout:
.config/local-agents/
├── model/<id>.gguf # the chat model
└── embed/<id>.onnx # the embedding model (bge-small by default)Machine-wide cache. Same blob serves every project on the host.
Why cgo, not subprocess
We considered running prebuilt llama-server as a child process. The trade-offs flipped because:
- Subprocess management is non-trivial: ports, health checks, crash recovery, version compatibility.
- HTTP round-trip per token isn't free at our volumes.
share --self-containedwould have to bundle allama-serverbinary per OS/arch — multiplying zip size.- One static binary is a strictly better UX: "download one file, run it."
The cost: a cgo build step + a C toolchain on dev machines + a CI build per OS/arch. All accepted.
Read the specs
The full design lives under docs/specs/ in the repo — browse on GitHub →