Skip to main content

Command Palette

Search for a command to run...

What local-first actually costs

Updated
14 min read

Post #3 in the Lumogis launch series. Post #1 made the case for keeping your data home. Post #2 showed the assistant working there. This one is the bill. If the first two talked you into it, you should know exactly what you are signing up for before you run docker compose up.

Most posts about self-hosted AI sell the upside and go quiet on the rest. That is a tell. A project that believes in the trade will name the losses first, because the losses are real and you will hit them in the first hour. Here they are, in full.

The bargain

Running the assistant on your own hardware means giving up three comforts you have stopped noticing. You give up the frontier-grade model, because what fits on your machine is smaller than what sits in a browser tab. You give up zero setup, because nobody has provisioned anything for you. And you give up someone else's operations team, because that team is now you.

In return you get locality and a corpus that compounds, the things the first two posts were about. Whether that is a good deal depends entirely on who you are, which is the last section. First, the costs in detail.

The honest catch: a clean install does not act

Lumogis is a self-hosted personal AI that runs on your own hardware as a set of Docker containers: an orchestrator, a vector store, a database, and a local model runner, all sitting next to your files rather than across a vendor's boundary. A fresh install of Lumogis pulls two local models automatically on first boot: nomic-embed-text for embeddings and llama3.2:3b for chat. A 3B model is genuinely useful for grounded, retrieval-backed questions about your own files. It is not a frontier model, and pretending its open-ended reasoning is in the same class as the hosted giants would be dishonest. For "what did this document say," it holds up. For "think hard about this gnarly problem," you will feel the gap.

There is a sharper catch, and it is the one most likely to trip you up after reading post #2. That post described an assistant that searches, opens files, and acts through a tool loop. The default local model is registered without tool-calling. So on a clean install with no API keys, you get retrieval-grounded chat, but not the acting assistant. Getting the full loop is a deliberate step, not a default. The rest of this post is mostly about what that step costs.

What it actually takes to run the acting assistant

The tools it gains

Turning the loop on gives the model a small, permissioned toolset over your own data. search_files semantically searches your indexed documents. read_file opens a specific match, bounded to your configured ingest roots so it cannot wander off into the rest of your disk. query_entity resolves a name across your files and past sessions. Through the MCP surface the same capabilities are reachable as memory.search, entity.lookup, context.build, and a few more, so an external agent can use them too without you pasting context into its window.

Every one of those calls passes the Ask/Do permission check. That is the whole "acts" surface in the open build: read and retrieve by default, with anything that writes gated behind explicit Do-mode capabilities you add deliberately. The assistant cannot reach further than you let it, which is the point, but it also means the acting is only as broad as what you wire up.

The model you need

Two paths get you a tool-using assistant.

The first is a cloud key. Lumogis ships a Claude entry in its model registry that activates the moment you add an API key, and a frontier hosted model is both the most reliable tool-caller and the lowest hardware requirement, since the heavy compute is someone else's. The trade is a per-token bill and the fact that the composed turn leaves your network. That is the privacy cost from post #1, bounded to the excerpts Core selected rather than your whole archive, but it is a real boundary crossing and you should make it with eyes open.

The second is a tools-capable local model. You pull one through Ollama and enable tools for it in the registry. As of mid-2026 the consensus pick for local tool calling is the Qwen line. For a simple single-tool loop, a 7B-to-8B Qwen at 4-bit quantisation runs on about 8 GB of VRAM and works reliably. For multi-step chains, where the model calls a tool, reads the result, and decides what to do next, step up to a 14B-to-27B Qwen or a Mistral or Devstral Small 24B, which wants roughly 12 to 24 GB. The honest limit: small models stay coherent for one or two tool calls and start to wander on longer chains, so size the model to how much chaining you actually want.

The reality check across both paths: a local model is private and has no per-token cost, but it is not frontier-grade reasoning. For your-data retrieval it is excellent. For hard open-ended problems the hosted models still win. Plenty of people run both, a local model for the private bulk and a cloud key for the occasional hard question, and switch per query.

The hardware stack

Three rough tiers, depending on which model path you took.

A small always-on box is enough if you lean on a cloud model for the tool loop, or run only the smallest local models. A mini PC or efficient small-form-factor machine with 16 GB or so of memory runs the Lumogis stack, the embedder, and a small Qwen.

A single 24 GB-class GPU is the tier where a capable local model becomes a genuinely good acting assistant. A used 24 GB GPU has been the long-standing value pick for local inference. This is the tier that buys you reliable multi-step tool loops without a cloud key.

Apple Silicon is the efficiency outlier. Its unified memory lets you run larger models than a single consumer GPU of the same price, slower per token but with more capacity, and it idles at a fraction of the power. For a machine that runs 24/7, that efficiency matters more than peak speed.

If you already own a capable machine, the marginal hardware cost is zero, which is the cheapest path by a wide margin. If you are buying, prices move constantly, so check current ones, but the order of magnitude is a hobbyist purchase, not a server budget.

The running cost, with the maths shown

A 24/7 assistant is idle most of the day with short bursts during queries, so the dominant cost is idle power, not peak draw. The maths is plain arithmetic: average watts times 24 times 365, divided by 1000, gives kilowatt-hours per year. Multiply by your electricity rate.

A small efficient box averages somewhere around 25 W across a real day. That is roughly 220 kWh a year. At a German household rate near 33 cents per kWh that is about 70 euros a year; at a US rate near 15 cents, about 33 dollars.

A box with a 24 GB GPU idles higher and spikes hard while generating, but because it sits idle most of the day the daily average tends to land around 80 to 120 W depending on how aggressively the GPU parks at idle and how often you query. Call it 100 W. That is roughly 880 kWh a year, about 290 euros at the German rate or 130 dollars at the US rate.

Apple Silicon running a mid-sized model through unified memory can hold a 24/7 average closer to the small-box tier, in the 15-to-25 W range, despite running a bigger model. That is why it is the quiet favourite for always-on setups.

These are rule-of-thumb figures from typical device draw and current electricity rates, not measurements of Lumogis. Clip a cheap plug-in power meter onto whatever you actually run and you will know your real number within a day. The point is the scale: somewhere between a streaming subscription and a phone bill per year, not a data-centre invoice, and entirely yours to control.

The knowledge cost

The least visible cost is what you need to know, and it is the one people underestimate. You do not need to be a developer. You do need to be comfortable at a confident homelab level: running docker compose up and reading the logs it prints, editing a .env file, and editing one YAML file to register a model and pull it through Ollama. Putting the assistant on your phone while you are out means understanding Tailscale or a tunnel, because exposing it raw to the internet is the one thing the docs explicitly tell you not to do. When something runs out of memory, you need to know the fix is a smaller model or fewer features rather than a support ticket.

None of this is exotic, but it is real, and it is measured in an evening or two of learning rather than minutes. If editing a YAML file and reading a log is your idea of a bad evening, that is a genuine cost, and an honest one to weigh before you start.

You are the operator now

Past setup, the ongoing operational load is the largest real cost and the least exciting one. The good news first: database migrations run automatically every time the orchestrator boots, so schema upgrades are the one chore taken off your plate. Everything else is yours.

Backups today are on demand rather than scheduled. There is a backup endpoint and a per-user export, so your data is recoverable, but the convenience layer is still being built: there is not yet a one-click scheduled backup running quietly in the background, and the quick backup stores your text and metadata and re-embeds on restore rather than copying the vectors, which keeps it small but means a restore is not instant. Restore is currently a short operator procedure rather than a single button. Making this smoother is active work, so expect it to improve, but as it stands today you should either run the on-demand backup as a habit or wire up your own schedule around it. I am flagging this plainly because the worst place to over-promise is the feature people only test the day they need it.

Version upgrades are the ordinary pull-and-restart you know from any Compose stack. And putting the assistant anywhere beyond your own network is on you: it is LAN-only by default, the docs steer you towards Tailscale or a Cloudflare Tunnel, and they warn against forwarding a port straight to the open internet. Health checks and logs are there, and you will use them, because something will break at an inconvenient hour and you will be the one reading the output.

None of this is a flaw to apologise for. It is the actual shape of owning your infrastructure. But it is hours, and you should budget them before you start, not discover them at 11pm.

The cost that is falling

One line item on this bill shrinks while you hold it, and it happens to be the weakest part of the bargain today: model quality. The open-model landscape in 2026 is moving fast in exactly the direction that helps a setup like this.

Google's Gemma 4, released in April under a fully permissive Apache 2.0 licence, put frontier-flavoured multimodal capability into sizes that fit mainstream machines, with the 31B dense model landing near the top of the open leaderboards. The 12B variant that followed in June runs on a 16 GB laptop and, by Google's own account, nearly matches the twice-as-large 26B. On the Qwen side, the newer coder models use mixture-of-experts designs that run on a single 24 GB GPU while activating only a slice of their full weights per token, and they now land within striking distance of hosted frontier models on practical coding and tool-use tasks.

The pattern underneath is the part that matters: architecture and training are improving faster than raw parameter count, so the capability you can run on a fixed amount of memory keeps climbing, and the licensing keeps getting more permissive. Tool calling in particular, the exact thing the acting assistant depends on, now works on local models at close to cloud quality for straightforward calls. It is not frontier parity on the hardest reasoning, and the rankings churn monthly, so treat any specific model as a this-month recommendation. But the direction is not ambiguous.

What that means for your bill: the hardware you buy this year runs a better local assistant next year for free, just by pulling new weights. None of the other costs behave like that. Electricity does not drop because you waited. Your operator hours do not shrink on their own. Model quality does. You are buying into the one cost on the list with a downward slope.

Who this is for, and who it is not

Not for you if you want the smartest possible answer with no setup, if you will not run containers, or if your archive is disposable enough that a vendor holding it is fine. For those needs a hosted chatbot is the correct tool and you should use it without guilt.

For you if you already self-host, if you want the assistant's reach into your files and credentials to stay on hardware you control, and if you will trade some model IQ, a few evenings of operations, and a modest power bill for a setup that does not change its terms on you next quarter. The compounding part is the payoff: every document you add makes the next question cheaper to answer, and none of that value is rented.

If naming these costs talks you out of it, the post did its job. If it did not, you now know the full price going in, which is the only honest way to start.

What the price buys

So weigh it honestly. On one side of the ledger: real money for hardware and a modest power bill, an evening or two of learning, ongoing operator hours, and a local model that today still trails the hosted frontier on the hardest questions. Those are not trivial, and this whole post exists so you meet them on purpose rather than by surprise.

On the other side sit three things money cannot rent. Privacy stops being a clause in a terms-of-service document and becomes a physical property of where your bytes live, which is hardware you can point at. Control over which model runs, what it is allowed to touch, when it updates, and whether anything leaves the box stays with you rather than a vendor's roadmap. And durability, because a stack you host does not get sunset, rate-limited, quietly retrained into something you like less, or repriced overnight; the version that works today keeps working for as long as you keep running it. Underneath all three is the compounding payoff: every document you add makes the next answer cheaper, and none of that accumulated value is rented.

That is the real trade. Convenience and raw model power, which anyone can rent from a browser tab, set against privacy, control, and ownership, which no one will rent you and which you can only hold by running the thing yourself. For a throwaway question the rented version wins easily, and you should use it without guilt. For the archive of your life, the place your contracts and your insurance and your years of notes actually live, owning the machinery that reads it is worth a power bill and a few evenings. And the costliest part of the trade, the model gap, is the one closing on its own while the value of what you keep only compounds.

Post #1 was your data stays home. Post #2 was your assistant works there too. Post #3 is the bill, itemised, and the case that it is a fair one to pay. Clone the repo, run Compose, and measure the rest on your own hardware.


Lumogis is AGPL-3.0-only. Documentation: README, quickstart, remote access.

1 views