Lumogis Thoughts

What local-first actually costs

Thomas Kohlborn — Mon, 08 Jun 2026 13:12:45 GMT

Post #3 in the Lumogis launch series. Post #1 made the case for keeping your data home. Post #2 showed the assistant working there. This one is the bill. If the first two talked you into it, you should know exactly what you are signing up for before you run docker compose up.

Most posts about self-hosted AI sell the upside and go quiet on the rest. That is a tell. A project that believes in the trade will name the losses first, because the losses are real and you will hit them in the first hour. Here they are, in full.

The bargain

Running the assistant on your own hardware means giving up three comforts you have stopped noticing. You give up the frontier-grade model, because what fits on your machine is smaller than what sits in a browser tab. You give up zero setup, because nobody has provisioned anything for you. And you give up someone else's operations team, because that team is now you.

In return you get locality and a corpus that compounds, the things the first two posts were about. Whether that is a good deal depends entirely on who you are, which is the last section. First, the costs in detail.

The honest catch: a clean install does not act

Lumogis is a self-hosted personal AI that runs on your own hardware as a set of Docker containers: an orchestrator, a vector store, a database, and a local model runner, all sitting next to your files rather than across a vendor's boundary. A fresh install of Lumogis pulls two local models automatically on first boot: nomic-embed-text for embeddings and llama3.2:3b for chat. A 3B model is genuinely useful for grounded, retrieval-backed questions about your own files. It is not a frontier model, and pretending its open-ended reasoning is in the same class as the hosted giants would be dishonest. For "what did this document say," it holds up. For "think hard about this gnarly problem," you will feel the gap.

There is a sharper catch, and it is the one most likely to trip you up after reading post #2. That post described an assistant that searches, opens files, and acts through a tool loop. The default local model is registered without tool-calling. So on a clean install with no API keys, you get retrieval-grounded chat, but not the acting assistant. Getting the full loop is a deliberate step, not a default. The rest of this post is mostly about what that step costs.

What it actually takes to run the acting assistant

The tools it gains

Turning the loop on gives the model a small, permissioned toolset over your own data. search_files semantically searches your indexed documents. read_file opens a specific match, bounded to your configured ingest roots so it cannot wander off into the rest of your disk. query_entity resolves a name across your files and past sessions. Through the MCP surface the same capabilities are reachable as memory.search, entity.lookup, context.build, and a few more, so an external agent can use them too without you pasting context into its window.

Every one of those calls passes the Ask/Do permission check. That is the whole "acts" surface in the open build: read and retrieve by default, with anything that writes gated behind explicit Do-mode capabilities you add deliberately. The assistant cannot reach further than you let it, which is the point, but it also means the acting is only as broad as what you wire up.

The model you need

Two paths get you a tool-using assistant.

The first is a cloud key. Lumogis ships a Claude entry in its model registry that activates the moment you add an API key, and a frontier hosted model is both the most reliable tool-caller and the lowest hardware requirement, since the heavy compute is someone else's. The trade is a per-token bill and the fact that the composed turn leaves your network. That is the privacy cost from post #1, bounded to the excerpts Core selected rather than your whole archive, but it is a real boundary crossing and you should make it with eyes open.

The second is a tools-capable local model. You pull one through Ollama and enable tools for it in the registry. As of mid-2026 the consensus pick for local tool calling is the Qwen line. For a simple single-tool loop, a 7B-to-8B Qwen at 4-bit quantisation runs on about 8 GB of VRAM and works reliably. For multi-step chains, where the model calls a tool, reads the result, and decides what to do next, step up to a 14B-to-27B Qwen or a Mistral or Devstral Small 24B, which wants roughly 12 to 24 GB. The honest limit: small models stay coherent for one or two tool calls and start to wander on longer chains, so size the model to how much chaining you actually want.

The reality check across both paths: a local model is private and has no per-token cost, but it is not frontier-grade reasoning. For your-data retrieval it is excellent. For hard open-ended problems the hosted models still win. Plenty of people run both, a local model for the private bulk and a cloud key for the occasional hard question, and switch per query.

The hardware stack

Three rough tiers, depending on which model path you took.

A small always-on box is enough if you lean on a cloud model for the tool loop, or run only the smallest local models. A mini PC or efficient small-form-factor machine with 16 GB or so of memory runs the Lumogis stack, the embedder, and a small Qwen.

A single 24 GB-class GPU is the tier where a capable local model becomes a genuinely good acting assistant. A used 24 GB GPU has been the long-standing value pick for local inference. This is the tier that buys you reliable multi-step tool loops without a cloud key.

Apple Silicon is the efficiency outlier. Its unified memory lets you run larger models than a single consumer GPU of the same price, slower per token but with more capacity, and it idles at a fraction of the power. For a machine that runs 24/7, that efficiency matters more than peak speed.

If you already own a capable machine, the marginal hardware cost is zero, which is the cheapest path by a wide margin. If you are buying, prices move constantly, so check current ones, but the order of magnitude is a hobbyist purchase, not a server budget.

The running cost, with the maths shown

A 24/7 assistant is idle most of the day with short bursts during queries, so the dominant cost is idle power, not peak draw. The maths is plain arithmetic: average watts times 24 times 365, divided by 1000, gives kilowatt-hours per year. Multiply by your electricity rate.

A small efficient box averages somewhere around 25 W across a real day. That is roughly 220 kWh a year. At a German household rate near 33 cents per kWh that is about 70 euros a year; at a US rate near 15 cents, about 33 dollars.

A box with a 24 GB GPU idles higher and spikes hard while generating, but because it sits idle most of the day the daily average tends to land around 80 to 120 W depending on how aggressively the GPU parks at idle and how often you query. Call it 100 W. That is roughly 880 kWh a year, about 290 euros at the German rate or 130 dollars at the US rate.

Apple Silicon running a mid-sized model through unified memory can hold a 24/7 average closer to the small-box tier, in the 15-to-25 W range, despite running a bigger model. That is why it is the quiet favourite for always-on setups.

These are rule-of-thumb figures from typical device draw and current electricity rates, not measurements of Lumogis. Clip a cheap plug-in power meter onto whatever you actually run and you will know your real number within a day. The point is the scale: somewhere between a streaming subscription and a phone bill per year, not a data-centre invoice, and entirely yours to control.

The knowledge cost

The least visible cost is what you need to know, and it is the one people underestimate. You do not need to be a developer. You do need to be comfortable at a confident homelab level: running docker compose up and reading the logs it prints, editing a .env file, and editing one YAML file to register a model and pull it through Ollama. Putting the assistant on your phone while you are out means understanding Tailscale or a tunnel, because exposing it raw to the internet is the one thing the docs explicitly tell you not to do. When something runs out of memory, you need to know the fix is a smaller model or fewer features rather than a support ticket.

None of this is exotic, but it is real, and it is measured in an evening or two of learning rather than minutes. If editing a YAML file and reading a log is your idea of a bad evening, that is a genuine cost, and an honest one to weigh before you start.

You are the operator now

Past setup, the ongoing operational load is the largest real cost and the least exciting one. The good news first: database migrations run automatically every time the orchestrator boots, so schema upgrades are the one chore taken off your plate. Everything else is yours.

Backups today are on demand rather than scheduled. There is a backup endpoint and a per-user export, so your data is recoverable, but the convenience layer is still being built: there is not yet a one-click scheduled backup running quietly in the background, and the quick backup stores your text and metadata and re-embeds on restore rather than copying the vectors, which keeps it small but means a restore is not instant. Restore is currently a short operator procedure rather than a single button. Making this smoother is active work, so expect it to improve, but as it stands today you should either run the on-demand backup as a habit or wire up your own schedule around it. I am flagging this plainly because the worst place to over-promise is the feature people only test the day they need it.

Version upgrades are the ordinary pull-and-restart you know from any Compose stack. And putting the assistant anywhere beyond your own network is on you: it is LAN-only by default, the docs steer you towards Tailscale or a Cloudflare Tunnel, and they warn against forwarding a port straight to the open internet. Health checks and logs are there, and you will use them, because something will break at an inconvenient hour and you will be the one reading the output.

None of this is a flaw to apologise for. It is the actual shape of owning your infrastructure. But it is hours, and you should budget them before you start, not discover them at 11pm.

The cost that is falling

One line item on this bill shrinks while you hold it, and it happens to be the weakest part of the bargain today: model quality. The open-model landscape in 2026 is moving fast in exactly the direction that helps a setup like this.

Google's Gemma 4, released in April under a fully permissive Apache 2.0 licence, put frontier-flavoured multimodal capability into sizes that fit mainstream machines, with the 31B dense model landing near the top of the open leaderboards. The 12B variant that followed in June runs on a 16 GB laptop and, by Google's own account, nearly matches the twice-as-large 26B. On the Qwen side, the newer coder models use mixture-of-experts designs that run on a single 24 GB GPU while activating only a slice of their full weights per token, and they now land within striking distance of hosted frontier models on practical coding and tool-use tasks.

The pattern underneath is the part that matters: architecture and training are improving faster than raw parameter count, so the capability you can run on a fixed amount of memory keeps climbing, and the licensing keeps getting more permissive. Tool calling in particular, the exact thing the acting assistant depends on, now works on local models at close to cloud quality for straightforward calls. It is not frontier parity on the hardest reasoning, and the rankings churn monthly, so treat any specific model as a this-month recommendation. But the direction is not ambiguous.

What that means for your bill: the hardware you buy this year runs a better local assistant next year for free, just by pulling new weights. None of the other costs behave like that. Electricity does not drop because you waited. Your operator hours do not shrink on their own. Model quality does. You are buying into the one cost on the list with a downward slope.

Who this is for, and who it is not

Not for you if you want the smartest possible answer with no setup, if you will not run containers, or if your archive is disposable enough that a vendor holding it is fine. For those needs a hosted chatbot is the correct tool and you should use it without guilt.

For you if you already self-host, if you want the assistant's reach into your files and credentials to stay on hardware you control, and if you will trade some model IQ, a few evenings of operations, and a modest power bill for a setup that does not change its terms on you next quarter. The compounding part is the payoff: every document you add makes the next question cheaper to answer, and none of that value is rented.

If naming these costs talks you out of it, the post did its job. If it did not, you now know the full price going in, which is the only honest way to start.

What the price buys

So weigh it honestly. On one side of the ledger: real money for hardware and a modest power bill, an evening or two of learning, ongoing operator hours, and a local model that today still trails the hosted frontier on the hardest questions. Those are not trivial, and this whole post exists so you meet them on purpose rather than by surprise.

On the other side sit three things money cannot rent. Privacy stops being a clause in a terms-of-service document and becomes a physical property of where your bytes live, which is hardware you can point at. Control over which model runs, what it is allowed to touch, when it updates, and whether anything leaves the box stays with you rather than a vendor's roadmap. And durability, because a stack you host does not get sunset, rate-limited, quietly retrained into something you like less, or repriced overnight; the version that works today keeps working for as long as you keep running it. Underneath all three is the compounding payoff: every document you add makes the next answer cheaper, and none of that accumulated value is rented.

That is the real trade. Convenience and raw model power, which anyone can rent from a browser tab, set against privacy, control, and ownership, which no one will rent you and which you can only hold by running the thing yourself. For a throwaway question the rented version wins easily, and you should use it without guilt. For the archive of your life, the place your contracts and your insurance and your years of notes actually live, owning the machinery that reads it is worth a power bill and a few evenings. And the costliest part of the trade, the model gap, is the one closing on its own while the value of what you keep only compounds.

Post #1 was your data stays home. Post #2 was your assistant works there too. Post #3 is the bill, itemised, and the case that it is a fair one to pay. Clone the repo, run Compose, and measure the rest on your own hardware.

Lumogis is AGPL-3.0-only. Documentation: README, quickstart, remote access.

A chatbot reads. An assistant acts.

Thomas Kohlborn — Sat, 06 Jun 2026 19:23:40 GMT

Post #2 in the Lumogis launch series. Post #1 argued that the AI should come to your data, not the other way around. The fair objection is that the model in your browser tab is smarter than anything you can run at home. True, and beside the point. This post is about what "smarter" leaves out.

You have done this. You open a chat product, you have a question about your own life, and the first thing you do is go find the file. You export the PDF, you paste the paragraph, you upload the spreadsheet. The model is brilliant and the model knows nothing, so you spend the first two minutes of every conversation being its research assistant.

That is the tell. A chatbot reads what you hand it. It has no standing relationship with your documents, your past conversations, or the names that recur across both. Each session starts cold, and you warm it up by hand.

An assistant is the other shape. It already has your archive indexed. You ask a question and it goes and finds the answer, opens the file that matches, and replies from what was actually written down. The difference is not the size of the weights. It is where the assistant runs and what it is allowed to touch.

The ceiling is not intelligence

Drop the smartest available model into a sandbox and it is still a stranger to your data. The session is stateless. Your mortgage, your insurance renewal, the email thread from last spring: none of it exists in that window until you carry it in.

Even "upload a file" does not change the relationship, it just relocates it. The bytes go to their storage. They index them on their side. Your corpus becomes a guest in their environment, retrieved by their retriever, reached by their tools. For a throwaway question that is a reasonable bargain. For a living archive, one where documents change, conversations accumulate, and credentials sit next to calendars and notifications, it is the wrong place to put any of it.

The fix is not a better model. It is moving the machinery that reads and acts to the same side of the wall as the data.

Where Lumogis puts the machinery

Lumogis runs as a control plane on your own hardware. Core is a FastAPI orchestrator. Postgres keeps metadata, permissions, and entity records. Qdrant keeps the embeddings for your documents and your past conversations. Ollama does the embedding by default and can run the model locally too if you want nothing leaving the box at all.

You point ingest at folders you already use. Core watches them, pulls out the text, chunks it, embeds it, and writes the vectors locally. The index of your life lives on your disk.

Then the chat loop stops being a recitation from training data. When the model supports tools, Core runs a bounded tool-calling loop: the model asks for a tool, Core runs it against your stores and your filesystem, hands back the result, and the model continues until it can answer. A permission check sits in front of every one of those calls. Read access is the default Ask mode. Anything that writes is Do mode, turned on deliberately and per user. The assistant cannot quietly reach further than you let it.

Watching it work

Here is the whole path, using nothing but the open build.

Drop a PDF into the inbox folder. The watcher notices it, waits for the write to finish, and feeds it to the pipeline. Core extracts the text with pdfminer, falling back to OCR on scanned pages when that is switched on. If the file has not changed since last time, a content hash skips it, so re-ingesting a folder is cheap rather than wasteful. The text gets chunked into roughly 512-token pieces along sentence boundaries with a little overlap, each piece is embedded through Ollama, and the vectors land in Qdrant tagged with your user and the file's path. Names found in the text are pulled out and recorded in Postgres and a separate entities collection. No part of that touched a cloud.

Now ask, in Lumogis Web: when does my home insurance renew?

Before the model even starts, Core can fold in your recent session memory and, if you have auto-RAG on, a few relevant chunks fetched under the same access rules as search. That is a convenience, not a crutch, and the loop runs fine without it.

The model calls search_files. Core embeds your question, searches Qdrant with dense vectors and BM25 fused together, optionally reranks the candidates with a cross-encoder, and returns the strongest chunks with their paths and scores, filtered to what you are allowed to see: your own scope, plus whatever household material you have chosen to share. The model picks the most promising hit and calls read_file on it. Core reads up to a few thousand characters from disk, but only if that path lives under an ingest root you configured. Ask for anything outside those roots and it is refused. If you named the insurer, the model can call query_entity, which looks the name up in Postgres, falls back to a semantic match in Qdrant, and reports everywhere that name has surfaced across your files and conversations.

Search, open, read, answer. The model supplied the reasoning. Your machine supplied the facts.

The same archive, through other agents

The loop above is not the only way in. Core publishes a curated surface over MCP at /mcp on the same port, so any agent that speaks it can reach your memory and search without you pasting context into its window. Point a client like Claude Desktop at it and call context.build, which runs document search and session retrieval locally, merges the results, and returns a context string capped to a budget along with its sources. Or reach for memory.search, entity.lookup, and the rest of the tools Core declares in its manifest at GET /capabilities.

The outside model still never receives your archive. It receives what Core retrieved and what the agent asked for, scoped to your user, and nothing more.

If you want to grow the surface, the unified catalog at GET /api/v1/me/tools reports everything Core can see: the built-in tools, the MCP ones, and any healthy capability services you have registered alongside it. With the catalog on, the chat loop folds those extra tools in for the length of a single request, with bearer trust and a permission check on each call. Capability containers can bring their own write actions. Stock Core ships the read path and the Ask/Do gate rather than a fixed menu of automations, so the household decides what its assistant is permitted to do.

Why none of this travels

Three kinds of locality hold the whole thing together.

Your corpus is local. The chunks and embeddings are computed on your hardware and Qdrant never syncs to a vendor's index. A cloud product would have to make you upload the same material all over again, on its terms, to match what you already have.

Your credentials are local. Connector secrets, model keys, notification endpoints, calendar logins, all sit encrypted in Postgres. MCP tokens are minted and revoked per user. When a tool reaches out, it carries credentials Core resolved on the box, not a share of some pooled SaaS OAuth grant.

Your execution is local. search_files, read_file, and query_entity run inside Core against your own paths and stores. The MCP tools call the very same services. The optional capability services are containers you deploy next to Core, not infrastructure hidden in someone else's account.

You can still send the final answer to a cloud model if you want the sharper weights. That is a clean, visible trade: only the excerpts Core assembled for that one turn cross your network, never the archive behind them. Or you keep the whole completion on the box with a local model. Either way the assistant's reach into your data stays at home.

The comparison that matters

Raw intelligence is the wrong scoreboard. A cloud chatbot with no tools is a brilliant generalist working from memory and whatever you remembered to paste. A modest local model with retrieval and a permissioned tool loop is a specialist wired into your files, your sessions, and your entity index, and it gets more useful every time you add to the pile, because the next question is already cheaper to answer.

That is the wager Lumogis makes. We are opening the AGPL build for people who would rather read the mechanism than the manifesto. Clone it, run Compose, drop a file in the inbox, and watch the orchestrator ingest it, search it, and work your question through the loop. The code paths in this post are the whole product.

Post #1 was your data stays home. Post #2 is your assistant works there too.

Lumogis is AGPL-3.0-only. Documentation: README, capabilities overview, architecture.

The AI comes to your data. Not the other way around.

Thomas Kohlborn — Thu, 14 May 2026 16:41:08 GMT

If a stranger walked up to me on the street and asked to read my notes, my work documents, the questions I've been turning over in my head for the past six months — I'd say no. Obviously.

But every day I open an AI assistant and do exactly that.

I use AI for everything. Thinking through decisions, drafting, researching, remembering things I'd otherwise lose. It's genuinely useful and I'm not giving it up. But there's always this low-level discomfort in the background. A feeling that I'm sharing things I probably shouldn't, with a company I don't fully understand, under terms I didn't really read.

I'm a person who thrives on convenience. And for a long time, convenience was winning.

The deal is always the same. You want a smarter assistant, so you upload your notes. You connect your calendar. You paste in the document you've been wrestling with for weeks. And somewhere in the background, all of it lands on a server you don't control, indexed by a company whose business model you're not entirely sure about.

Nobody forces you. You click agree. But the trade is always the same: intelligence in exchange for access.

I didn't want to make that trade anymore.

So I built Lumogis.

It's a self-hosted, local-first AI platform you run yourself, on your own hardware, in your own home, under your own terms. Your documents, notes, and conversations stay on your machine. The indexes that make retrieval possible stay on your machine too. When you ask a question, Lumogis finds the relevant context locally, assembles it, and sends only that composed prompt to an LLM. Or, if you prefer, keeps inference fully local with Ollama.

Nothing gets bulk-uploaded. Nothing gets handed to a SaaS indexer. You can read every line of how it works, because it's AGPL-3.0 and the source is all there.

I'm not anti-cloud. I still use cloud models. I like convenience as much as the next person. That's exactly how I ended up here.

But I think there's a meaningful difference between choosing to send something out and having no other option. Most AI tooling today only offers the second. The moment you want memory, retrieval, context across sessions, you have to trust someone else with the raw material of your thinking.

That felt wrong to me. Lumogis is the answer I built for myself, and I'm making it available to anyone who feels the same way.

It's early. There's a lot still to build. But the foundation is solid: Docker Compose, Qdrant for vectors, Postgres for metadata, a FastAPI orchestrator, and a clean web UI. It works today for household use, one person or a whole family on a LAN, each with their own identity and memory, with an audit log on every action that matters.

If you've ever felt that low-level discomfort, if you've ever clicked agree while knowing you probably shouldn't, I'd love for you to take a look.

⭐ Star Lumogis on GitHub and follow along. We're just getting started.