Most operators who ask us about AI have already been sold two ideas by someone with a quota. First, that they need it. Second, that it’s too complicated to understand, so they should just trust the person selling it. The first is sometimes true. The second is a way to keep you from asking the one question that decides everything: when your software does something intelligent, where does that thinking physically happen, and does the location change anything for your business?

That question has a real answer, and you can understand it without a computer science degree. This guide walks through it — the hardware, the money, and the law — and lands you in one of two buckets. The honest spoiler: almost everyone belongs in the same bucket, and it’s the cheaper, simpler one. The interesting part is learning to recognise the small number of businesses that don’t, because if you’re one of them, getting this wrong is expensive.

Let’s take the mystery out of it.

What “running a model” actually means

When an application reads a supplier’s invoice and pulls the line items off it, or answers a customer’s question at midnight, or flags the one purchase order that doesn’t look right, there’s a model doing that work. A model is a very large mathematical function — billions of numbers, called parameters, that encode patterns learned from training. To produce an answer, the computer has to load all those numbers into fast memory and run a flood of arithmetic across them. That act is called inference: the model inferring an answer from your input.

Inference is hungry in a specific way, and understanding the appetite explains the entire rest of this guide. The model’s parameters have to sit in memory the processor can reach instantly. The processor then streams those billions of numbers through arithmetic, over and over, for every chunk of text it produces. Two things therefore determine whether inference is fast, slow, or impossible: how much fast memory you have, and how quickly the processor can move numbers through it.

This is where the hardware acronyms come in, and they’re worth knowing, because they’re the difference between a model that answers instantly and one that crawls.

CPUs, GPUs, and the memory that decides everything

Your computer has a CPU — the central processor. It’s a brilliant generalist: a handful of powerful cores that do one complex thing after another, very fast. It runs your operating system, your spreadsheet, your web browser. Ask a CPU to run a small AI model and it can, slowly. Ask it to run a large one and it grinds, because inference isn’t one complex task done in sequence — it’s millions of simple multiplications that all want to happen at once.

That “all at once” is what a GPU — graphics processing unit — was born for. A GPU has thousands of small cores instead of a few large ones. It was originally built to colour millions of screen pixels simultaneously for video games, and that same talent — do the same simple operation across enormous batches of numbers in parallel — turns out to be exactly what running a neural network needs. For AI inference, a GPU isn’t a luxury; it’s the difference between usable and not.

But raw GPU speed isn’t the bottleneck most people hit first. For local AI inference, VRAM — or unified memory on Apple Silicon — is the single most important spec. VRAM is the GPU’s own dedicated memory, and it’s a hard wall. The model’s parameters have to fit inside it. If they fit, the GPU runs the model at full speed. If you exceed your VRAM, you have to offload part of the model to the CPU and ordinary system memory, which severely cripples inference speed. There’s no graceful degradation — you’re either in fast memory or you’ve fallen off a cliff.

So how much memory does a model need? It depends on two numbers: how many parameters the model has, and how aggressively it’s been compressed. Parameters are counted in billions — a “7B” model has seven billion. Compression is called quantization: storing each parameter with less numerical precision to shrink it, trading a little accuracy for a lot of memory savings. A rough, practical map of the landscape in 2026:

  • A 7-billion-parameter model needs roughly 8–9 GB of VRAM. A 14B model needs about 15 GB. Most 70B models need around 39 GB at common quantization — more than a single high-end consumer card holds.
  • The practical minimum for any useful local AI work is 16 GB of system RAM and a GPU with at least 6 GB of VRAM, or an Apple Silicon Mac — enough for a 3B–7B model.
  • The genuinely large open models released in 2026 — Llama 4, Qwen 3.5, DeepSeek V4 — at full scale demand server hardware: stacks of data-centre GPUs with hundreds of gigabytes of pooled memory.

Apple Silicon deserves a note because it changes the math for small shops. Apple’s chips use unified memory, meaning the system RAM doubles as VRAM — an M-series Mac with 48 GB can run models that would otherwise need two 24 GB GPUs, and a 64 GB Max-tier machine can run 70B models that would otherwise require multiple data-centre cards. A MacBook on a desk can do surprisingly serious AI work. The catch is the ceiling: a laptop runs medium models, not frontier ones, and never several clients’ worth at once.

Hold onto one fact from all this: the frontier models — the genuinely smartest ones — do not fit on hardware you’d put in an office. They live in data centres on purpose. That single fact is the hinge the rest of this guide turns on.

The two places your AI can live

With the mechanics in hand, the choice becomes simple to state. The model doing your thinking runs in one of two places.

Hosted. The model lives in a data centre, maintained by a large provider, and your application sends it a request and gets an answer back over the network. You don’t own the hardware, the VRAM, or the electricity bill. You rent the thinking by the use. This is how the overwhelming majority of applied AI works — including most of the AI you already touch without noticing.

Localized. The model runs on hardware you or your builder control — a server in a Canadian data centre, or in some cases a machine on your own premises. The data and the inference both stay inside that boundary. Nothing crosses to a third party. This is more to build and more to keep running, and it exists for one real reason, which is the heart of this guide.

That’s the whole decision. Not whether you need AI — you’ve usually settled that before you call anyone. The decision is where the model lives. And it turns out to be a question about your data and your obligations, not about your ambition or how advanced you want to sound.

Why hosted is right for almost everyone

For most businesses, hosted is the correct answer — and not as a budget compromise. Three reasons, in plain terms.

The hosted models are simply better, and they stay current. Remember that the smartest models don’t fit on office hardware. The best open-weight models you can self-host are genuinely competitive with the frontier on many tasks, but the frontier models still lead on complex reasoning, long documents, and following instructions precisely — and if your use case needs that top tier, hosted is your only option. A hosted model also improves on a schedule you never have to manage. When you run a model on your own box, you freeze it at whatever you installed, and you inherit the job of updating it.

The money lands in the right place. Hosted intelligence is billed per use — you pay for the work the application does, with no bill when it’s idle. Self-hosting inverts that: you pay for the hardware whether it’s working or not. That idle baseline cost is the anchor most “local is cheaper” pitches quietly ignore — cloud APIs have no baseline; you pay only for what you use. The break-even is not close for a normal business. Below roughly 500,000 tokens a day on a small model, or a few million on a large one, hosted APIs are cheaper once you count the whole picture; only above those sustained volumes does owning hardware start to win. To put that in human terms: a “token” is about three-quarters of a word, so 500,000 tokens a day is several hundred thousand words of AI work, every day, forever. Most operators reading customer messages or processing invoices never come close.

The build is simpler, so it ships sooner and costs less. A localized model is a second system to stand up, secure, monitor, and keep alive at 2 a.m. when a GPU faults. For most teams under a high daily volume, APIs are the cheaper option after accounting for the full total cost of ownership — the GPU is only the start; the real cost is everything around it. For the work most operators actually need — reading documents, automating intake, flagging exceptions, search that understands your business — hosted gets you a working release in weeks, not a hardware procurement project.

None of this is the interesting part, though. The interesting part is the single reason the other bucket exists at all.

The one reason to localize: your data can’t leave

There is exactly one question that moves a business from hosted to localized, and it has nothing to do with speed, cost, or how clever the AI is. It’s this: can your data touch a server outside Canada — even briefly, even encrypted — or is it under a rule that says it cannot?

For most businesses the honest answer is yes, with ordinary safeguards. A field-service company’s job tickets, a winery’s inventory counts, a hospitality operator’s booking list — all sensitive, all worth protecting, none of it under a law forbidding it from crossing a border. And here the law is more nuanced than the scaremongering suggests. Canada has no blanket law requiring all data to stay within its borders; PIPEDA permits cross-border transfers as long as the data receives comparable protection abroad and customers are informed. So for the ordinary case, a well-built hosted application can satisfy the requirement honestly. You don’t need a server in your back office to be compliant.

For some businesses, though, the answer is a hard no — and the reasons are real, not theatrical. PIPEDA allows cross-border transfer only where the receiving jurisdiction provides adequate protection, and US law doesn’t clear that bar without additional safeguards. The deeper issue is who else can reach the data once it’s on foreign soil. Data processed through US infrastructure carries potential exposure under the US CLOUD Act, which is precisely why many Canadian organisations keep sensitive data within Canadian jurisdiction. On top of federal law, two things tighten the screws further:

  • Contracts and sector rules. Government contracts and healthcare regulations often mandate Canadian residency regardless of what PIPEDA technically permits — and many clients write residency into contracts anyway to simplify breach reporting and reduce foreign-subpoena exposure. A clinic governed by provincial health law, a vendor serving a government buyer, a firm whose own customer contract names data residency — for these, “the law allows it with safeguards” is beside the point, because the contract says no.
  • Penalties that make caution rational. Bill C-27 brings penalties up to C$25 million — far above the equivalent US caps — which turns a data-residency mistake from a theoretical risk into a budget-destroying one. In Quebec, Law 25 goes further still, requiring a privacy impact assessment before any transfer of personal information outside the province, with fines up to 4% of worldwide revenue.

That is the line. Not how advanced your AI is — how regulated your data is. If your data can leave the country under ordinary protection, hosted is right for you and you can stop reading the worried part. If it genuinely cannot — because a regulator, a province, or a signed contract says so — you’re in the localized bucket, and that’s a real, solvable engineering problem, not a dead end.

If you’re in the localized bucket, here’s what it really involves

Suppose you are the clinic. Your data can’t leave Canadian infrastructure, the model included. What does localizing actually look like, now that you understand the hardware?

It means running an open-weight model — Llama, Qwen, Mistral and their kin — on a machine inside the boundary. The honest engineering picture, with the numbers from earlier doing the work:

  • A medium model (in the 7B–14B range) that handles document extraction, intake classification, or exception-flagging will run on a single capable GPU or a well-specced Apple Silicon machine — 16 to 24 GB of VRAM or unified memory. This is genuinely attainable hardware, not a server farm. A single 24 GB card remains the sweet spot for serious single-GPU inference in 2026.
  • A large model (70B and up), if the job truly needs that much capability, climbs into multi-GPU territory or high-memory Apple Silicon, and the cost and operational weight climb with it. Most 70B models need around 39 GB or more, beyond what a single consumer card holds.
  • Either way, someone has to keep it running: patching, monitoring, securing, and replacing hardware when it fails. That operational tail is the real cost of local, more than the GPU sticker price.

The honest framing matters here, and we’ll always give it to you straight: running a capable model on your own infrastructure is more to build and more to maintain than calling a hosted one. The right model size depends entirely on the job — over-buying hardware for a task a 14B model handles is just lighting money on fire. We’d rather scope the actual requirement, tell you what it genuinely costs, and help you decide whether your obligation truly demands it, than sell you a server you didn’t need. Often the conversation ends with a hosted build and a clear, documented reason it satisfies the rule. A hybrid pattern is also real and underused: keep the sensitive baseline on local Canadian infrastructure and route only non-sensitive overflow to hosted models — capturing the cost efficiency where it’s safe and the residency guarantee where it’s required. Sometimes the answer is fully local, and then it’s exactly the right call. Either way, you’ll know precisely why.

This is, for what it’s worth, the same promise we make about everything else carried all the way through. We build and host in Canada as a matter of course. Localized inference is just that same boundary extended to cover the part of the system that thinks.

Where the application comes in — and why it was always the point

Here’s the thing the “do you need AI” sales pitch always skips. In both buckets, hosted and localized, the thing you are actually buying is an application.

The model — wherever it runs — does nothing on its own. It reads an invoice because an application handed it the invoice, captured the structured answer, put it in front of the right person, wrote it into your records, and flagged the exception for review. The intelligence is one component inside a system that runs a real workflow. A winery doesn’t want “an AI.” It wants the system that takes a distributor’s PDF, pulls the order off it, updates inventory, and never makes anyone re-type a line — and intelligence is one part of that system, sitting underneath, doing a specific job with its work shown and checkable.

This is why “hosted or localized” is never really the question in isolation. The real question is: what should this application do, and where does the thinking part need to live to satisfy your obligations? The application is the product. The model’s location is a deployment decision underneath it — a decision that matters enormously when your data is regulated and is nearly invisible to you when it isn’t. Custom application development is how either version gets built; the AI is the intelligence engineered into it, not a chatbot bolted onto the side.

Five questions to ask before anyone sells you local AI

If someone is pitching you an on-premise AI box, or a “private AI server,” these five questions cut through it fast. They’re the same ones we’d ask on your behalf.

  1. What’s the actual obligation? Name the law, the provincial act, or the contract clause that requires residency. “It’s more secure” is not an obligation. If nobody can point to the rule, you’re probably in the hosted bucket and being upsold.
  2. What model size does the job truly need? A 14B model running document extraction needs a fraction of the hardware a 70B model does. If the pitch jumps straight to a multi-GPU rig without naming the task, the hardware is driving the conversation instead of the work.
  3. Who maintains it on a Tuesday? Patching, monitoring, security updates, and the GPU that fails without warning — someone owns that, forever. If that answer is vague, the running cost is hidden.
  4. What’s the all-in cost versus hosted, at your real volume? Get the idle hardware cost, the electricity, and the maintenance counted — not just the per-token comparison. Below the break-even volume, hosted usually wins even before you count the hassle.
  5. Could a hybrid do it? Sensitive data local, everything else hosted. Often the cheapest shape that still satisfies the rule, and the one a hardware vendor is least likely to mention.

If the answers to these point at a genuine residency obligation and a workload that justifies the hardware, localized is right and worth doing well. If they don’t, you’ve just saved yourself a server.

The short version

If you skipped to the end, here’s the whole guide in six lines:

  • The real question isn’t whether you need AI. It’s where the model that does the thinking is allowed to run.
  • That comes down to your data and your obligations, not your ambition. The smartest models live in data centres because they physically don’t fit anywhere smaller.
  • Hosted is right for almost everyone: better models, lower total cost until very high volume, faster to ship, nothing to maintain.
  • Localized exists for one reason — data that genuinely can’t leave Canadian infrastructure, because a regulator, a province, or a signed contract says so. The penalties for getting that wrong are real.
  • If that’s you, it’s a build we can scope and stand up in-country, at a model size matched to the actual job — and sometimes a hybrid is the smartest shape.
  • In both cases you’re buying an application that runs a real workflow. The model is a part inside it, not the product.

Not sure which bucket you’re in? That’s the first thing a Blueprint sorts out — a paid, fixed scoping engagement that ends with a real plan and a real number. Bring the obligation you’re worried about. We’ll tell you straight whether it moves you across the line, and we’ll show our work either way.