Mentatcurated
concept also: local LLMs, on-device inference, running models locally

Local inference

Running a model on your own hardware rather than calling a hosted API — private, free at the margin, and offline-capable, at the cost of some quality and speed.

In a nutshell

Local inference means running a model on your own machine instead of calling a hosted API — so the data never leaves your device, it's free at the margin, and it works offline. It became practical through two advances: small models that punch above their weight, and quantisation that shrinks them to fit consumer hardware. The trade is some quality and speed for privacy and control.

Local inference became practical through two advances: small models that punch above their size, and quantisation that shrinks them to fit consumer hardware. Tools now hide the setup behind a single command.

It's the democratisation track — anyone can run capable AI privately — and the proliferation worry, since the same accessibility cuts both ways. Both are true at once.

Where it came from

Year2023
Sourcellama.cpp (Georgi Gerganov)
Why it matteredShowed a quantised LLM could run on a laptop CPU, kicking off the local-model wave.

Related concepts