Running LLMs Locally: An Honest Review
I tried running LLMs locally on an RTX 4070 Ti. Here's the gap between expectations and reality.
The API Bill Made Me Flinch
When my monthly API costs hit around $110, I thought "this can't keep going." My RTX 4070 Ti was sitting under my desk doing nothing when I wasn't gaming, so why not put it to work? Installed Ollama, downloaded a model, got my first terminal response in exactly 23 minutes. (I timed it.)
Then reality set in.
Installation Is Genuinely Easy Now
I'll give it that. In 2024, you had to deal with GGUF file conversion, quantization, all sorts of headaches. Now it's ollama pull llama3.3:70b-q4 and you're done. Just wait for the 40GB download.
Installation alone is arguably easier than npm install.
This Is Where Things Went Sideways
The problem is speed. My 4070 Ti has 12GB of VRAM, but the 70B model's q4 quantization weighs about 40GB. Doesn't fit in VRAM, so it offloads to CPU and RAM.
Result? Token generation at 3-5 tokens per second. Claude 4's API does 30-50. That's roughly 10x slower. Asking it to write a simple function takes 40 seconds for a complete response. At that point, I'm faster typing it myself. Welp.
Dropping to the 8B model gets you up to 25 tokens/sec, but code quality drops noticeably.
Same Prompt, Different Results
I gave both the same prompt: "implement a debounce function in TypeScript." Claude 4 delivered clean code with full generic type support. Llama 70B q4 produced working code plastered with any types. The 8B model gave me code that threw runtime errors. (Does it even know how generics work?)
Realistically, it's more of a coding reference than a coding assistant.
There Are Some Legit Use Cases Though
Completely useless? Not quite. I found three scenarios where it works. Working with sensitive internal code when you don't want to send it to external APIs. Cranking out similar CRUD patterns -- the 8B model handles that fine. And coding on a plane. Surprisingly useful with no internet.
But the Electric Bill...
Didn't see this one coming. GPU at full load pulls about 285W. Run it 8 hours a day and your monthly electric bill goes up $15-22. Plus the fan noise is insane. Ran the LLM during a video call and someone asked "are you at a construction site?"
Trying to save on API costs, paying it back in electricity. Honestly a bit infuriating.
Where I'm At Now
I haven't uninstalled Ollama. But honestly, I open it maybe once a week at best. A $15/month API subscription is overwhelmingly better value. Once 24GB VRAM GPUs become mainstream and model compression improves further, things might change. When that'll be, I have no idea. For now, I just use the API.