Development··3 min read

A Small Team's LLM Fine-Tuning Adventure

Three months of trying, failing, and trying again to fine-tune an LLM with a 5-person team

We Got Stuck Before We Even Got a GPU

It was early September when we declared we'd build an automated customer inquiry classification system. Five team members, monthly budget of about $340. At first we figured the OpenAI API would be enough, but classification accuracy plateaued at 72% no matter what. We rewrote the prompt endlessly. (47 times. I stopped counting at 47.)

So we decided to fine-tune. But the first problem was where to rent GPUs. AWS p4d instances run $32/hour. Half our monthly budget gone in a single day.

We Started with Colab Pro+

The math said a single A100 could handle a 7B model. Colab Pro+ at $49.99/month was the most realistic option. (Our team lead asked "is this actually going to work?" three separate times.)

Data prep was the real nightmare. We manually labeled 3,200 customer inquiries. Split across five people, it took two and a half days. Labeling criteria varied between people, so we scrapped everything halfway through and rebuilt the rubric from scratch. That detour cost us four more days.

Without LoRA, We'd Have Quit Way Earlier

Full parameter tuning was impossible with our memory constraints. We switched to LoRA. Started with rank 8, alpha 16, and the loss wouldn't budge on the first run. Dropped the learning rate from 2e-4 to 5e-5 and it finally started moving. (Finding that took two days. Two days of staring at logs.)

Three epochs took 4 hours 23 minutes. Colab disconnected midway through. We hadn't been saving checkpoints. Started over from scratch. I had three beers that evening.

Accuracy Went Up, But Not as Much as We'd Hoped

Post-fine-tuning accuracy went from 72% to 83%. An 11 percentage point improvement. Honestly, I'd been hoping for 90%+, but reality had other plans. The culprit was ambiguous inquiries that straddled categories, like tickets that mixed "payment error" with "refund request."

On the second attempt, we refined the labels and expanded the dataset to 4,800 entries. Accuracy hit 87.3%. Getting there took two and a half months.

But How Does This Compare to Just Using the API?

I ran the numbers. Total fine-tuning cost including GPU rental and labor came to roughly $1,600. Just using the OpenAI API would cost about $85/month, but with 72% accuracy. The fine-tuned model has near-zero inference costs, so breakeven happens around the 6-month mark.

But honestly, I have no idea if this model will still be useful in six months. If customer inquiry patterns shift, we'd need to tune again.

Things Worth Knowing If Your Small Team Tries This

Data labeling consumed 60% of our total time. Model training finishes faster than you'd expect. The bottleneck is always data.

LoRA is genuinely revolutionary. Fine-tuning a 7B model on a single A100 would've been unimaginable two years ago. But hyperparameter tuning is still more art than science -- reading three papers helps less than just running three experiments yourself.

My biggest regret was not setting up proper evaluation metrics from the start. We only tracked overall accuracy when we should've been looking at per-category precision and recall. Discovering that one category had only 43% recall, way too late in the process, was a gut punch.

Anyway, small teams can absolutely do fine-tuning. But I'll never call it "easy."

Related Posts