Setting Up Monitoring in Production: A Grafana + Prometheus Story

"The Site Won't Load"

Got a call from a client contact. Checked and found the server had been dead for 30 minutes. Nobody noticed for half an hour. Our only way to check server status was SSH-ing in and running htop.

After that incident, setting up monitoring became the top priority.

Why Prometheus + Grafana

Three candidates: Datadog, New Relic, Prometheus + Grafana.

Datadog has great features but costs over $15/host/month, and multiplied by our server count, it added up. Prometheus + Grafana is open source so cost is basically zero. There's a learning curve, but once you get it, the flexibility is unbeatable.

(Honestly, cost was the deciding factor.)

Week One: Faster Than Expected

Spun up Prometheus in Docker and started collecting server metrics with Node Exporter. CPU, memory, disk, network. Basic metrics alone were enough to understand server health.

Application metrics were exposed through Spring Boot's Micrometer via the /actuator/prometheus endpoint -- Prometheus scrapes it automatically. Added the target to prometheus.yml, set scrape_interval to 15 seconds, done. Half a day for the basic setup.

Week Two: Building Dashboards Was Actually Fun

Making dashboards in Grafana was honestly kind of fun. Import a community dashboard and you've got basic server monitoring in 5 minutes. The Node Exporter dashboard (ID: 1860) was particularly good.

Custom dashboards took more time. PromQL felt alien at first. Wrapping my head around rate(http_requests_total[5m]) being the average requests per second over 5 minutes took a full day. But once it clicked, it was powerful. Being able to see p99 response times, error rate trends, and per-service traffic comparisons in real time -- that was a completely different world from before.

"The Site Won't Load"

Why Prometheus + Grafana

Week One: Faster Than Expected

Week Two: Building Dashboards Was Actually Fun

Week Three: Way Too Many Alerts

Problems We Never Knew About Started Showing Up

Total Cost: About $15/Month

No Going Back

Related Posts

GraphQL vs REST: My Verdict After 3 Projects

Zero-Downtime DB Migration, Easier Said Than Done

WebSocket vs SSE: My Realtime Misadventure