Development··3 min read

Setting Up Monitoring in Production: A Grafana + Prometheus Story

What it was like bringing Grafana + Prometheus into a production service, including all the trial and error.

"The Site Won't Load"

Got a call from a client contact. Checked and found the server had been dead for 30 minutes. Nobody noticed for half an hour. Our only way to check server status was SSH-ing in and running htop.

After that incident, setting up monitoring became the top priority.

Why Prometheus + Grafana

Three candidates: Datadog, New Relic, Prometheus + Grafana.

Datadog has great features but costs over $15/host/month, and multiplied by our server count, it added up. Prometheus + Grafana is open source so cost is basically zero. There's a learning curve, but once you get it, the flexibility is unbeatable.

(Honestly, cost was the deciding factor.)

Week One: Faster Than Expected

Spun up Prometheus in Docker and started collecting server metrics with Node Exporter. CPU, memory, disk, network. Basic metrics alone were enough to understand server health.

Application metrics were exposed through Spring Boot's Micrometer via the /actuator/prometheus endpoint -- Prometheus scrapes it automatically. Added the target to prometheus.yml, set scrape_interval to 15 seconds, done. Half a day for the basic setup.

Week Two: Building Dashboards Was Actually Fun

Making dashboards in Grafana was honestly kind of fun. Import a community dashboard and you've got basic server monitoring in 5 minutes. The Node Exporter dashboard (ID: 1860) was particularly good.

Custom dashboards took more time. PromQL felt alien at first. Wrapping my head around rate(http_requests_total[5m]) being the average requests per second over 5 minutes took a full day. But once it clicked, it was powerful. Being able to see p99 response times, error rate trends, and per-service traffic comparisons in real time -- that was a completely different world from before.

Week Three: Way Too Many Alerts

Connected Alertmanager and set up Slack webhook notifications. CPU above 80% for 5 minutes, memory above 90%, API error rate above 5%, p99 response time above 3 seconds.

Alerts fired constantly at first. Thresholds were too low. Alert fatigue makes you ignore alerts. Two weeks of observing and tuning got us to 0-2 alerts per day on average.

Problems We Never Knew About Started Showing Up

Things we discovered after adding monitoring: CPU was spiking to 90% every night at 3 AM. Turned out a cron batch job was running queries without indexes.

We also spotted a pattern where a specific API slowed down only on weekday afternoons. Root cause: external API call timeouts. No way we would have caught these without monitoring.

Total Cost: About $15/Month

Prometheus and Grafana run on our own server, so it's just the EC2 cost. Both fit on a single t3.small, roughly $15/month. Metric retention set to 30 days. For our scale, plenty.

No Going Back

Now we detect outages an average of 5 minutes before users do. Server health visible at a glance on a single dashboard. Performance issues analyzed with data instead of guesses.

Running a service without monitoring is like driving without a dashboard. The hardest part is getting started -- once it's set up, you never go back.

Related Posts