DAY-4: Custom Metrics Instrumentation and Scraping config, Service monitor
Observability is a shared responsibility among teams, where DevOps engineers play a key role in setting up the foundational stack to ensure seamless monitoring, visualization, and debugging capabilities. The commonly used observability stack includes:
Prometheus for monitoring
Grafana for visualization
EFK (Elasticsearch, Fluentd, Kibana) for log aggregation
Jaeger for distributed tracing Instrumentation
Instrumentation is the process of adding monitoring capabilities to your applications(metrics,logs,traces).
Types of Metrics
Counter: Nature of the metrics continuously increasing
ex: Number of http requests, Number of user logins
Gauge: Nature of the metric which can go up or down.
ex: cpu utilization and memory utilizationHistogram: Provides a count and sum of observed values, along with configurable buckets.
ex: http request duration - how many times http request took more than 5ms,10ms…Summary: Similar to histogram but provides quantile estimates for observed values.
ex: Get 90th percentile request latency.
How does prometheus know from which applications it has to scrape the metrics?
Static Discovery: If you're manually choosing which services to scrape, you would use the Prometheus configuration file (
prometheus.yml
), typically stored as a ConfigMap and mounted into Prometheus pods.Dynamic Discovery: If you're using ServiceMonitor, Prometheus will automatically discover and scrape services based on labels and selectors. This is especially useful in dynamic environments where services scale in and out.
Implementation
Verify EKS cluster with managed nodes are running and promethus-stack deployed in the cluster
Create
dev
namespace and deploy k8s manifests using Kustomize as shown belowClone this Repository to get the manifests files: https://github.com/iam-veeramalla/observability-zero-to-hero.git
Now, access the service
a-service
using AWS-ELB DNS in your browser to verify the service is accessibleTry accessing ELB DNS at
/metrics
to see if the metrics from service is exposed or notThis clearly shows that application is omitting these metrics.
Verify the PromQL query is retrieving data in prometheus UI
As we can see that prometheus is unable to get the data despite using the PromQL query and even after the instrumentation is done for application to expose metrics.
Reason: In this case we didn’t update scrape config with targets, we will use service-monitor to let prometheus identify the service endpoint to scrape the metrics from
Here is service-monitor.yaml file
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app: a-service-service-monitor
release: monitoring
name: a-service-service-monitor
namespace: monitoring
spec:
jobLabel: job
endpoints:
- interval: 2s
port: a-service-port
path: /metrics
selector:
matchLabels:
app: a-service
namespaceSelector:
matchNames:
- dev
Once service monitoring deployed in the cluster. It says in dev ns and watch for service with labels a-service
at /metrics
endpoint
Prometheus look out for service monitor and scrape the metrics from target services only.
Now, execute the PromQL query and see the output graph
Alert-Manager configuration
Create and deploy an
alertmanager-config.yaml
file that specifies the alerts to be routed and the corresponding receiver details where they should be sent.apiVersion: monitoring.coreos.com/v1alpha1 kind: AlertmanagerConfig metadata: name: main-rules-alert-config namespace: monitoring labels: release: monitoring spec: route: repeatInterval: 30m receiver: 'null' routes: - matchers: - name: alertname value: HighCpuUsage receiver: 'send-email' - matchers: - name: alertname value: PodRestart receiver: 'send-email' repeatInterval: 5m receivers: - name: 'send-email' emailConfigs: - to: YOUR_EMAIL_ID from: YOUR_EMAIL_ID sendResolved: false smarthost: smtp.gmail.com:587 authUsername: YOUR_EMAIL_ID authIdentity: YOUR_EMAIL_ID authPassword: name: mail-pass key: gmail-pass - name: 'null'
Here, we used receiver as email and hence gmail app password must be stored in the form of secret
apiVersion: v1 kind: Secret type: Opaque metadata: name: mail-pass namespace: monitoring labels: release: monitoring data: gmail-pass: <<ENTER_YOUR_APP PASSWORDS_IN_BASE64_ENCODED_FORMAT>>
Here is alerts.yaml file
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: custom-alert-rules namespace: monitoring labels: release: monitoring # if you installed through then you've to mention the release name of helm, otherwise prometheus will not recognize it spec: groups: - name: custom.rules rules: - alert: HighCpuUsage expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50 for: 5m labels: severity: warning annotations: summary: "High CPU usage on instance {{ $labels.instance }}" description: "CPU usage is above 50% (current value: {{ $value }}%)" - alert: PodRestart expr: kube_pod_container_status_restarts_total > 2 for: 0m labels: severity: critical annotations: summary: "Pod restart detected in namespace {{ $labels.namespace }}" description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times"
Once alertmanager, alerts and secret filed deployed, verify their status
Now, access the DNS endpoint at
/crash
and we have already simulated the application to throw an error// Simulate a crash by throwing an error app.get('/crash', (req, res) => { console.log('Intentionally crashing the server...'); process.exit(1); });
As a result, The pods have restarted multiple times and alert manager fires an alert to the configured email as shown below