DAY-4: Custom Metrics Instrumentation and Scraping config, Service monitor

Observability is a shared responsibility among teams, where DevOps engineers play a key role in setting up the foundational stack to ensure seamless monitoring, visualization, and debugging capabilities. The commonly used observability stack includes:

Prometheus for monitoring
Grafana for visualization
EFK (Elasticsearch, Fluentd, Kibana) for log aggregation
Jaeger for distributed tracing Instrumentation

Instrumentation is the process of adding monitoring capabilities to your applications(metrics,logs,traces).

Types of Metrics

Counter: Nature of the metrics continuously increasing

ex: Number of http requests, Number of user logins
Gauge: Nature of the metric which can go up or down.
ex: cpu utilization and memory utilization
Histogram: Provides a count and sum of observed values, along with configurable buckets.
ex: http request duration - how many times http request took more than 5ms,10ms…
Summary: Similar to histogram but provides quantile estimates for observed values.
ex: Get 90th percentile request latency.

How does prometheus know from which applications it has to scrape the metrics?

Static Discovery: If you're manually choosing which services to scrape, you would use the Prometheus configuration file (prometheus.yml), typically stored as a ConfigMap and mounted into Prometheus pods.
Dynamic Discovery: If you're using ServiceMonitor, Prometheus will automatically discover and scrape services based on labels and selectors. This is especially useful in dynamic environments where services scale in and out.

Implementation

Verify EKS cluster with managed nodes are running and promethus-stack deployed in the cluster
Create dev namespace and deploy k8s manifests using Kustomize as shown below

Clone this Repository to get the manifests files: https://github.com/iam-veeramalla/observability-zero-to-hero.git
Now, access the service a-service using AWS-ELB DNS in your browser to verify the service is accessible
Try accessing ELB DNS at /metrics to see if the metrics from service is exposed or not

This clearly shows that application is omitting these metrics.
Verify the PromQL query is retrieving data in prometheus UI

As we can see that prometheus is unable to get the data despite using the PromQL query and even after the instrumentation is done for application to expose metrics.

Reason: In this case we didn’t update scrape config with targets, we will use service-monitor to let prometheus identify the service endpoint to scrape the metrics from

Here is service-monitor.yaml file


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: a-service-service-monitor
    release: monitoring
  name: a-service-service-monitor
  namespace: monitoring
spec:
  jobLabel: job
  endpoints:
    - interval: 2s
      port: a-service-port
      path: /metrics
  selector:
    matchLabels:
      app: a-service
  namespaceSelector:
    matchNames:
      - dev

Once service monitoring deployed in the cluster. It says in dev ns and watch for service with labels a-service at /metrics endpoint

Prometheus look out for service monitor and scrape the metrics from target services only.

Now, execute the PromQL query and see the output graph

Alert-Manager configuration

Create and deploy an alertmanager-config.yaml file that specifies the alerts to be routed and the corresponding receiver details where they should be sent.


  apiVersion: monitoring.coreos.com/v1alpha1
  kind: AlertmanagerConfig
  metadata:
    name: main-rules-alert-config
    namespace: monitoring
    labels:
      release: monitoring
  spec:
    route:
      repeatInterval: 30m
      receiver: 'null'
      routes:
      - matchers:
        - name: alertname
          value: HighCpuUsage
        receiver: 'send-email'
      - matchers:
        - name: alertname
          value: PodRestart
        receiver: 'send-email'
        repeatInterval: 5m
    receivers:
    - name: 'send-email'
      emailConfigs:
      - to: YOUR_EMAIL_ID
        from: YOUR_EMAIL_ID
        sendResolved: false
        smarthost: smtp.gmail.com:587
        authUsername: YOUR_EMAIL_ID
        authIdentity: YOUR_EMAIL_ID
        authPassword:
          name: mail-pass
          key: gmail-pass
    - name: 'null'

Here, we used receiver as email and hence gmail app password must be stored in the form of secret


  apiVersion: v1
  kind: Secret
  type: Opaque
  metadata:
    name: mail-pass
    namespace: monitoring
    labels:
      release: monitoring
  data:
    gmail-pass: <<ENTER_YOUR_APP PASSWORDS_IN_BASE64_ENCODED_FORMAT>>

Here is alerts.yaml file

  apiVersion: monitoring.coreos.com/v1
  kind: PrometheusRule
  metadata:
    name: custom-alert-rules
    namespace: monitoring
    labels:
      release: monitoring # if you installed through then you've to mention the release name of helm, otherwise prometheus will not recognize it
  spec:
    groups:
    - name: custom.rules
      rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on instance {{ $labels.instance }}"
          description: "CPU usage is above 50% (current value: {{ $value }}%)"
      - alert: PodRestart
        expr: kube_pod_container_status_restarts_total > 2
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Pod restart detected in namespace {{ $labels.namespace }}"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times"

Once alertmanager, alerts and secret filed deployed, verify their status
Now, access the DNS endpoint at /crash and we have already simulated the application to throw an error
```
  // Simulate a crash by throwing an error
  app.get('/crash', (req, res) => {
      console.log('Intentionally crashing the server...');
      process.exit(1);
  });
```
As a result, The pods have restarted multiple times and alert manager fires an alert to the configured email as shown below