DAY-4: Custom Metrics Instrumentation and Scraping config, Service monitor

Observability is a shared responsibility among teams, where DevOps engineers play a key role in setting up the foundational stack to ensure seamless monitoring, visualization, and debugging capabilities. The commonly used observability stack includes:

  • Prometheus for monitoring

  • Grafana for visualization

  • EFK (Elasticsearch, Fluentd, Kibana) for log aggregation

  • Jaeger for distributed tracing Instrumentation

Instrumentation is the process of adding monitoring capabilities to your applications(metrics,logs,traces).

Types of Metrics

  1. Counter: Nature of the metrics continuously increasing

    ex: Number of http requests, Number of user logins

  2. Gauge: Nature of the metric which can go up or down.
    ex: cpu utilization and memory utilization

  3. Histogram: Provides a count and sum of observed values, along with configurable buckets.
    ex: http request duration - how many times http request took more than 5ms,10ms…

  4. Summary: Similar to histogram but provides quantile estimates for observed values.
    ex: Get 90th percentile request latency.

How does prometheus know from which applications it has to scrape the metrics?

  • Static Discovery: If you're manually choosing which services to scrape, you would use the Prometheus configuration file (prometheus.yml), typically stored as a ConfigMap and mounted into Prometheus pods.

  • Dynamic Discovery: If you're using ServiceMonitor, Prometheus will automatically discover and scrape services based on labels and selectors. This is especially useful in dynamic environments where services scale in and out.

Implementation

  1. Verify EKS cluster with managed nodes are running and promethus-stack deployed in the cluster

  2. Create dev namespace and deploy k8s manifests using Kustomize as shown below

    Clone this Repository to get the manifests files: https://github.com/iam-veeramalla/observability-zero-to-hero.git

  3. Now, access the service a-service using AWS-ELB DNS in your browser to verify the service is accessible

  4. Try accessing ELB DNS at /metrics to see if the metrics from service is exposed or not

    This clearly shows that application is omitting these metrics.

  5. Verify the PromQL query is retrieving data in prometheus UI

As we can see that prometheus is unable to get the data despite using the PromQL query and even after the instrumentation is done for application to expose metrics.

Reason: In this case we didn’t update scrape config with targets, we will use service-monitor to let prometheus identify the service endpoint to scrape the metrics from

Here is service-monitor.yaml file


apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app: a-service-service-monitor
    release: monitoring
  name: a-service-service-monitor
  namespace: monitoring
spec:
  jobLabel: job
  endpoints:
    - interval: 2s
      port: a-service-port
      path: /metrics
  selector:
    matchLabels:
      app: a-service
  namespaceSelector:
    matchNames:
      - dev

Once service monitoring deployed in the cluster. It says in dev ns and watch for service with labels a-service at /metrics endpoint

Prometheus look out for service monitor and scrape the metrics from target services only.

Now, execute the PromQL query and see the output graph

Alert-Manager configuration

  • Create and deploy an alertmanager-config.yaml file that specifies the alerts to be routed and the corresponding receiver details where they should be sent.

    
      apiVersion: monitoring.coreos.com/v1alpha1
      kind: AlertmanagerConfig
      metadata:
        name: main-rules-alert-config
        namespace: monitoring
        labels:
          release: monitoring
      spec:
        route:
          repeatInterval: 30m
          receiver: 'null'
          routes:
          - matchers:
            - name: alertname
              value: HighCpuUsage
            receiver: 'send-email'
          - matchers:
            - name: alertname
              value: PodRestart
            receiver: 'send-email'
            repeatInterval: 5m
        receivers:
        - name: 'send-email'
          emailConfigs:
          - to: YOUR_EMAIL_ID
            from: YOUR_EMAIL_ID
            sendResolved: false
            smarthost: smtp.gmail.com:587
            authUsername: YOUR_EMAIL_ID
            authIdentity: YOUR_EMAIL_ID
            authPassword:
              name: mail-pass
              key: gmail-pass
        - name: 'null'
    
  • Here, we used receiver as email and hence gmail app password must be stored in the form of secret

    
      apiVersion: v1
      kind: Secret
      type: Opaque
      metadata:
        name: mail-pass
        namespace: monitoring
        labels:
          release: monitoring
      data:
        gmail-pass: <<ENTER_YOUR_APP PASSWORDS_IN_BASE64_ENCODED_FORMAT>>
    
  • Here is alerts.yaml file

      apiVersion: monitoring.coreos.com/v1
      kind: PrometheusRule
      metadata:
        name: custom-alert-rules
        namespace: monitoring
        labels:
          release: monitoring # if you installed through then you've to mention the release name of helm, otherwise prometheus will not recognize it
      spec:
        groups:
        - name: custom.rules
          rules:
          - alert: HighCpuUsage
            expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 50
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High CPU usage on instance {{ $labels.instance }}"
              description: "CPU usage is above 50% (current value: {{ $value }}%)"
          - alert: PodRestart
            expr: kube_pod_container_status_restarts_total > 2
            for: 0m
            labels:
              severity: critical
            annotations:
              summary: "Pod restart detected in namespace {{ $labels.namespace }}"
              description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value }} times"
    
  • Once alertmanager, alerts and secret filed deployed, verify their status

  • Now, access the DNS endpoint at /crash and we have already simulated the application to throw an error

    
      // Simulate a crash by throwing an error
      app.get('/crash', (req, res) => {
          console.log('Intentionally crashing the server...');
          process.exit(1);
      });
    

    As a result, The pods have restarted multiple times and alert manager fires an alert to the configured email as shown below