DAY-3: Prometheus- Hands on Explanation

After Installing Prometheus-grafana and alert-manager stack in the EKS cluster, verify the services running. These are all clusterIP services which are accessible with in the cluster network.

Metric Endpoint Verification

  • SSH into node and Curl service IP:port/metrics

    • Node Exporter: Gather node-level metrics

    • Kube-state-metrics: Collect Kubernetes resource level metrics

Kube-state-Metrics

  • Exposed at /metrics endpoint

  • Sample metrics:

      kube_pod_container_status_restarts_total
      kube_pod_container_status_restarts_total{namespace="default"}
    

Practical Demonstration

Creating a Crash-Loop Pod

kubectl run busybox --image=busybox -- /bin/sh -c "exit 1"

Metrics Collection Flow

When we send an instruction like kubectl run pod, it goes to api-server and then eventually scheduler and then kubelet to schedule a pod, kube-state-metrics continuously looking at api-server and get the metrics of the pod and expose them at /metrics endpoint to make it understandable to prometheus. PromQL Query query HTTP server to retrieve specific metrics data.

Grafana

Add prometheus as a data source for better visualization and can also setup Authentication and authorization by integrating with IAM for access control.

  1. Sum Up All CPU Usage: This query aggregates the CPU usage across all nodes in grafana

  1. Average Memory Usage per Namespace: This query provides the average memory usage grouped by namespace.

  1. Sum Up All CPU Usage: This PromQL query aggregates the CPU usage across all nodes in grafana

  1. Pod Container restarts: This query gives the total container restarts

  1. Memory Utilization of Nodes