DAY-3: Prometheus- Hands on Explanation

After Installing Prometheus-grafana and alert-manager stack in the EKS cluster, verify the services running. These are all clusterIP services which are accessible with in the cluster network.

Metric Endpoint Verification

SSH into node and Curl service IP:port/metrics
- Node Exporter: Gather node-level metrics
- Kube-state-metrics: Collect Kubernetes resource level metrics

Kube-state-Metrics

Exposed at /metrics endpoint

Sample metrics:

  kube_pod_container_status_restarts_total
  kube_pod_container_status_restarts_total{namespace="default"}

Practical Demonstration

Creating a Crash-Loop Pod

kubectl run busybox --image=busybox -- /bin/sh -c "exit 1"

Metrics Collection Flow

When we send an instruction like kubectl run pod, it goes to api-server and then eventually scheduler and then kubelet to schedule a pod, kube-state-metrics continuously looking at api-server and get the metrics of the pod and expose them at /metrics endpoint to make it understandable to prometheus. PromQL Query query HTTP server to retrieve specific metrics data.

Grafana

Add prometheus as a data source for better visualization and can also setup Authentication and authorization by integrating with IAM for access control.

Sum Up All CPU Usage: This query aggregates the CPU usage across all nodes in grafana

Average Memory Usage per Namespace: This query provides the average memory usage grouped by namespace.

Sum Up All CPU Usage: This PromQL query aggregates the CPU usage across all nodes in grafana

Pod Container restarts: This query gives the total container restarts

Memory Utilization of Nodes