Kubernetes Metrics I Check Before Opening Grafana

January 30, 2026

People usually start caring about Kubernetes metrics after something breaks. That is backwards.

Even a small setup can teach you most of what matters in production. I will use a simple Iris ML API running on Kubernetes as the running example, but the point of this post is not the model. The point is the signals that tell you whether a cluster is healthy, whether traffic can flow, and whether scaling will hold under pressure.

Quick reference

If you want a compact checklist, here is what I watch first.

CategoryKey signalWhy it matters
OrchestrationDesired replicas versus available replicasTells you if the workload can converge
ConnectivityService endpointsTells you if traffic has anywhere to go
StabilityRestart count and OOM killsTells you if you are silently losing capacity
EfficiencyCPU throttlingExplains slow behavior even when CPU looks low
MemoryWorking set and limitsPredicts restarts and evictions
Node healthPressure conditionsPredicts scheduling failures and evictions
StorageVolume stats and mount errorsExplains slow start and timeouts during loading

The setup I am referencing

This is intentionally small:

  1. An Iris inference API running in a Deployment
  2. Two replicas
  3. A Service that routes traffic to the pods
  4. Tested in Minikube, designed to map cleanly to a real cluster

If you can reason about metrics here, you can reason about them in larger systems.

Where to see these signals

You do not need a full observability stack to get value. Start with kubectl, then layer Prometheus later.

One note before we go deeper: Prometheus metric names can vary slightly based on exporter versions, cluster distribution, and runtime. Treat the metric names below as reliable starting points, not a strict contract.

Fast cluster view

kubectl get deploy
kubectl get pods -o wide
kubectl get svc
kubectl get endpoints
kubectl get nodes

When something does not converge

kubectl get events -A --sort-by=.lastTimestamp
kubectl describe deploy DEPLOYMENT_NAME
kubectl describe pod POD_NAME
kubectl describe node NODE_NAME

If metrics server is present

kubectl top pods
kubectl top nodes

A short incident story that changed how I read metrics

In one deployment, we saw p95 latency climb while CPU usage looked fine. Endpoints were healthy. Pods were running. It was tempting to blame the application.

The root cause was CPU throttling. Requests were set too low for the real burstiness of the workload. The kernel was enforcing CFS quotas, so inference threads were getting paused in short intervals. Nothing crashed, but the system felt slow and inconsistent.

That experience is why I look for gaps between intent and reality. The gaps tell you where the system is lying.

1. Desired and running pods

The first question I ask is simple: do the desired replicas match the running replicas.

If they do not match, Kubernetes is telling you the system cannot converge. That gap is the signal.

Common causes include:

  1. The scheduler cannot place pods because nodes are out of CPU or memory
  2. Image pulls are failing
  3. Pods are crashing or stuck starting
  4. Probes are failing and the workload never becomes stable

When desired and running do not align, do not start with application logs. Start with the cluster view.

Rule of thumb: If desired replicas and available replicas do not match, do not debug the app first. Debug scheduling and startup.

Prometheus signals to map this

If you have kube state metrics, these are the names I use most:

  1. kube_deployment_spec_replicas
  2. kube_deployment_status_replicas_available
  3. kube_deployment_status_replicas_unavailable

2. Availability, not just running

Kubernetes can show a pod as running while it is not actually ready to receive traffic. Availability is the more important signal.

For an ML API, availability often fails for reasons that look normal in code but painful in production:

  1. Model warm up takes longer than expected
  2. Dependencies are reachable but not ready
  3. Readiness checks are too strict or too loose

The mindset is simple: a running pod is a process, an available pod is capacity.

Rule of thumb: Running is not readiness. If traffic is failing, look at readiness and endpoints before anything else.

What I check when availability fails

kubectl get pods
kubectl describe pod POD_NAME
kubectl get endpoints SERVICE_NAME -o yaml

Prometheus names that align well:

  1. kube_pod_status_ready
  2. kube_pod_container_status_ready

3. Restart count

Restarts are a silent reliability tax. A pod that restarts is not stable capacity, even if it recovers quickly.

If restarts are above zero, I treat it as an early warning:

  1. Memory limits are too low and the container is being killed
  2. Requests are wrong and the node is under pressure
  3. The app has spikes that were not tested under load

For ML workloads this is common when inference creates short bursts of memory usage that look harmless until concurrency rises.

Rule of thumb: Restarts above zero in steady state means you are losing capacity. It is already a scaling problem.

How to catch the cause quickly

kubectl get pods
kubectl describe pod POD_NAME
kubectl logs POD_NAME --previous

Prometheus names that align well:

  1. kube_pod_container_status_restarts_total
  2. kube_pod_container_status_last_terminated_reason
  3. container_memory_working_set_bytes

4. CPU requests and throttling

Kubernetes schedules based on requests, not actual usage.

If CPU requests are unrealistically low, you can end up with pods that exist but feel slow because the CPU is being throttled. This shows up as latency and timeouts long before it shows up as a crash.

If CPU requests are unrealistically high, you reduce bin packing efficiency and limit how many replicas can run at all.

The goal is honesty. Scaling works when requests are close to reality.

One subtle point: CPU throttling can happen even when average CPU usage looks low. CFS enforcement works in periods. You can get short bursts that hit the quota, then a pause, then another burst. That pattern is enough to break tail latency.

Smell test: Low CPU usage plus high latency is often throttling, not efficiency.

What I look for in Prometheus

These signals explain slow behavior even when the pod stays up:

  1. container_cpu_usage_seconds_total
  2. container_cpu_cfs_throttled_seconds_total
  3. kube_pod_container_resource_requests

5. Memory usage and limits

Memory is not forgiving. CPU can throttle, memory kills.

If a container crosses its memory limit, it will be terminated and restarted. If the request is too low, it may be evicted when the node is under pressure.

For ML APIs, memory behavior is often shaped by:

  1. Model size and how it is loaded
  2. Libraries that allocate large buffers during inference
  3. Concurrency and request payload size

Treat memory limits as safety rails, not guesses.

If you see Exit Code 137 in termination details, treat it as a strong signal for memory limit kills.

Rule of thumb: If you are surprised by memory behavior, you are missing a load pattern or a worst case request.

What I look for in Prometheus

  1. container_memory_working_set_bytes
  2. kube_pod_container_resource_limits
  3. kube_pod_container_status_last_terminated_reason

6. Service endpoints

Service endpoints are one of the most underused signals.

A Service can exist and pods can exist, but if endpoints are missing, traffic will not go anywhere. This catches issues that are easy to miss when you focus only on pod count:

  1. Labels do not match between pods and the Service selector
  2. Readiness never becomes healthy, so endpoints are not added
  3. A rollout is partially complete and only some pods are receiving traffic

If endpoints are zero, Kubernetes is being very direct with you.

Rule of thumb: If endpoints are empty, stop checking logs. Fix selectors or readiness first.

Where to debug it

kubectl get svc SERVICE_NAME -o yaml
kubectl get endpoints SERVICE_NAME -o wide
kubectl get pods --show-labels

Prometheus names that align well:

  1. kube_endpoint_address_available
  2. kube_endpoint_address_not_ready

7. Node health and pressure

Even in a single node environment, node health is a real signal.

In a real cluster, pressure conditions change everything:

  1. Memory pressure can trigger eviction and instability
  2. Disk pressure can break image pulls and logging
  3. PID pressure can stop new workloads from starting

Healthy nodes produce honest scheduling behavior. Unhealthy nodes create confusing symptoms.

Where to see node pressure

kubectl describe node NODE_NAME
kubectl get pods -A --field-selector spec.nodeName=NODE_NAME

8. Control plane symptoms you can actually observe

You will not always have direct visibility into scheduler or API server internals, especially in managed clusters. But you can still observe control plane pain through symptoms:

  1. Pods stuck pending even when resources look fine
  2. Rollouts that take far longer than expected
  3. Commands that hang or time out

When replicas do not converge and node resources look healthy, the control plane becomes the likely suspect.

The internal causes are often one of these:

  1. API server latency spikes
  2. Scheduler throughput drops or queue buildup
  3. etcd saturation and slow writes

You may not see them directly, but naming them helps you recognize the pattern and avoid chasing random application logs.

9. Storage and I/O signals that hurt ML APIs

This is easy to ignore until you deploy a model that takes real time to load.

Even with a small Iris API, the same failure mode exists: startup depends on reading files and initializing libraries. In real systems that can involve container image pulls, volume mounts, and model downloads.

Symptoms usually look like:

  1. Readiness takes far longer than expected
  2. Pods churn because startup hits timeouts
  3. Nodes show disk pressure and image pulls fail

Where I start:

kubectl describe pod POD_NAME
kubectl get events -A --sort-by=.lastTimestamp
kubectl describe node NODE_NAME

Prometheus names that help:

  1. kubelet_volume_stats_used_bytes
  2. kubelet_volume_stats_available_bytes
  3. storage_operation_duration_seconds
  4. storage_operation_errors_total

A simple way to think about Kubernetes metrics

When I want clarity quickly, I reduce everything to three questions:

  1. Can the cluster place the workload
  2. Can the workload receive traffic
  3. Can the workload stay stable under load

Desired versus running answers the first question.

Availability and endpoints answer the second question.

Restarts, CPU behavior, and memory behavior answer the third question.

A practical workflow when something feels off

This is the flow I use when an API feels slow or unreliable.

  1. Check deployment convergence and pod readiness
  2. Check endpoints to confirm traffic can route
  3. Check restarts and last termination reason
  4. Check CPU throttling and memory working set
  5. Check node pressure and recent events

In most incidents, you will get to the root cause before you ever open a dashboard.

Closing

Metrics are not dashboards. Metrics are questions your system answers under pressure.

If you build the habit of asking the right questions early, Kubernetes becomes predictable. Predictable systems scale.