Kubernetes Metrics I Check Before Opening Grafana

People usually start caring about Kubernetes metrics after something breaks. That is backwards.

Even a small setup can teach you most of what matters in production. I will use a simple Iris ML API running on Kubernetes as the running example, but the point of this post is not the model. The point is the signals that tell you whether a cluster is healthy, whether traffic can flow, and whether scaling will hold under pressure.

Quick reference

If you want a compact checklist, here is what I watch first.

Category	Key signal	Why it matters
Orchestration	Desired replicas versus available replicas	Tells you if the workload can converge
Connectivity	Service endpoints	Tells you if traffic has anywhere to go
Stability	Restart count and OOM kills	Tells you if you are silently losing capacity
Efficiency	CPU throttling	Explains slow behavior even when CPU looks low
Memory	Working set and limits	Predicts restarts and evictions
Node health	Pressure conditions	Predicts scheduling failures and evictions
Storage	Volume stats and mount errors	Explains slow start and timeouts during loading

The setup I am referencing

This is intentionally small:

An Iris inference API running in a Deployment
Two replicas
A Service that routes traffic to the pods
Tested in Minikube, designed to map cleanly to a real cluster

If you can reason about metrics here, you can reason about them in larger systems.

Where to see these signals

You do not need a full observability stack to get value. Start with kubectl, then layer Prometheus later.

One note before we go deeper: Prometheus metric names can vary slightly based on exporter versions, cluster distribution, and runtime. Treat the metric names below as reliable starting points, not a strict contract.

Fast cluster view

kubectl get deploy
kubectl get pods -o wide
kubectl get svc
kubectl get endpoints
kubectl get nodes

When something does not converge

kubectl get events -A --sort-by=.lastTimestamp
kubectl describe deploy DEPLOYMENT_NAME
kubectl describe pod POD_NAME
kubectl describe node NODE_NAME

If metrics server is present

kubectl top pods
kubectl top nodes

A short incident story that changed how I read metrics

In one deployment, we saw p95 latency climb while CPU usage looked fine. Endpoints were healthy. Pods were running. It was tempting to blame the application.

The root cause was CPU throttling. Requests were set too low for the real burstiness of the workload. The kernel was enforcing CFS quotas, so inference threads were getting paused in short intervals. Nothing crashed, but the system felt slow and inconsistent.

That experience is why I look for gaps between intent and reality. The gaps tell you where the system is lying.

1. Desired and running pods

The first question I ask is simple: do the desired replicas match the running replicas.

If they do not match, Kubernetes is telling you the system cannot converge. That gap is the signal.

Common causes include:

The scheduler cannot place pods because nodes are out of CPU or memory
Image pulls are failing
Pods are crashing or stuck starting
Probes are failing and the workload never becomes stable

When desired and running do not align, do not start with application logs. Start with the cluster view.

Rule of thumb: If desired replicas and available replicas do not match, do not debug the app first. Debug scheduling and startup.

Prometheus signals to map this

If you have kube state metrics, these are the names I use most:

kube_deployment_spec_replicas
kube_deployment_status_replicas_available
kube_deployment_status_replicas_unavailable

2. Availability, not just running

Kubernetes can show a pod as running while it is not actually ready to receive traffic. Availability is the more important signal.

For an ML API, availability often fails for reasons that look normal in code but painful in production:

Model warm up takes longer than expected
Dependencies are reachable but not ready
Readiness checks are too strict or too loose

The mindset is simple: a running pod is a process, an available pod is capacity.

Rule of thumb: Running is not readiness. If traffic is failing, look at readiness and endpoints before anything else.

What I check when availability fails

kubectl get pods
kubectl describe pod POD_NAME
kubectl get endpoints SERVICE_NAME -o yaml

Prometheus names that align well:

kube_pod_status_ready
kube_pod_container_status_ready

3. Restart count

Restarts are a silent reliability tax. A pod that restarts is not stable capacity, even if it recovers quickly.

If restarts are above zero, I treat it as an early warning:

Memory limits are too low and the container is being killed
Requests are wrong and the node is under pressure
The app has spikes that were not tested under load

For ML workloads this is common when inference creates short bursts of memory usage that look harmless until concurrency rises.

Rule of thumb: Restarts above zero in steady state means you are losing capacity. It is already a scaling problem.

How to catch the cause quickly

kubectl get pods
kubectl describe pod POD_NAME
kubectl logs POD_NAME --previous

Prometheus names that align well:

kube_pod_container_status_restarts_total
kube_pod_container_status_last_terminated_reason
container_memory_working_set_bytes

4. CPU requests and throttling

Kubernetes schedules based on requests, not actual usage.

If CPU requests are unrealistically low, you can end up with pods that exist but feel slow because the CPU is being throttled. This shows up as latency and timeouts long before it shows up as a crash.

If CPU requests are unrealistically high, you reduce bin packing efficiency and limit how many replicas can run at all.

The goal is honesty. Scaling works when requests are close to reality.

One subtle point: CPU throttling can happen even when average CPU usage looks low. CFS enforcement works in periods. You can get short bursts that hit the quota, then a pause, then another burst. That pattern is enough to break tail latency.

Smell test: Low CPU usage plus high latency is often throttling, not efficiency.

What I look for in Prometheus

These signals explain slow behavior even when the pod stays up:

container_cpu_usage_seconds_total
container_cpu_cfs_throttled_seconds_total
kube_pod_container_resource_requests

5. Memory usage and limits

Memory is not forgiving. CPU can throttle, memory kills.

If a container crosses its memory limit, it will be terminated and restarted. If the request is too low, it may be evicted when the node is under pressure.

For ML APIs, memory behavior is often shaped by:

Model size and how it is loaded
Libraries that allocate large buffers during inference
Concurrency and request payload size

Treat memory limits as safety rails, not guesses.

If you see Exit Code 137 in termination details, treat it as a strong signal for memory limit kills.

Rule of thumb: If you are surprised by memory behavior, you are missing a load pattern or a worst case request.

What I look for in Prometheus

container_memory_working_set_bytes
kube_pod_container_resource_limits
kube_pod_container_status_last_terminated_reason

6. Service endpoints

Service endpoints are one of the most underused signals.

A Service can exist and pods can exist, but if endpoints are missing, traffic will not go anywhere. This catches issues that are easy to miss when you focus only on pod count:

Labels do not match between pods and the Service selector
Readiness never becomes healthy, so endpoints are not added
A rollout is partially complete and only some pods are receiving traffic

If endpoints are zero, Kubernetes is being very direct with you.

Rule of thumb: If endpoints are empty, stop checking logs. Fix selectors or readiness first.

Where to debug it

kubectl get svc SERVICE_NAME -o yaml
kubectl get endpoints SERVICE_NAME -o wide
kubectl get pods --show-labels

Prometheus names that align well:

kube_endpoint_address_available
kube_endpoint_address_not_ready

7. Node health and pressure

Even in a single node environment, node health is a real signal.

In a real cluster, pressure conditions change everything:

Memory pressure can trigger eviction and instability
Disk pressure can break image pulls and logging
PID pressure can stop new workloads from starting

Healthy nodes produce honest scheduling behavior. Unhealthy nodes create confusing symptoms.

Where to see node pressure

kubectl describe node NODE_NAME
kubectl get pods -A --field-selector spec.nodeName=NODE_NAME

8. Control plane symptoms you can actually observe

You will not always have direct visibility into scheduler or API server internals, especially in managed clusters. But you can still observe control plane pain through symptoms:

Pods stuck pending even when resources look fine
Rollouts that take far longer than expected
Commands that hang or time out

When replicas do not converge and node resources look healthy, the control plane becomes the likely suspect.

The internal causes are often one of these:

API server latency spikes
Scheduler throughput drops or queue buildup
etcd saturation and slow writes

You may not see them directly, but naming them helps you recognize the pattern and avoid chasing random application logs.

9. Storage and I/O signals that hurt ML APIs

This is easy to ignore until you deploy a model that takes real time to load.

Even with a small Iris API, the same failure mode exists: startup depends on reading files and initializing libraries. In real systems that can involve container image pulls, volume mounts, and model downloads.

Symptoms usually look like:

Readiness takes far longer than expected
Pods churn because startup hits timeouts
Nodes show disk pressure and image pulls fail

Where I start:

kubectl describe pod POD_NAME
kubectl get events -A --sort-by=.lastTimestamp
kubectl describe node NODE_NAME

Prometheus names that help:

kubelet_volume_stats_used_bytes
kubelet_volume_stats_available_bytes
storage_operation_duration_seconds
storage_operation_errors_total

A simple way to think about Kubernetes metrics

When I want clarity quickly, I reduce everything to three questions:

Can the cluster place the workload
Can the workload receive traffic
Can the workload stay stable under load

Desired versus running answers the first question.

Availability and endpoints answer the second question.

Restarts, CPU behavior, and memory behavior answer the third question.

A practical workflow when something feels off

This is the flow I use when an API feels slow or unreliable.

Check deployment convergence and pod readiness
Check endpoints to confirm traffic can route
Check restarts and last termination reason
Check CPU throttling and memory working set
Check node pressure and recent events

In most incidents, you will get to the root cause before you ever open a dashboard.

Closing

Metrics are not dashboards. Metrics are questions your system answers under pressure.

If you build the habit of asking the right questions early, Kubernetes becomes predictable. Predictable systems scale.

Kubernetes Metrics I Check Before Opening Grafana

Quick reference

The setup I am referencing

Where to see these signals

Fast cluster view

When something does not converge

If metrics server is present

A short incident story that changed how I read metrics

1. Desired and running pods

Prometheus signals to map this

2. Availability, not just running

What I check when availability fails

3. Restart count

How to catch the cause quickly

4. CPU requests and throttling

What I look for in Prometheus

5. Memory usage and limits

What I look for in Prometheus

6. Service endpoints

Where to debug it

7. Node health and pressure

Where to see node pressure

8. Control plane symptoms you can actually observe

9. Storage and I/O signals that hurt ML APIs

A simple way to think about Kubernetes metrics

A practical workflow when something feels off

Closing

Table of Contents