Kubernetes Metrics I Check Before Opening Grafana
People usually start caring about Kubernetes metrics after something breaks. That is backwards.
Even a small setup can teach you most of what matters in production. I will use a simple Iris ML API running on Kubernetes as the running example, but the point of this post is not the model. The point is the signals that tell you whether a cluster is healthy, whether traffic can flow, and whether scaling will hold under pressure.
Quick reference
If you want a compact checklist, here is what I watch first.
| Category | Key signal | Why it matters |
|---|---|---|
| Orchestration | Desired replicas versus available replicas | Tells you if the workload can converge |
| Connectivity | Service endpoints | Tells you if traffic has anywhere to go |
| Stability | Restart count and OOM kills | Tells you if you are silently losing capacity |
| Efficiency | CPU throttling | Explains slow behavior even when CPU looks low |
| Memory | Working set and limits | Predicts restarts and evictions |
| Node health | Pressure conditions | Predicts scheduling failures and evictions |
| Storage | Volume stats and mount errors | Explains slow start and timeouts during loading |
The setup I am referencing
This is intentionally small:
- An Iris inference API running in a Deployment
- Two replicas
- A Service that routes traffic to the pods
- Tested in Minikube, designed to map cleanly to a real cluster
If you can reason about metrics here, you can reason about them in larger systems.
Where to see these signals
You do not need a full observability stack to get value. Start with kubectl, then layer Prometheus later.
One note before we go deeper: Prometheus metric names can vary slightly based on exporter versions, cluster distribution, and runtime. Treat the metric names below as reliable starting points, not a strict contract.
Fast cluster view
kubectl get deploy
kubectl get pods -o wide
kubectl get svc
kubectl get endpoints
kubectl get nodes
When something does not converge
kubectl get events -A --sort-by=.lastTimestamp
kubectl describe deploy DEPLOYMENT_NAME
kubectl describe pod POD_NAME
kubectl describe node NODE_NAME
If metrics server is present
kubectl top pods
kubectl top nodes
A short incident story that changed how I read metrics
In one deployment, we saw p95 latency climb while CPU usage looked fine. Endpoints were healthy. Pods were running. It was tempting to blame the application.
The root cause was CPU throttling. Requests were set too low for the real burstiness of the workload. The kernel was enforcing CFS quotas, so inference threads were getting paused in short intervals. Nothing crashed, but the system felt slow and inconsistent.
That experience is why I look for gaps between intent and reality. The gaps tell you where the system is lying.
1. Desired and running pods
The first question I ask is simple: do the desired replicas match the running replicas.
If they do not match, Kubernetes is telling you the system cannot converge. That gap is the signal.
Common causes include:
- The scheduler cannot place pods because nodes are out of CPU or memory
- Image pulls are failing
- Pods are crashing or stuck starting
- Probes are failing and the workload never becomes stable
When desired and running do not align, do not start with application logs. Start with the cluster view.
Rule of thumb: If desired replicas and available replicas do not match, do not debug the app first. Debug scheduling and startup.
Prometheus signals to map this
If you have kube state metrics, these are the names I use most:
kube_deployment_spec_replicaskube_deployment_status_replicas_availablekube_deployment_status_replicas_unavailable
2. Availability, not just running
Kubernetes can show a pod as running while it is not actually ready to receive traffic. Availability is the more important signal.
For an ML API, availability often fails for reasons that look normal in code but painful in production:
- Model warm up takes longer than expected
- Dependencies are reachable but not ready
- Readiness checks are too strict or too loose
The mindset is simple: a running pod is a process, an available pod is capacity.
Rule of thumb: Running is not readiness. If traffic is failing, look at readiness and endpoints before anything else.
What I check when availability fails
kubectl get pods
kubectl describe pod POD_NAME
kubectl get endpoints SERVICE_NAME -o yaml
Prometheus names that align well:
kube_pod_status_readykube_pod_container_status_ready
3. Restart count
Restarts are a silent reliability tax. A pod that restarts is not stable capacity, even if it recovers quickly.
If restarts are above zero, I treat it as an early warning:
- Memory limits are too low and the container is being killed
- Requests are wrong and the node is under pressure
- The app has spikes that were not tested under load
For ML workloads this is common when inference creates short bursts of memory usage that look harmless until concurrency rises.
Rule of thumb: Restarts above zero in steady state means you are losing capacity. It is already a scaling problem.
How to catch the cause quickly
kubectl get pods
kubectl describe pod POD_NAME
kubectl logs POD_NAME --previous
Prometheus names that align well:
kube_pod_container_status_restarts_totalkube_pod_container_status_last_terminated_reasoncontainer_memory_working_set_bytes
4. CPU requests and throttling
Kubernetes schedules based on requests, not actual usage.
If CPU requests are unrealistically low, you can end up with pods that exist but feel slow because the CPU is being throttled. This shows up as latency and timeouts long before it shows up as a crash.
If CPU requests are unrealistically high, you reduce bin packing efficiency and limit how many replicas can run at all.
The goal is honesty. Scaling works when requests are close to reality.
One subtle point: CPU throttling can happen even when average CPU usage looks low. CFS enforcement works in periods. You can get short bursts that hit the quota, then a pause, then another burst. That pattern is enough to break tail latency.
Smell test: Low CPU usage plus high latency is often throttling, not efficiency.
What I look for in Prometheus
These signals explain slow behavior even when the pod stays up:
container_cpu_usage_seconds_totalcontainer_cpu_cfs_throttled_seconds_totalkube_pod_container_resource_requests
5. Memory usage and limits
Memory is not forgiving. CPU can throttle, memory kills.
If a container crosses its memory limit, it will be terminated and restarted. If the request is too low, it may be evicted when the node is under pressure.
For ML APIs, memory behavior is often shaped by:
- Model size and how it is loaded
- Libraries that allocate large buffers during inference
- Concurrency and request payload size
Treat memory limits as safety rails, not guesses.
If you see Exit Code 137 in termination details, treat it as a strong signal for memory limit kills.
Rule of thumb: If you are surprised by memory behavior, you are missing a load pattern or a worst case request.
What I look for in Prometheus
container_memory_working_set_byteskube_pod_container_resource_limitskube_pod_container_status_last_terminated_reason
6. Service endpoints
Service endpoints are one of the most underused signals.
A Service can exist and pods can exist, but if endpoints are missing, traffic will not go anywhere. This catches issues that are easy to miss when you focus only on pod count:
- Labels do not match between pods and the Service selector
- Readiness never becomes healthy, so endpoints are not added
- A rollout is partially complete and only some pods are receiving traffic
If endpoints are zero, Kubernetes is being very direct with you.
Rule of thumb: If endpoints are empty, stop checking logs. Fix selectors or readiness first.
Where to debug it
kubectl get svc SERVICE_NAME -o yaml
kubectl get endpoints SERVICE_NAME -o wide
kubectl get pods --show-labels
Prometheus names that align well:
kube_endpoint_address_availablekube_endpoint_address_not_ready
7. Node health and pressure
Even in a single node environment, node health is a real signal.
In a real cluster, pressure conditions change everything:
- Memory pressure can trigger eviction and instability
- Disk pressure can break image pulls and logging
- PID pressure can stop new workloads from starting
Healthy nodes produce honest scheduling behavior. Unhealthy nodes create confusing symptoms.
Where to see node pressure
kubectl describe node NODE_NAME
kubectl get pods -A --field-selector spec.nodeName=NODE_NAME
8. Control plane symptoms you can actually observe
You will not always have direct visibility into scheduler or API server internals, especially in managed clusters. But you can still observe control plane pain through symptoms:
- Pods stuck pending even when resources look fine
- Rollouts that take far longer than expected
- Commands that hang or time out
When replicas do not converge and node resources look healthy, the control plane becomes the likely suspect.
The internal causes are often one of these:
- API server latency spikes
- Scheduler throughput drops or queue buildup
- etcd saturation and slow writes
You may not see them directly, but naming them helps you recognize the pattern and avoid chasing random application logs.
9. Storage and I/O signals that hurt ML APIs
This is easy to ignore until you deploy a model that takes real time to load.
Even with a small Iris API, the same failure mode exists: startup depends on reading files and initializing libraries. In real systems that can involve container image pulls, volume mounts, and model downloads.
Symptoms usually look like:
- Readiness takes far longer than expected
- Pods churn because startup hits timeouts
- Nodes show disk pressure and image pulls fail
Where I start:
kubectl describe pod POD_NAME
kubectl get events -A --sort-by=.lastTimestamp
kubectl describe node NODE_NAME
Prometheus names that help:
kubelet_volume_stats_used_byteskubelet_volume_stats_available_bytesstorage_operation_duration_secondsstorage_operation_errors_total
A simple way to think about Kubernetes metrics
When I want clarity quickly, I reduce everything to three questions:
- Can the cluster place the workload
- Can the workload receive traffic
- Can the workload stay stable under load
Desired versus running answers the first question.
Availability and endpoints answer the second question.
Restarts, CPU behavior, and memory behavior answer the third question.
A practical workflow when something feels off
This is the flow I use when an API feels slow or unreliable.
- Check deployment convergence and pod readiness
- Check endpoints to confirm traffic can route
- Check restarts and last termination reason
- Check CPU throttling and memory working set
- Check node pressure and recent events
In most incidents, you will get to the root cause before you ever open a dashboard.
Closing
Metrics are not dashboards. Metrics are questions your system answers under pressure.
If you build the habit of asking the right questions early, Kubernetes becomes predictable. Predictable systems scale.