Production Kubernetes Checklist: Resource Limits, RBAC, and Observability

Kubernetes will happily run a workload that is wildly misconfigured. It will not warn you that you forgot to set memory limits until a node is on fire at 3am. The checklist below is the bare minimum I run through before a cluster carries anything important.

1. Set requests and limits on every container

This is the single highest-leverage thing you can do. Without limits, a leaking pod can take the whole node down by exhausting memory. Without requests, the scheduler cannot make good placement decisions and you end up with packed-then-OOM nodes.

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    memory: 256Mi

Notice memory has a limit but CPU only has a request. This is deliberate. Memory overruns kill the pod; CPU overruns just throttle. Throttling caused by aggressive CPU limits has been a frequent cause of mysterious latency spikes in services I have debugged. Leave CPU uncapped, let the scheduler do its job with requests.

2. RBAC: least privilege, no exceptions

Every service account that ships with a workload should have a Role (or ClusterRole) that lists only the verbs and resources it actually uses. The default ServiceAccount has no permissions, which is what you want — do not bind it to anything broader.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: my-app
  name: config-reader
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch"]

Resist the urge to grant "*" on resources or verbs to make a CI job easier. The convenience now becomes an incident report later.

3. Network policies: default deny

By default, every pod can talk to every other pod in the cluster. For anything past a sandbox, deploy a default-deny network policy in each namespace and then allow the specific traffic you need.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: my-app
spec:
  podSelector: {}
  policyTypes: ["Ingress", "Egress"]

Once the default-deny is in place, you write small allow policies for the legitimate flows. The first time you do this, expect to spend a day discovering that you have services talking to each other through paths nobody documented.

4. Probes that mean something

A liveness probe should detect a process that needs to be killed. A readiness probe should detect a process that is up but not ready to handle traffic. Most production incidents I have seen with bad probes were either (a) liveness too aggressive, killing pods during legitimate slow operations, or (b) readiness conflated with liveness, causing slow-startup pods to be killed before they finish warming up.

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

If the same endpoint backs both probes, the readiness one should still be cheaper or more permissive. Liveness should be the harsh check.

5. PodDisruptionBudgets

Without a PDB, a routine node drain (during a cluster upgrade or autoscaler scale-down) can evict every replica of your deployment at once. With one, the eviction respects your minimum.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: my-app

For two-replica deployments, minAvailable: 1 is right. For larger deployments, prefer maxUnavailable as a percentage.

6. Observability: three signals, not just metrics

Prometheus and Grafana solve the metrics half. They do not solve logs or traces, and lacking the other two leaves you debugging blind for an entire class of issues.

Metrics: Prometheus, scraped from /metrics on each pod. Alert on rate-of-change, not absolute thresholds.
Logs: a centralized collector (Loki, Elastic, Cloudwatch agent). Structured JSON logs, not free-text.
Traces: OpenTelemetry SDK in the application, sent to Tempo or Jaeger. Most useful for cross-service latency questions.

The fastest payoff is logs. The hardest to retrofit is traces.

7. Image policy

Pin image tags to digests, not floating tags like latest. Scan images for CVEs in CI (Trivy is cheap and good). Run a non-root user. Use a read-only root filesystem if the app tolerates it.

securityContext:
  runAsNonRoot: true
  runAsUser: 1001
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false

None of these are exotic. They are the boring controls that prevent the boring incidents.