Read my new blog post on my work on Test Cases Strategies for CNCF - LitmusChaos! Read more

Read my new blog post on my work with Kubernetes during the Google Summer of code! Read more

CloudKubernetesDevOpsInfrastructure

Kubernetes: The Hard Parts Nobody Warns You About

2024-09-05

Beyond the hello-world tutorials — a practical look at the Kubernetes concepts that trip up even experienced engineers: networking, resource limits, and debugging crashed pods.

Every Kubernetes tutorial starts the same way: install kubectl, run kubectl apply -f deployment.yaml, watch your pod come up. It looks simple. Then you join a real project and a pod crashes in production at 2 AM and you have no idea why.

This article covers the parts that the getting-started guides skip.

Resource Requests and Limits (and Why Getting Them Wrong Destroys You)

The most common source of production incidents I have seen is misconfigured resource requests and limits. Here is what they actually mean:

  • Request — the amount of CPU/memory the scheduler reserves for your pod on a node. This affects where the pod lands.
  • Limit — the hard ceiling. Exceed the memory limit and your container is OOMKilled. Exceed the CPU limit and it is throttled.
resources:
  requests:
    memory: "128Mi"
    cpu: "250m"
  limits:
    memory: "256Mi"
    cpu: "500m"

The trap: if you set no requests, your pods land on nodes that are already full. Under load, everything fights for the same CPU and latency spikes. If you set no memory limit, one misbehaving pod can eat all node memory and take down unrelated workloads.

A practical rule: set requests to the 50th percentile usage, limits to the 95th. Use kubectl top pods and Vertical Pod Autoscaler recommendations to calibrate.

How Kubernetes Networking Actually Works

Pods get their own IP addresses. This sounds simple until you need to understand why a pod can talk to a service but not directly to another pod in a different namespace.

The key concepts:

Services are not load balancers — they are stable virtual IPs backed by kube-proxy rules (iptables or IPVS). When you create a ClusterIP service, kube-proxy on every node adds rules that intercept traffic to that IP and redirect it to one of the backing pods.

DNS is how services find each other. A service named api in namespace backend is reachable at:

  • api (from within the same namespace)
  • api.backend (short form)
  • api.backend.svc.cluster.local (fully qualified)

Network policies are the firewall. By default, pods accept traffic from anywhere. To lock it down:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-only-frontend
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: frontend

Without a CNI plugin that enforces network policies (Calico, Cilium, Weave), these YAML files do nothing. Many clusters run with no enforcement and engineers wonder why they wrote the policies.

Debugging Crashed Pods

A pod in CrashLoopBackOff is telling you it started, failed, and Kubernetes is retrying with exponential backoff. The debug workflow:

# 1. See what state the pod is in
kubectl describe pod <pod-name>

# 2. Check the current logs
kubectl logs <pod-name>

# 3. Check logs from the PREVIOUS crash (this is the one people forget)
kubectl logs <pod-name> --previous

# 4. If the container exits too fast to read logs, override the entrypoint
kubectl debug <pod-name> -it --image=busybox --copy-to=debug-pod

The most common causes:

  • OOMKilled — memory limit too low. Check kubectl describe pod, look for OOMKilled in the last state.
  • Liveness probe failing — your health check endpoint is not ready fast enough. Add initialDelaySeconds.
  • Missing environment variable or secret — the app panics on startup. Always check --previous logs.
  • Image pull error — wrong tag, private registry credentials missing, or rate limiting from Docker Hub.

Liveness vs. Readiness vs. Startup Probes

These three probes are frequently confused:

Probe What it does Consequence of failure
Startup Is the container done initialising? Kills and restarts the container
Liveness Is the container still healthy? Kills and restarts the container
Readiness Is the container ready to serve traffic? Removes from Service endpoints (no restart)

The critical insight: use readiness probes to handle graceful traffic drain. When you deploy a new version, old pods should fail their readiness probe before they are terminated — this prevents requests from being routed to a pod that has already started shutting down.

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

ConfigMaps and Secrets: The Right Pattern

Do not bake environment-specific configuration into your container images. Use ConfigMaps for non-sensitive config and Secrets for credentials.

apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
type: Opaque
stringData:
  password: "supersecret"
---
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: app
          env:
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password

One thing most tutorials skip: Kubernetes Secrets are base64-encoded, not encrypted, at rest by default. For production, enable encryption at rest in the API server config, or use an external secrets manager like Vault, AWS Secrets Manager, or External Secrets Operator.

The Mental Model That Helps Everything Click

Think of the control plane as a reconciliation loop. You declare desired state (via YAML), and controllers endlessly compare that desired state against the actual state, making changes to close the gap.

This is why kubectl apply is idempotent. It is also why you should never manually kubectl exec and change things inside a running pod — the controller will reconcile it away. Infrastructure in Kubernetes should always be managed through manifests, not manual intervention.

Once this model clicks, Kubernetes stops feeling like magic and starts feeling like a very opinionated state machine. Which it is.