Kubernetes (EKS/GKE)

What I’ve done (production patterns)

  • Operated EKS/GKE workloads with real on-call ownership: rollouts, incidents, brownouts, and noisy alerts.
  • Built safer delivery patterns: readiness gates, PDBs, canaries, and rollback playbooks.
  • Designed for scale: HPA/VPA, Cluster Autoscaler, node group isolation (system vs workload), multi-AZ posture.
  • Security + access: RBAC, IRSA, namespace boundaries, least-privilege service accounts, secrets strategy.

Things I care about (what breaks in real life)

  • HPA thrashing → fix with sane requests/limits, cooldowns, and queue-aware metrics.
  • Node pressure / evictions → right-size, set PDBs, separate noisy workloads, tune eviction thresholds.
  • DNS & CNI weirdness → correlate CoreDNS latency, conntrack pressure, and CNI errors; keep runbooks.
  • Upgrade blast radius → staged upgrades, test add-ons, and gate critical workloads.

Artifacts (public)

Interview-ready examples

  • Safe deploys: readiness + canary + rollback
  • Reduce on-call noise: SLO-based paging + ownership routing