What I’ve done (production patterns)
- Operated EKS/GKE workloads with real on-call ownership: rollouts, incidents, brownouts, and noisy alerts.
- Built safer delivery patterns: readiness gates, PDBs, canaries, and rollback playbooks.
- Designed for scale: HPA/VPA, Cluster Autoscaler, node group isolation (system vs workload), multi-AZ posture.
- Security + access: RBAC, IRSA, namespace boundaries, least-privilege service accounts, secrets strategy.
Things I care about (what breaks in real life)
- HPA thrashing → fix with sane requests/limits, cooldowns, and queue-aware metrics.
- Node pressure / evictions → right-size, set PDBs, separate noisy workloads, tune eviction thresholds.
- DNS & CNI weirdness → correlate CoreDNS latency, conntrack pressure, and CNI errors; keep runbooks.
- Upgrade blast radius → staged upgrades, test add-ons, and gate critical workloads.
Artifacts (public)
Interview-ready examples
- Safe deploys: readiness + canary + rollback
- Reduce on-call noise: SLO-based paging + ownership routing