2020-08-28 - Memory overload on WER cluster¶
On 2020-08-28, WER reported stuck pages for students. A total outage, nothing usable.
After investigation, we determined that the core pods didn’t have appropriate resource guarantees set. There was also no dedicated core pool, so the WER students overloaded CPU & RAM of the nodes. This starved everything of resources, causing issues.
This was resolved by:
All times in IST
Activity bump is noticed but regular fixes (incognito, restarting servers, etc) don’t seem to fix things
Looking at resource utilization on the nodes, resource exhaustion is clear
$ kubectl top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% gke-low-touch-hubs-cluster-core-pool-b7edea69-00sc 220m 11% 6151Mi 58% gke-low-touch-hubs-cluster-core-pool-b7edea69-gwrg 1944m 100% 10432Mi 98%
There were only core nodes - no separate user nodes. The suspicion is that the user pods are using up just enough resources that the core pods are being starved.
Based on tests on how much RAM WER needs, we had set a limit of 2G but guarantee of only 512M - a 4x overcommit as we often do. However, the tests revealed that users almost always use just under 1G of RAM, so our overcommit should’ve been just 2x. We just remove overcommit for now. This will also probably spawn another node, thus easing pressure on the other existing nodes.
We bump resource guarantees for all the core pods as well, so they will have enough to operate even if the nodes get full. This restarts the pods, and moves some to a new node - which also helps. Things seem to return to normal.
Make sure user pods are in a separate pool, so they do not create pressure on the core pods https://github.com/2i2c-org/infrastructure/issues/89
Set limits on the support infrastructure (prometheus, grafana, ingress) as well https://github.com/2i2c-org/infrastructure/issues/90
Document and think about overcommit ratios for memory usage https://github.com/2i2c-org/infrastructure/issues/91
Setup better Grafana dashboards to monitor resource usage https://github.com/2i2c-org/infrastructure/issues/92
Document how folks can get
kubectlaccess to the cluster, so others can look into issues too https://github.com/2i2c-org/infrastructure/issues/87