Alerting - 2i2c Infrastructure Guide

We have a few alerts configured to notify us when things go wrong and we use PagerDuty to manage them.

How to manage alerts¶

Severity levels¶

When an alert threshold is crossed, an automatic notification is sent to PagerDuty and the #pagerduty-notifications channel on the 2i2c Slack.

Each alert setup with Jsonnet has a severity level set through the *.jsonnet configuration file. The severity levels are:

take immediate action
same day action needed
action needed this week
to be planned in sprint planning

This level is what determines how quickly you should respond to the alert and translates into the priority of the incident created in PagerDuty. It does this by running an Event Orchestration after an incident is created. This Event Orchestration is what sets a priority based on the severity label.

Priority levels¶

The PagerDuty alerts can have a priority between P1 and P4 or have no priority set at all.

P1 alerts¶

These alerts signal an ongoing community outage! An outage is a period of time when a hub is unavailable or its critical services are not functioning as expected and impacting two or more of hub users’ activity
The priority is set by:
- PagerDuty’s Event Orchestration if the alert has a take immediate action severity or based on the Service it pertains. (E.g. all JupyterHub health checks are P1s)
- Manually by the engineer

P2 alerts¶

These alerts signal that the community is about to be affected if we don’t do something asap. E.g. bumping a hub’s home directory when it has less than 10% available.
The priority is set by PagerDuty’s Event Orchestration if the alert has a same day action needed severity or based on the Service it pertains. (E.g. all JupyterHub health checks are P1s)

P3 alerts¶

Correlate with the action needed this week severity level
Community about to be affected if we don’t do something soon, but not immediately

P4 alerts¶

Correlate to be planned in sprint planning severity level
Community not necessarily affected on a specific timeline, but we must take some action into the committed column of next sprint

Alerts configured with Jsonnet¶

Certain alerts are configured in support deployments using Jsonnet in our Infrastructure.

Configuration¶

We use the Prometheus alert manager to set up alerts that are defined in the helm-charts/support/values.jsonnet file.

At the time of writing, we have the following alerting rules groups, and under each group there are one or more alerts:

PVC available capacity For when a persistent volume claim (PVC) is approaching full capacity, with the following alerts:
- Home Directory Disk 90% full
- Home Directory Disk 100% full (outage)
- Hub Database Disk 90% full
- Prometheus Disk 90% full
Important Pod Restart For when a pod has restarted, with the following alerts:
- jupyterhub-groups-exporter restart
- jupyterhub-home-nfs restart
- jupyterhub-cost-monitoring restart
- support-grafana restart
- support-prometheus-server restart
- proxy restart
Server Startup Failure For when a user server has failed to start.
DiskIO saturation For when a disk is approaching IO saturation
Pods stuck in an undesirable state for too long For when there’s a pod that’s stuck in Pending for more than 15m or a pod stuck in Terminating for more than 10m.
Possible application outage For when an application is not working as expected.

Each of these alerts is integrated with a Pagerduty Service. And these services can then be grouped under Pagerduty Business Services that can be presented on the status page.

Alerts configured with Terraform¶

Some alerts are configured at the infrastructure using Terraform.

Configuration¶

AWS NFS Home Directory IOPs & Throughput When the EBS volume for NFS home directory storage becomes saturated at the provisioned IOPs or throughput limits for three in five one-minute collection periods, a CloudWatch Alarm is triggered, which propagates through to PagerDuty.

Important Pagerduty pages to know about¶

All of the alerts we have configured are managed by Pagerduty There are some important web pages provided by Pagerduty that are relevant to know about:

2i2c’s Pagerduty page
List of incidents This is where all incidents can be found
Internal status page This is where outages will show up, per business service. Clicking on an incident from this page will link you to the alert.
External status page This is where outages will show up, per business service to the outside world. This is where people can:
- subscribe for updates about outages
- subscribe to get info about maintenance windows that we might post
- Find out about the uptime of each Business Service.