Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Alerting

We have a few alerts configured to notify us when things go wrong and we use PagerDuty to manage them.

How to manage alerts

Severity levels

When an alert threshold is crossed, an automatic notification is sent to PagerDuty and the #pagerduty-notifications channel on the 2i2c Slack.

Each alert setup with Jsonnet has a severity level set through the *.jsonnet configuration file. The severity levels are:

This level is what determines how quickly you should respond to the alert and translates into the priority of the incident created in PagerDuty. It does this by running an Event Orchestration after an incident is created. This Event Orchestration is what sets a priority based on the severity label.

Priority levels

The PagerDuty alerts can have a priority between P1 and P4 or have no priority set at all.

P1 alerts

P2 alerts

P3 alerts

P4 alerts

Alerts configured with Jsonnet

Certain alerts are configured in support deployments using Jsonnet in our Infrastructure.

Configuration

We use the Prometheus alert manager to set up alerts that are defined in the helm-charts/support/values.jsonnet file.

At the time of writing, we have the following alerting rules groups, and under each group there are one or more alerts:

  1. PVC available capacity For when a persistent volume claim (PVC) is approaching full capacity, with the following alerts:

    • Home Directory Disk 90% full

    • Home Directory Disk 100% full (outage)

    • Hub Database Disk 90% full

    • Prometheus Disk 90% full

  2. Important Pod Restart For when a pod has restarted, with the following alerts:

    • jupyterhub-groups-exporter restart

    • jupyterhub-home-nfs restart

    • jupyterhub-cost-monitoring restart

    • support-grafana restart

    • support-prometheus-server restart

    • proxy restart

  3. Server Startup Failure For when a user server has failed to start.

  4. DiskIO saturation For when a disk is approaching IO saturation

  5. Pods stuck in an undesirable state for too long For when there’s a pod that’s stuck in Pending for more than 15m or a pod stuck in Terminating for more than 10m.

  6. Possible application outage For when an application is not working as expected.

Each of these alerts is integrated with a Pagerduty Service. And these services can then be grouped under Pagerduty Business Services that can be presented on the status page.

Alerts configured with Terraform

Some alerts are configured at the infrastructure using Terraform.

Configuration

  1. AWS NFS Home Directory IOPs & Throughput When the EBS volume for NFS home directory storage becomes saturated at the provisioned IOPs or throughput limits for three in five one-minute collection periods, a CloudWatch Alarm is triggered, which propagates through to PagerDuty.

Important Pagerduty pages to know about

All of the alerts we have configured are managed by Pagerduty There are some important web pages provided by Pagerduty that are relevant to know about:

  1. 2i2c’s Pagerduty page

  2. List of incidents This is where all incidents can be found

  3. Internal status page This is where outages will show up, per business service. Clicking on an incident from this page will link you to the alert.

  4. External status page This is where outages will show up, per business service to the outside world. This is where people can:

    • subscribe for updates about outages

    • subscribe to get info about maintenance windows that we might post

    • Find out about the uptime of each Business Service.