New Kubernetes cluster on GCP, Azure or AWS#
This guide will walk through the process of adding a new cluster to our terraform configuration.
You can find out more about terraform in Terraform and their documentation.
Attention
Currently, we do not deploy clusters to AWS solely using terraform. We use eksctl to provision our k8s clusters on AWS and terraform to provision supporting infrastructure, such as storage buckets.
Cluster Design#
This guide will assume you have already followed the guidance in Cluster design considerations to select the appropriate infrastructure.
Prerequisites#
Install
kubectl
,helm
,sops
, etc.In Setting up your local environment to work on this repo you find instructions on how to setup
sops
to encrypt and decrypt files.Install
aws
Verify install and version with
aws --version
. You should have at least version 2.Install or upgrade eksctl
Mac users with homebrew can run
brew install eksctl
.Verify install and version with
eksctl version
. You should have the latest version of this CLI.Important
Without the latest version, you may install an outdated versions of
aws-node
because its hardcoded.Install
jsonnet
Mac users with homebrew can run
brew install jsonnet
.Verify install and version with
jsonnet --version
.
Install
kubectl
,helm
,sops
, etc.In Setting up your local environment to work on this repo you find instructions on how to setup
sops
to encrypt and decrypt files.
Install
kubectl
,helm
,sops
, etc.In Setting up your local environment to work on this repo you find instructions on how to setup
sops
to encrypt and decrypt files.
Install
kubectl
,helm
,sops
and pip installpython-openstackclient
andpython-magnumclient
In Setting up your local environment to work on this repo you find instructions on how to setup
sops
to encrypt and decrypt files.
Create a new cluster#
Setup credentials#
Depending on whether this project is using AWS SSO or not, you can use the following links to figure out how to authenticate to this project from your terminal.
N/A
N/A
You will need to generate Jetstream2 application credentials that the cli client will use to authenticate against the desired Jetstream2 allocation.
There is a comprehensive guide on how to generate the credentials, and export them as environment variables through sourcing an openrc.sh
file. It is important to note that when creating the application credentials you must give them UNRESTRICTED
access by ticking the corresponding box and also select all roles available to you in the ROLES
box.
Go to https://js2.jetstream-cloud.org/ and follow the guide at https://cvw.cac.cornell.edu/jetstreamapi/cli/openrc, but keep in mind the
UNRESTRICTED
part as that’s not covered in the guide.After exporting the variables in the openrc.sh file, make sure you have access by running:
openstack coe cluster list
Generate cluster files#
We automatically generate the files required to setup a new cluster:
A
.jsonnet
file for use witheksctl
A
sops
encrypted ssh key that can be used to ssh into the kubernetes nodes.A ssh public key used by
eksctl
to grant access to the private key.A
.tfvars
terraform variables file that will setup most of the non EKS infrastructure.The cluster config directory in
./config/cluster/<new-cluster>
The
cluster.yaml
config fileThe support values file
support.values.yaml
The the support credentials encrypted file
enc-support.values.yaml
A
.tfvars
file for use withterraform
The cluster config directory in
./config/cluster/<new-cluster>
A sample
cluster.yaml
config fileThe support values file
support.values.yaml
The the support credentials encrypted file
enc-support.values.yaml
Warning
An automated deployer command doesn’t exist yet, these files need to be manually generated!
Warning
An automated deployer command doesn’t exist yet, these files need to be manually generated!
You can generate these with:
export CLUSTER_NAME=<cluster-name>
export CLUSTER_REGION=<cluster-region-like ca-central-1>
export ACCOUNT_ID=<declare 2i2c for clusters under 2i2c SSO, otherwise an account id or alias>
deployer generate dedicated-cluster aws --cluster-name=$CLUSTER_NAME --cluster-region=$CLUSTER_REGION --account-id=$ACCOUNT_ID
Create and render an eksctl config file
We use an eksctl config file in YAML to specify
how our cluster should be built. Since it can get repetitive, we use
jsonnet to declaratively specify this config. You can
find the .jsonnet
files for the current clusters in the eksctl/
directory.
The previous step should’ve created a baseline .jsonnet
file you can modify as
you like. The eksctl docs have a reference
for all the possible options. You’d want to make sure to change at least the following:
Region / Zone - make sure you are creating your cluster in the correct region and verify the suggested zones 1a, 1b, and 1c actually are available in that region.
# a command to list availability zones, for example # ca-central-1 doesn't have 1c, but 1d instead aws ec2 describe-availability-zones --region=$CLUSTER_REGION
Size of nodes in instancegroups, for both notebook nodes and dask nodes. In particular, make sure you have enough quota to launch these instances in your selected regions.
Kubernetes version - older
.jsonnet
files might be on older versions, but you should pick a newer version when you create a new cluster.
Once you have a .jsonnet
file, you can render it into a config file that eksctl
can read.
Tip
Make sure to run this command inside the eksctl
directory.
jsonnet $CLUSTER_NAME.jsonnet > $CLUSTER_NAME.eksctl.yaml
Tip
The *.eksctl.yaml
files are git ignored as we can regenerate it, so work
against the *.jsonnet
file and regenerate the YAML file when needed by a
eksctl
command.
Create the cluster
Now you’re ready to create the cluster!
Tip
Make sure to run this command inside the eksctl
directory, otherwise it cannot discover the ssh-keys
subfolder.
eksctl create cluster --config-file=$CLUSTER_NAME.eksctl.yaml
This might take a few minutes.
If any errors are reported in the config (there is a schema validation step),
fix it in the .jsonnet
file, re-render the config, and try again.
Once it is done, you can test access to the new cluster with kubectl
, after
getting credentials via:
aws eks update-kubeconfig --name=$CLUSTER_NAME --region=$CLUSTER_REGION
kubectl
should be able to find your cluster now! kubectl get node
should show
you at least one core node running.
export CLUSTER_NAME=<cluster-name>
export CLUSTER_REGION=<cluster-region-like ca-central-1>
export PROJECT_ID=<gcp-project-id>
deployer generate dedicated-cluster gcp --cluster-name=$CLUSTER_NAME --project-id=$PROJECT_ID --cluster-region=$CLUSTER_REGION
An automated deployer command doesn’t exist yet, these files need to be manually generated. The minimum inputs this file requires are:
subscription_id
: Azure subscription ID to create resources in. Should be the id, rather than display name of the project.resourcegroup_name
: The name of the Resource Group to be created by terraform, where the cluster and other resources will be deployed into.global_container_registry_name
: The name of an Azure Container Registry to be created by terraform to use for our image. This must be unique across all of Azure. You can use the following Azure CLI command to check your desired name is available:az acr check-name --name ACR_NAME --output table
global_storage_account_name
: The name of a storage account to be created by terraform to use for Azure File Storage. This must be unique across all of Azure. You can use the following Azure CLI command to check your desired name is available:az storage account check-name --name STORAGE_ACCOUNT_NAME --output table
ssh_pub_key
: The public half of an SSH key that will be authorised to login to nodes.
See the variables file for other inputs this file can take and their descriptions.
Naming Convention Guidelines for Container Registries and Storage Accounts
Names for Azure container registries and storage accounts must conform to the following guidelines:
alphanumeric strings between 5 and 50 characters for container registries, e.g.,
myContainerRegistry007
lowercase letters and numbers strings between 2 and 24 characters for storage accounts, e.g.,
mystorageaccount314
Note
A failure will occur if you try to create a storage account whose name is not entirely lowercase.
We recommend the following conventions using lowercase
:
{CLUSTER_NAME}hubregistry
for container registries{CLUSTER_NAME}hubstorage
for storage accounts
Note
Changes in Azure’s own requirements might break our recommended convention. If any such failure occurs, please signal it.
This increases the probability that we won’t take up a namespace that may be required by the Hub Community, for example, in cases where we are deploying to Azure subscriptions not owned/managed by 2i2c.
Example .tfvars
file:
subscription_id = "my-awesome-subscription-id"
resourcegroup_name = "my-awesome-resource-group"
global_container_registry_name = "myawesomehubregistry"
global_storage_account_name = "myawesomestorageaccount"
ssh_pub_key = "ssh-rsa my-public-ssh-key"
An automated deployer command doesn’t exist yet, these files need to be manually generated. The minimum inputs this file requires are:
prefix
: A prefix that will be added to all the cluster-specific resources. Changing this will force the recreation of all resources.notebook_nodes
: A list of nodebook nodes that will be created in the cluster. Themachine_type
should be one of the Jetstream2 flavors.Warning
The value of
min
cannot be zero as currently the Magnum API driver doesn’t support having any nodepool with zero nodes.notebook_nodes = { "m3.medium" : { min : 1, max : 100, # 8 CPU, 30 RAM # https://docs.jetstream-cloud.org/general/instance-flavors/#jetstream2-cpu machine_type : "m3.medium", labels = { "hub.jupyter.org/node-purpose" = "user", "k8s.dask.org/node-purpose" = "scheduler", } }, }
Add GPU nodegroup if needed#
If this cluster is going to have GPUs, you should edit the generated jsonnet file to include a GPU nodegroups.
Initialising Terraform#
Our default terraform state is located centrally in our two-eye-two-see-org
GCP project, therefore you must authenticate gcloud
to your @2i2c.org
account before initialising terraform.
The terraform state includes all cloud providers, not just GCP.
gcloud auth application-default login
Then you can change into the terraform subdirectory for the appropriate cloud provider and initialise terraform.
Our AWS terraform code is now used to deploy supporting infrastructure for the EKS cluster, including:
An IAM identity account for use with our CI/CD system
Appropriately networked EFS storage to serve as an NFS server for hub home directories
Optionally, setup a shared database
Optionally, setup user buckets
The steps above will have created a default .tfvars
file. This file can either be used as-is or edited to enable the optional features listed above.
Initialise terraform for use with AWS:
cd terraform/aws
terraform init
cd terraform/gcp
terraform init
cd terraform/azure
terraform init
cd terraform/openstack
terraform init
Creating a new terraform workspace#
We use terraform workspaces so that the state of one .tfvars
file does not influence another.
Create a new workspace with the below command, and again give it the same name as the .tfvars
filename, $CLUSTER_NAME.
terraform workspace new $CLUSTER_NAME
Note
Workspaces are defined per backend. If you can’t find the workspace you’re looking for, double check you’ve enabled the correct backend.
Setting up Budget Alerts#
Follow the instructions in Setting up Budget Alerts to determine if and how you should setup budget alerts.
You can learn more about our budget alerts in Cloud Billing Budget Alerts.
Plan and Apply Changes#
Important
When deploying to Google Cloud, make sure the Compute Engine, Kubernetes Engine, Artifact Registry, Cloud Filestore, and Cloud Logging APIs are enabled on the project before deploying!
First, make sure you are in the new workspace that you just created.
terraform workspace show
Plan your changes with the terraform plan
command, passing the .tfvars
file as a variable file.
terraform plan -var-file=projects/$CLUSTER_NAME.tfvars
Check over the output of this command to ensure nothing is being created/deleted that you didn’t expect. Copy-paste the plan into your open Pull Request so a fellow 2i2c engineer can double check it too.
If you’re both satisfied with the plan, merge the Pull Request and apply the changes to deploy the cluster.
terraform apply -var-file=projects/$CLUSTER_NAME.tfvars
Congratulations, you’ve just deployed a new cluster!
Exporting and Encrypting the Cluster Access Credentials#
In the previous step, we will have created an IAM user with just enough permissions for automatic deployment of hubs from CI/CD. Since these credentials are checked-in to our git repository and made public, they should have least amount of permissions possible.
To begin deploying and operating hubs on your new cluster, we need to export these credentials, encrypt them using sops
, and store them in the secrets
directory of the infrastructure
repo.
First, make sure you are in the right terraform directory:
cd terraform/aws
cd terraform/gcp
cd terraform/azure
cd terraform/openstack
Check you are still in the correct terraform workspace
terraform workspace show
If you need to change, you can do so as follows
terraform workspace list # List all available workspaces terraform workspace select WORKSPACE_NAME
Fetch credentials for automatic deployment
Create the directory if it doesn’t exist already:
mkdir -p ../../config/clusters/$CLUSTER_NAME
terraform output -raw continuous_deployer_creds > ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
terraform output -raw ci_deployer_key > ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
terraform output -raw kubeconfig > ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.yaml
To access the cluster using kubectl we need to get the kubeconfig with:
openstack coe cluster config <cluster-name> --force > ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
This command will generate a file named config in the cwd with the configuration. The –force flag will overwrite this file if it already exists.
Encrypt the kubeconfig file using
sops
:sops --output ./config --encrypt ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
Note
You must be logged into Google with your
@2i2c.org
account at this point sosops
can read the encryption key from thetwo-eye-two-see
project.Delete the config file to avoid committing it by mistake:
rm ./config
Then encrypt the key using
sops
.Important
This step can be skipped for Jetstream2 because the kubeconfig file is already encrypted from step 1.
Note
You must be logged into Google with your
@2i2c.org
account at this point sosops
can read the encryption key from thetwo-eye-two-see
project.sops --output ../../config/clusters/$CLUSTER_NAME/enc-deployer-credentials.secret.json --encrypt ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
This key can now be committed to the
infrastructure
repo and used to deploy and manage hubs hosted on that cluster.Double check to make sure that the
config/clusters/$CLUSTER_NAME/enc-deployer-credentials.secret.json
file is actually encrypted bysops
before checking it in to the git repo. Otherwise this can be a serious security leak!cat ../../config/clusters/$CLUSTER_NAME/enc-deployer-credentials.secret.json
Create a cluster.yaml
file#
See also
We use cluster.yaml
files to describe a specific cluster and all the hubs deployed onto it.
See Configuration structure for more information.
Create a cluster.yaml
file under the config/cluster/$CLUSTER_NAME>
folder and populate it with the following info:
A cluster.yaml
file should already have been generated as part of Generate cluster files.
A cluster.yaml
file should already have been generated as part of Generate cluster files.
Billing information
For projects where we are paying the cloud bill & then passing costs through, you need to fill
in information under gcp.billing.bigquery
and set gcp.billing.paid_by_us
to true
. Partnerships
should be able to tell you if we are doing cloud costs pass through or not.
Going to the Billing Tab on Google Cloud Console
Make sure the correct project is selected in the top bar. You might have to select the ‘All’ tab in the project chooser if you do not see the project right away.
Click ‘Go to billing account’
In the default view (Overview) that opens, you can find the value for
billing_id
in the right sidebar, under “Billing Account”. It should be of the formXXXXXX-XXXXXX-XXXXXX
.Select “Billing export” on the left navigation bar, and you will find the values for
project
anddataset
under “Detailed cost usage”.If “Detailed cost usage” is not set up, you should enable it
Warning
We use this config only when we do not have permissions on the Azure subscription to create a Service Principal with terraform.
name: <cluster-name> # This should also match the name of the folder: config/clusters/$CLUSTER_NAME
provider: kubeconfig
kubeconfig:
# The location of the *encrypted* key we exported from terraform
file: enc-deployer-credentials.secret.yaml
name: <cluster-name> # This should also match the name of the folder: config/clusters/$CLUSTER_NAME
provider: azure
azure:
# The location of the *encrypted* key we exported from terraform
key: enc-deployer-credentials.secret.json
# The name of the cluster *as it appears in the Azure Portal*! Sometimes our
# terraform code adjusts the contents of the 'name' field, so double check this.
cluster: <cluster-name>
# The name of the resource group the cluster has been deployed into. This is
# the same as the resourcegroup_name variable in the .tfvars file.
resource_group: <resource-group-name>
name: <cluster-name> # This should also match the name of the folder: config/clusters/$CLUSTER_NAME
provider: kubeconfig
kubeconfig:
# The location of the *encrypted* key we exported from terraform
file: enc-deployer-credentials.secret.yaml
Commit this file to the repo.
Access#
Grant the deployer’s IAM user cluster access
Note
This still works, but makes use of a deprecated system (iamidentitymapping
and
aws-auth
ConfigMap in kube-system namespace) instead of the new system called
EKS access entries. Migrating to the new system is tracked by this github issue.
We need to grant the freshly created deployer IAM user access to the kubernetes cluster.
As this requires passing in some parameters that match the created cluster, we have a
terraform output
that can give you the exact command to run.
terraform output -raw eksctl_iam_command
Run the
eksctl create iamidentitymapping
command returned byterraform output
. That should give the continuous deployer user access.
The command should look like this:
eksctl create iamidentitymapping \
--cluster $CLUSTER_NAME \
--region $CLUSTER_REGION \
--arn arn:aws:iam::<aws-account-id>:user/hub-continuous-deployer \
--username hub-continuous-deployer \
--group system:masters
Test the access by running:
deployer use-cluster-credentials $CLUSTER_NAME
and running:
kubectl get node
It should show you the provisioned node on the cluster if everything works out ok.
Grant cluster access to other users
Note
This step is only needed within AWS accounts outside 2i2c’s AWS organization where we haven’t logged in using 2i2c SSO.
This is because new EKS clusters comes with an access entry to the user or role that created the cluster, and when we work against an AWS account within the 2i2c AWS organization, we all assume the same role, so an access entry for that role grants us all access. However, when we work against AWS accounts outside the 2i2c AWS organization, we typically use a IAM User directly, and that will be different for all of us, so we need to add access entries for other engineers as well then.
Find the usernames of the 2i2c engineers on this particular AWS account, and run the following command to give them access using the deprecated system active in parallel to the newer system with access entries:
Note
You can modify the command output by running terraform output -raw eksctl_iam_command
as described in Exporting and Encrypting the Cluster Access Credentials.
eksctl create iamidentitymapping \
--cluster $CLUSTER_NAME \
--region $CLUSTER_REGION \
--arn arn:aws:iam::<aws-account-id>:user/<iam-user-name> \
--username <iam-user-name> \
--group system:masters
This gives all the users full access to the entire kubernetes cluster. After this step is done, they can fetch local config with:
aws eks update-kubeconfig --name=$CLUSTER_NAME --region=$CLUSTER_REGION
This should eventually be converted to use an IAM Role instead, so we need not give each individual user access, but just grant access to the role - and users can modify them as they wish. It should also eventually be converted to use access entries instead of the legacy system active in parallel.
Test deployer access by running:
deployer use-cluster-credentials $CLUSTER_NAME
and running:
kubectl get node
It should show you the provisioned node on the cluster if everything works out ok.
Test deployer access by running:
deployer use-cluster-credentials $CLUSTER_NAME
and running:
kubectl get node
It should show you the provisioned node on the cluster if everything works out ok.
Test deployer access by running:
deployer use-cluster-credentials $CLUSTER_NAME
and running:
kubectl get node
It should show you the provisioned node on the cluster if everything works out ok.