New Kubernetes cluster on GCP, Azure or AWS#
This guide will walk through the process of adding a new cluster to our terraform configuration.
You can find out more about terraform in Terraform and their documentation.
Attention
Currently, we do not deploy clusters to AWS solely using terraform. We use eksctl to provision our k8s clusters on AWS and terraform to provision supporting infrastructure, such as storage buckets.
Cluster Design#
This guide will assume you have already followed the guidance in Cluster design considerations to select the appropriate infrastructure.
Prerequisites#
Install
kubectl
,helm
,sops
, etc.In Setting up your local environment to work on this repo you find instructions on how to setup
sops
to encrypt and decrypt files.Install
aws
Verify install and version with
aws --version
. You should have at least version 2.Install or upgrade eksctl
Mac users with homebrew can run
brew install eksctl
.Verify install and version with
eksctl version
. You should have the latest version of this CLI.Important
Without the latest version, you may install an outdated versions of
aws-node
because its hardcoded.Install
jsonnet
Mac users with homebrew can run
brew install jsonnet
.Verify install and version with
jsonnet --version
.
Install
kubectl
,helm
,sops
, etc.In Setting up your local environment to work on this repo you find instructions on how to setup
sops
to encrypt and decrypt files.
Install
kubectl
,helm
,sops
, etc.In Setting up your local environment to work on this repo you find instructions on how to setup
sops
to encrypt and decrypt files.
Create a new cluster#
Setup credentials#
Depending on whether this project is using AWS SSO or not, you can use the following links to figure out how to authenticate to this project from your terminal.
N/A
N/A
Generate cluster files#
We automatically generate the files required to setup a new cluster:
A
.jsonnet
file for use witheksctl
A
sops
encrypted ssh key that can be used to ssh into the kubernetes nodes.A ssh public key used by
eksctl
to grant access to the private key.A
.tfvars
terraform variables file that will setup most of the non EKS infrastructure.The cluster config directory in
./config/cluster/<new-cluster>
The
cluster.yaml
config fileThe support values file
support.values.yaml
The the support credentials encrypted file
enc-support.values.yaml
A
.tfvars
file for use withterraform
The cluster config directory in
./config/cluster/<new-cluster>
A sample
cluster.yaml
config fileThe support values file
support.values.yaml
The the support credentials encrypted file
enc-support.values.yaml
Warning
An automated deployer command doesn’t exist yet, these files need to be manually generated!
You can generate these with:
export CLUSTER_NAME=<cluster-name>
export CLUSTER_REGION=<cluster-region-like ca-central-1>
export ACCOUNT_ID=<declare 2i2c for clusters under 2i2c SSO, otherwise an account id or alias>
deployer generate dedicated-cluster aws --cluster-name=$CLUSTER_NAME --cluster-region=$CLUSTER_REGION --account-id=$ACCOUNT_ID
Create and render an eksctl config file
We use an eksctl config file in YAML to specify
how our cluster should be built. Since it can get repetitive, we use
jsonnet to declaratively specify this config. You can
find the .jsonnet
files for the current clusters in the eksctl/
directory.
The previous step should’ve created a baseline .jsonnet
file you can modify as
you like. The eksctl docs have a reference
for all the possible options. You’d want to make sure to change at least the following:
Region / Zone - make sure you are creating your cluster in the correct region and verify the suggested zones 1a, 1b, and 1c actually are available in that region.
# a command to list availability zones, for example # ca-central-1 doesn't have 1c, but 1d instead aws ec2 describe-availability-zones --region=$CLUSTER_REGION
Size of nodes in instancegroups, for both notebook nodes and dask nodes. In particular, make sure you have enough quota to launch these instances in your selected regions.
Kubernetes version - older
.jsonnet
files might be on older versions, but you should pick a newer version when you create a new cluster.
Once you have a .jsonnet
file, you can render it into a config file that eksctl
can read.
Tip
Make sure to run this command inside the eksctl
directory.
jsonnet $CLUSTER_NAME.jsonnet > $CLUSTER_NAME.eksctl.yaml
Tip
The *.eksctl.yaml
files are git ignored as we can regenerate it, so work
against the *.jsonnet
file and regenerate the YAML file when needed by a
eksctl
command.
Create the cluster
Now you’re ready to create the cluster!
Tip
Make sure to run this command inside the eksctl
directory, otherwise it cannot discover the ssh-keys
subfolder.
eksctl create cluster --config-file=$CLUSTER_NAME.eksctl.yaml
This might take a few minutes.
If any errors are reported in the config (there is a schema validation step),
fix it in the .jsonnet
file, re-render the config, and try again.
Once it is done, you can test access to the new cluster with kubectl
, after
getting credentials via:
aws eks update-kubeconfig --name=$CLUSTER_NAME --region=$CLUSTER_REGION
kubectl
should be able to find your cluster now! kubectl get node
should show
you at least one core node running.
export CLUSTER_NAME=<cluster-name>
export CLUSTER_REGION=<cluster-region-like ca-central-1>
export PROJECT_ID=<gcp-project-id>
deployer generate dedicated-cluster gcp --cluster-name=$CLUSTER_NAME --project-id=$PROJECT_ID --cluster-region=$CLUSTER_REGION
An automated deployer command doesn’t exist yet, these files need to be manually generated. The minimum inputs this file requires are:
subscription_id
: Azure subscription ID to create resources in. Should be the id, rather than display name of the project.resourcegroup_name
: The name of the Resource Group to be created by terraform, where the cluster and other resources will be deployed into.global_container_registry_name
: The name of an Azure Container Registry to be created by terraform to use for our image. This must be unique across all of Azure. You can use the following Azure CLI command to check your desired name is available:az acr check-name --name ACR_NAME --output table
global_storage_account_name
: The name of a storage account to be created by terraform to use for Azure File Storage. This must be unique across all of Azure. You can use the following Azure CLI command to check your desired name is available:az storage account check-name --name STORAGE_ACCOUNT_NAME --output table
ssh_pub_key
: The public half of an SSH key that will be authorised to login to nodes.
See the variables file for other inputs this file can take and their descriptions.
Naming Convention Guidelines for Container Registries and Storage Accounts
Names for Azure container registries and storage accounts must conform to the following guidelines:
alphanumeric strings between 5 and 50 characters for container registries, e.g.,
myContainerRegistry007
lowercase letters and numbers strings between 2 and 24 characters for storage accounts, e.g.,
mystorageaccount314
Note
A failure will occur if you try to create a storage account whose name is not entirely lowercase.
We recommend the following conventions using lowercase
:
{CLUSTER_NAME}hubregistry
for container registries{CLUSTER_NAME}hubstorage
for storage accounts
Note
Changes in Azure’s own requirements might break our recommended convention. If any such failure occurs, please signal it.
This increases the probability that we won’t take up a namespace that may be required by the Hub Community, for example, in cases where we are deploying to Azure subscriptions not owned/managed by 2i2c.
Example .tfvars
file:
subscription_id = "my-awesome-subscription-id"
resourcegroup_name = "my-awesome-resource-group"
global_container_registry_name = "myawesomehubregistry"
global_storage_account_name = "myawesomestorageaccount"
ssh_pub_key = "ssh-rsa my-public-ssh-key"
Add GPU nodegroup if needed#
If this cluster is going to have GPUs, you should edit the generated jsonnet file to include a GPU nodegroups.
Initialising Terraform#
Our default terraform state is located centrally in our two-eye-two-see-org
GCP project, therefore you must authenticate gcloud
to your @2i2c.org
account before initialising terraform.
The terraform state includes all cloud providers, not just GCP.
gcloud auth application-default login
Then you can change into the terraform subdirectory for the appropriate cloud provider and initialise terraform.
Our AWS terraform code is now used to deploy supporting infrastructure for the EKS cluster, including:
An IAM identity account for use with our CI/CD system
Appropriately networked EFS storage to serve as an NFS server for hub home directories
Optionally, setup a shared database
Optionally, setup user buckets
The steps above will have created a default .tfvars
file. This file can either be used as-is or edited to enable the optional features listed above.
Initialise terraform for use with AWS:
cd terraform/aws
terraform init
cd terraform/gcp
terraform init -backend-config=backends/default-backend.hcl -reconfigure
cd terraform/azure
terraform init
Note
There are other backend config files stored in terraform/backends
that will configure a different storage bucket to read/write the remote terraform state for projects which we cannot access from GCP with our @2i2c.org
email accounts.
This saves us the pain of having to handle multiple authentications as these storage buckets are within the project we are trying to deploy to.
For example, to work with Pangeo you would initialise terraform like so:
terraform init -backend-config=pangeo-backend.hcl -reconfigure
Creating a new terraform workspace#
We use terraform workspaces so that the state of one .tfvars
file does not influence another.
Create a new workspace with the below command, and again give it the same name as the .tfvars
filename, $CLUSTER_NAME.
terraform workspace new $CLUSTER_NAME
Note
Workspaces are defined per backend. If you can’t find the workspace you’re looking for, double check you’ve enabled the correct backend.
Setting up Budget Alerts#
Follow the instructions in Setting up Budget Alerts to determine if and how you should setup budget alerts.
You can learn more about our budget alerts in Cloud Billing Budget Alerts.
Plan and Apply Changes#
Important
When deploying to Google Cloud, make sure the Compute Engine, Kubernetes Engine, Artifact Registry, Cloud Filestore, and Cloud Logging APIs are enabled on the project before deploying!
First, make sure you are in the new workspace that you just created.
terraform workspace show
Plan your changes with the terraform plan
command, passing the .tfvars
file as a variable file.
terraform plan -var-file=projects/$CLUSTER_NAME.tfvars
Check over the output of this command to ensure nothing is being created/deleted that you didn’t expect. Copy-paste the plan into your open Pull Request so a fellow 2i2c engineer can double check it too.
If you’re both satisfied with the plan, merge the Pull Request and apply the changes to deploy the cluster.
terraform apply -var-file=projects/$CLUSTER_NAME.tfvars
Congratulations, you’ve just deployed a new cluster!
Exporting and Encrypting the Cluster Access Credentials#
In the previous step, we will have created an IAM user with just enough permissions for automatic deployment of hubs from CI/CD. Since these credentials are checked-in to our git repository and made public, they should have least amount of permissions possible.
To begin deploying and operating hubs on your new cluster, we need to export these credentials, encrypt them using sops
, and store them in the secrets
directory of the infrastructure
repo.
First, make sure you are in the right terraform directory:
cd terraform/aws
cd terraform/gcp
cd terraform/azure
Check you are still in the correct terraform workspace
terraform workspace show
If you need to change, you can do so as follows
terraform workspace list # List all available workspaces terraform workspace select WORKSPACE_NAME
Fetch credentials for automatic deployment
Create the directory if it doesn’t exist already:
mkdir -p ../../config/clusters/$CLUSTER_NAME
terraform output -raw continuous_deployer_creds > ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
terraform output -raw ci_deployer_key > ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
terraform output -raw kubeconfig > ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.yaml
Then encrypt the key using
sops
.Note
You must be logged into Google with your
@2i2c.org
account at this point sosops
can read the encryption key from thetwo-eye-two-see
project.sops --output ../../config/clusters/$CLUSTER_NAME/enc-deployer-credentials.secret.json --encrypt ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
This key can now be committed to the
infrastructure
repo and used to deploy and manage hubs hosted on that cluster.Double check to make sure that the
config/clusters/$CLUSTER_NAME/enc-deployer-credentials.secret.json
file is actually encrypted bysops
before checking it in to the git repo. Otherwise this can be a serious security leak!cat ../../config/clusters/$CLUSTER_NAME/enc-deployer-credentials.secret.json
Create a cluster.yaml
file#
See also
We use cluster.yaml
files to describe a specific cluster and all the hubs deployed onto it.
See Configuration structure for more information.
Create a cluster.yaml
file under the config/cluster/$CLUSTER_NAME>
folder and populate it with the following info:
A cluster.yaml
file should already have been generated as part of Generate cluster files.
A cluster.yaml
file should already have been generated as part of Generate cluster files.
Billing information
For projects where we are paying the cloud bill & then passing costs through, you need to fill
in information under gcp.billing.bigquery
and set gcp.billing.paid_by_us
to true
. Partnerships
should be able to tell you if we are doing cloud costs pass through or not.
Going to the Billing Tab on Google Cloud Console
Make sure the correct project is selected in the top bar. You might have to select the ‘All’ tab in the project chooser if you do not see the project right away.
Click ‘Go to billing account’
In the default view (Overview) that opens, you can find the value for
billing_id
in the right sidebar, under “Billing Account”. It should be of the formXXXXXX-XXXXXX-XXXXXX
.Select “Billing export” on the left navigation bar, and you will find the values for
project
anddataset
under “Detailed cost usage”.If “Detailed cost usage” is not set up, you should enable it
Warning
We use this config only when we do not have permissions on the Azure subscription to create a Service Principal with terraform.
name: <cluster-name> # This should also match the name of the folder: config/clusters/$CLUSTER_NAME
provider: kubeconfig
kubeconfig:
# The location of the *encrypted* key we exported from terraform
file: enc-deployer-credentials.secret.yaml
name: <cluster-name> # This should also match the name of the folder: config/clusters/$CLUSTER_NAME
provider: azure
azure:
# The location of the *encrypted* key we exported from terraform
key: enc-deployer-credentials.secret.json
# The name of the cluster *as it appears in the Azure Portal*! Sometimes our
# terraform code adjusts the contents of the 'name' field, so double check this.
cluster: <cluster-name>
# The name of the resource group the cluster has been deployed into. This is
# the same as the resourcegroup_name variable in the .tfvars file.
resource_group: <resource-group-name>
Commit this file to the repo.
Access#
Grant the deployer’s IAM user access
Note
This still works, but makes use of a deprecated system (iamidentitymapping
and
aws-auth
ConfigMap in kube-system namespace) instead of the new system called
EKS access entries. Migrating to the new system is tracked by this github issue.
We need to grant the freshly created deployer IAM user access to the kubernetes cluster.
As this requires passing in some parameters that match the created cluster, we have a
terraform output
that can give you the exact command to run.
terraform output -raw eksctl_iam_command
Run the
eksctl create iamidentitymapping
command returned byterraform output
. That should give the continuous deployer user access.
The command should look like this:
eksctl create iamidentitymapping \
--cluster $CLUSTER_NAME \
--region $CLUSTER_REGION \
--arn arn:aws:iam::<aws-account-id>:user/hub-continuous-deployer \
--username hub-continuous-deployer \
--group system:masters
Test the access by running:
deployer use-cluster-credentials $CLUSTER_NAME
and running:
kubectl get node
It should show you the provisioned node on the cluster if everything works out ok.
(no longer needed) Grant eksctl
access to other users
Use of eksctl create iamidentitymapping
was previously required step to grant
access to other engineers, but after AWS introduced a new system (EKS access
entries) in parallel to the now deprecated iamidentitymapping
system, it seems
AWS account admin users are no longer required to be granted access like this.
To conclude, any AWS account admin authenticated should be able to acquire k8s
cluster credentials like below without use of eksctl create iamidentitymapping
:
aws eks update-kubeconfig --name=$CLUSTER_NAME --region=$CLUSTER_REGION
Test deployer access by running:
deployer use-cluster-credentials $CLUSTER_NAME
and running:
kubectl get node
It should show you the provisioned node on the cluster if everything works out ok.
Test deployer access by running:
deployer use-cluster-credentials $CLUSTER_NAME
and running:
kubectl get node
It should show you the provisioned node on the cluster if everything works out ok.
AWS only: Expandable storage class#
The default storage class that is created when we deploy a cluster to AWS does permit auto-expansion of persistent volumes. This can cause problems when we want to expand the size of a disk, say used by Prometheus to store metrics data. We will therefore patch the default storage class to permite auto-expansion.
# Gain k8s access to the cluster
deployer use-cluster-credentials $CLUSTER_NAME
# Patch the storage class
kubectl patch storageclass gp2 --patch '{\"allowVolumeExpansion\": true}'