Setup object storage buckets - 2i2c Infrastructure Guide

See the relevant topic page for more information on why users want this!

In the .tfvars file for the project in which this hub is based off create (or modify) the user_buckets variable. The config is like:
```
user_buckets = {
   "bucket1": {
      "delete_after": 7
   },
   "bucket2": {
      "delete_after": null
   },
   "bucket3": {
      "archival_storageclass_after": 3
   }
}
```
Since storage buckets need to be globally unique across all of Google Cloud, the actual created names are <prefix>-<bucket-name>, where <prefix> is set by the prefix variable in the .tfvars file
delete_after specifies the number of days after object creation time the object will be automatically cleaned up - this is very helpful for ‘scratch’ buckets that are temporary. Set to null to prevent this cleaning up process from happening, e.g., if users want a persistent bucket.
archival_storageclass_after (available only for AWS currently) transitions objects created in this bucket to a cheaper, slower archival class after the number of days specified in this variable. This is helpful for archiving user home directories or similar use cases, where data needs to be kept for a long time but rarely accessed. This should not be used for frequently accessed or publicly accessible data.
Enable access to these buckets from the hub or make them publicly accessible from outside by editing hub_cloud_permissions in the same .tfvars file. Follow all the steps listed there - this should create the storage buckets and provide all users access to them!
You can set the SCRATCH_BUCKET (and the deprecated PANGEO_SCRATCH) env vars on all user pods so users can use the created bucket without having to hard-code the bucket name in their code. In the hub-specific .values.yaml file in config/clusters/<cluster-name>, set:
```
 jupyterhub:
   singleuser:
      extraEnv:
         SCRATCH_BUCKET: <s3 or gs>://<bucket-full-name>/$(JUPYTERHUB_USER)
         PANGEO_SCRATCH: <s3 or gs>://<bucket-full-name>/$(JUPYTERHUB_USER)
         # If we have a bucket that does not have a `delete_after`
         PERSISTENT_BUCKET: <s3 or gs>://<bucket-full-name>/$(JUPYTERHUB_USER)
         # If we have a bucket defined in user_buckets that should be granted public read access.
         PUBLIC_PERSISTENT_BUCKET: <s3 or gs>://<bucket-full-name>/$(JUPYTERHUB_USER)
```
Note
Use s3 on AWS and gs on GCP for the protocol part
Note
If the hub is a daskhub, nest the config under a basehub key
The $(JUPYTERHUB_USER) expands to the name of the current user for each user, so everyone gets a little prefix inside the bucket to store their own stuff without stepping on other people’s objects. But this is not a security mechanism - everyone can access everyone else’s objects!
<bucket-full-name> is the full name of the bucket, which is formed by <prefix>-<bucket-name>, where <prefix> is also set in the .tfvars file. You can see the full names of created buckets with terraform output buckets too.
You can also add other env vars pointing to other buckets users requested.
Get this change deployed, and users should now be able to use the buckets! Currently running users might have to restart their pods for the change to take effect.

Testing access to buckets¶

Once bucket access has been set up, we should test to make sure users can write to and read from it.

AWS¶

Login to the hub, and open a Terminal in JupyterLab
Look for the environment variables we just set (SCRATCH_BUCKET and/or PERSISTENT_BUCKET), make sure they are showing up correctly:
```
env | grep _BUCKET
```
They should end with the name of your JupyterHub user. For example, here is the output on the openscapes hub, when my JupyterHub username is yuvipanda:
```
PERSISTENT_BUCKET=s3://openscapeshub-persistent/yuvipanda
SCRATCH_BUCKET=s3://openscapeshub-scratch/yuvipanda
```
Check if the AWS CLI is installed by running the aws command - many base images already include this package. If not, you can do a local installation with:
```
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
export PATH=$(pwd)/aws/dist/:$PATH
```
Note
This could have been as simple as a pip install, but AWS does not support it
Create a temporary file, which we will then copy over to our scratch bucket.
```
echo 'hi' > temp-test-file
```
Copy the file over to S3, under $SCRATCH_BUCKET or $PERSISTENT_BUCKET (based on which one we are going to be testing).
```
aws s3 cp temp-test-file $SCRATCH_BUCKET/temp-test-file
```
This should succeed with a message like upload: ./temp-test-file to s3://openscapeshub-scratch/yuvipanda/temp-test-file
Let’s list our bucket to make sure the file is there.
```
$ aws s3 ls $SCRATCH_BUCKET/
2024-03-26 01:38:53          3 temp-test-file
```
Note
The trailing / is important.
Note
If testing $PERSISTENT_BUCKET, use that environment variable instead

Copy the file back from s3, to make sure we can read.

$ aws s3 cp  $SCRATCH_BUCKET/temp-test-file back-here
download: s3://openscapeshub-scratch/yuvipanda/temp-test-file to ./back-here
$ cat back-here
hi

We have verified this all works!

Clean up our files so we don’t cost the community money in the long run.
```
aws s3 rm $SCRATCH_BUCKET/temp-test-file
rm temp-test-file back-here
```

Allowing public, readonly to buckets from outside the JupyterHub¶

GCP¶

Some hubs want to expose a particular bucket to the broad internet. This can have catastrophic cost consequences, so we only allow this on clusters where 2i2c is not paying the bill for.

This can be enabled by setting the public_access parameter in user_buckets for the appropriate bucket, and running terraform apply.

Example:

user_buckets = {
   "persistent": {
      "delete_after": null,
      "public_access": true
   }
}

Enable access logs for objects in a bucket¶

GCP¶

We may want to know what objects in a bucket are actually being accessed, and when. While there is not a systematic way to do a ‘when was this object last accessed’, we can instead enable usage logs that allow hub administrators to get access to some raw data.

Note that we currently can not actually help hub admins process these logs - that is their responsibility. We can only enable this logging.

This can be enabled by setting usage_logs parameter in user_buckets for the appropriate bucket, and running terraform apply.

Example:

user_buckets = {
   "persistent": {
      "usage_logs": true
   }
}

Once enabled, you can find out what bucket the access logs will be sent to with terraform output usage_log_bucket. The access logs will by default be deleted after 30 days, to avoid them costing too much money.

The logs are in CSV format, with the fields documented here. We suggest that hub admins interested can download the logs and parse them as they wish - this is not something that we can currently help much with.

Allowing authenticated access to buckets from outside the JupyterHub¶

GCP¶

Some hub users want to be able to write to the bucket from outside the hub, primarily for large data transfer from on-premise systems. Since Google Groups can be used to control access to GCS buckets, it can be used to allow arbitrary users to write to the bucket!

With your 2i2c.org google account, go to Google Groups and create a new Google Group with the name “{bucket-name}-writers” and email “{bucket-name}-writers@googlegroups.com”, where “{bucket-name}” is the name of the bucket we are going to grant write access to.
Use of @googlegroups.com instead of @2i2c.org is suitable as this group will include non-2i2c members, who otherwise could get control of a 2i2c.org email which isn’t a necessary security compromise.
Grant “Group Owner” access to the community champion requesting this feature. They will be able to add / remove users from the group as necessary, and thus manage access without needing to involve 2i2c engineers.
In the user_buckets definition for the bucket in question, add the group name as an extra_admin_members:
```
user_buckets = {
  "persistent": {
    "delete_after": null,
    "extra_admin_members": [
      "group:<name-of-group>@googlegroups.com"
    ]
  }
}
```
Apply this terraform change to create the appropriate permissions for members of the group to have full read/write access to that GCS bucket.
We want the community champions to handle granting / revoking access to this google group, as well as produce community specific documentation on how to actually upload data here. We currently do not have a template of how end users can use this, but something can be stolen from the documentation for LEAP users

Granting access to cloud buckets in other cloud accounts / projects¶

Sometimes, users on a hub we manage need access to a storage bucket managed by an external third party - often a different research group. This can help with access to raw data, collaboration, etc.

This section outlines how to grant this access. Currently, this functionality is implemented only on AWS - but we can add it for other cloud providers when needed.

AWS¶

On AWS, we would need to set up cross account S3 access.

Find the ARN of the service account used by the users on the hub. You can find this under userServiceAccount.annotations.eks.amazon.com/role-arn in the values.yaml file for your hub. It should look something like arn:aws:iam::<account-id>:role/<hub-name>.
In the AWS account with the S3 bucket, create an IAM policy that grants appropriate access to the S3 bucket from the hub. For example, the following policy grants readonly access to the bucket for users of the hub
```
{
 "Version": "2012-10-17",
 "Statement": [
     {
         "Effect": "Allow",
         "Principal": {
             "AWS": "<arn-of-service-account-from-step-1>"
         },
         "Action": [
             "s3:GetObject",
             "s3:GetObjectVersion",
             "s3:ListBucket"
         ],
         "Resource": [
             "arn:aws:s3:::<name-of-bucket>",
             "arn:aws:s3:::<name-of-bucket>/*"
         ]
     }
 ]
}
```
You can add additional permissions to the bucket if needed here.
Note
You can list as many buckets as you want, but each bucket needs two entries - one with the /* and one without so both listing the bucket as well as fetching data from it can work
In the .tfvars file for the cluster hosting the hub, add extra_iam_policy as a key to the hub under hub_cloud_permissions. This is used to set any additional IAM permissions granted to the users of the hub. In this case, you should copy the exact policy that was applied to the bucket in step 2, but remove the “Principal” key. So it would look something like:
```
{
 "Version": "2012-10-17",
 "Statement": [
     {
         "Effect": "Allow",
         "Action": [
             "s3:GetObject",
             "s3:GetObjectVersion",
             "s3:ListBucket"
         ],
         "Resource": [
             "arn:aws:s3:::<name-of-bucket>",
             "arn:aws:s3:::<name-of-bucket>/*"
         ]
     }
 ]
}
```
Apply the terraform config, and test out if s3 bucket access works on the hub!