Setup object storage buckets#
See the relevant topic page for more information on why users want this!
In the
.tfvars
file for the project in which this hub is based off create (or modify) theuser_buckets
variable. The config is like:user_buckets = { "bucket1": { "delete_after": 7 }, "bucket2": { "delete_after": null }, "bucket3": { "archival_storageclass_after": 3 } }
Since storage buckets need to be globally unique across all of Google Cloud, the actual created names are
<prefix>-<bucket-name>
, where<prefix>
is set by theprefix
variable in the.tfvars
filedelete_after
specifies the number of days after object creation time the object will be automatically cleaned up - this is very helpful for ‘scratch’ buckets that are temporary. Set tonull
to prevent this cleaning up process from happening, e.g., if users want a persistent bucket.archival_storageclass_after
(available only for AWS currently) transitions objects created in this bucket to a cheaper, slower archival class after the number of days specified in this variable. This is helpful for archiving user home directories or similar use cases, where data needs to be kept for a long time but rarely accessed. This should not be used for frequently accessed or publicly accessible data.Enable access to these buckets from the hub or make them publicly accessible from outside by editing
hub_cloud_permissions
in the same.tfvars
file. Follow all the steps listed there - this should create the storage buckets and provide all users access to them!You can set the
SCRATCH_BUCKET
(and the deprecatedPANGEO_SCRATCH
) env vars on all user pods so users can use the created bucket without having to hard-code the bucket name in their code. In the hub-specific.values.yaml
file inconfig/clusters/<cluster-name>
, set:jupyterhub: singleuser: extraEnv: SCRATCH_BUCKET: <s3 or gs>://<bucket-full-name>/$(JUPYTERHUB_USER) PANGEO_SCRATCH: <s3 or gs>://<bucket-full-name>/$(JUPYTERHUB_USER) # If we have a bucket that does not have a `delete_after` PERSISTENT_BUCKET: <s3 or gs>://<bucket-full-name>/$(JUPYTERHUB_USER) # If we have a bucket defined in user_buckets that should be granted public read access. PUBLIC_PERSISTENT_BUCKET: <s3 or gs>://<bucket-full-name>/$(JUPYTERHUB_USER)
Note
Use s3 on AWS and gs on GCP for the protocol part
Note
If the hub is a
daskhub
, nest the config under abasehub
keyThe
$(JUPYTERHUB_USER)
expands to the name of the current user for each user, so everyone gets a little prefix inside the bucket to store their own stuff without stepping on other people’s objects. But this is not a security mechanism - everyone can access everyone else’s objects!<bucket-full-name>
is the full name of the bucket, which is formed by<prefix>-<bucket-name>
, where<prefix>
is also set in the.tfvars
file. You can see the full names of created buckets withterraform output buckets
too.You can also add other env vars pointing to other buckets users requested.
Get this change deployed, and users should now be able to use the buckets! Currently running users might have to restart their pods for the change to take effect.
Testing access to buckets#
Once bucket access has been set up, we should test to make sure users can write to and read from it.
AWS#
Login to the hub, and open a Terminal in JupyterLab
Look for the environment variables we just set (
SCRATCH_BUCKET
and/orPERSISTENT_BUCKET
), make sure they are showing up correctly:env | grep _BUCKET
They should end with the name of your JupyterHub user. For example, here is the output on the openscapes hub, when my JupyterHub username is
yuvipanda
:PERSISTENT_BUCKET=s3://openscapeshub-persistent/yuvipanda SCRATCH_BUCKET=s3://openscapeshub-scratch/yuvipanda
Check if the AWS CLI is installed by running the
aws
command - many base images already include this package. If not, you can do a local installation with:curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip export PATH=$(pwd)/aws/dist/:$PATH
Note
This could have been as simple as a
pip install
, but AWS does not support itCreate a temporary file, which we will then copy over to our scratch bucket.
echo 'hi' > temp-test-file
Copy the file over to S3, under
$SCRATCH_BUCKET
or$PERSISTENT_BUCKET
(based on which one we are going to be testing).aws s3 cp temp-test-file $SCRATCH_BUCKET/temp-test-file
This should succeed with a message like
upload: ./temp-test-file to s3://openscapeshub-scratch/yuvipanda/temp-test-file
Let’s list our bucket to make sure the file is there.
$ aws s3 ls $SCRATCH_BUCKET/ 2024-03-26 01:38:53 3 temp-test-file
Note
The trailing
/
is important.Note
If testing
$PERSISTENT_BUCKET
, use that environment variable insteadCopy the file back from s3, to make sure we can read.
$ aws s3 cp $SCRATCH_BUCKET/temp-test-file back-here download: s3://openscapeshub-scratch/yuvipanda/temp-test-file to ./back-here $ cat back-here hi
We have verified this all works!
Clean up our files so we don’t cost the community money in the long run.
aws s3 rm $SCRATCH_BUCKET/temp-test-file rm temp-test-file back-here
Allowing public, readonly to buckets from outside the JupyterHub#
GCP#
Some hubs want to expose a particular bucket to the broad internet. This can have catastrophic cost consequences, so we only allow this on clusters where 2i2c is not paying the bill for.
This can be enabled by setting the public_access
parameter in
user_buckets
for the appropriate bucket, and running terraform apply
.
Example:
user_buckets = {
"persistent": {
"delete_after": null,
"public_access": true
}
}
Enable access logs for objects in a bucket#
GCP#
We may want to know what objects in a bucket are actually being accessed, and when. While there is not a systematic way to do a ‘when was this object last accessed’, we can instead enable usage logs that allow hub administrators to get access to some raw data.
Note that we currently can not actually help hub admins process these logs - that is their responsibility. We can only enable this logging.
This can be enabled by setting usage_logs
parameter in user_buckets
for the appropriate bucket, and running terraform apply
.
Example:
user_buckets = {
"persistent": {
"usage_logs": true
}
}
Once enabled, you can find out what bucket the access logs will be sent
to with terraform output usage_log_bucket
. The access logs will by
default be deleted after 30 days, to avoid them costing too much money.
The logs are in CSV format, with the fields documented here. We suggest that hub admins interested can download the logs and parse them as they wish - this is not something that we can currently help much with.
Allowing authenticated access to buckets from outside the JupyterHub#
GCP#
Some hub users want to be able to write to the bucket from outside the hub, primarily for large data transfer from on-premise systems. Since Google Groups can be used to control access to GCS buckets, it can be used to allow arbitrary users to write to the bucket!
With your
2i2c.org
google account, go to Google Groups and create a new Google Group with the name “{bucket-name}-writers” and email “{bucket-name}-writers@googlegroups.com”, where “{bucket-name}” is the name of the bucket we are going to grant write access to.Use of
@googlegroups.com
instead of@2i2c.org
is suitable as this group will include non-2i2c members, who otherwise could get control of a 2i2c.org email which isn’t a necessary security compromise.Grant “Group Owner” access to the community champion requesting this feature. They will be able to add / remove users from the group as necessary, and thus manage access without needing to involve 2i2c engineers.
In the
user_buckets
definition for the bucket in question, add the group name as anextra_admin_members
:user_buckets = { "persistent": { "delete_after": null, "extra_admin_members": [ "group:<name-of-group>@googlegroups.com" ] } }
Apply this terraform change to create the appropriate permissions for members of the group to have full read/write access to that GCS bucket.
We want the community champions to handle granting / revoking access to this google group, as well as produce community specific documentation on how to actually upload data here. We currently do not have a template of how end users can use this, but something can be stolen from the documentation for LEAP users
Granting access to cloud buckets in other cloud accounts / projects#
Sometimes, users on a hub we manage need access to a storage bucket managed by an external third party - often a different research group. This can help with access to raw data, collaboration, etc.
This section outlines how to grant this access. Currently, this functionality is implemented only on AWS - but we can add it for other cloud providers when needed.
AWS#
On AWS, we would need to set up cross account S3 access.
Find the ARN of the service account used by the users on the hub. You can find this under
userServiceAccount.annotations.eks.amazon.com/role-arn
in thevalues.yaml
file for your hub. It should look something likearn:aws:iam::<account-id>:role/<hub-name>
.In the AWS account with the S3 bucket, create an IAM policy that grants appropriate access to the S3 bucket from the hub. For example, the following policy grants readonly access to the bucket for users of the hub
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "<arn-of-service-account-from-step-1>" }, "Action": [ "s3:GetObject", "s3:GetObjectVersion", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<name-of-bucket>", "arn:aws:s3:::<name-of-bucket>/*" ] } ] }
You can add additional permissions to the bucket if needed here.
Note
You can list as many buckets as you want, but each bucket needs two entries - one with the
/*
and one without so both listing the bucket as well as fetching data from it can workIn the
.tfvars
file for the cluster hosting the hub, addextra_iam_policy
as a key to the hub underhub_cloud_permissions
. This is used to set any additional IAM permissions granted to the users of the hub. In this case, you should copy the exact policy that was applied to the bucket in step 2, but remove the “Principal” key. So it would look something like:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:GetObjectVersion", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<name-of-bucket>", "arn:aws:s3:::<name-of-bucket>/*" ] } ] }
Apply the terraform config, and test out if s3 bucket access works on the hub!