Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model training guide #1406

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
c6c2a73
initial commit + tf code
BRV158 Jul 8, 2024
636c0e8
document folder renamed
BRV158 Jul 9, 2024
1dd678a
GPU node pool added
BRV158 Jul 11, 2024
9b73def
Merge branch 'GoogleCloudPlatform:main' into data-load-strategy
ganochenkodg Aug 6, 2024
6b6d726
some updates
ganochenkodg Aug 8, 2024
aa79450
quickfix
ganochenkodg Aug 8, 2024
54c1f31
updates
ganochenkodg Aug 8, 2024
b8a0279
add sa
ganochenkodg Aug 8, 2024
5cddc1f
update
ganochenkodg Aug 11, 2024
7a45966
update notebook
ganochenkodg Aug 14, 2024
33d6d89
updates
ganochenkodg Aug 15, 2024
bbac023
update the code
ganochenkodg Aug 20, 2024
6a637e6
updates
ganochenkodg Aug 20, 2024
7c81a9e
quickfix
ganochenkodg Aug 20, 2024
be00cb6
quickfix
ganochenkodg Aug 20, 2024
6d4e556
updates
ganochenkodg Aug 21, 2024
bc0c262
regional tags added
BRV158 Aug 21, 2024
1166228
notebook cells explanation
BRV158 Aug 21, 2024
f9fff75
update the notebook
ganochenkodg Aug 22, 2024
c151919
fix
BRV158 Aug 22, 2024
33b3c2d
update the job
ganochenkodg Aug 22, 2024
595a01a
update the notebook
ganochenkodg Aug 22, 2024
2e0d880
new volume
ganochenkodg Aug 23, 2024
191a46c
dawnloading logic added
BRV158 Aug 27, 2024
a52a22a
separate download jobs
BRV158 Aug 29, 2024
c9dc544
ram-job fix
BRV158 Sep 3, 2024
59206d8
update the notebook
ganochenkodg Sep 12, 2024
0053b0b
Merge branch 'main' into data-load-strategy
ganochenkodg Sep 12, 2024
e9e1caf
model training squence edited
BRV158 Sep 13, 2024
d0043d2
dataset jobs renaming
BRV158 Sep 13, 2024
92b93fd
update headers
ganochenkodg Sep 16, 2024
f9520d1
Merge branch 'main' into data-load-strategy
ganochenkodg Sep 16, 2024
b9ecfdc
updates
ganochenkodg Sep 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions ai-ml/model-train/manifests/01-volumes/bucket.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest: Judging from the requirements listed here, we'll need a README.md file for the ai-ml/model-train folder. Could you please add a README.md file with a link to the cloud.google.com tutorial where these samples will be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, but link is empty, we don't know it until guide is published

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Looks like we'll need Copyright 2024 Google LLC Apache license headers for most of these files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

apiVersion: v1
kind: PersistentVolume
metadata:
name: gcs-fuse-pv
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 16Gi
storageClassName: example-storage-class
mountOptions:
- implicit-dirs
csi:
driver: gcsfuse.csi.storage.gke.io
volumeHandle: <PROJECT_ID>-<CLUSTER_PREFIX>-model-train
volumeAttributes:
fileCacheCapacity: 5Gi
fileCacheForRangeRead: "true"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: gcs-fuse-claim
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 16Gi
volumeName: gcs-fuse-pv
storageClassName: example-storage-class
15 changes: 15 additions & 0 deletions ai-ml/model-train/manifests/01-volumes/cloudbuild.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
steps:
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: /bin/bash
args:
- '-c'
- |
gcloud compute ssh --tunnel-through-iap --quiet cloudbuild@${_INSTANCE_NAME} --zone=${_ZONE} --command="\
sudo mkdir -p /mnt/disks/ram-disk && \
sudo mount -t tmpfs -o size=16g tmpfs /mnt/disks/ram-disk && \
sudo mkfs.ext4 -F /dev/disk/by-id/google-local-ssd-block0 && \
sudo mkdir -p /mnt/disks/ssd0 && \
sudo mount /dev/disk/by-id/google-local-ssd-block0 /mnt/disks/ssd0"
substitutions:
_ZONE: us-central1-a
_INSTANCE_NAME: model-train-vm
69 changes: 69 additions & 0 deletions ai-ml/model-train/manifests/01-volumes/volumes.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-ssd-pv
spec:
capacity:
storage: 16Gi
accessModes: ["ReadWriteOnce"]
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /mnt/disks/ssd0
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: "node_pool"
operator: "In"
values:
- "model-train-pool"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: local-ssd-claim
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-storage
volumeName: local-ssd-pv
resources:
requests:
storage: 16Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: ram-disk-pv
spec:
capacity:
storage: 16Gi
accessModes: ["ReadWriteOnce"]
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /mnt/disks/ram-disk
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: "node_pool"
operator: "In"
values:
- "model-train-pool"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: ram-disk-claim
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-storage
volumeName: ram-disk-pv
resources:
requests:
storage: 16Gi

93 changes: 93 additions & 0 deletions ai-ml/model-train/manifests/02-dataset/job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: bucket-access
---
apiVersion: v1
kind: ConfigMap
metadata:
name: download-script
data:
download.sh: |-
#!/usr/bin/bash -x
apt-get update -y && \
apt-get install -y --no-install-recommends \
git git-lfs rsync
git lfs install
cd /local-ssd
echo "Saving dataset into Local SSD..."
time git clone --depth=1 "$DATASET_REPO"
echo "Saving dataset into Ram disk..."
time rsync --info=progress2 -a /local-ssd/dataset/dataset/ /ram-disk/dataset/
echo "Saving dataset into Bucket..."
time gsutil -q -m cp -r /local-ssd/dataset/dataset/ gs://$BUCKET_NAME/dataset/
echo "Dataset was successfully saved in all storages!"
---
apiVersion: batch/v1
kind: Job
metadata:
name: dataset-downloader
labels:
app: dataset-downloader
spec:
ttlSecondsAfterFinished: 120
template:
metadata:
labels:
app: dataset-downloader
spec:
restartPolicy: OnFailure
serviceAccountName: bucket-access
containers:
- name: gcloud
image: gcr.io/google.com/cloudsdktool/google-cloud-cli:slim
resources:
requests:
cpu: "2"
memory: "12Gi"
limits:
cpu: "2"
memory: "12Gi"
command:
- /scripts/download.sh
env:
- name: BUCKET_NAME
value: <PROJECT_ID>-<CLUSTER_PREFIX>-model-train
- name: DATASET_REPO
value: "https://huggingface.co/datasets/dganochenko/dataset"
- name: TIMEFORMAT
value: "%0lR"
volumeMounts:
- name: local-ssd-storage
mountPath: /local-ssd
- name: ram-disk-storage
mountPath: /ram-disk
- name: scripts-volume
mountPath: "/scripts/"
readOnly: true
volumes:
- name: scripts-volume
configMap:
defaultMode: 0700
name: download-script
- name: local-ssd-storage
persistentVolumeClaim:
claimName: local-ssd-claim
- name: ram-disk-storage
persistentVolumeClaim:
claimName: ram-disk-claim
- name: gcs-fuse-storage
persistentVolumeClaim:
claimName: gcs-fuse-claim
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: NoSchedule
- key: "app.stateful/component"
operator: "Equal"
value: "model-train"
effect: NoSchedule
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
Loading