-
Notifications
You must be signed in to change notification settings - Fork 501
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * [Provisioner] Support reserved instances in GCP (#2824) * Support reserved instances * remove min max count * remove unecessary fields * Add todo * Add todo * remove unused reseravation config * Fix config.yaml tests * format * sync with the upstream (Dec 05, 23) * set timeout and retries * handle GCP creation errors * Fix provisioning errors and improve error handling * update blocklist for GCP * refactor code for linting issues * fix * show instance status during assertion error * Refactor error handling for failover * adopt changes in #2854 * format * retry for wait operation * format * fix typo * fix interface * more robust zone to region * Fix tpu vm external IP setup * Fix get node * format * revert for TPU VM pod * Fix get_cluster_info call * fix tab * Fix timeout case * remvoe \t * GCP query statuses with new provisioner * format * fix import * refactor query status * fix stopped status * Fix stopped status * Add head ray start command * Add back keys * add workers * Fix non stopped states * Add more logs for autostop * format * increase job_docker job time * better logging * shorter time for recovering * fix conflicting var * change to V1 * fix comments * refactor constants * refactoring * typo * Fix max retry * longer sleep time for job * add detach setup * revert --detach-setup * shorter time for recovering * more retries * Update sky/provision/instance_setup.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * format * [Provisioner] New provisioner for GCP TPU VM (#2898) * init * test * test ins_type * fix * format.. * wip * remove TPU config * fix node ips * Fix TPU VM pod * format * use TPU VM as default * Fix example for TPU VM * format * fix optimizer random dag * set TPU-VM * accelerator_args False * backward compatibility * add tpu filter for tests * fix * Fix * fix status refresh for tpu VM pod * Support autodown for TPU VM pod * Allow multi-node TPU VM pod * Allow multi-node TPU VM pod * fix * add execute for operation * avoid from * Wait for pending before set_labels * format * refactor constants * Fix for API changes * remove GCP failover handler v1 * format * remove TPU VM pod specific codes as they have been moved to new provisioner * Add error handling for TPU pod case * fix * fix multiple node calculation * refactor tpu_utils to gcp_utils * shorter time for recovering * format --------- Co-authored-by: Wei-Lin Chiang <weichiang@berkeley.edu> * better error logging * Fix logging for TPU VM * Fix logging * Add insufficientCapacity to error handler * Avoid adding duplicated resources to blocked_resources * Fix blocked resources * address comment * add comment * Add comments * format * Fix num_node_ips * format * fix smoke test for preinstalled package * shorter wait time for recovering * Fix TPU VM pod stop * format * Update sky/provision/gcp/instance_utils.py Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * update * format * Add debug message * revert version for handle * disable tpu name set --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> Co-authored-by: Wei-Lin Chiang <weichiang@berkeley.edu>
- Loading branch information
1 parent
394ec4a
commit 318553b
Showing
33 changed files
with
2,937 additions
and
1,163 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.