Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCI] Re-Implementation with the new provision API framework. #4119

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

HysunHe
Copy link
Contributor

@HysunHe HysunHe commented Oct 18, 2024

This is the PR to implement the OCI support in SkyPilot under its new provision API framework. The old implementation is deleted.

Leveraging the SkyPilot new provision API framework, this PR is expected to solve the TIMEOUT issue in the "old" implementation for some OCI VMs which needs long time to complete the setup/init.

It also includes some minor enhancement/fix.

This PR should be review/merged after #4080

All functions in the "old" implemented are now tested against with the new implementation.

All tests are passed.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@HysunHe HysunHe marked this pull request as draft October 21, 2024 05:22
@HysunHe HysunHe closed this Oct 21, 2024
@HysunHe
Copy link
Contributor Author

HysunHe commented Oct 21, 2024

Resolved conflict.

@HysunHe HysunHe reopened this Oct 21, 2024
@HysunHe HysunHe marked this pull request as ready for review October 21, 2024 05:58
@HysunHe
Copy link
Contributor Author

HysunHe commented Oct 28, 2024

Hi @cblmemo @Michaelvll , thanks for support this OCI provisioning API migration PR. Could you please take a review if your time possible? Thanks in advance.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing to this @HysunHe ! This looks awesome. Left some nits and will test it out tomorrow!

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/clouds/oci.py Outdated Show resolved Hide resolved
sky/clouds/utils/oci_utils.py Show resolved Hide resolved
sky/provision/oci/query_utils.py Outdated Show resolved Hide resolved
sky/provision/oci/query_utils.py Outdated Show resolved Hide resolved
sky/provision/oci/instance.py Show resolved Hide resolved
sky/provision/oci/instance.py Outdated Show resolved Hide resolved
sky/provision/oci/instance.py Outdated Show resolved Hide resolved
sky/provision/oci/instance.py Outdated Show resolved Hide resolved
sky/provision/oci/instance.py Outdated Show resolved Hide resolved
@HysunHe
Copy link
Contributor Author

HysunHe commented Oct 29, 2024

Hi @cblmemo , Thanks for your detail review and kind comments. I've addressed (or commented) all of them and well performed functionality tests.

@HysunHe HysunHe requested a review from cblmemo October 29, 2024 09:34
@HysunHe
Copy link
Contributor Author

HysunHe commented Nov 2, 2024

Hi @cblmemo ,could you please take the review for the latest changes which addressed the previous comments? Thanks a lot :)

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 2, 2024

Hi @HysunHe , thanks for the reminder! I tested this PR with sky launch --cloud oci --num-nodes 2 whoami and got the following error. Could you help take a look on what is happening here?

D 11-02 13:06:14 provisioner.py:151]     raise exceptions.ServiceError(
D 11-02 13:06:14 provisioner.py:151] oci.exceptions.ServiceError: {'target_service': 'compute', 'status': 400, 'code': 'InvalidParameter', 'opc-request-id': '2085EBD06D624CE9A952258F3B61A3FB/5C41B95C78B9AFBC18E5A15CF96F98DA/E27434EFC36C28FF3644A68FC7C96A21', 'message': 'Invalid ratio of memory in GB to OCPUs. Current ratio: 16.0. Valid ratio range: 0 - 0', 'operation_name': 'launch_instance', 'timestamp': '2024-11-02T20:06:14.589716+00:00', 'client_version': 'Oracle-PythonSDK/2.110.0', 'request_endpoint': 'POST https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances', 'logging_tips': 'To get more info on the failing request, refer to https://docs.oracle.com/en-us/iaas/tools/python/latest/logging.html for ways to log the request/response details.', 'troubleshooting_tips': "See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_400__400_invalidparameter for more information about resolving this error. Also see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/LaunchInstance for details on this operation's requirements. If you are unable to resolve this compute issue, please contact Oracle support and provide them this full error message."}

Also, I noticed there are several unresolved comments - could you take a look on them as well? ;)

Co-authored-by: Tian Xia <cblmemo@gmail.com>
@HysunHe
Copy link
Contributor Author

HysunHe commented Nov 3, 2024

Also, I noticed there are several unresolved comments - could you take a look on them as well? ;)

Ahh, lost those "hidden" ones. Now all nits are addressed.

@HysunHe
Copy link
Contributor Author

HysunHe commented Nov 3, 2024

Hi @HysunHe , thanks for the reminder! I tested this PR with sky launch --cloud oci --num-nodes 2 whoami and got the following error. Could you help take a look on what is happening here?

emm, looks cannot reproduce this on my env.

Would you please paste the messages regarding to the chosen resource (as below):

(sky) hysunhe@HYHE-PF1ZGYCQ:~$ sky launch --cloud oci --num-nodes 2 whoami --dryrun
Task from command: whoami
Considered resources (2 nodes):

CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN

OCI VM.Standard.E4.Flex$_8_32 8 32 - af-johannesburg-1 0.30 ✔

@HysunHe
Copy link
Contributor Author

HysunHe commented Nov 5, 2024

Hi @HysunHe , thanks for the reminder! I tested this PR with sky launch --cloud oci --num-nodes 2 whoami and got the following error. Could you help take a look on what is happening here?

D 11-02 13:06:14 provisioner.py:151]     raise exceptions.ServiceError(
D 11-02 13:06:14 provisioner.py:151] oci.exceptions.ServiceError: {'target_service': 'compute', 'status': 400, 'code': 'InvalidParameter', 'opc-request-id': '2085EBD06D624CE9A952258F3B61A3FB/5C41B95C78B9AFBC18E5A15CF96F98DA/E27434EFC36C28FF3644A68FC7C96A21', 'message': 'Invalid ratio of memory in GB to OCPUs. Current ratio: 16.0. Valid ratio range: 0 - 0', 'operation_name': 'launch_instance', 'timestamp': '2024-11-02T20:06:14.589716+00:00', 'client_version': 'Oracle-PythonSDK/2.110.0', 'request_endpoint': 'POST https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances', 'logging_tips': 'To get more info on the failing request, refer to https://docs.oracle.com/en-us/iaas/tools/python/latest/logging.html for ways to log the request/response details.', 'troubleshooting_tips': "See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_400__400_invalidparameter for more information about resolving this error. Also see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/LaunchInstance for details on this operation's requirements. If you are unable to resolve this compute issue, please contact Oracle support and provide them this full error message."}

Also, I noticed there are several unresolved comments - could you take a look on them as well? ;)

Hi @cblmemo, could you please double confirm if you're using an trial account and have reached the resource limit? Also you may go to https://cloud.oracle.com/limits?region=us-sanjose-1 and check the resource type "Cores for Standard.E4.Flex ...".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants