Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to latest Azure Go SDK from out-of-support go-autorest #91

Closed
21 of 22 tasks
himanshu-kun opened this issue Apr 11, 2023 · 2 comments · Fixed by #105
Closed
21 of 22 tasks

Upgrade to latest Azure Go SDK from out-of-support go-autorest #91

himanshu-kun opened this issue Apr 11, 2023 · 2 comments · Fixed by #105
Assignees
Labels
kind/enhancement Enhancement, improvement, extension priority/critical Needs to be resolved soon, because it impacts users negatively status/closed Issue is closed (either delivered or triaged)
Milestone

Comments

@himanshu-kun
Copy link
Contributor

himanshu-kun commented Apr 11, 2023

What would you like to be added:

Currently the go-autorest version we are using is out of support as of March 31 2023. Also there is no way to re-generate mocks using Makefile, which makes it tough for external contributors to contribute.

Tasks:
API Changes

CRD Changes

  • Change the CRD and mark MachineSet as deprecated and introduce VirtualMachineScaleSet

Implementation Changes

  • Refactor validation code and utils code
  • Upgrade go.mod to use the latest Azure GO SDK.
  • new go-sdk provides fake package use that instead of using mocks. See issue
  • Refactor SPI package
  • When creating VirtualMachine set DeletionOption as Delete for both NIC and Disks. This will ensure that when the VM gets deleted the associated disks and nics will also be deleted.
  • Adapt to all new APIs (there has been a significant change in the APIs)
  • In the existing code SDK defaults the VM creation timeout to 15 mins. Check if there is a default timeout in the new SDK, if there is none then create a child context with timeout set to 15 mins to ensure that it is also in line what we have today. Check for operations - delete VM, NIC, Disk and create NIC what are the timeouts (either default or explicitly specified) and for now ensure that in the new code they are the same. We can revise the timeouts once we have metrics.

VMSS Support
During refactoring an issue w.r.t VMSS support was identified. It was collectively decided that we will not handle this issue as part of this issue so a separate issue has been created to fix VMSS support.

Metrics

  • Introduce metrics to capture:
    • API request duration - captured for all APIs made to azure.
    • Time taken to complete driver operations (e.g total time take to complete CreateMachine, DeleteMachine)
    • Status of an operation - capture failures/success

Misc

  • Convert the new API test code into OQ tests project and schedule it for regular run in CI/CD.
  • Take a look at Orphaned Azure NICs block subnet deletion afterwards #41
  • GetMachineStatus - the only usage of this is in triggerCreationFlow (MCM). We should just get machine names from VMs and not from NICs and Disks. This will resolve the issue of create machine not called if only a NIC is present. Currently GetMachineStatus leverages ListMachines which should no longer be the case.
  • ListMachines - we collect VM names from NICs and VMs (no need for disks as these are created and deleted along with the VM) since it impacts orphan collection and any change should be handled in a separate PR.
  • Implement error code mapping for Resource exhaustion.
  • There might be a bug which leads to ClusterAutoScaler fails to handle the ResourceExhausted situation #111

Tests

  • Remove ginkgo based unit tests and use native golang tests instead for better readability and maintainability.
  • Update integration tests to use the new API. (There is currently no stand-alone integration tests for a MCM-provider)

Out of Scope

  • AzureNetworkProfile.NetworkInterfaces is plural and should have had a type []AzureNetworkInterfaceReference but instead it only has a single AzureNetworkInterfaceReference. The provider API also provides an ability to set more than one NICs for a VM. So either the name of the API property needs to be change and reflect that it only allows one NIC per VM or it should make it a slice. This will not be handled as part of this issue.
  • DataDisk#Lun is currently a pointer which indicates that this field is optional and can accept a nil value. This is not true. It is a mandatory field and there is a explicit validation to ensure that it is always set. This is a API design mistake and will not be corrected as part of this PR. A separate Issue/PR should be created if there is a need to change this.

Why is this needed:
Current go-autorest is no longer supported by Azure/MS. It is therefore recommended by Azure to move to their new set of APIs.

@himanshu-kun himanshu-kun added kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority) priority/critical Needs to be resolved soon, because it impacts users negatively and removed priority/2 Priority (lower number equals higher priority) labels Apr 11, 2023
@unmarshall unmarshall self-assigned this Apr 13, 2023
@unmarshall unmarshall changed the title Upgrade go-autorest to latest and regenerate mocks Upgrade to latest Azure Go SDK from out-of-support go-autorest Jun 27, 2023
@unmarshall
Copy link
Contributor

A VMImage used today is a marketplace image. This requires a purchase plan to exist and the customer is expected to create an agreement. If an agreement GET returns an error then we just return the error. In a conversation with @dkistner and @kon-angelo we agreed on the following:

  • Keep the creation of agreement as a prerequisite for the customer.
  • Handle 404 error and have a log explicitly stating that agreement needs to be created. Still return the error and do not continue VM creation.
  • Keep the existing behavior of checking if agreement is not accepted then update the agreement to be Accepted= true

@elankath elankath removed their assignment Oct 12, 2023
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension priority/critical Needs to be resolved soon, because it impacts users negatively status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants