Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to set property to null on AgentPool #2462

Closed
jaxxstorm opened this issue May 18, 2023 · 7 comments · Fixed by #3655
Closed

Unable to set property to null on AgentPool #2462

jaxxstorm opened this issue May 18, 2023 · 7 comments · Fixed by #3655
Assignees
Labels
area/providers impact/quality impact/usability Something that impacts users' ability to use the product easily and intuitively kind/bug Some behavior is incorrect or out of spec language/python resolution/fixed This issue was fixed

Comments

@jaxxstorm
Copy link

What happened?

When creating a container service agent pool, modifying the gpu_instance_profile doesn't work for null values

Expected Behavior

If you set the property to None or any other nullable value, it should modify the resource

Steps to reproduce

Create an agent pool with a gpu_instance_profile

containerservice.AgentPool(
  "example",
  name=pool_config.name,
  gpu_instance_profile="MIG1g"
)

Then modify this to remove the GPU instance profile:

containerservice.AgentPool(
  "example",
  name=pool_config.name,
  gpu_instance_profile="None"
)

Output of pulumi about

N/A

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@jaxxstorm jaxxstorm added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels May 18, 2023
@kpitzen
Copy link
Contributor

kpitzen commented May 19, 2023

Hey @jaxxstorm ! Thanks for opening this - is this reproducible in other languages as well, or is it only a Python issue?

@kpitzen kpitzen added impact/usability Something that impacts users' ability to use the product easily and intuitively awaiting-feedback Blocked on input from the author area/providers impact/quality and removed needs-triage Needs attention from the triage team labels May 19, 2023
@mikhailshilkov mikhailshilkov added language/python and removed awaiting-feedback Blocked on input from the author labels Sep 14, 2023
@thomas11
Copy link
Contributor

thomas11 commented Jan 5, 2024

In the line gpu_instance_profile="None", the value is simply the string "None" because it's quoted. I'm surprised Azure didn't respond with a 400 here since gpu_instance_profile is an enum with defined values.

Unless the example snippet was manually edited and None was quoted erroneously then.

@thomas11
Copy link
Contributor

thomas11 commented Jan 6, 2024

I tried to repro but there's no auto-approvable quota for multi-GPU VM SKUs. sigh Submitted a support request.

import * as resources from "@pulumi/azure-native/resources";
import * as containerservice from "@pulumi/azure-native/containerservice";

const resourceGroup = new resources.ResourceGroup("resourceGroup");

const managedCluster = new containerservice.ManagedCluster("managedCluster", {
    addonProfiles: {},
    agentPoolProfiles: [{
        count: 3,
        enableNodePublicIP: true,
        mode: "System",
        name: "nodepool1",
        osType: "Linux",
        type: "VirtualMachineScaleSets",
        vmSize: "Standard_DS2_v2",
    }],
    dnsPrefix: "dnsprefix1",
    enableRBAC: true,
    kubernetesVersion: "",
    networkProfile: {
        loadBalancerProfile: {
            managedOutboundIPs: {
                count: 2,
            },
        },
        loadBalancerSku: "standard",
        outboundType: "loadBalancer",
    },
    resourceGroupName: resourceGroup.name,
    resourceName: "clustername1",
    servicePrincipalProfile: {
        clientId: REDACTED
        secret: REDACTED
    },
    sku: {
        name: "Base",
        tier: "Free",
    },
});

const pool = new containerservice.AgentPool("sa", {
    agentPoolName: "agentpool1",
    count: 1,
    gpuInstanceProfile: "MIG1g",
    orchestratorVersion: "",
    osType: "Linux",
    resourceGroupName: resourceGroup.name,
    resourceName: managedCluster.name,
    vmSize: "Standard_ND96asr_v4",
});

@danielrbradley
Copy link
Member

danielrbradley commented Oct 16, 2024

Starting to familiarise myself with this resource. Here's the facts I think are true.

It looks like the AgentPool is a nested resource of the managed cluster. You can create AgentPoolProfiles inline with the parent, but the AgentPool itself is a standalone resource, though shares all of the same properties as the profiles - with the addition of the agentPoolName which forms part of the URL.

ARM resource docs: https://learn.microsoft.com/en-us/azure/templates/microsoft.containerservice/managedclusters/agentpools

Example VM Size supporting GPUs: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndasra100v4-series

Our default version is using 2023-04-01 but the latest ARM docs are using 2024-06-02-preview.

The specs don't indicate how to unset this field. The empty string is not valid and as far as I can remember ARM update operations don't require all properties, so a missing field is not treated like a removal.

I expect this field might be set to a default value server-side for VMs with GPU support, but is removed if the VM doesn't support GPUs. There's no indication of this behaviour in the spec.

It's also possible that this is a field which Azure treats as immutable once created, but hasn't documented via the spec. The API might just discard the field it can't update as we've observed this behaviour in other resources.

Next step will be retrying the program above.

@danielrbradley danielrbradley self-assigned this Oct 17, 2024
@danielrbradley
Copy link
Member

danielrbradley commented Oct 17, 2024

Test Plan

  1. Deploy the AgentPool with the smallest possible GPU count, and without specifying the gpuInstanceProfile.
    • Does the create succeed without this value set explicitly?
    • Does the created resource have this property set in its outputs? This would indicate that a default value is always set for GPU VMs and therefore attempting to clear the value might be impossible.
  2. Attempt to update gpuInstanceProfile to another value (e.g. MIG1g or MIG2g)
    • Does this update succeed?
    • What is contained in the logs for the request being sent for the update?

Notes:

  • Using the standard_nc24ads_a100_v4 vmSize in EastUS results in the error: Insufficient regional vcpu quota left for location eastus. left regional vcpu quota 23, requested quota 24.
  • GPU availability really matters. The UK does not seem to have this feature. US East only has 23 gpus currently available which is thwarting our testing.
  • Use the command PULUMI_DEBUG_GRPC=grpc.log pulumi up --logtostderr --logflow -v=9 2> out.txt for capturing very verbose logs plus gRPC logs.

After 24 failed attempts to deploy an agent pool with GPUs by varying the vmSize I'm going to need to put this back on hold until we can find a provisionable VM Size.

@danielrbradley danielrbradley removed their assignment Oct 17, 2024
@danielrbradley
Copy link
Member

danielrbradley commented Oct 18, 2024

Docs on what multi-instance GPUs actually are: https://learn.microsoft.com/en-us/azure/aks/gpu-multi-instance

From the page above:

You can't change the applied GPU instance profile after node pool creation.

Therefore, it's not possible to update this value to either a new value or to remove it. Therefore, we'll mark this property as causing replacement in order to resolve this issue.

Also related in giving a little more insight on what can be updated: https://learn.microsoft.com/en-us/cli/azure/aks/nodepool?view=azure-cli-latest#az-aks-nodepool-update

danielrbradley added a commit that referenced this issue Oct 18, 2024
As noted on https://learn.microsoft.com/en-us/azure/aks/gpu-multi-instance?tabs=azure-cli#gpu-instance-profiles

> You can't change the applied GPU instance profile after node pool creation.

As noted on https://learn.microsoft.com/en-us/azure/aks/manage-node-pools#limitations

> You can't change the VM size of a node pool after you create it.

Fixes #2462
danielrbradley added a commit that referenced this issue Oct 18, 2024
As noted on
https://learn.microsoft.com/en-us/azure/aks/gpu-multi-instance?tabs=azure-cli#gpu-instance-profiles

> You can't change the applied GPU instance profile after node pool
creation.

As noted on
https://learn.microsoft.com/en-us/azure/aks/manage-node-pools#limitations

> You can't change the VM size of a node pool after you create it.

Fixes #2462
@pulumi-bot pulumi-bot added the resolution/fixed This issue was fixed label Oct 18, 2024
@pulumi-bot
Copy link
Contributor

This issue has been addressed in PR #3655 and shipped in release v2.68.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/providers impact/quality impact/usability Something that impacts users' ability to use the product easily and intuitively kind/bug Some behavior is incorrect or out of spec language/python resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants