Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sandbox] Kubernetes AI Toolchain Operator (KAITO) #106

Open
2 tasks done
sdesai345 opened this issue Jun 20, 2024 · 17 comments
Open
2 tasks done

[Sandbox] Kubernetes AI Toolchain Operator (KAITO) #106

sdesai345 opened this issue Jun 20, 2024 · 17 comments

Comments

@sdesai345
Copy link

sdesai345 commented Jun 20, 2024

Application contact emails

sachidesai@microsoft.com, guofei@microsoft.com, ishaansehgal@microsoft.com, jpalma@microsoft.com, qike@microsoft.com

Project Summary

KAITO automates the deployment of AI models and associated infrastructure provisioning on a Kubernetes cluster

Project Description

The Kubernetes AI toolchain operator (KAITO) is a cloud-native Kubernetes operator that automates the deployment of language models in a cluster across available CPU and GPU resources. For inferencing and fine-tuning scenarios, KAITO selects optimally sized infrastructure for the model, as well as offers users the flexibility to switch to other available resource types. KAITO makes it easy to split inferencing for a range of preset models across multiple lower-GPU count VMs, significantly reducing maintenance costs and overall inference service setup time.

Org repo URL (provide if all repos under the org are in scope of the application)

N/A

Project repo URL in scope of application

https://github.com/Azure/kaito

Additional repos in scope of the application

No response

Website URL

https://github.com/Azure/kaito

Roadmap

https://github.com/orgs/Azure/projects/669

Roadmap context

No response

Contributing Guide

https://github.com/Azure/kaito/blob/main/docs/contributing/readme.md

Code of Conduct (CoC)

https://github.com/Azure/kaito/tree/main?tab=coc-ov-file

Adopters

No response

Contributing or Sponsoring Org

Microsoft Azure

Maintainers file

https://github.com/Azure/kaito/blob/main/CODEOWNERS

IP Policy

  • If the project is accepted, I agree the project will follow the CNCF IP Policy

Trademark and accounts

  • If the project is accepted, I agree to donate all project trademarks and accounts to the CNCF

Why CNCF?

The CNCF can provide KAITO with the ability to grow as a project and community of contributors. Across the CNAI and related working groups, members can extend KAITO to support more large language models for inferencing, improve fine-tuning capabilities, connect to a wider range of GPU infrastructure, and more. The CNCF encourages and cultivates a strong overlap between focus areas/working groups, particularly regarding scheduling, networking, infrastructure management, etc. Enhancements in cloud-native technologies can benefit AI/ML workloads. Given this interconnected nature of CNCF, several working groups can collaborate and grow KAITO to match the pace of AI growth in today's world.

Benefit to the Landscape

This project seeks to bridge the gap between AI application development and cloud-native technologies. KAITO serves as a tool to onboard and streamline containerized AI/ML workloads for cloud-native users - built upon open-source CNCF projects and extensible to pair with future CNCF projects. As the interest in ML inferencing and fine-tuning grows exponentially with the frequent release of high performance open source models, KAITO will help the CNCF community keep up, regardless of expertise in container orchestration or AI.

Cloud Native 'Fit'

KAITO fits into Automation and Configuration (Provisioning) of the Cloud Native landscape, as a Kubernetes operator that automates the deployment of containerized LLMs. KAITO has two main open-source components, a workspace controller that triggers node auto-provisioning and uses model preset configurations to create the inference workload, interacting with a gpu-provisioner controller to add GPUs onto a cluster from a given cloud provider.

Cloud Native 'Integration'

A major component of KAITO, the node provisioner controller, is built upon the machine custom resource definition (CRD) of the Karpenter project, to interact with workspace controller component and trigger the auto-provisioning of GPU nodes in a Kubernetes cluster.

Cloud Native Overlap

KAITO overlaps with Kubernetes, as the project is a Kubernetes operator following the established Kubernetes custom resource definition (CRD) and controller design pattern.

Similar projects

N/A

Landscape

No, this project is not yet listed in the CNCF landscape.

Business Product or Service to Project separation

Azure Kubernetes Service (AKS) has developed a managed add-on based on the KAITO project for AKS customers. This add-on is called the AI toolchain operator add-on, which automatically provisions Azure-managed GPU nodes when deploying AI workloads on AKS clusters. The Azure-managed AI toolchain operator add-on will follow a separate release cadence and will be compatible with AKS features, while the KAITO open-source project will be developed in collaboration with the AKS team and members of the Upstream Kubernetes community, to remain extensible across cloud providers and empower developers to leverage various GPU types for AI workloads.

Project presentations

We recently presented to the TAG app delivery on June 12 and TAG runtime on June 20, on KAITO and our roadmap. We plan to present to WG-artificial-intelligence on June 27 as well.

Project champions

@lachie83

Additional information

No response

@sdesai345 sdesai345 added the New New Application label Jun 20, 2024
@raravena80
Copy link

TAG-Runtime

@TheFoxAtWork
Copy link
Contributor

@srust @raravena80 @miao0miao. Does the tag have a recommendation?
@lianmakesthings @thschue @roberthstrand Does the tag have a recommendation?

@TheFoxAtWork
Copy link
Contributor

Questions for the project

  • reviewing the content in the installation guide and the comment under business product separation, it appears the project is currently limited to AKS, is this correct?
  • If this is the case, is the project seeking inclusion in CNCF for vendor neutrality extend to other CSPs, as well as on-prem? i didnt see any of this in the roadmap

@sdesai345
Copy link
Author

  • KAITO works on AKS, and is now compatible with Amazon EKS - we are updating the repository and documentation in this upcoming week. Testing with GKE is currently in progress.
  • Yes, the project is seeking inclusion in CNCF for vendor neutrality extending to other CSPs (and currently supports self-hosted k8s), this is an active workstream in the roadmap here: Onboard Katio to Kubernetes services hosted by other cloud vendors Azure/kaito#452.

Please follow up with any further questions, thank you @TheFoxAtWork

@raravena80
Copy link

TAG-Runtime's review/assessment

@cathyhongzhang
Copy link

This project customizes the deployment of AI models on the K8S cluster. It provides one way to specify AI model GPU resource requirements and fine-tune params. I would expect more alternative ways down the road.

@jberkus
Copy link

jberkus commented Oct 4, 2024

TAG Contributor strategy has reviewed this project and found the following:

  • The contributor guide is currently a stub.
  • The project appears to have no written governance, yet.
  • The roadmap is a simple GH project board; it appears to be a few months old and in use.
  • There are three maintainers, all from Microsoft.
  • The project uses the Microsoft CoC, so would need to switch.

This review is for the TOC’s information only. Sandbox projects are not required to have full governance or contributor documentation.

@mrbobbytables
Copy link
Member

Project has been given the okay to move to a vote in today's sandbox review
/vote

Copy link

git-vote bot commented Oct 8, 2024

Vote created

@mrbobbytables has called for a vote on [Sandbox] Kubernetes AI Toolchain Operator (KAITO) (#106).

The members of the following teams have binding votes:

Team
@cncf/cncf-toc

Non-binding votes are also appreciated as a sign of support!

How to vote

You can cast your vote by reacting to this comment. The following reactions are supported:

In favor Against Abstain
👍 👎 👀

Please note that voting for multiple options is not allowed and those votes won't be counted.

The vote will be open for 2months 30days 2h 52m 48s. It will pass if at least 66% of the users with binding votes vote In favor 👍. Once it's closed, results will be published here as a new comment.

@sdesai345
Copy link
Author

/check-vote

Copy link

git-vote bot commented Oct 9, 2024

Vote status

So far 54.55% of the users with binding vote are in favor (passing threshold: 66%).

Summary

In favor Against Abstain Not voted
6 0 0 5

Binding votes (6)

User Vote Timestamp
angellk In favor 2024-10-08 17:20:08.0 +00:00:00
kgamanji In favor 2024-10-09 12:51:22.0 +00:00:00
rochaporto In favor 2024-10-09 1:02:11.0 +00:00:00
nikhita In favor 2024-10-09 6:07:27.0 +00:00:00
TheFoxAtWork In favor 2024-10-08 17:36:12.0 +00:00:00
dims In favor 2024-10-08 18:06:47.0 +00:00:00
@mauilion Pending
@linsun Pending
@dzolotusky Pending
@kevin-wangzefeng Pending
@cathyhongzhang Pending

@sdesai345
Copy link
Author

/check-vote

Copy link

git-vote bot commented Oct 10, 2024

Votes can only be checked once a day.

@sdesai345
Copy link
Author

/check-vote

Copy link

git-vote bot commented Oct 10, 2024

Vote status

So far 63.64% of the users with binding vote are in favor (passing threshold: 66%).

Summary

In favor Against Abstain Not voted
7 0 0 4

Binding votes (7)

User Vote Timestamp
angellk In favor 2024-10-08 17:20:08.0 +00:00:00
dims In favor 2024-10-08 18:06:47.0 +00:00:00
kevin-wangzefeng In favor 2024-10-10 6:55:57.0 +00:00:00
nikhita In favor 2024-10-09 6:07:27.0 +00:00:00
kgamanji In favor 2024-10-09 12:51:22.0 +00:00:00
rochaporto In favor 2024-10-09 1:02:11.0 +00:00:00
TheFoxAtWork In favor 2024-10-08 17:36:12.0 +00:00:00
@mauilion Pending
@linsun Pending
@dzolotusky Pending
@cathyhongzhang Pending

Non-binding votes (1)

User Vote Timestamp
raravena80 In favor 2024-10-09 19:32:44.0 +00:00:00

Copy link

git-vote bot commented Oct 11, 2024

Vote closed

The vote passed! 🎉

72.73% of the users with binding vote were in favor (passing threshold: 66%).

Summary

In favor Against Abstain Not voted
8 0 0 3

Binding votes (8)

User Vote Timestamp
@kevin-wangzefeng In favor 2024-10-10 6:55:57.0 +00:00:00
@nikhita In favor 2024-10-09 6:07:27.0 +00:00:00
@TheFoxAtWork In favor 2024-10-08 17:36:12.0 +00:00:00
@rochaporto In favor 2024-10-09 1:02:11.0 +00:00:00
@dims In favor 2024-10-08 18:06:47.0 +00:00:00
@kgamanji In favor 2024-10-09 12:51:22.0 +00:00:00
@angellk In favor 2024-10-08 17:20:08.0 +00:00:00
@dzolotusky In favor 2024-10-10 23:02:35.0 +00:00:00

Non-binding votes (1)

User Vote Timestamp
@raravena80 In favor 2024-10-09 19:32:44.0 +00:00:00

@git-vote git-vote bot removed the vote open label Oct 11, 2024
@Cmierly
Copy link

Cmierly commented Oct 17, 2024

Congrats on being accepted into the CNCF Sandbox!
Here's a link to your onboarding checklist:
#298

If you have any questions or concerns, please don't hesitate to reach out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

8 participants