diff --git a/index.html b/index.html
index 674668dd..42bc22a0 100644
--- a/index.html
+++ b/index.html
@@ -730,8 +730,8 @@ <h2 id="supported-models">Supported Models<a class="headerlink" href="#supported
 </ul>
 <h2 id="guides">Guides<a class="headerlink" href="#guides" title="Permanent link">&para;</a></h2>
 <ul>
-<li><a href="./docs/cloud-install.md">Cloud Installation</a> - Deploy on Kubernetes clusters in the cloud</li>
-<li><a href="./docs/model-management.md">Model Management</a> - Manage ML models</li>
+<li><a href="installation/gke/">Installation on GKE</a> - Deploy on Kubernetes clusters in the cloud</li>
+<li><a href="model-management/">Model Management</a> - Manage ML models</li>
 </ul>
 <h2 id="openai-api-compatibility">OpenAI API Compatibility<a class="headerlink" href="#openai-api-compatibility" title="Permanent link">&para;</a></h2>
 <div class="highlight"><pre><span></span><code><span class="c1"># Implemented #</span>
diff --git a/search/search_index.json b/search/search_index.json
index 8531fa0d..6529b3ed 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"KubeAI: Private Open AI on Kubernetes","text":"<p>The simple AI platform that runs on Kubernetes.</p> <p>\"KubeAI is highly scalable, yet compact enough to fit on my old laptop.\" - Some Google Engineer</p> <p>\u2705\ufe0f  Drop-in replacement for OpenAI with API compatibility \ud83d\ude80  Serve OSS LLMs on CPUs or GPUs \u2696\ufe0f  Scale from zero, autoscale based on load \ud83d\udee0\ufe0f  Zero dependencies (no Istio, Knative, etc.)  \ud83e\udd16  Operates OSS model servers (vLLM and Ollama) \ud83d\udd0b  Additional OSS addons included (OpenWebUI i.e. ChatGPT UI) \u2709\ufe0f  Plug-n-play with cloud messaging systems (Kafka, PubSub, etc.)</p>"},{"location":"#architecture","title":"Architecture","text":"<p>KubeAI serves an OpenAI compatible HTTP API. Admins can configure ML models via <code>kind: Model</code> Kubernetes Custom Resources. KubeAI can be thought of as a Model Operator (See Operator Pattern) that manages vLLM and Ollama servers.</p> <p></p>"},{"location":"#local-quickstart","title":"Local Quickstart","text":"<p>Create a local cluster using kind or minikube.</p> TIP: If you are using Podman for kind... Make sure your Podman machine can use up to 6G of memory (by default it is capped at 2G):  <pre><code># You might need to stop and remove the existing machine:\npodman machine stop\npodman machine rm\n\n# Init and start a new machine:\npodman machine init --memory 6144\npodman machine start\n</code></pre> <pre><code>kind create cluster # OR: minikube start\n</code></pre> <p>Add the KubeAI Helm repository.</p> <pre><code>helm repo add kubeai https://substratusai.github.io/kubeai/\nhelm repo update\n</code></pre> <p>Install KubeAI and wait for all components to be ready (may take a minute).</p> <pre><code>cat &lt;&lt;EOF &gt; helm-values.yaml\nmodels:\n  catalog:\n    gemma2-2b-cpu:\n      enabled: true\n      minReplicas: 1\n    qwen2-500m-cpu:\n      enabled: true\n    nomic-embed-text-cpu:\n      enabled: true\nEOF\n\nhelm upgrade --install kubeai kubeai/kubeai \\\n    -f ./helm-values.yaml \\\n    --wait --timeout 10m\n</code></pre> <p>Before progressing to the next steps, start a watch on Pods in a standalone terminal to see how KubeAI deploys models. </p> <pre><code>kubectl get pods --watch\n</code></pre>"},{"location":"#interact-with-gemma2","title":"Interact with Gemma2","text":"<p>Because we set <code>minReplicas: 1</code> for the Gemma model you should see a model Pod already coming up.</p> <p>Start a local port-forward to the bundled chat UI.</p> <pre><code>kubectl port-forward svc/openwebui 8000:80\n</code></pre> <p>Now open your browser to localhost:8000 and select the Gemma model to start chatting with.</p>"},{"location":"#scale-up-qwen2-from-zero","title":"Scale up Qwen2 from Zero","text":"<p>If you go back to the browser and start a chat with Qwen2, you will notice that it will take a while to respond at first. This is because we set <code>minReplicas: 0</code> for this model and KubeAI needs to spin up a new Pod (you can verify with <code>kubectl get models -oyaml qwen2-500m-cpu</code>).</p> <p>NOTE: Autoscaling after initial scale-from-zero is not yet supported for the Ollama backend which we use in this local quickstart. KubeAI relies upon backend-specific metrics and the Ollama project has an open issue: https://github.com/ollama/ollama/issues/3144. To see autoscaling in action, checkout one of the cloud install guides which uses the vLLM backend and autoscales across GPU resources.</p>"},{"location":"#supported-models","title":"Supported Models","text":"<p>Any vLLM or Ollama model can be served by KubeAI. Some examples of popular models served on KubeAI include:</p> <ul> <li>Llama v3.1 (8B, 70B, 405B) </li> <li>Gemma2 (2B, 9B, 27B)</li> <li>Qwen2 (1.5B, 7B, 72B)</li> </ul>"},{"location":"#guides","title":"Guides","text":"<ul> <li>Cloud Installation - Deploy on Kubernetes clusters in the cloud</li> <li>Model Management - Manage ML models</li> </ul>"},{"location":"#openai-api-compatibility","title":"OpenAI API Compatibility","text":"<pre><code># Implemented #\n/v1/chat/completions\n/v1/completions\n/v1/embeddings\n/v1/models\n\n# Planned #\n# /v1/assistants/*\n# /v1/batches/*\n# /v1/fine_tuning/*\n# /v1/images/*\n# /v1/vector_stores/*\n</code></pre>"},{"location":"#immediate-roadmap","title":"Immediate Roadmap","text":"<ul> <li>Model caching</li> <li>LoRA finetuning (compatible with OpenAI finetuning API)</li> <li>Image generation (compatible with OpenAI images API)</li> </ul>"},{"location":"#contact","title":"Contact","text":"<p>Let us know about features you are interested in seeing or reach out with questions. Visit our Discord channel to join the discussion!</p> <p>Or just reach out on LinkedIn if you want to connect:</p> <ul> <li>Nick Stogner</li> <li>Sam Stoelinga</li> </ul>"},{"location":"development/","title":"Development","text":"<p>This document provides instructions for setting up a development environment for KubeAI.</p>"},{"location":"development/#cloud-setup","title":"Cloud Setup","text":"<pre><code>gcloud pubsub topics create test-kubeai-requests\ngcloud pubsub subscriptions create test-kubeai-requests-sub --topic test-kubeai-requests\ngcloud pubsub topics create test-kubeai-responses\ngcloud pubsub subscriptions create test-kubeai-responses-sub --topic test-kubeai-responses\n</code></pre>"},{"location":"development/#local-cluster","title":"Local Cluster","text":"<pre><code>kind create cluster\n# OR\n#./hack/create-dev-gke-cluster.yaml\n\n# When CRDs are changed reapply using kubectl:\nkubectl apply -f ./charts/kubeai/charts/crds/crds\n\n# Model with special address annotations:\nkubectl apply -f ./hack/dev-model.yaml\n\n# For developing in-cluster features:\nhelm upgrade --install kubeai ./charts/kubeai \\\n    --set openwebui.enabled=true \\\n    --set image.tag=latest \\\n    --set image.pullPolicy=Always \\\n    --set image.repository=us-central1-docker.pkg.dev/substratus-dev/default/kubeai \\\n    --set replicaCount=1 # 0 if running out-of-cluster (using \"go run\")\n\n# -f ./helm-values.yaml \\\n\n# Run in development mode.\nCONFIG_PATH=./hack/dev-config.yaml POD_NAMESPACE=default go run ./cmd/main.go --allow-pod-address-override\n\n# In another terminal:\nwhile true; do kubectl port-forward service/dev-model 7000:7000; done\n</code></pre>"},{"location":"development/#running","title":"Running","text":""},{"location":"development/#completions-api","title":"Completions API","text":"<pre><code># If you are running kubeai in-cluster:\n# kubectl port-forward svc/kubeai 8000:80\n\ncurl http://localhost:8000/openai/v1/completions -H \"Content-Type: application/json\" -d '{\"prompt\": \"Hi\", \"model\": \"dev\"}' -v\n</code></pre>"},{"location":"development/#messaging-integration","title":"Messaging Integration","text":"<pre><code>gcloud pubsub topics publish test-kubeai-requests \\                  \n  --message='{\"path\":\"/v1/completions\", \"metadata\":{\"a\":\"b\"}, \"body\": {\"model\": \"dev\", \"prompt\": \"hi\"}}'\n\ngcloud pubsub subscriptions pull test-kubeai-responses-sub --auto-ack\n</code></pre>"},{"location":"model-management/","title":"Model Management","text":"<p>KubeAI uses Model Custom Resources to configure what ML models are available in the system.</p> <p>Example:</p> <pre><code>apiVersion: kubeai.org/v1\nkind: Model\nmetadata:\n  name: llama-3.1-8b-instruct-fp8-l4\nspec:\n  features: [\"TextGeneration\"]\n  owner: neuralmagic\n  url: hf://neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8\n  engine: VLLM\n  args:\n    - --max-model-len=16384\n    - --max-num-batched-token=16384\n    - --gpu-memory-utilization=0.9\n  minReplicas: 0\n  maxReplicas: 3\n  resourceProfile: L4:1\n</code></pre>"},{"location":"model-management/#listing-models","title":"Listing Models","text":"<p>You can view all installed models through the Kubernetes API using <code>kubectl get models</code> (use the <code>-o yaml</code> flag for more details).</p> <p>You can also list all models via the OpenAI-compatible <code>/v1/models</code> endpoint:</p> <pre><code>curl http://your-deployed-kubeai-endpoint/openai/v1/models\n</code></pre>"},{"location":"model-management/#installing-a-predefined-model-using-helm","title":"Installing a predefined Model using Helm","text":"<p>When you are defining your Helm values, you can install a predefined Model by setting <code>enabled: true</code>:</p> <pre><code>models:\n  catalog:\n    llama-3.1-8b-instruct-fp8-l4:\n      enabled: true\n</code></pre> <p>You can also optionally override settings for a given model:</p> <pre><code>models:\n  catalog:\n    llama-3.1-8b-instruct-fp8-l4:\n      enabled: true\n      env:\n        MY_CUSTOM_ENV_VAR: \"some-value\"\n</code></pre>"},{"location":"model-management/#adding-custom-models-with-helm","title":"Adding Custom Models with Helm","text":"<p>If you prefer to add a custom model via the same Helm chart you use for installed KubeAI, you can add your custom model entry into the <code>.models.catalog</code> array of your existing Helm values file:</p> <pre><code># ...\nmodels:\n  catalog:\n    my-custom-model-name:\n      enabled: true\n      features: [\"TextEmbedding\"]\n      owner: me\n      url: \"hf://me/my-custom-model\"\n      resourceProfile: CPU:1\n</code></pre> <p>They you can re-run <code>helm upgrade</code> with the same flags you used to install KubeAI.</p>"},{"location":"model-management/#adding-custom-models-directly","title":"Adding Custom Models Directly","text":"<p>You can add your own model by defining a Model yaml file and applying it using <code>kubectl apply -f model.yaml</code>.</p> <p>If you have a running cluster with KubeAI installed you can inspect the schema for a Model using <code>kubectl explain</code>:</p> <pre><code>kubectl explain models\nkubectl explain models.spec\nkubectl explain models.spec.engine\n</code></pre>"},{"location":"model-management/#model-management-ui","title":"Model Management UI","text":"<p>We are considering adding a UI for managing models in a running KubeAI instance. Give the GitHub Issue a thumbs up if you would be interested in this feature.</p>"},{"location":"installation/gke/","title":"Install KubeAI on GKE","text":"TIP: Make sure you have enough quota in your GCP project. <p>Open the cloud console quotas page: https://console.cloud.google.com/iam-admin/quotas. Make sure your project is selected in the top left.</p> <p>There are 3 critical quotas you will need to verify for this guide. The minimum value here is assuming that you have nothing else running in your project.</p> Quota Location Min Value Preemptible NVIDIA L4 GPUs <code>&lt;your-region&gt;</code> 2 GPUs (all regions) - 2 CPUs (all regions) - 24 <p>See the following screenshot examples of how these quotas appear in the console:</p> <p></p> <p></p> <p></p>"},{"location":"installation/gke/#gke-autopilot","title":"GKE Autopilot","text":"<p>Create an Autopilot cluster (replace <code>us-central1</code> with a region that you have quota).</p> <pre><code>gcloud container clusters create-auto cluster-1 \\\n    --location=us-central1\n</code></pre> <p>Define the installation values for GKE.</p> <pre><code>cat &lt;&lt;EOF &gt; helm-values.yaml\nmodels:\n  catalog:\n    llama-3.1-8b-instruct-fp8-l4:\n      enabled: true\n\nresourceProfiles:\n  L4:\n    nodeSelector:\n      cloud.google.com/gke-accelerator: \"nvidia-l4\"\n      cloud.google.com/gke-spot: \"true\"\nEOF\n</code></pre> <p>Make sure you have a HuggingFace Hub token set in your environment (<code>HUGGING_FACE_HUB_TOKEN</code>).</p> <p>Install KubeAI with Helm.</p> <pre><code>helm repo add kubeai https://substratusai.github.io/kubeai/\nhelm repo update\n\nhelm upgrade --install kubeai kubeai/kubeai \\\n    -f ./helm-values.yaml \\\n    --set secrets.huggingface.token=$HUGGING_FACE_HUB_TOKEN \\\n    --wait\n</code></pre>"},{"location":"tutorials/langtrace/","title":"Deploying KubeAI with Langtrace","text":"<p>Langtrace is an open source tool that helps you with tracing and monitoring your AI calls. It includes a self-hosted UI that for example shows you the estimated costs of your LLM calls.</p> <p>KubeAI is used for deploying LLMs with an OpenAI compatible endpoint.</p> <p>In this tutorial you will learn how to deploy KubeAI and Langtrace end-to-end. Both KubeAI and Langtrace are installed in your Kubernetes cluster. No cloud services or external dependencies are required.</p> <p>If you don't have a K8s cluster yet, you can create one using kind or minikube. <pre><code>kind create cluster # OR: minikube start\n</code></pre></p> <p>Install Langtrace: <pre><code>helm repo add langtrace https://Scale3-Labs.github.io/langtrace-helm-chart\nhelm repo update\nhelm install langtrace langtrace/langtrace\n</code></pre></p> <p>Install KubeAI: <pre><code>helm repo add kubeai https://substratusai.github.io/kubeai/\nhelm repo update\ncat &lt;&lt;EOF &gt; helm-values.yaml\nmodels:\n  catalog:\n    gemma2-2b-cpu:\n      enabled: true\n      minReplicas: 1\nEOF\n\nhelm upgrade --install kubeai kubeai/kubeai \\\n    --wait --timeout 10m \\\n    -f ./helm-values.yaml\n</code></pre></p> <p>Create a local Python environment and install dependencies: <pre><code>python3 -m venv .venv\nsource .venv/bin/activate\npip install langtrace-python-sdk openai\n</code></pre></p> <p>Expose the KubeAI service to your local port: <pre><code>kubectl port-forward service/kubeai 8000:80\n</code></pre></p> <p>Expose the Langtrace service to your local port: <pre><code>kubectl port-forward service/langtrace-app 3000:3000\n</code></pre></p> <p>A Langtrace API key is required to use the Langtrace SDK. So lets get one by visiting your self hosted Langtace UI.</p> <p>Open your browser to http://localhost:3000, create a project and get the API keys for your langtrace project.</p> <p>In the Python script below, replace <code>langtrace_api_key</code> with your API key.</p> <p>Create file named <code>langtrace-example.py</code> with the following content: <pre><code># Replace this with your langtrace API key by visiting http://localhost:3000\nlangtrace_api_key=\"f7e003de19b9a628258531c17c264002e985604ca9fa561debcc85c41f357b09\"\n\nfrom langtrace_python_sdk import langtrace\nfrom langtrace_python_sdk.utils.with_root_span import with_langtrace_root_span\n# Paste this code after your langtrace init function\n\nfrom openai import OpenAI\n\nlangtrace.init(\n    api_key=api_key,\n    api_host=\"http://localhost:3000/api/trace\",\n)\n\nbase_url = \"http://localhost:8000/openai/v1\"\nmodel = \"gemma2-2b-cpu\"\n\n@with_langtrace_root_span()\ndef example():\n    client = OpenAI(base_url=base_url, api_key=\"ignored-by-kubeai\")\n    response = client.chat.completions.create(\n        model=model,\n        messages=[\n            {\n                \"role\": \"system\",\n                \"content\": \"How many states of matter are there?\"\n            }\n        ],\n    )\n    print(response.choices[0].message.content)\n\nexample()\n</code></pre></p> <p>Run the Python script: <pre><code>python3 langtrace-example.py\n</code></pre></p> <p>Now you should see the trace in your Langtrace UI. Take a look by visiting http://localhost:3000.</p> <p></p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"KubeAI: Private Open AI on Kubernetes","text":"<p>The simple AI platform that runs on Kubernetes.</p> <p>\"KubeAI is highly scalable, yet compact enough to fit on my old laptop.\" - Some Google Engineer</p> <p>\u2705\ufe0f  Drop-in replacement for OpenAI with API compatibility \ud83d\ude80  Serve OSS LLMs on CPUs or GPUs \u2696\ufe0f  Scale from zero, autoscale based on load \ud83d\udee0\ufe0f  Zero dependencies (no Istio, Knative, etc.)  \ud83e\udd16  Operates OSS model servers (vLLM and Ollama) \ud83d\udd0b  Additional OSS addons included (OpenWebUI i.e. ChatGPT UI) \u2709\ufe0f  Plug-n-play with cloud messaging systems (Kafka, PubSub, etc.)</p>"},{"location":"#architecture","title":"Architecture","text":"<p>KubeAI serves an OpenAI compatible HTTP API. Admins can configure ML models via <code>kind: Model</code> Kubernetes Custom Resources. KubeAI can be thought of as a Model Operator (See Operator Pattern) that manages vLLM and Ollama servers.</p> <p></p>"},{"location":"#local-quickstart","title":"Local Quickstart","text":"<p>Create a local cluster using kind or minikube.</p> TIP: If you are using Podman for kind... Make sure your Podman machine can use up to 6G of memory (by default it is capped at 2G):  <pre><code># You might need to stop and remove the existing machine:\npodman machine stop\npodman machine rm\n\n# Init and start a new machine:\npodman machine init --memory 6144\npodman machine start\n</code></pre> <pre><code>kind create cluster # OR: minikube start\n</code></pre> <p>Add the KubeAI Helm repository.</p> <pre><code>helm repo add kubeai https://substratusai.github.io/kubeai/\nhelm repo update\n</code></pre> <p>Install KubeAI and wait for all components to be ready (may take a minute).</p> <pre><code>cat &lt;&lt;EOF &gt; helm-values.yaml\nmodels:\n  catalog:\n    gemma2-2b-cpu:\n      enabled: true\n      minReplicas: 1\n    qwen2-500m-cpu:\n      enabled: true\n    nomic-embed-text-cpu:\n      enabled: true\nEOF\n\nhelm upgrade --install kubeai kubeai/kubeai \\\n    -f ./helm-values.yaml \\\n    --wait --timeout 10m\n</code></pre> <p>Before progressing to the next steps, start a watch on Pods in a standalone terminal to see how KubeAI deploys models. </p> <pre><code>kubectl get pods --watch\n</code></pre>"},{"location":"#interact-with-gemma2","title":"Interact with Gemma2","text":"<p>Because we set <code>minReplicas: 1</code> for the Gemma model you should see a model Pod already coming up.</p> <p>Start a local port-forward to the bundled chat UI.</p> <pre><code>kubectl port-forward svc/openwebui 8000:80\n</code></pre> <p>Now open your browser to localhost:8000 and select the Gemma model to start chatting with.</p>"},{"location":"#scale-up-qwen2-from-zero","title":"Scale up Qwen2 from Zero","text":"<p>If you go back to the browser and start a chat with Qwen2, you will notice that it will take a while to respond at first. This is because we set <code>minReplicas: 0</code> for this model and KubeAI needs to spin up a new Pod (you can verify with <code>kubectl get models -oyaml qwen2-500m-cpu</code>).</p> <p>NOTE: Autoscaling after initial scale-from-zero is not yet supported for the Ollama backend which we use in this local quickstart. KubeAI relies upon backend-specific metrics and the Ollama project has an open issue: https://github.com/ollama/ollama/issues/3144. To see autoscaling in action, checkout one of the cloud install guides which uses the vLLM backend and autoscales across GPU resources.</p>"},{"location":"#supported-models","title":"Supported Models","text":"<p>Any vLLM or Ollama model can be served by KubeAI. Some examples of popular models served on KubeAI include:</p> <ul> <li>Llama v3.1 (8B, 70B, 405B) </li> <li>Gemma2 (2B, 9B, 27B)</li> <li>Qwen2 (1.5B, 7B, 72B)</li> </ul>"},{"location":"#guides","title":"Guides","text":"<ul> <li>Installation on GKE - Deploy on Kubernetes clusters in the cloud</li> <li>Model Management - Manage ML models</li> </ul>"},{"location":"#openai-api-compatibility","title":"OpenAI API Compatibility","text":"<pre><code># Implemented #\n/v1/chat/completions\n/v1/completions\n/v1/embeddings\n/v1/models\n\n# Planned #\n# /v1/assistants/*\n# /v1/batches/*\n# /v1/fine_tuning/*\n# /v1/images/*\n# /v1/vector_stores/*\n</code></pre>"},{"location":"#immediate-roadmap","title":"Immediate Roadmap","text":"<ul> <li>Model caching</li> <li>LoRA finetuning (compatible with OpenAI finetuning API)</li> <li>Image generation (compatible with OpenAI images API)</li> </ul>"},{"location":"#contact","title":"Contact","text":"<p>Let us know about features you are interested in seeing or reach out with questions. Visit our Discord channel to join the discussion!</p> <p>Or just reach out on LinkedIn if you want to connect:</p> <ul> <li>Nick Stogner</li> <li>Sam Stoelinga</li> </ul>"},{"location":"development/","title":"Development","text":"<p>This document provides instructions for setting up a development environment for KubeAI.</p>"},{"location":"development/#cloud-setup","title":"Cloud Setup","text":"<pre><code>gcloud pubsub topics create test-kubeai-requests\ngcloud pubsub subscriptions create test-kubeai-requests-sub --topic test-kubeai-requests\ngcloud pubsub topics create test-kubeai-responses\ngcloud pubsub subscriptions create test-kubeai-responses-sub --topic test-kubeai-responses\n</code></pre>"},{"location":"development/#local-cluster","title":"Local Cluster","text":"<pre><code>kind create cluster\n# OR\n#./hack/create-dev-gke-cluster.yaml\n\n# When CRDs are changed reapply using kubectl:\nkubectl apply -f ./charts/kubeai/charts/crds/crds\n\n# Model with special address annotations:\nkubectl apply -f ./hack/dev-model.yaml\n\n# For developing in-cluster features:\nhelm upgrade --install kubeai ./charts/kubeai \\\n    --set openwebui.enabled=true \\\n    --set image.tag=latest \\\n    --set image.pullPolicy=Always \\\n    --set image.repository=us-central1-docker.pkg.dev/substratus-dev/default/kubeai \\\n    --set replicaCount=1 # 0 if running out-of-cluster (using \"go run\")\n\n# -f ./helm-values.yaml \\\n\n# Run in development mode.\nCONFIG_PATH=./hack/dev-config.yaml POD_NAMESPACE=default go run ./cmd/main.go --allow-pod-address-override\n\n# In another terminal:\nwhile true; do kubectl port-forward service/dev-model 7000:7000; done\n</code></pre>"},{"location":"development/#running","title":"Running","text":""},{"location":"development/#completions-api","title":"Completions API","text":"<pre><code># If you are running kubeai in-cluster:\n# kubectl port-forward svc/kubeai 8000:80\n\ncurl http://localhost:8000/openai/v1/completions -H \"Content-Type: application/json\" -d '{\"prompt\": \"Hi\", \"model\": \"dev\"}' -v\n</code></pre>"},{"location":"development/#messaging-integration","title":"Messaging Integration","text":"<pre><code>gcloud pubsub topics publish test-kubeai-requests \\                  \n  --message='{\"path\":\"/v1/completions\", \"metadata\":{\"a\":\"b\"}, \"body\": {\"model\": \"dev\", \"prompt\": \"hi\"}}'\n\ngcloud pubsub subscriptions pull test-kubeai-responses-sub --auto-ack\n</code></pre>"},{"location":"model-management/","title":"Model Management","text":"<p>KubeAI uses Model Custom Resources to configure what ML models are available in the system.</p> <p>Example:</p> <pre><code>apiVersion: kubeai.org/v1\nkind: Model\nmetadata:\n  name: llama-3.1-8b-instruct-fp8-l4\nspec:\n  features: [\"TextGeneration\"]\n  owner: neuralmagic\n  url: hf://neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8\n  engine: VLLM\n  args:\n    - --max-model-len=16384\n    - --max-num-batched-token=16384\n    - --gpu-memory-utilization=0.9\n  minReplicas: 0\n  maxReplicas: 3\n  resourceProfile: L4:1\n</code></pre>"},{"location":"model-management/#listing-models","title":"Listing Models","text":"<p>You can view all installed models through the Kubernetes API using <code>kubectl get models</code> (use the <code>-o yaml</code> flag for more details).</p> <p>You can also list all models via the OpenAI-compatible <code>/v1/models</code> endpoint:</p> <pre><code>curl http://your-deployed-kubeai-endpoint/openai/v1/models\n</code></pre>"},{"location":"model-management/#installing-a-predefined-model-using-helm","title":"Installing a predefined Model using Helm","text":"<p>When you are defining your Helm values, you can install a predefined Model by setting <code>enabled: true</code>:</p> <pre><code>models:\n  catalog:\n    llama-3.1-8b-instruct-fp8-l4:\n      enabled: true\n</code></pre> <p>You can also optionally override settings for a given model:</p> <pre><code>models:\n  catalog:\n    llama-3.1-8b-instruct-fp8-l4:\n      enabled: true\n      env:\n        MY_CUSTOM_ENV_VAR: \"some-value\"\n</code></pre>"},{"location":"model-management/#adding-custom-models-with-helm","title":"Adding Custom Models with Helm","text":"<p>If you prefer to add a custom model via the same Helm chart you use for installed KubeAI, you can add your custom model entry into the <code>.models.catalog</code> array of your existing Helm values file:</p> <pre><code># ...\nmodels:\n  catalog:\n    my-custom-model-name:\n      enabled: true\n      features: [\"TextEmbedding\"]\n      owner: me\n      url: \"hf://me/my-custom-model\"\n      resourceProfile: CPU:1\n</code></pre> <p>They you can re-run <code>helm upgrade</code> with the same flags you used to install KubeAI.</p>"},{"location":"model-management/#adding-custom-models-directly","title":"Adding Custom Models Directly","text":"<p>You can add your own model by defining a Model yaml file and applying it using <code>kubectl apply -f model.yaml</code>.</p> <p>If you have a running cluster with KubeAI installed you can inspect the schema for a Model using <code>kubectl explain</code>:</p> <pre><code>kubectl explain models\nkubectl explain models.spec\nkubectl explain models.spec.engine\n</code></pre>"},{"location":"model-management/#model-management-ui","title":"Model Management UI","text":"<p>We are considering adding a UI for managing models in a running KubeAI instance. Give the GitHub Issue a thumbs up if you would be interested in this feature.</p>"},{"location":"installation/gke/","title":"Install KubeAI on GKE","text":"TIP: Make sure you have enough quota in your GCP project. <p>Open the cloud console quotas page: https://console.cloud.google.com/iam-admin/quotas. Make sure your project is selected in the top left.</p> <p>There are 3 critical quotas you will need to verify for this guide. The minimum value here is assuming that you have nothing else running in your project.</p> Quota Location Min Value Preemptible NVIDIA L4 GPUs <code>&lt;your-region&gt;</code> 2 GPUs (all regions) - 2 CPUs (all regions) - 24 <p>See the following screenshot examples of how these quotas appear in the console:</p> <p></p> <p></p> <p></p>"},{"location":"installation/gke/#gke-autopilot","title":"GKE Autopilot","text":"<p>Create an Autopilot cluster (replace <code>us-central1</code> with a region that you have quota).</p> <pre><code>gcloud container clusters create-auto cluster-1 \\\n    --location=us-central1\n</code></pre> <p>Define the installation values for GKE.</p> <pre><code>cat &lt;&lt;EOF &gt; helm-values.yaml\nmodels:\n  catalog:\n    llama-3.1-8b-instruct-fp8-l4:\n      enabled: true\n\nresourceProfiles:\n  L4:\n    nodeSelector:\n      cloud.google.com/gke-accelerator: \"nvidia-l4\"\n      cloud.google.com/gke-spot: \"true\"\nEOF\n</code></pre> <p>Make sure you have a HuggingFace Hub token set in your environment (<code>HUGGING_FACE_HUB_TOKEN</code>).</p> <p>Install KubeAI with Helm.</p> <pre><code>helm repo add kubeai https://substratusai.github.io/kubeai/\nhelm repo update\n\nhelm upgrade --install kubeai kubeai/kubeai \\\n    -f ./helm-values.yaml \\\n    --set secrets.huggingface.token=$HUGGING_FACE_HUB_TOKEN \\\n    --wait\n</code></pre>"},{"location":"tutorials/langtrace/","title":"Deploying KubeAI with Langtrace","text":"<p>Langtrace is an open source tool that helps you with tracing and monitoring your AI calls. It includes a self-hosted UI that for example shows you the estimated costs of your LLM calls.</p> <p>KubeAI is used for deploying LLMs with an OpenAI compatible endpoint.</p> <p>In this tutorial you will learn how to deploy KubeAI and Langtrace end-to-end. Both KubeAI and Langtrace are installed in your Kubernetes cluster. No cloud services or external dependencies are required.</p> <p>If you don't have a K8s cluster yet, you can create one using kind or minikube. <pre><code>kind create cluster # OR: minikube start\n</code></pre></p> <p>Install Langtrace: <pre><code>helm repo add langtrace https://Scale3-Labs.github.io/langtrace-helm-chart\nhelm repo update\nhelm install langtrace langtrace/langtrace\n</code></pre></p> <p>Install KubeAI: <pre><code>helm repo add kubeai https://substratusai.github.io/kubeai/\nhelm repo update\ncat &lt;&lt;EOF &gt; helm-values.yaml\nmodels:\n  catalog:\n    gemma2-2b-cpu:\n      enabled: true\n      minReplicas: 1\nEOF\n\nhelm upgrade --install kubeai kubeai/kubeai \\\n    --wait --timeout 10m \\\n    -f ./helm-values.yaml\n</code></pre></p> <p>Create a local Python environment and install dependencies: <pre><code>python3 -m venv .venv\nsource .venv/bin/activate\npip install langtrace-python-sdk openai\n</code></pre></p> <p>Expose the KubeAI service to your local port: <pre><code>kubectl port-forward service/kubeai 8000:80\n</code></pre></p> <p>Expose the Langtrace service to your local port: <pre><code>kubectl port-forward service/langtrace-app 3000:3000\n</code></pre></p> <p>A Langtrace API key is required to use the Langtrace SDK. So lets get one by visiting your self hosted Langtace UI.</p> <p>Open your browser to http://localhost:3000, create a project and get the API keys for your langtrace project.</p> <p>In the Python script below, replace <code>langtrace_api_key</code> with your API key.</p> <p>Create file named <code>langtrace-example.py</code> with the following content: <pre><code># Replace this with your langtrace API key by visiting http://localhost:3000\nlangtrace_api_key=\"f7e003de19b9a628258531c17c264002e985604ca9fa561debcc85c41f357b09\"\n\nfrom langtrace_python_sdk import langtrace\nfrom langtrace_python_sdk.utils.with_root_span import with_langtrace_root_span\n# Paste this code after your langtrace init function\n\nfrom openai import OpenAI\n\nlangtrace.init(\n    api_key=api_key,\n    api_host=\"http://localhost:3000/api/trace\",\n)\n\nbase_url = \"http://localhost:8000/openai/v1\"\nmodel = \"gemma2-2b-cpu\"\n\n@with_langtrace_root_span()\ndef example():\n    client = OpenAI(base_url=base_url, api_key=\"ignored-by-kubeai\")\n    response = client.chat.completions.create(\n        model=model,\n        messages=[\n            {\n                \"role\": \"system\",\n                \"content\": \"How many states of matter are there?\"\n            }\n        ],\n    )\n    print(response.choices[0].message.content)\n\nexample()\n</code></pre></p> <p>Run the Python script: <pre><code>python3 langtrace-example.py\n</code></pre></p> <p>Now you should see the trace in your Langtrace UI. Take a look by visiting http://localhost:3000.</p> <p></p>"}]}
\ No newline at end of file