Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better documentation around scaling strategy #1463

Open
jason-berk-k1x opened this issue Aug 31, 2024 · 0 comments
Open

better documentation around scaling strategy #1463

jason-berk-k1x opened this issue Aug 31, 2024 · 0 comments

Comments

@jason-berk-k1x
Copy link

I'm using ScaledJob and I'm having a lot of confusion trying to understand the scaling strategies and how they differ.

my ScaledJob is triggered from an Azure Service Bus Queue and is configured like so:

job:
  paused: "false"
  activeDeadlineSeconds: 600
  pollingInterval: 30
  minReplicaCount: 0 
  maxReplicaCount: 3
  successfulJobsHistoryLimit: 10
  failedJobsHistoryLimit: 10
  scalingStrategy: "eager"
  trigger:
      queueName: some-queue-name
      messageCount: "1"
      auth: my-cluster-trigger-auth

my goal is to have a ScaledJob defined that is triggered to run when messages land on the queue.....up to three Jobs running in parallel. My job:

  1. gets a message from the queue and locks it (at least, that's what my engineers are telling me)
  2. processes the message to completion
  3. "completes" the message (so it's no longer in the queue)
  4. exits cleanly

on the off chance the processing fails or the pod dies, the lock will expire (eventually) and a different job will be started to process the message again. Eventually, if no job can process the message, we'll hit the max delivery count and the message will be dead lettered.

with both accurate and eager strategies, when I drop a message on the queue, I see a job start within 30 seconds (as expected). Again, my understanding is that the message is locked...but.....

  • thirty seconds later, after the next poll, another job starts up and tries to pull a message from the queue and just sits idle while blocking and waiting for a message
  • another thirty seconds later, another job starts up and again, just sits idle blocking while waiting for a message

meanwhile, the only job actually doing any work is the first job, but now I'm at three running jobs....one processing a message and the other two just sitting around waiting. eventually either a message comes in and one of those two idle jobs will grab it, or no jobs come in and the job hits the activeDeadlineSeconds and appears as a Failed job.

I see the same behavior when using accurate, except after the idle jobs timeout, more jobs are started....meaning it appears like there are always three running jobs....even overnight while nothing is in the queue....every ten minutes one job "Fails" and another job starts..... With eager, once the idle jobs timeout, new ones are not created while the queue is empty

also, in the docs for scaling strategy, I see:

accurate If the scaler returns queueLength (number of items in the queue) that does not include the number of locked messages, this strategy is recommended. Azure Storage Queue is one example. You can use this strategy if you delete a message once your app consumes it.

so my questions are:

  1. how exactly does one confirm if the scaler behaves this way?
  2. why do those jobs get started long after the first job actually pulled the message and started processing it?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant