Hey!
"It depends"!
Some strategies:
- Create a server (or multiple, depending on how large your queue volume is) that isn't in the load balancing rotation, and who's job is just to churn through queue jobs
- Choose a server in the load balancer rotation that is also has a queue worker (if that server fails, you'll have a manual step of enabling queues on another server)
- Run queue workers on each server (safest)
The "depends" part depends on your application:
- Do queue jobs take up a lot of memory (image processing?) that might compete with the web-server/app for resources? (If so, a separate server might be best)
- Are there a large volume of jobs? (Do you need more than one worker churning through jobs?). This can affect the decision how many workers you enable and across how many servers.
Last notes:
I usually use SQS so I can avoid managing yet another service (such as supervisord or rabbitmq). However, some things that make SQS different:
- It's neither LIFO nor FIFO - there's NO guaranteed order in which you'll receive a job, so if ordering matters, you'll need to write code to orchestrate it, or use another queue that is LIFO or FIFO
- SQS has a "Visibility Timeout" on each job. If a job is NOT complete with that set time, it gets reset as "available" and another worker may pick it up. This means you need to carefully manage that:
- You'll need an idea of how long a job might take (so you don't accidentally make a job available and have it run twice)
- You might need to manage resetting timeout visibility per job type, or have multiple SQS queues each with a different default timeout. The AWS SDK will let you change job visibility per job, which is something I've done before (I Know Job A takes 10 minutes, but Job B only takes up to 10 seconds, so I set job visibility per job accordingly within each worker as it gets processed).