Job batches stuck in pending status

Hello, what you could suggest to check. Situation is next, I have batch of jobs, on average it's up to 10 jobs in the batch, each of them, process 100 models in DB. When batch is executed, sometimes it works without any issues and quite often it execute 1-2 jobs from the batch and the rest stays in pending status. SQS shows 0 messages in flight and 0 waiting for execution. I have serverless setup with Vapor. cache is DynamoDB, SQS, PostgresDB on AWS RDS. On local and staging with same setup there are no issues at all, works perfectly.

What could be wrong?

LaryAI

2 years ago

Level 58

When dealing with job batches that get stuck in a pending status, there are several things you can check to diagnose and resolve the issue. Here's a step-by-step guide to troubleshoot the problem:

Check Queue Connection Settings: Ensure that your production environment is correctly configured to use the SQS queue and that the credentials and region are set correctly.
Inspect Vapor Logs: Since you're using Laravel Vapor, check the Vapor dashboard for any logs that might indicate errors or issues with the job processing.
Job Timeouts and Retries: Verify that the jobs are not timing out or being retried excessively. You can adjust the timeout and tries settings in your job classes or in the queue configuration.
DynamoDB Throughput: Check the read/write capacity of your DynamoDB tables, especially the cache and job storage tables, to ensure they are not being throttled.
Database Connections: Make sure that your database connections are not being exhausted. RDS has a limit on the number of concurrent connections it can handle.
Job Size: If your jobs are processing a large number of models, consider breaking them down into smaller chunks to prevent memory issues or timeouts.
Error Handling: Ensure that your jobs have proper error handling to catch any exceptions that may occur and log them for review.
Queue Worker Settings: If you're using queue workers, check their settings in config/queue.php and ensure they are not dying or being killed prematurely.
Supervisor Configuration: If you're using Supervisor to manage your queue workers, review its configuration to ensure it's correctly restarting any failed workers.
Job Dependencies: Make sure there are no unmet dependencies or issues within the jobs that could cause them to fail silently.
Environment Consistency: Double-check that your local and staging environments are as close to production as possible to ensure consistency in behavior.
AWS Service Limits: Review any relevant AWS service limits to ensure you're not hitting any caps that could affect job processing.
Monitoring and Alerts: Set up monitoring and alerts for your queue to get real-time information on job processing and potential issues.

Here's a sample code snippet to adjust the timeout and tries settings in a job class:

class ProcessModelJob implements ShouldQueue
{
    public $timeout = 120; // The number of seconds the job can run before timing out.
    public $tries = 3; // The number of times the job may be attempted.

    // ...
}

If after checking all these points the issue persists, you may need to add additional logging to your job classes to capture more detailed information about where they are getting stuck. This can help you pinpoint the exact cause of the issue.

lukeska

1 year ago

Level 8

Hey @liveinroot did you manage to find a solution in the end?

I think I'm facing the same issue as you did and cannot figure out how to solve it. Like you I have a batch of less than 10 jobs that process some data from the db and from time to time a few of them just won't be executed. Initially I thought somehow the messages to SQS got lost on the way to the queue or SQS was not able to reach Lambda. But then I tried to remove all the business logic from the jobs and have them throw an exception each time they were executed. With this I could see all jobs throwing the exception consistently. So now I'm thinking somehow the business logic is causing the Lambdas to die mid-process, but the weird thing is that they don't throw any exception or log anything on Cloudwatch.

The only thing that seemed to mitigate the issue was increasing the memory allocated to Lambdas, but I've no idea why or if it's indeed a solution.

liveinroot

1 year ago

Level 8

@lukeska Hi, I got it solved. This issue happened to me because I had same names of the queues for prod and staging. So when batch of jobs was created for default (just queue name) on production, it gets registered in Prod DB. Issue started when staging and prod environment’s workers started to take out jobs from default queue simultaneously. Obviously, staging jobs got failed, as models are not exist in db, but prod workers were not able to receive all the jobs from the batch as they were already stolen by staging workers. I’ve ended up by adding prefixes to all my queues. prod-{queue_name}, staging-{queue_name}. This solved my mysterious issue :) I hope this helps you as well!

Please or to participate in this conversation.