ahoi's avatar
Level 5

Chaining Jobs with Horizon: Running chain in parallel causes one job never to finish

Hi everybody,

I got a strange issue:

I am going to do some work on a video using long running jobs.

Some tasks require multiple steps, so in this case, I am using a chain of jobs like this:

  1. UploadVideoToStorageJob::class
  2. NormalizeVideo::class
  3. OptimizeVideo::class

Well - this works fine, if I am running this chain for one video.

Laravel horizon shows something like:

UploadVideoToStorageJob RUNNING
UploadVideoToStorageJob RUNNING
UploadVideoToStorageJob DONE
UploadVideoToStorageJob DONE
NomalizeVideo RUNNING
NomalizeVideo RUNNING
NomalizeVideo DONE
NomalizeVideo DONE
OptimizeVideo RUNNING
OptimizeVideo RUNNING
OptimizeVideo FAILED
OptimizeVideo FAILED
OptimizeVideo RUNNING
OptimizeVideo RUNNING
OptimizeVideo DONE

Well. That's too bad. There's only one OptimizeVideo DONE in the list.

Checking out the horizon dashboard, I can see that this job is pending and actually there is no work happening. How do I know? Well, the job spawns a docker container and all the time the job is pending, there is no container running at all.

After waiting for 20 minutes, the job will succeed.

I was wondering about that:

In the queue.php config, there is a retry_after value set, which is just a little higher than 20min:

'retry_after'  => 1300,

This should mean that the job got picked up again. But I cannot explain, why it got stuck in the first place while it does nothing. After failing, the job is just waiting.

Edit

What I found out is that the docker-job itself doesn't really has to do something with this.

If two jobs are being retried due to an error, they are not being run concurrently after the first failure.

0 likes
2 replies
kiwi0134's avatar

Please check if your job isn't running into it's timeout. They timeout after 60 seconds by default. If your job takes longer, it'll exit.

At this stage, I highly doubt that there's something wrong with Laravels job system. How does your failing job class look like? Might there be an issue with spawning the docker containers? You could also add some debugging logs to your job, logging what it's doing right now, to find out where it fails.

// edit: Didn't see your edit. Whoops. Concurrency of jobs depends on how many queue workers are running. If you run multiple queue workers and they fail without restarting automatically, it might be possible that only one worker is running and therefore no other worker is there to do that other job at the same time.

ahoi's avatar
Level 5

I found out that this happens, if the balance-method is set to "auto"

Setting it to simple does solve the problem (for now):

        'supervisor-1' => [
            'connection'   => 'redis',
            'queue'        => ['default'],
            'balance'      => 'simple',
            'maxProcesses' => 20,
            'maxTime'      => 0,
            'maxJobs'      => 0,
            'memory'       => 128,
            'tries'        => 1,
            'timeout'      => 60,
            'nice'         => 0,
        ],
    ],

Please or to participate in this conversation.