Aug 1, 2015

Level 50

Thread safe long running jobs from queue

How is the best way to handle long running jobs?

We are using supervisor to run "queue:work" with "--daemon", and the way that jobs are reserved for ttr of 60 seconds when a thread gets a job from the queue can allow other threads to get the same job where the process takes longer than 60 seconds.

I know that having a job that takes longer than 60 seconds needs to be reviewed to see if it really needs to take that long, but a good use case is generating a very complicated PDF report that the user request to have generated, so the request gets queued up. The job gets picked up by one of the worker threads, and starts crunching all of the data. If the process takes > than the ttr, then a second process (assuming that you are running more than 1 queue worker thread) will be given that same job, and it will start that same process. This could/would essentially monopolize all of the threads & never get "done"

I know that I can try to break the job into smaller chunks to do a little work & then requeue the next piece or increase the ttr, but both of those feel like bandaids.

Is there a way to have the job update/tell the queue that it is still working away, so reup the reservation?

LucasFecko

10 years ago

Level 2

Why won't you up the timeout ?

php artisan queue:listen --timeout=300

ohffs

10 years ago

Level 50

Off the top of my head and without giving it much thought (yay) - could you have something like a Redis key where each worker that grabbed a job set and periodically 'pinged' to update the timestamp? Then any other worker that tried to grab the job could see if it was already being processed and only 'steal' it if the original worker hadn't pinged for some reasonable time?

I'm not sure if that's just moving the problem along one level - but that's what popped into my head ;-)

jimmy.puckett

10 years ago

Level 50

@LucasFecko Thanks for the suggestion, but I should have been more clear about the "ttr" as that has to do when you are are using supervisor to run "queue:work" with "--daemon", so there is not a "timeout". I know that we can adjust the "ttr" so that the reservation will last longer, but it does not solve the problem--it just increases the time before the problem occurs.

I will update my question to make it more clear that we are using queue:work.

Thanks!

jimmy.puckett

10 years ago

Level 50

@ohffs I guess that we could look into that, but the concept with the reservation is that incase the job died, that the queue should give back out to the next worker, so you would still need some ttr type value in the reds key, and then clean it up, so I was hoping that there was some way to tell the queue at various stages in the method that process is still alive.

Thanks for you time.

LucasFecko

10 years ago

Level 2

@jimmy.puckett what exacly are you referring to when speaking about "ttr" ? I just assume that when you are having a long-running process that your only option is to increase some sort of timeout, if you don't want to break it to smaller parts. Can your PDF generators be stacked in queue or you need to generate them ASAP?

I opted for queue:listen that is also ran under supervisor. I have three queues (high, medium, low) and each one is used for different type of jobs.

jimmy.puckett

10 years ago

Level 50

@LucasFecko It is "Time To Run: seconds a job can be reserved for". You can see where Laravel uses it here...

https://github.com/laravel/framework/blob/master/src/Illuminate/Queue/Connectors/BeanstalkdConnector.php#L23

so you can set a ttr config value, or it defaults to "Pheanstalk::DEFAULT_TTR", which you can see here is 60 seconds...

https://github.com/pda/pheanstalk/blob/e677fe978ab568801de42a599f19b398b6d7a31b/src/PheanstalkInterface.php#L10

so when jobs get put in a tub, they are put in there with the ttr value...

https://github.com/pda/pheanstalk/blob/e677fe978ab568801de42a599f19b398b6d7a31b/src/PheanstalkInterface.php#L183

As you can see from the protocol docs, that this is how long the job is allowed to be "reserved", which translates to how long a job is allowed to get the job done or beanstalk assumes that the worker died off, so it gives it to the next worker. This in turn is the issue for long running processes that are greater than ttr.

I am pretty sure that they other queues have something simular, but SQS is the only other one that we have had the need to use, and that project does not have this issue as the jobs are not that long running--currently ;-).

Maybe the only solution is to increase the ttr to a value above the expected ttr, but that just feels hackie.

I was looking for someway to "renew" the reservation as specific places in the method.

1 like

jimmy.puckett

10 years ago

Level 50

Also, @LucasFecko with using queue:listen under supervisor, how are you restarting your queue? We have a process that is similar to the way that forge deploys code where we have a symlink to the current running version, so when the symlink gets updated, then the handle to the file gets lost so the queued jobs start falling.

By moving to queue:work with the daemon flag, then you can run the queue:restart, which gracefully terminates the worker threads after they are done. Then supervisor restarts the worker with the correct handle to the files & with the new version of the Job.

LucasFecko

10 years ago

Level 2

@jimmy.puckett thank you for those references about Beanstalkd queues, I learned something new, I have never used Beanstalk for queue management.

I am using Redis for queues, and I never had to specifically "restart" the queue process. When something goes very wrong (happended once or twice :)) I log in the redis dedicated server and empty the queues (redis-cli del queues:medium). When deploying to servers, the queue workers aren't restarted.

I wouldn't choose customizing the ttr if I were you, we have that 300 seconds timeout just in case, the main reason for having a bigger timeout is because once a day some very heavy financial data is computed for admin dashboard stats. Those data are not needed for real-time usage, they are just there to keep track for our management on how the project stands in global. So thousands of transactions are being crunched to offer some graphs/data. This happens at 2AM.

If I were in your situation I would try to break the process into smaller queue tasks, I assume that PDF creation is just the last step for that command, so I would divide it into smaller steps, and each of those steps has their own tube, something like:

user requests PDF report, put job to tube "first"
job is picked up in tube "first", gather data, make some small computations, put results in tube "second"
job is picked up in "second", make more computations from results, put in "third"
job is picked in "third", make PDF

I have never encountered similar problem as yours, and have little knowledge about Beanstalkd queues, so pardon me if that sounds silly.

jimmy.puckett

10 years ago

Best Answer

Level 50

@LucasFecko I appreciate your suggestion.

In cause you are curious, redis appears to do the same thing, but calls it "expire" instead of "ttr"...

https://github.com/laravel/framework/blob/master/src/Illuminate/Queue/RedisQueue.php#L139

We are not restarting the queue because things are going wrong. We do continuous deployment of code using Jenkins to push the code the servers once all of the tests pass. After the code is deployed to a date time folder, then we update the symbolic link to do zero downtime deployments. This is very simular to the way that Taylor does it with Envoyer...

https://laracasts.com/series/envoyer/episodes/2

so by using queue:listen then redis is getting a pointer to some files, which are broken after the symbolic link is updated to the new location. This will cause your jobs to stop working. Also, this will not allow redis to pick up any changes that may've been made as the artisan script is now in memory. There is no safe way to restart the queue when it was ran as queue:listen, but by using queue:work with daemon, then you can safe shut it down.

Here is another thread about this issue...

https://laracasts.com/discuss/channels/general-discussion/supervisor-and-artisan-queues

Anyhow, I am hoping that someone has some way to renew the reservation on a long running job?

Please or to participate in this conversation.