Our queued listeners (triggered by an event) and also a few regularly dispatched jobs randomly start failing, when it starts to fail, we have a lot of these failed queued listeners. The (unhandled) exception that we are getting:
Unhandled exception within Laravel Queue job: 'Illuminate\Database\Eloquent\ModelNotFoundException'
Notice the message: "Unhandled exception within Laravel Queue job". I haven't seen internet posts talking about that message. When they talk about the ModelNotFoundException they talk about the message "No query results for model [App\SomeModel]'". Also notice that the exception is 'unhandled', which could explain why the failed job doesn't reach the failed_jobs table. (we keep failed_jobs in MariaDB)
In the worker.log we see that the jobs failed. But they also don't appear in the failed_jobs table. Which is odd, since we didn't set the deleteWhenMissingModels flag. If I read the documentation, the default behaviour should be that jobs with missing models should be in the failed jobs table. But although we are still researching, it looks like they are not reaching the failed jobs table or are automatically being purged/deleted. But like I said the unhandled exception could be a cause of that. In what kind of circumstance would Laravel not be able to catch/handle the exception thrown by or within a job?
But the main thing is; why are queued events failing? Normally they work fine, but when this issue is happening, 1% of the listeners start to fail. It's not 1 particular listener or 1 particular event. We have set the after_commit flag (https://laravel.com/docs/10.x/queues#jobs-and-database-transactions) which should guarantee that a listener is executed when there aren't any open transactions.
Our config looks like this:
'redis' => [
'driver' => 'redis',
'connection' => 'queue',
'queue' => 'default',
'retry_after' => 9000,
'after_commit' => true
],
We do have multiple queue's. Not sure if that has to be specified, but then it's still strange that the jobs do land in Redis, but don't honor the after_commit flag. We don't seem to see problems for normal jobs, that are dispatched from the code. Only the event based jobs (the listeners) have this problem.
Some additional context; a couple of weeks ago we faced problems where job queries (INSERT INTO jobs, etc) came to an halt, apparently waiting for a transaction to be commit. At that time we could find the source of the problem, and decided to move from a MariaDB queue to a Redis queue. Pretty quickly after that we started to see the problems mentioned in this ticket.
Last 2 days we didn't face any of these issues, where 3 days ago we faced hundreds of these incidents.
Would appreciate any help of this topic.