Queue jobs timeout. Fixed by clearing cache & restarting horizon

My queue jobs all run fairly seamlessy in our production server, but about every 2 - 3 months I start getting a lot of timeout exceeded/too many attempts exceptions.

Our app is running with event sourcing and many events are queued so neededless to say we have a lot of jobs passing through the system (100 - 200k per day generally).

I have not found the root cause of the issues yet, but a simple re-deploy through envoyer fixes the issue. This is most likely due to the cache clear command being run.

Currently cache is handled by redis and is on the same server as the app. I was considering moving the cache to its own server/instance but this still does not help me with the root cause.

Does anyone have any ideas what might be going on here and how I can diagnose/fix it? I am guessing the cache is just getting overloaded/running out of space/leaking etc. over time but not really sure where to go from here.

Please or to participate in this conversation.