Redis host name resolution failure
I have an app that uses Horizon where I process 30k jobs per hour on average. Horizon is running on multiple job processing servers, with a shared ElastiCache Redis cache implementation. Over the last two days, for an unknown reason, and oddly around the same time of day (0:00-2:00 UTC, which does not correlate with the maintenance window), the CPU has spiked to 60-100% on some of the servers. After a lot of digging, I found that the reason is due to intermittent Redis host resolution issues when processing jobs via horizon. An example error looks like this:
[2021-12-03 01:50:37][a88fcc17-70dd-4931-a119-87335e42f543] Processing: {JobName}
In UdpSocket.php line 65: socket_sendto(): Host lookup failed [-10002]: Host name lookup failure
In UdpSocket.php line 65: socket_sendto(): Host lookup failed [-10002]: Host name lookup failure
In PhpRedisConnector.php line 141: php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution
In PhpRedisConnector.php line 141: Redis::connect(): php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution
I'm no Redis expert, but all Redis node (1 primary, 1 replica, cache.t3.medium) resources look like they're within appropriate ranges. CPU, Engine CPU, and DB Memory Usage percentages are all within 1-3%, and Freeable Memory is near 3GB. Because this error is intermittent, I don't believe that its a configuration issue, and it likely has more to do with resource utilization/threshholds or something along those lines.
Any thoughts or ideas? I'm totally stumped. Thanks in advance for any help or pointers in the right direction!
Please or to participate in this conversation.