PDO hangs on executing query on heavy queue system, timeout does not work

Laravel Version: 7.6.2
PHP Version: 7.4.5-fpm-alpine
Database Driver & Version: MYSQL 8.0
Queue driver: Redis:alpine3.11

I'm running a heavy queue system that runs about 300k jobs and insert/updating about 20 million rows per day, using docker-compose and scale workers by docker-compose scale command Sometimes, a job gets stuck and does not release even if the timeout has been exceeded. I'm running 6 different queues, the first is a crawler and the others are updaters (update crawled data into the database). When I run multiple crawlers processors, the updaters can be stuck randomly at any time, and any job.

This is an example updater:

        $start = microtime(true);
        $this->tmpTable = $this->getTmpTableName();
        $this->swapTmpTable();
        $totalRows = DB::table("$this->tmpTable")->count();
        $this->logInfo("Total rows: $totalRows");

        $this->updateProducts();
        $this->updatePriceHistoryTmp();
        $this->updateSoldHistoryTmp();
        $this->deleteUpdatedRows();
        $this->insertProducts();
        $this->insertPriceHistoryTmp();
        $this->insertSoldHistoryTmp();
        $this->truncateTmpTable();

        $end = microtime(true);
        $processTime = $end - $start;
        $this->logInfo("Merge tmp table time: $processTime");

All of the functions in this job just send the raw query to the Mysql server, such as:

    protected function deleteUpdatedRows()
    {
        $sql = "DELETE $this->tmpTable
            FROM $this->tmpTable
            INNER JOIN products
            ON $this->tmpTable.id = products.id
        ";
        return DB::statement($sql);
    }

The workers are started up by docker-compose:

worker_merge_products_tmp:
        image: sanhangre:worker_dev
        volumes: 
            - ./webroot:/home/projects
        working_dir: /home/projects
        restart: always
        networks: 
            - default
        command: "/usr/local/bin/php /home/projects/artisan queue:work --queue=mergeProductsTmpTable --tries=3 --timeout=60"
        depends_on:
            - redis_db

As you can see, the timeout is set to 60s, but take a look at this picture:

error

The stuck job took 7949s to complete, far beyond the timeout, so weird. This job normally just take about 10~20s to complete, I have no idea about this weird thing.

Currently, I'm running 6 workers for the crawler, and one worker per updater. If I scale the crawler for more than 6 workers, the updaters start to be stuck, RANDOMLY at ANY JOB. The job stuck for about 2 hours, no more, no less. While a job is sticking, the rest are running normally without any error.

I also note that the job only gets stuck at DB action (insert / update / delete), for 2 hours.

I wonder why Laravel does not detect that something went wrong with the mysql connection and trying to reconnects? And why the timeout param does not work here? The job should be release after timeout exceeded (60 seconds) instead of sticking for 2 hours, and why always 2 hours?

Please help me solving this issue, I have been sticking with this issue for 2 months, searching tons of posts on google without any solution. Thank you very much

fernandods

3 years ago

Level 1

Hi @nhoxtwi , I know this was a long time ago, but did you manage to solve this problem ? I have the exact same issue with a job, that hangs on a query for 15 minutes, and does not trigger the job timeout.

[email protected]

1 year ago

Level 6

I am also encountering the same issue, and it also hangs 15 minutes (Laravel 9.52.15, Horizon 5.7)

Tray2

1 year ago

Level 73

@[email protected] It is most likely dead locks caused by the index being updated to frequently. The queued jobs table has two indexes, and if you run several thousands of job each minute, the updating of the index slows everything down, since the index is locked when it's updated.

[email protected]

1 year ago

Level 6

@Tray2 Hi Tray2,

Thank you for your response.

I apologize for not explaining clearly earlier. Our queue driver is not using MySQL but Redis, so we suspect that it is not an issue with table deadlocks.

We later discovered that when jobs establish a database connection, there is occasionally an issue with SQLSTATE[HY000]: General error: 2006 MySQL server has gone away. This causes our jobs to hang while attempting to connect, often for around 15 minutes.

The reason we didn’t identify this issue sooner is due to Laravel’s connection handling mechanism. It ignores the first connection error and only reports an error when a second connection attempt fails, which is why the issue wasn’t caught earlier.

We have already asked our database administrators to assist in investigating the problem. Once again.

Thanks again.

1 like

Please or to participate in this conversation.