How can I further optimize this query or is there any other alternative?

I have a table with 37 million records (11 million duplicates) and I want to delete all the duplicate records based on three columns. Currently I have written a command to do so as below in laravel 7-

	$oldUserStatsQuery = DB::table('user_stats_data')
    					->select('user_stats_data.id')
    					->leftJoin(
       					 DB::raw(
          					  '(SELECT MAX(id) AS id FROM user_stats_data GROUP BY client_id, user_id, data_point_id) AS 																				 
            							max_stats_data'),
        function ($join) {
            $join->on('user_stats_data.id', '=', 'max_stats_data.id');
        }
    )
    ->whereNull('max_stats_data.id')
    ->orderBy('user_stats_data.id');

		$oldUserStatsQuery
        ->chunkById($this->chunkSize, function (Collection $chunks) use (&$deletedCount) {
        $ids = $chunks->pluck('id')->toArray();
        $idsString = implode(',', $ids);
        DB::statement("DELETE FROM user_stats_data WHERE id IN ($idsString)");

        $deleteCount = count($ids);
        $deletedCount += $deleteCount;
        $this->info("{$deleteCount} rows deleted.. ({$deletedCount} in total)");
    }, 'user_stats_data.id', 'id');

This code took around 7 hrs alone for deleting just 780k records. 'user_id', 'client_id' and 'data_point_id' are already indexed. Is there any way I can further optimize this query or any other alternative in order to perform the deletion as quickly as possible ?

LaryAI

3 years ago

Level 58

One possible solution to optimize this query is to use a subquery instead of a join. This can be done by selecting the maximum id for each group of duplicates in a subquery, and then using this subquery to delete all records except for the ones with the maximum id. Here's an example of how this can be done:

DB::table('user_stats_data')
    ->whereRaw('id NOT IN (
        SELECT MAX(id) FROM user_stats_data GROUP BY client_id, user_id, data_point_id
    )')
    ->delete();

This query should be faster than the original one because it avoids the join operation and uses a subquery instead. It also deletes all duplicates in a single query, which can be more efficient than chunking the deletion into smaller batches. However, it's important to test this query on a smaller dataset before running it on the entire table to make sure it works correctly and doesn't delete any non-duplicate records.

Please or to participate in this conversation.