How can I further optimize this query or is there any other alternative?
I have a table with 37 million records (11 million duplicates) and I want to delete all the duplicate records based on three columns. Currently I have written a command to do so as below in laravel 7-
$oldUserStatsQuery = DB::table('user_stats_data')
->select('user_stats_data.id')
->leftJoin(
DB::raw(
'(SELECT MAX(id) AS id FROM user_stats_data GROUP BY client_id, user_id, data_point_id) AS
max_stats_data'),
function ($join) {
$join->on('user_stats_data.id', '=', 'max_stats_data.id');
}
)
->whereNull('max_stats_data.id')
->orderBy('user_stats_data.id');
$oldUserStatsQuery
->chunkById($this->chunkSize, function (Collection $chunks) use (&$deletedCount) {
$ids = $chunks->pluck('id')->toArray();
$idsString = implode(',', $ids);
DB::statement("DELETE FROM user_stats_data WHERE id IN ($idsString)");
$deleteCount = count($ids);
$deletedCount += $deleteCount;
$this->info("{$deleteCount} rows deleted.. ({$deletedCount} in total)");
}, 'user_stats_data.id', 'id');
This code took around 7 hrs alone for deleting just 780k records. 'user_id', 'client_id' and 'data_point_id' are already indexed. Is there any way I can further optimize this query or any other alternative in order to perform the deletion as quickly as possible ?
One possible solution to optimize this query is to use a subquery instead of a join. This can be done by selecting the maximum id for each group of duplicates in a subquery, and then using this subquery to delete all records except for the ones with the maximum id. Here's an example of how this can be done:
DB::table('user_stats_data')
->whereRaw('id NOT IN (
SELECT MAX(id) FROM user_stats_data GROUP BY client_id, user_id, data_point_id
)')
->delete();
This query should be faster than the original one because it avoids the join operation and uses a subquery instead. It also deletes all duplicates in a single query, which can be more efficient than chunking the deletion into smaller batches. However, it's important to test this query on a smaller dataset before running it on the entire table to make sure it works correctly and doesn't delete any non-duplicate records.