@pilat Hey.
TL;DR I've written a package that utilizes MySQL's on duplicate key update feature (https://kfirba.me/blog/performant-mass-update-or-create-strategy-for-data-imports).
I've actually had a case where I needed to load massive Excel files into my database (The excel file was exported from another CRM program). The thing is that the customer required that if a record that already exists in my database also appears in the Excel file, he wants to update my local database with the new data from the Excel file.
So I've gone ahead and implemented that very naively: loop through the Excel rows and for each row if it exists update it, otherwise create it.
Well, it worked but it was really slow when the Excel file had like 2k rows or even more (Don't forget the constraints that the database has to check every time we insert or update a record).
So I started thinking what can I do in order to improve that? I ended up using MySQL's on duplicate key update feature.
So to get you familiar with this feature, just as the name suggests, it attempts to insert the rows into the database. When it encounters a duplicate UNIQUE constraint or duplicate PRIMARY KEY, it will automatically run an update statement for that row. Sounds good right? However, you might wonder: "Oh well, MySQL still internally does all that checking behind the scenes, why would it be significantly faster"?
There is a short answer for that. Usually, the time required for inserting a row is determined by the following factors, where the numbers indicate approximate proportions:
- Connecting: 3
- Sending query to server: 2
- Parsing query: 2
- Inserting row: 1 × size of row
- Inserting indexes: 1 × number of indexes
- Closing: 1
(The information above was taken from MySQL's website)
So basically, there is a cost we can't do anything about which is connecting, sending the query, parsing it and terminating the connection. This cost is as high as 8. it takes 80% of the time (Given we are inserting 1 row with 1 index only)!
So the code is pretty straightforward (an updated code is available in the package: https://github.com/kfirba/import-query-generator):
namespace App\Support;
use Illuminate\Support\Facades\DB;
class ResourceImporter
{
/**
* Inserts or updates the given resource.
*
* @param $table
* @param $rows
* @param array $exclude The attributes to exclude in case of update.
*/
public function insertOrUpdate($table, $rows, array $exclude = [])
{
// We assume all rows have the same keys so we arbitrarily pick one of them.
$columns = array_keys($rows[0]);
$columnsString = implode('`,`', $columns);
$values = $this->buildSQLValuesFrom($rows);
$updates = $this->buildSQLUpdatesFrom($columns, $exclude);
$params = array_flatten($rows);
$query = "insert into {$table} (`{$columnsString}`) values {$values} on duplicate key update {$updates}";
DB::statement($query, $params);
return $query;
}
/**
* Build proper SQL string for the values.
*
* @param array $rows
* @return string
*/
protected function buildSQLValuesFrom(array $rows)
{
$values = collect($rows)->reduce(function ($valuesString, $row) {
return $valuesString .= '(' . rtrim(str_repeat("?,", count($row)), ',') . '),';
}, '');
return rtrim($values, ',');
}
/**
* Build proper SQL string for the on duplicate update scenario.
*
* @param $columns
* @param array $exclude
* @return string
*/
protected function buildSQLUpdatesFrom($columns, array $exclude)
{
$updateString = collect($columns)->reject(function ($column) use ($exclude) {
return in_array($column, $exclude);
})->reduce(function ($updates, $column) {
return $updates .= "`{$column}`=VALUES(`{$column}`),";
}, '');
return trim($updateString, ',');
}
}
The usage is as follows:
$resourceImporter->insertOrUpdate(
'users', // table name
$rows, // array represenation of the rows to insert (assoc array where the key is the same as the DB column name)
$excludeFromUpdate // Fields you want to exclude (excluded from the $rows parameter above) when a row is being updated. For example, if each row has a default `created_at` set to now, you may want to exclude the `created_at` update field when you update a row since it was created somewhen else in the past.
);
Note that the class returns the generated query right after it executes it.
As you can see, this class isn't limited to a specific resource type. You can import users, products, articles or anything else you want with it.
Hope it helps.