Sep 13, 2017

Level 41

Performant way to mass "update or create"

Hi, I'm looking for a way to do something like updateOrCreateMany (imaginary method).Currently, the code looks like this:

class ActivityRepositoryEloquent extends BaseRepository implements ActivityRepository
{
    // ... before this method, that was something like:
    //  $this->model
           ->whereIn('element_id', array_column($records, 'element_id')
           ->update([ 'delete' => true ]);

    public function insertMany($records)
    {
        $searchFields = [
            'subdomain',
            'element_type',
            'element_id',
            'slug',
        ];

        foreach ($records as $record) {
            $record['deleted'] = false;

            $this->model->updateOrCreate(
                array_only($record, $searchFields),
                array_except($record, $searchFields)
            );
        }
    }
}

What it does, in fact, is the following: it makes SELECT * …nd then `UPDATE ?or each of the records!

Is there a way to minimize the number of SQL queries?

A comment about that 'deleted' field: I'm syncing data with a 3rd-part service. For the original data, let's call is "leads", there is id field defined by 3rd-party system. So, I simply store it in my database with that ID and have no issues. There is related data, however, like "custom fields", that I (re)generate in my site, basing on the leads data. Here I have auto-incremental id and I've noticed that my genius schema (deleting all related and re-generating it after leads sync) causes ids to grow higher and higher after each sync.

So, the revised schema is the following:

Set deleted = false for all the custom fields, with element_id IN ([ array of lead ids ])
updateOrCreate custom fields, basing on leads data + hardcoded deleted = 0
Delete all the records that have deleted = 1 remaining;

kfirba

8 years ago

Best Answer

Level 50

@pilat Hey.

TL;DR I've written a package that utilizes MySQL's on duplicate key update feature (https://kfirba.me/blog/performant-mass-update-or-create-strategy-for-data-imports).

I've actually had a case where I needed to load massive Excel files into my database (The excel file was exported from another CRM program). The thing is that the customer required that if a record that already exists in my database also appears in the Excel file, he wants to update my local database with the new data from the Excel file.

So I've gone ahead and implemented that very naively: loop through the Excel rows and for each row if it exists update it, otherwise create it.

Well, it worked but it was really slow when the Excel file had like 2k rows or even more (Don't forget the constraints that the database has to check every time we insert or update a record).

So I started thinking what can I do in order to improve that? I ended up using MySQL's on duplicate key update feature.

So to get you familiar with this feature, just as the name suggests, it attempts to insert the rows into the database. When it encounters a duplicate UNIQUE constraint or duplicate PRIMARY KEY, it will automatically run an update statement for that row. Sounds good right? However, you might wonder: "Oh well, MySQL still internally does all that checking behind the scenes, why would it be significantly faster"?

There is a short answer for that. Usually, the time required for inserting a row is determined by the following factors, where the numbers indicate approximate proportions:

Connecting: 3
Sending query to server: 2
Parsing query: 2
Inserting row: 1 × size of row
Inserting indexes: 1 × number of indexes
Closing: 1

(The information above was taken from MySQL's website)

So basically, there is a cost we can't do anything about which is connecting, sending the query, parsing it and terminating the connection. This cost is as high as 8. it takes 80% of the time (Given we are inserting 1 row with 1 index only)!

So the code is pretty straightforward (an updated code is available in the package: https://github.com/kfirba/import-query-generator):

namespace App\Support;

use Illuminate\Support\Facades\DB;

class ResourceImporter
{
    /**
     * Inserts or updates the given resource.
     *
     * @param       $table
     * @param       $rows
     * @param array $exclude The attributes to exclude in case of update.
     */
    public function insertOrUpdate($table, $rows, array $exclude = [])
    {
        // We assume all rows have the same keys so we arbitrarily pick one of them.
        $columns = array_keys($rows[0]);

        $columnsString = implode('`,`', $columns);
        $values = $this->buildSQLValuesFrom($rows);
        $updates = $this->buildSQLUpdatesFrom($columns, $exclude);
        $params = array_flatten($rows);

        $query = "insert into {$table} (`{$columnsString}`) values {$values} on duplicate key update {$updates}";

        DB::statement($query, $params);

        return $query;
    }

    /**
     * Build proper SQL string for the values.
     *
     * @param array $rows
     * @return string
     */
    protected function buildSQLValuesFrom(array $rows)
    {
        $values = collect($rows)->reduce(function ($valuesString, $row) {
            return $valuesString .= '(' . rtrim(str_repeat("?,", count($row)), ',') . '),';
        }, '');

        return rtrim($values, ',');
    }

    /**
     * Build proper SQL string for the on duplicate update scenario.
     *
     * @param       $columns
     * @param array $exclude
     * @return string
     */
    protected function buildSQLUpdatesFrom($columns, array $exclude)
    {
        $updateString = collect($columns)->reject(function ($column) use ($exclude) {
            return in_array($column, $exclude);
        })->reduce(function ($updates, $column) {
            return $updates .= "`{$column}`=VALUES(`{$column}`),";
        }, '');

        return trim($updateString, ',');
    }
}

The usage is as follows:

$resourceImporter->insertOrUpdate(
                'users', // table name
        $rows, // array represenation of the rows to insert (assoc array where the key is the same as the DB column name)
                $excludeFromUpdate // Fields you want to exclude (excluded from the $rows parameter above) when a row is being updated. For example, if each row has a default `created_at` set to now, you may want to exclude the `created_at` update field when you update a row since it was created somewhen else in the past.
            );

Note that the class returns the generated query right after it executes it.

As you can see, this class isn't limited to a specific resource type. You can import users, products, articles or anything else you want with it.

Hope it helps.

8 likes

nai

4 years ago

Level 1

@kfirba does this package it work for composite keys ? unique(['column1,column2,colum3'],unique_name). I tried the laravel8 built in upsert methods but its not updating, it keeps adding. Thanks

pilat

8 years ago

Level 41

@kfirba Thank you, this is something I've thought of, actually, just wanted to avoid slipping off the ORM-way. Also, there is a strange error when I'm trying to add "unique index" to this table (looks that I have some records violating this).

But I should definitely give it a try and measure how fast if would become.

Btw, I import 500 records at a time, are there any limitation on the SQL query length?

P.S.: there was a point in past, when I simply used REPLACE INTO …but it kept increasing autoincremental ` and that was also "no an ORM" :-)

kfirba

8 years ago

Level 50

@pilat ORM is great but sometimes it's not the most performant solution.

Don't be afraid to "stray off" the ORM way. I LOVE ORM but many times I find myself writing custom queries in order to squeeze more performance from my application.

As for the max query length, you can just run: show variables like 'max_allowed_packet' to see what the max query length is set on your server (the result is in bytes). Of course, you can change this setting. To increase the max_allowed_packet, open the my.ini/my.cnf under [mysqld] section and alter the setting. Once the change is done you would have to restart the server.

1 like

pilat

8 years ago

Level 41

@kfirba

Finally implemented this approach and had some time to test it out. Looks good so far! At least, I can be sure that relations records IDs stay unchanged (once created) and they're not going to overflow INTEGER type eventually :-)

Thank you very much!

kfirba

8 years ago

Level 50

@pilat I'm glad you liked it!

I've actually written a small library to generate the appropriate query along with its bindings: https://github.com/kfirba/import-query-generator

I've also published an in-depth blog about that feature that I think can come handy: https://kfirba.me/blog/performant-mass-update-or-create-strategy-for-data-imports

1 like

pilat

7 years ago

Level 41

Well, this worked for some time, but now it hit its top as well…

After some good chunk of "thinking time", I've realised that I'm willingly making the system overly complicated on low [!!and, what's most important — not well-learned!!] level (Database) for the sake of simplicity on high-level (ORM). This is just wrong…

What I do: I regularly sync (and less often, but still regularly, re-sync from scratch) data from an external API into my application's database, so that I could query this data as I need. And for "closer" access, for sure.

Thus, here's my basic conditions:

Lots of writes; Often, in large chunks (could easily be like a half a million records within one sync session);
A need to keep IDs for existing records.
At the same time, I need to remove the records not available now.

Here's the pseudo-code of how it's done now:

// 1. updating a number of leads;

$leadIds = $this->updateLeadsSince($timestamp);

// 2. now, need to update some linked data

$this->syncCustomFields( $leadIds );

// ...
protected function syncCustomFields( $leadIds )
{
    // Step 1: mark all related fields as deleted, without actually deleting them:
    
    CustomField::whereIn('lead_id', $leadIds)->update([ 'deleted' => true ]);

    // Step 2: update or insert new data, utilising "INSERT … ON DUPLICATE UPDATE";
    // all new records are written with "deleted = false";
    
    $this->generateCustomFieldsForLeads($leadIds);

    // Step 3: delete the records remaining with "deleted = true" for real:
    
    CustomField::where('deleted', true)->delete();
}

With this schema, I often get wait lock errors, especially on update([ 'deleted' => true ]) queries. The insertion itself (Step 2) is also slower than I would like… An I have no idea on how to optimise it, without giving up one any of the 'basic conditions"…

However.

If to look "from a bird's sight height", what I actually need in this case: is a composite primary key + REPLACE INTO statements. That's all and it's definitely simpler.

The Eloquent ORM, however, does not provide any good support for composite primary keys, and this is why I added "AUTO INCREMENTAL" surrogate key in the very beginning. But this key is causing so many troubles in unknown area now, that I'm considering to convert my tables in a way to have it simpler on the database level and tackle with Eloquent instead; This might work, for example: https://blog.maqe.com/solved-eloquent-doesnt-support-composite-primary-keys-62b740120f

So, I'll hopefully keep you informed. ;-)

Hansz

7 years ago

Level 1

@kfirba that class looks good, but what about relations? how you can implement it? in a many-to-many for example

kfirba

7 years ago

Level 50

@hansz Well, you can't, at least not in one query.

You will need some code to split your Excel file into separate models (tables) and run them one after another, probably in a logical way where you'd get ids for foreign keys.

Do you have a specific excel file you want to import?

pilat

7 years ago

Level 41

@KFIRBA - Recently, I've got huuuge (naturally, 3 levels of magnitude) speedup with this "fix":

        - DB::statement($query, $params); // this one took around 30 seconds
        
        + $stmt = DB::getPdo()->prepare($query);
        + $success = $stmt->execute($params); // this one takes 31.45ms (magnitude of three!)

There are downsides, of cause:

query is not registered in Debugbar
no automatic type conversion (you have to check for types to match in the client code, when preparing "params" array)

But, maybe, has a sense to enable this as an option ($unsafeButFast = false) =)

1 like

staudenmeir

6 years ago

Level 24

I've created an INSERT & UPDATE (UPSERT) package for all databases: https://github.com/staudenmeir/laravel-upsert

1 like

mathewparet

6 years ago

Level 22

@staudenmeir The package looks promising.

I have a requirement, which I am not sure this package would help.

I am loading data from an external source and the data is up to 7GB Gzipped! So I have a laravel job that processes 100,000 records per thread.

Right now this is what I am doing:

Each thread processes 100,000 records
Call DB::beginTransaction()
Call Model::updateOrCreate()
Every 10,000 records, call DB::commit() and then DB:beginTransaction()

With this I can see that data is committed only every 10,000 records and it does show improvement compared to running Model::updateOrCreate() without transaction. This is good. But my data still takes a long time to process.

Your laravel-upsert package seems promising, but I need to check something with you before I start using it:

Considering I am using transactions in my current logic to avoid frequent committing, will upsert be still faster?
If I update my current logic to collect all data into a collection of 10,000 records and use upsert, and at that point if 1 of these 10,000 records fail to insert into db due to any reason (say data too long for a field), won't all 10,000 records fail? I mean how do I know which record failed?

staudenmeir

6 years ago

Level 24

@mathewparet The package will be way faster than updateOrCreate(). It allows you to insert/update thousands of records in a single query, while updateOrCreate() executes two queries per record.

What database are you using?

1 like

nai

4 years ago

Level 1

@staudenmeir does this package works composite keys ? $table->unique([column1, column2,column3],uniquename). I tried laravel 8 built in upsert but it doesnt update, it keeps adding/inserting duplicates

Please or to participate in this conversation.