Aug 19, 2019

Level 1

Mass update DB based on an array from an CSV file

Hello Laravel friends, I've been struggeling with a problem and i hope someone will be able to help me.

I'm working on a project in Laravel 5.8. I have CSV export fles from GSC and aHrefs with pages, links, domain raitings etc. form my tests I'm using a file with 80k entries, I will have bigger files 200-500k or even more.

First of all I convert the CSV file in to an array.

    function csv_to_array($filename = '', $delimiter = ',')
    {

        ini_set('auto_detect_line_endings', TRUE);
        if (!file_exists($filename) || !is_readable($filename))
            return FALSE;

        $header = NULL;
        $data = array();
        if (($handle = fopen($filename, 'r')) !== FALSE) {
            while (($row = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
                if (!$header) {
                    $header = $row;
                } else {
                    if (count($header) > count($row)) {
                        $difference = count($header) - count($row);
                        for ($i = 1; $i <= $difference; $i++) {
                            $row[count($row) + 1] = $delimiter;
                        }
                    }
                    $data[] = array_combine($header, $row);
                }
            }
            fclose($handle);
        }
        return $data;
    }

Next I use a for loop to get only the fields I want to store in the DB and then I insert them like this

    DB::beginTransaction();

        \DB::disableQueryLog();

        $file = csv_to_array(request()->file('csv'));

        for ($i=0; $i <= count($file)-1 ; $i++) { 

            $ar1 =[
                'url' => $file[$i]["Linking page"],
                'last_crawl'  => (array_key_exists('Last crawled', $file[$i])) ? ($file[$i]['Last crawled'] === "N/A" ?  '' : date('Y-m-d', strtotime($file[$i]['Last crawled'])) . " 00:00:00") : "",
                'added' => Carbon::now()
             ];

            Link::create($ar1);

} DB::commit();

And this works fine even with the 80k file.

Butt the problem comes when i have to update the DB. I get the max execution time. First I tried to do it like the insertion in a loop which i know is a bad idea.

      for ($i=0; $i <= count($file)-1 ; $i++) { 

            Link::updateOrCreate(
                ['url' => $file[$i]["Linking page"]],
                [
                    'url' => $file[$i]["Linking page"],
                    'last_crawl'  => (array_key_exists('Last crawled',$file[$i])) ? ($file[$i]['Last crawled'] === "N/A" ?  '' : date('Y-m-d', strtotime($file[$i]['Last crawled'])) . " 00:00:00" ) : "",
                    'added' => Carbon::now()
                ]);
        }

I think I've red all the related topick entries i could find. I was recommended to use this package https://github.com/mavinoo/laravelBatch

But It works with files arround 18k entries and with bigger it also dies :/

And now I just don't have any more ideas on how to do this propperly. I would appreciate on some kind of a hint or maby a code snippet on how to do this the right way, because as I wrote I will have much bigger files than 80k.

I hope I described the problem properly. Thanks for the replies.

andresayej

6 years ago

Level 13

A question:

When performing the mass update is it expected to update all of the rows with the same values?

If yes you can structure your updates like this:

For you every column you would like to update for a given set of rows (think ID's) with the same value

Model::whereIn('id', $arrayOfIds)->update([
    'firstColumnName' => $firstColumnValue,
    'secondColumnName' => $secondColumnValue,
]);

and so on...

andresayej

6 years ago

Level 13

Another approach would be if for some reason you have to definitely traverse then update the values for each row/id is to use Queues:

https://laravel.com/docs/5.8/queues

This approach would bypass the max execution time because you will offload the expensive job operation of updating a massive set of records to a background job.

cierek1

6 years ago

Level 1

Hi, thank you for the reply.

No, every time I will have different values based on the data exported from aHrefs and Google Search Console. That's why I tried to use "updateOrCreate". The database will be only empty before the first insertion, after that I always will have entries and which I will have to check to update the DB.

I have posted the shorter version ( adding data from GSC ) here, where I will have only the "url" extracted from the CSV from the "Linking page" field.

In the aHrefs files I will have to compare and insert something like this:

         Link::updateOrCreate(
             ['domain'=> mb_substr(plainDomainHelper($file[$i]["Referring Page URL"]), 0, strlen($file[0]["Referring Page URL"]) - 1,), 'url' => mb_substr($file[$i]["Link URL"], 0, strlen($file[0]["Link URL"])-1, )],
             [
                'dr' => $file[$i][ "Domain Rating"],
                'ur'  => $file[$i][ "URL Rating (desc)"],
                'url' => mb_substr($file[$i]["Link URL"], 0, strlen($file[0]["Link URL"])-1, ),
                'page_title' => mb_substr($file[$i]["Referring Page Title"], 0, strlen($file[$i]["Referring Page Title"])-1, ),
                'target_url' => mb_substr($file[$i]["Link URL"], 0, strlen($file[0]["Link URL"])-1, ),
                'first_crawl' => Carbon::createFromFormat('Y-m-d H:i:s', $file[$i]["First Seen"], 'Europe/London'),
                'last_crawl'  => Carbon::createFromFormat('Y-m-d H:i:s', $file[$i]["Last Check"], 'Europe/London'),
                'anchor_text' => $file[$i][ "Link Anchor"],
                'type' => $file[$i][ "Type"],
                'domain' => mb_substr(plainDomainHelper($file[$i][ "Referring Page URL"]), 0, strlen($file[0]["Referring Page URL"])-1, ),
                'added' => Carbon::now()
             ]
         );

I won't be able to provide a set of id's. Like I wrote I will have files of 80-200k of fields and it will be for one of the pages/client and I will have several of those so there will be milions of rows in the DB.

andresayej

6 years ago

Level 13

Also you can definitely use the package:

https://laravel-excel.com/

It would simplify your flow for dealing with imports/exports for files like XLS,CSV and you can more easily leverage Laravel's Queue system because the package has built-in support for them. The package also includes Events which can help you leverage Laravel's built-in Notifications system which could be helpful for notifying your users of a completed import/export for example.

Consider the following example (In my application i am doing exactly that):

Logged in users in my application can import files by visiting a page with a standard form and file upload input.

When and if the validation for the forms has passed the application dispatches a job (for background execution at a later time /when and if the job is executed depends on the items in my job(s) queue at the dispatched moment/). The user is then redirected to a page with a successful "import process has begun message" just so the user has a feedback that the import has been successful in terms of uploading to the application but now has to be processed. The user is never interrupted because of waiting for the request to finish(think max execution time) because the costly part of the execution is now offloaded to a background job and all the user has to do now is wait for a notification from the application when the imported file has finished processing. Woala.

cierek1

6 years ago

Level 1

Thank you for your next reply.

I started with laravel-excel but I left it to the function I posted in the first post but I really don't remember why.

I will definitely check it again and in case of a problem If I can I would ask you a question. I will also check out the queues.

Thank you for your time and tips.

andresayej

6 years ago

Level 13

@cierek1

Glad it helps.

Happy coding :)

andresayej

6 years ago

Level 13

@cierek1 did you resolve your question for this thread?

cierek1

6 years ago

Level 1

@andresayej Hello,

yes I did it just a minute ago :) I refactored my code to use Events, Listeners and the Queues. Now the 80k file is being updated in less than 90s. It's not the best result but at least it gets updated :)

Pciranda

6 years ago

Level 9

Look this. https://laracasts.com/discuss/channels/laravel/too-many-placeholders

in my case it's insertion, but it's similar.

I believe slowness is the number of queries sent to the database.

if you are doing something like this:

$data = [
     ['id' => 1, 'name' => 'Jhon', 'age' => 30],
     ['id' => 2, 'name' => 'Alex', 'age' => 32],
     ['id' => 3, 'name' => 'Ane', 'age' => 20],
     ...
];

foreach ($ data as $ user) {
     \App\User::where('id', $user ['id'])->update($user);
}

This will generate 1 query for each line of your file, and make it extremely slow.

One option is to use "ON DUPLICATE KEY UPDATE" for mysql or "ON CONFLICT UPDATE" for postgresql.

and Generate a large query like this.

INSERT INTO users (id, name, age)
VALUES
(1, 'Jhon', 30),
(2, 'Alex', 32),
(3, 'Ane', 20),
...
ON DUPLICATE KEY UPDATE
    name = VALUES (name),
    email = VALUES (email)
;

Test and then tell me the result.

cierek1

6 years ago

Level 1

@andresayej Unfortunetly I was to fast with my response :/ I didn;t check it properly and the updateOrCreate function just created the extra rows and when I tried to updated them after 30min I gave up. I bet its the issue Pciranda is writing about :(

@pciranda I will defenetly try it. Thank you for ur response and the tip. I hope it will solve my problems

Pciranda

6 years ago

Level 9

@cierek1 One more option.

You can create a temporary table and update using join.

I will use this database as an example. https://github.com/datacharmer/test_db

Let's take a look at the salaries table.

mysql> desc salaries;
+-----------+---------+------+-----+---------+-------+
| Field     | Type    | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| emp_no    | int(11) | NO   | PRI | NULL    |       |
| salary    | int(11) | NO   |     | NULL    |       |
| from_date | date    | NO   | PRI | NULL    |       |
| to_date   | date    | NO   |     | NULL    |       |
+-----------+---------+------+-----+---------+-------+
4 rows in set (0.01 sec)

mysql> select count (*) from salaries;
+----------+
| count(*) |
+----------+
|  2844047 |
+----------+
1 row in set (0.10 sec)

Now let's create a temporary table in the same format as our salaries table.

CREATE TEMPORARY TABLE temp_salaries LIKE salaries;
Query OK, 0 rows affected (0.00 sec)

In this part you will insert the entire contents of your CSV file into the temporary table. I will simulate this by copying the salaries table data to temp_salaries, I will only change the salary value using the RAND () function. // FLOOR (RAND () * (MAX - MIN + 1)) + MIN

mysql> INSERT INTO temp_salaries (emp_no, salary, from_date, to_date) 
SELECT 
    emp_no,
    FLOOR (RAND () * (1000-10 + 1)) + 10,
    from_date,
    to_date 
FROM salaries;
Query OK, 2844047 rows affected (8.18 sec)
Records: 2844047 Duplicates: 0 Warnings: 0

Let's look at the first 3 of each table.

mysql> select * from temp_salaries limit 3;
+--------+--------+------------+------------+
| emp_no | salary | from_date  | to_date    |
+--------+--------+------------+------------+
|  10001 |    203 | 1986-06-26 | 1987-06-26 |
|  10001 |    922 | 1987-06-26 | 1988-06-25 |
|  10001 |     27 | 1988-06-25 | 1989-06-25 |
+--------+--------+------------+------------+
3 rows in set (0,00 sec)

mysql> select * from salaries limit 3;
+--------+--------+------------+------------+
| emp_no | salary | from_date  | to_date    |
+--------+--------+------------+------------+
|  10001 |  60117 | 1986-06-26 | 1987-06-26 |
|  10001 |  62102 | 1987-06-26 | 1988-06-25 |
|  10001 |  66074 | 1988-06-25 | 1989-06-25 |
+--------+--------+------------+------------+
3 rows in set (0,00 sec)

Now we will update the salaries table according to the contents of the temporary table.

mysql> UPDATE 
    salaries,
    temp_salaries
    SET 
        salaries.salary = temp_salaries.salary
    WHERE 
        salaries.emp_no = temp_salaries.emp_no AND
        salaries.from_date = temp_salaries.from_date;
Query OK, 2844047 rows affected (53.71 sec)
Rows matched: 2844047 Changed: 2844047 Warnings: 0

Let's see...

mysql> select * from salaries limit 10;
+--------+--------+------------+------------+
| emp_no | salary | from_date  | to_date    |
+--------+--------+------------+------------+
|  10001 |    203 | 1986-06-26 | 1987-06-26 |
|  10001 |    922 | 1987-06-26 | 1988-06-25 |
|  10001 |     27 | 1988-06-25 | 1989-06-25 |
|  10001 |    332 | 1989-06-25 | 1990-06-25 |
|  10001 |    579 | 1990-06-25 | 1991-06-25 |
|  10001 |    898 | 1991-06-25 | 1992-06-24 |
|  10001 |    761 | 1992-06-24 | 1993-06-24 |
|  10001 |    110 | 1993-06-24 | 1994-06-24 |
|  10001 |    242 | 1994-06-24 | 1995-06-24 |
|  10001 |    868 | 1995-06-24 | 1996-06-23 |
+--------+--------+------------+------------+
10 rows in set (0,00 sec)

We updated 2.8 million in 53 seconds!

Using a Dell Inpirion 5000 Dual Core notebook Intel® Core ™ i7-7500U CPU @ 2.70GHz - 8.0 GB memory.

imagine on a server.

I believe your 80,000 line csv will come easy.

The only challenge will be to transcribe this to Laravel's Eloquent.

1 like

Please or to participate in this conversation.