Jul 15, 2015

22

OP

Level 5

Extracting and saving "large" binary file

Hi guys!

Here is the thing, I have some binary files (logs) wich I upload to my webapp. The webapp run Laravel, it extracts all informations from the binary file and store it to a MySQL database.

Problem is, the binary files can be up to 5 or 8MB. I can extract data at 300kB/s and save those at 30kB/s to the DB. Which means that it takes up to 294s, almost 5 minutes! And I don't even count the time to upload an 8MB file to the server...

Question is, is there a way to not get an active connection for more than 5 minutes to extract datas and store them to the DB? Like running a PHP command in background? My first thought was using a cron for that but it looks crappy from the outside...

Any thoughts?

Level 65

I don't understand the extraction/saving at KB/s? You mean CPU time or?

OP

Level 5

Hey, thanks for the prompt reply!

I just divided the file size by the time measured to extract files and the one to save them. With that I have two constants, whatever the file size is I've got 300kB/s to extract and 30kB/s to save them. I just use this to calculate the least time it'll take to extract/save a file regarding its size.

In fact it doesn't really matter, what I can say is that it takes 5 minutes to handle a file of 8MB.

OP

Level 5

Well, nobody?

Level 56

Like running a PHP command in background?

Of course, it is queued jobs.

Level 7

Maybe some extra information is needed ... I think your application can be divided into several pieces , where is the bottleneck ?

is the upload fast ?
is the extraction of data fast , your transformation from logfile to query or models
is the dump to the database fast ?

What is your input ? xml , csv , text input ?

OP

Level 5

Thanks guys!

@pmall I was thinking about that but seemed overwhelmed, don't really know why though.

@deltasolutions It is indeed divide, first step, upload (seems normal), second step, extraction and then registering to database. It seems that the registering step is pretty slow...

I have a service that handle my file then in my controller I call the extract method on that service, and then a save method on that same service.

Any advice to handle this with queue?

Level 7

so it will be the "translation" step that is slow , 'the computing' ... Which format do you transform ? And to what are you transforming ? objects , queries ?

Level 56

I have a service that handle my file then in my controller I call the extract method on that service, and then a save method on that same service.

Create a Job which takes all the parameters it needs, then in the handler inject this service and use it in the very same way. Implements ShouldBeQueued so the job is queued. Finally dispatch the job when you want to perform this task.

If you set another queue driver than sync in you queue config, it will be run in background.

Level 26

As @pmall said Queues are the way to go when you need to process something in the background.

I can extract data at 300kB/s and save those at 30kB/s to the DB

That seems exceeding slow, I would suggest posting some code in Code Review to see if anyone can spot the bottleneck. 5 minutes to process 8MB seems insane to stick with.

Level 65

I also want to know how DB saving is measured by KB/s... where and how? :D

Level 26

So do I but I'm guessing he means it takes ~300s to process ~8000KB, so 1s to upload ~27KB which is where he's getting 30KB/s from.

Without seeing any code it's hard to see what's taking the time, I mean the actual saving to the DB should be almost instant.

OP

Level 5

Ok thanks everybody!

@bashy You're totally right @phildawson, I only use this speed to calculate an approximative processing/saving time knowing the file size. Nothing fancy here but 30kB/s seemed stable so I used it to calculate my things...

Anyway, I'll go with the Queue solution. Thanks to @pmall I'll check on that!

Unfortunately, I can't post my entire code here. All I can say is that I have a file and the data inside are binary coded. I have a documentation where I know exactly the size of the datas I want to extract (1Byte, 2B, or 4B and signed or unsigned), this is the mess! So I parse my file knowing at which address I should read which data. Then I have some checksum and all to verify the data.

Since I HAVE to parse the entire file with jump of less than 50Bytes, with a file of 8MB, I have no less than 160,000 iterations.. AT LEAST.

So I think we could optimize a bit but I think that even if we divide by 2 the total time, we'll still be around a minute, and I don't like it.

That's why I'll go with the Queue solution.

If anyone have a better idea than parse the entire file like I do, you're welcome to contact me :)

Thanks again everybody!

OP

Level 5

Last thought, I didn't precise it but with a 600kB file, I have approximatively 7000 lines to insert into my database.

Level 26

Is that 7000 rows on the table? That should be fairly quick, depending on the data size you could do say 1000 at a time and have 7 inserts. If the server is fairly decent it should be seconds.

I guess passing it off to c++ to deal with could be option though using the standard fread and unpack its easy to deal with binary files GBs in size so I'm not sure why a few MBs would take minutes¯(°_o)/¯

OP

Level 5

@phildawson Yes 7000 rows. Problem is I can't do has many row at a time as I want.. I have to insert one type to get its ID (because I have relationship to handle), then insert around 50 rows then iterate like that over the whole file.

I actually use unpack, but I can't use it on the whole file because there is different type of data at different places all over the file. I also have to check data integrity with each type of data so I can't extract the whole thing in one block don't I?

What takes the most of the time is the inserts, literally. Like 10x the extraction.

What I do is something like this, I store the whole extraction into an array then iterate over it like this

foreach ($this->extracts as $extract) {
    $parent = $this->parent->create($extract['fields']);
    DB::table('children')->insert(array_merge($extract['children'], ['parent_id' => $parent->id]));
}

Not exactly what I use but something similar...

Level 26

@peterpan666 Well based on soley above you are just saying all these children belong to the parent.

Foreach ing over an insert is always going to suck.

If: your script is ran once at a time you could insert all in one go. Before inserting get the highest primary key if its 7000 then the next lot with be ids will 7001-7051 if you added 50 being sequential. Easy for a select where in if you wanted them back. You could also lock if needed but see below.

Else:

If: the children table just had one lot of data relating to the parent_id then you could after simply do a select id where parent_id=123

Else: the children table had multiple lots of data relating to the same parent over different import times then i would simply add a batch column to distinguish the import. select id where parent_id=123 and batch=32 You would have a table called batches to insert before to give you the next batch id.

1 like

OP

Level 5

@phildawson Damn! Not a bad idea at all!! Don't even know why I've never thought about that...

Any pros and cons on that? Like isn't that bad to store about 5MB into the database with a single insert?

Level 13

@peterpan666 Why are you storing binary data in a database? You mentioned that the data is on various formats throughout the upload. I assume those formats can occur at variable spots in the stream? Mass inserts into a database are not a magic "It Just Happens" process. @peterpan666 Are you using indexes? Are you using database transactions? These will contribute to your insert time. Particularly when inserting a large number of rows. Which is why i ask once more, Why are you storing binary data in a database? What formats does your binary stream expect? How do you want to catalog those formats?

OP

Level 5

Why are you storing binary data in a database?

I don't store binary data. My file contains binary data and my extract method returns readable data (I use unpack).

I assume those formats can occur at variable spots in the stream?

Yes, exactly.

Are you using indexes?

For now, yes. But it might change because of others considerations.

Are you using database transactions?

No, never had to use them in fact

How do you want to catalog those formats?

I store multiple type of data in a row like, integer, number, string or even timestamps. Each type have its own column.

Don't really know if it's the kind of answer you were waiting for?

Level 26

@phildawson Damn! Not a bad idea at all!! Don't even know why I've never thought about that... Any pros and cons on that? Like isn't that bad to store about 5MB into the database with a single insert?

@peterpan666 it's perfectly fine :) I've dealt with millions of rows of data per import around 500MB being chucked in at a time per data set, the only thing to do is up the max_allowed_packet The limit is 1GB at a time.

An average server with SSD should be able to chuck in 5MB in less than a second easily.

http://dev.mysql.com/doc/refman/5.7/en/packet-too-large.html

I would do a few timestamps just using simple $start = microtime(true); and echo microtime(true) - $start; around sections of the code to spot where its really taking the time. I would be tempted to C++ deal with the extraction and return the result.

5MB is pretty tiny, but I would also look at this for huge inserts. https://dev.mysql.com/doc/refman/5.7/en/load-data.html

Speed tips here too:

https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-bulk-data-loading.html

https://dev.mysql.com/doc/refman/5.7/en/optimizing-myisam-bulk-data-loading.html

OP

Level 5

@phildawson Ok great! I've always the feeling, when I do this kind of stuff, that I use "big things" then someone tell me "hey, you know you can do like 100x more than that!" :D I'll do some research and performance tests on that and I'll let you know how much time I've saved ;) Thanks!

OP

Best Answer

Level 5

Ok guys!

I just did some little optimization and I manage to go from 19s to less than 5s with a 600kB file (extraction + saving)!!

What I've done:

Don't know why but for the extraction I read the file each time I needed it.. So I've made a method getContent that checks if the $content attribute is set or not. If not it puts the whole file in it then return $content. Divide by almost 2 the execution time of extraction!
Thanks to @phildawson and this awesome link: https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-bulk-data-loading.html with just the first points, the saving time (for a 600kB file) went from 15s down to 1.33s!!
Other thing I did, and this one is pretty simple... Just disable the db collector of the Laravel Debugbar and gain like 1 or 2s.

The results are pretty good for what I do so I think I'll stick with that for now and use Queues after.

Thanks to everyone!

Please or to participate in this conversation.