Batches vs single long running job

I have to process several million database rows and that could take up to 20 minutes. What would be the disadvantage of processing everything in a single job? For example, if I split it into batches of 1000 records, I either have to create them at the beginning and then use the slow SQL OFFSET to jump to the respective 1000 block in the SELECT statement. Or I create a new batch with $this->batch()->add at the end of each batch if there is more data to be processed.

It all sounds rather cumbersome. So why not process everything in a single job? I don't need parallel processing because it is very important that the order of the AutoIncrement IDs in the database is correct. If the single job fails, this would not mean that I have to repeat everything because the processed data records are marked. I can therefore start the job again using job retry and only the unprocessed records would be retried.

tisuchi

2 years ago

Level 70

@drhouse Based on my understanding, you may use chunked queries to efficiently process large datasets without loading them all into memory at once.

drhouse

2 years ago

Level 1

@tisuchi Correct, chunked queries will help, but that doesn't answer the question if I need batches, or a single long running job is fine.

tisuchi

2 years ago

Level 70

@potentdevelopment I may go with batch because you mentioned that you have million of records.

drhouse

2 years ago

Level 1

@tisuchi But why, what are the advantages?

tisuchi

2 years ago

Level 70

@drhouse Yes

Based on my understanding, there are some advantages. For example, Parallel Processing, Optimized resource uses, ....

drhouse

2 years ago

Level 1

@tisuchi I can't use Parallel Processing because the processing inserts new database records and the autoincrement ID must be in the same order as the input data.

And as for "optimized resource usage", I can't see how it helps if, for example, I split 10,000,000 rows of data into 1,000 packages in advance and then create 10,000 batches, each of which has to be started individually. The startup time of a batch process alone costs unnecessary time.

Tray2

2 years ago

Level 74

It depends on what you need to process, but let's say that you change the status of a record when it has been processed, then you can use batches, but if you don't then I suggest that you use a single jobb and use chunk instead.

If you are using database transactions as well you need to think about where you put the commit or you would run inte deadlocks in your database.

drhouse

2 years ago

Level 1

@Tray2 I create new data that is linked to the input data via a relationship. This allows me to check whether there is still input data that has not yet been assigned to output data in another model. I just don't see how batches help me yet, except that they lead to a lot of load in the queue with such amounts of data (above I wrote 10,000 batches á 1,000 rows/per batch run in an example) and there is unnecessary startup time of a batch run. Even with a startup time of 0.1 seconds, that would already be 1000 seconds or over 16 minutes that would be spent just on starting all batches. I just don't see the advantage of batches yet.

Please or to participate in this conversation.