ConradJvR's avatar

What is the best way to create a new non null column in an existing table and populating that column with data

I want to learn a bit more regarding best practices when it comes to migrations. Usually when we created a new column in our tables we would make use of migrations to create the column and then use the up function to populate the table when the migration executes. But this feels risky because if something goes wrong in the execution of the up function you are unable to use the down function to revert any changes made by the migration.

So for my latest feature I opted to make use of the commands feature to then populate my new column with data. This command only needs to be run once off and from there the system will handle data for that column on create and update actions. But now there will be a manual process on the initial release to run the command via forge. Is this approach better or what would your recommendations be?

Here is an example of my migration and command logic if that gives a clearer picture of what I'm doing

0 likes
11 replies
Snapey's avatar
Snapey
Best Answer
Level 122

This approach seems fine provided your generateCode function is non-destructive, and deterministic.

What I mean is that running the command twice should have no negative effects. This includes, the code that is generated should generate the same code each time for a given set of data. Suppose that this code is some form of validation code that you send to users. If you send these out then accidentally run the command again, you could change all the codes and now all the links you sent out previously are broken.

One option is to check if the code is already set, and skip the record if so.

Imagine a scenario where you introduce some new code that bypasses the model observer. You end up with a load of records with no code. It would be nice to be able to run the command again.

I would put some progress indicator in your command.

1 like
ConradJvR's avatar

@Snapey cool thanks for the response, I'll definitely add a check to only create codes for records that don't already have a code. Then we can safely rerun the command in the future without risking any overrides.

And I'll have a look and do some research regarding the process indicator, giving us a bit more information on the status of the command won't hurt.

mileswebhosting's avatar

Based on your current scenario,

  1. you can stick with your current strategy
  2. Double-check that your generate:evaluation-codes command can be run multiple times without causing issues
  3. do give try to a progress bar for long-running commands
1 like
ConradJvR's avatar

Thank you for your feedback, I had a look at the documentation and added the progress bar. I tried modifying the query to only read evaluations where the code field was null but that does not seem to work well with the chunking as a bunch of records were skipped.

Here is the latest version of my generate:evaluation-codes command based on your feedback

    public function handle()
    {
        $this->info('Starting evaluation code generation...');

        $evals = Evaluation::withTrashed()->get();

        $bar = $this->output->createProgressBar(count($evals));

        $bar->start();

        Evaluation::withTrashed()->chunk(1000, function($evaluations) use ($bar) {
            foreach ($evaluations as $evaluation) {
                if (!is_null($evaluation->code)) continue;

                $evaluation->code = $evaluation->generateCode();
                $evaluation->save();

                $bar->advance();
            }
        });

        $bar->finish();
        $this->info("\nFinished generating evaluation codes");
    }
1 like
Glukinho's avatar

@ConradJvR here you retrieve all rows from DB just to count them for progress bar:

$evals = Evaluation::withTrashed()->get();
$bar = $this->output->createProgressBar(count($evals));

So your ->chunk(...) has no meaning and gives no benefit.

If you really have a lot of rows and want to process them by chunks, then do counting on DB level:

$evals_count = Evaluation::withTrashed()->count();
$bar = $this->output->createProgressBar($evals_count);

Also, filled() is a bit nicer and more understandable than !is_null(...), but this is a matter of taste:

if (!is_null($evaluation->code)) continue;  // :-(

if (filled($evaluation->code)) continue;  // :-)
1 like
krisi_gjika's avatar

@ConradJvR

Evaluation::withTrashed()
  ->whereNull('code')
  // since the inner closure modifies the condition we chunk by, we need to chunk by PK/ID
  ->eachById(function (Evaluation $evaluation) use ($bar) {

    $evaluation->code = $evaluation->generateCode();
    $evaluation->save();

    $bar->advance();
  });
2 likes
Glukinho's avatar

@krisi_gjika can you please describe a bit this: "since the inner closure modifies the condition we chunk by, we need to chunk by PK/ID"? I don't see where anything is changed except $evaluation->code, which shouldn't affect chunking, as I understand. Thanks!

krisi_gjika's avatar

@Glukinho exactly in making code non-null we modify the whereNull('code') constraint the query depends on, we remove records from the result set.

Think of it like this: query runs with LIMIT 10, we update 10 records and set their code, second chunk runs with LIMIT 10,10, skipping the first 10.

But these 10 that were skipped are not the same we processed before, bc we removed what we processed from the result set. The results that we just skipped are valid records that we never process.

So you need a chunking method that does not rely on limits

1 like
Glukinho's avatar

@krisi_gjika Thanks a lot. Now I get it. But I personally don't like it, this is complicated and unobvious place in code. I would say retrieving all rows and skipping ones with code is way clearer, in spite of being more verbose:

Evaluation::withTrashed()->chunk(1000, function($evaluation) {
	foreach ($evaluations as $evaluation) {
		if (filled($evaluation->code)) continue;
		// ...
	}
});

There should be no performance issues unless OP has billions of rows, so it's not a problem to retrieve all rows. After all, this code is supposed to be executed only once.

ian_h's avatar

I'd strongly (personally) avoid putting data processing in migrations. IMO, migrations are for schema definitions, not data processing, despite the naming convention.

I use the OneTimeOperations package for this, alongside schema migrations. Run the migration to update the table, then run the operations to populate it. This can all be tied into your CI/CD pipeline too.

The operations work like migrations and will populate an operations table so that they only run once.. and the added benefit of being able to tag them (if needed) and to also run them in sync or async mode (I use sync for local testing, async (default) in prod).

1 like
ConradJvR's avatar

Thanks again everyone for your advice, I learned a bunch of new things and added a few notes for me to read up on. Posting the final version of the generate:evaluation-codes command if it helps someone in the future.

     public function handle()
    {
        $this->info('Starting evaluation code generation...');

        $evals = Evaluation::withTrashed()->whereNull('code');

        $bar = $this->output->createProgressBar($evals->count());

        $bar->start();

        $evals->eachById(function (Evaluation $evaluation) use ($bar) {
            $evaluation->code = $evaluation->generateCode();
            $evaluation->save();

            $bar->advance();
        });

        $bar->finish();
        $this->info("\nFinished generating evaluation codes");
    }

Please or to participate in this conversation.