Be part of JetBrains PHPverse 2026 on June 9 – a free online event bringing PHP devs worldwide together.

kalvinmizzi's avatar

Laravel slow collection filtering

I've discovered that filtering large Laravel collections is inherently slow. You would think that because the operation occurs in memory that it would be faster than querying the database, but that is not the case. I have around 60,000 items in a collection, loaded from the database before a foreach loop. Each iteration of the foreach loop needs to filter by a "product id". I've found discussions on how you can use keyBy() and get() instead of where() to speed up performance

Reference:

https://marcus-obst.de/blog/slow-where-query-in-laravel-collections

https://laracasts.com/discuss/channels/laravel/working-on-collections-is-so-slow

However, by 60,000 items has duplicate keys for the filter condition I want. According to the Laravel documentation, "If multiple items have the same key, only the last one will appear in the new collection".

So I'm lost as to how to efficiently return a smaller collection of filtered results from a bigger collection in an efficient way. Right now it takes about 2 seconds (And there are 20,000 different keys to iterate through, so you can how this becomes a very slow operation)

Pseudo code:

Standard way

$productIds = $this->repository->getProductIds(); // 20,000 results
$productInventory = $this->repository->getInventoryForProducts($productIds->toArray()); // 60,000 results
$productIds->each(function (int $productId) use ($productInventory) {
	$productInventoryForProduct = $productInventory->where('product_id', $productId); // ~2 seconds
}

Using keyId()

$productIds = $this->repository->getProductIds(); // 20,000 results
$productInventory = $this->repository->getInventoryForProducts($productIds->toArray())->keyBy('product_id'); // 60,000 results
$products->each(function (int $productId) use ($productInventory) {
	$productInventoryForProduct = $productInventory->get($productId); // fast... but only returns one record
  // Processing of $productInventoryForProduct
}
0 likes
22 replies
Tray2's avatar

Filter them in the database instead, it is much faster, and that is what the database is for.

Tray2's avatar

@kmizzi No, you are wrong, what you are doing now is fetching all the product ids (20000), then you get 60000 records, then you loop over those 20000 records and fetch another huge amount of data.

That is insane, show us what your data looks like and the desired result, and we can help your write a single query that does it for you.

kalvinmizzi's avatar

I just found a way that is fast, but maybe not the most elegant. Curious if there is a better way (although this could work):

$productIds = $this->repository->getProductIds(); // 20,000 results
$productInventory = $this->repository->getInventoryForProducts($productIds->toArray()); // 60,000 results

$productsInventoryData = [];

$productsInventory->each(function(ProductInventory $productInventory) use (&$productsInventoryData) {
  $productsInventoryData[$productInventory->product_id][] = $productInventory
});

$productIds->each(function (int $productId) use ($productInventoryData) {
  $productInventoryForProduct = $productInventoryData[$productId];
  // Processing of $productInventoryForProduct
}
rodrigo.pedra's avatar

The pseudocode does not help, as we can't tell what you really want to achieve to propose an alternative.

  • Where the $products variable come from?
  • Why $productsIds is retrieved if never used? (this might be the same as the variable above, but it is not clear)
  • How would you later use $productInventoryForProduct?

Please either provide your actual code, or describe the intention, so we can better assess how to help you out.

kalvinmizzi's avatar

@rodrigo.pedra My bad, I had an error with my pseudocode (updated now) $productInventoryForProduct usage I don't think is relevant other than to say it is important to the next step within the each loop (just in order to isolate the real issue).

Snapey's avatar

@kmizzi never mind the speed, have you checked out how much memory you are using?

Doing such large filters in memory is an insane idea

rodrigo.pedra's avatar
Level 56

@kmizzi

As @snapey said, it is not a good idea to load 20,000 records in memory to process it. But locally, it didn't take two seconds iterating over all the collection using Collection@groupBy:

<?php

use App\Models\Product;
use App\Models\ProductInventory;
use Illuminate\Support\Facades\Artisan;

Artisan::command('test', function () {
    $this->info(now()->toDateTimeString());

    $productIds = Product::query()->pluck('id');

    $productInventory = ProductInventory::query()
        ->whereIn('product_id', $productIds)
        ->get()
        ->groupBy('product_id');

    foreach ($productIds as $productId) {
        $productInventoryForProduct = $productInventory[$productId] ?? collect([]);

        $this->info(
            vsprintf('[%s] %05d: [%s]', [
                now()->toDateTimeString(),
                $productId,
                $productInventoryForProduct->pluck('id')->implode(','),
            ]),
        );
    }

    $this->info(now()->toDateTimeString());
});

Output:

$ php artisan test
2022-12-03 18:52:46
[2022-12-03 18:52:46] 00001: [1,2,3]
[2022-12-03 18:52:46] 00002: [4,5,6]
[2022-12-03 18:52:46] 00003: [7,8,9]
[2022-12-03 18:52:46] 00004: [10,11,12]
[2022-12-03 18:52:46] 00005: [13,14,15]
[2022-12-03 18:52:46] 00006: [16,17,18]
[2022-12-03 18:52:46] 00007: [19,20,21]
[2022-12-03 18:52:46] 00008: [22,23,24]
[2022-12-03 18:52:46] 00009: [25,26,27]
[2022-12-03 18:52:46] 00010: [28,29,30]
... truncated for brevity
[2022-12-03 18:52:47] 19991: [59971,59972,59973]
[2022-12-03 18:52:47] 19992: [59974,59975,59976]
[2022-12-03 18:52:47] 19993: [59977,59978,59979]
[2022-12-03 18:52:47] 19994: [59980,59981,59982]
[2022-12-03 18:52:47] 19995: [59983,59984,59985]
[2022-12-03 18:52:47] 19996: [59986,59987,59988]
[2022-12-03 18:52:47] 19997: [59989,59990,59991]
[2022-12-03 18:52:47] 19998: [59992,59993,59994]
[2022-12-03 18:52:47] 19999: [59995,59996,59997]
[2022-12-03 18:52:47] 20000: [59998,59999,60000]
2022-12-03 18:52:47

For reference, I created a dummy project with a products and a product_inventories tables and seeded them with:

<?php

namespace Database\Seeders;

use App\Models\Product;
use App\Models\ProductInventory;
use Illuminate\Database\Seeder;

class DatabaseSeeder extends Seeder
{
    public function run()
    {
        Product::factory(20_000)
            ->has(ProductInventory::factory()->count(3), 'inventory')
            ->create();
    }
}
1 like
rodrigo.pedra's avatar

Also, I used a SQLite database for easiness on the example above.

I also just tested with MySQL and got the same results (2 seconds on everything).

I added

$this->info(memory_get_peak_usage(true) / 1024 / 1024);

To the end, and got around 150 MB on MySQL., and 130 MB on SQLite.

But note my ProductInventory model only has a name and quantity columns, besides product_id, id and the defaults timestamps fields.

Also, my Product model has only name column (besides id and timestamps) but that is irrelevant as I only fetch their IDs.

kalvinmizzi's avatar

@rodrigo.pedra Thank you! This is the much more elegant solution to the speed solution that I posted. I'll post the updated pseudocode using this improvement

1 like
rodrigo.pedra's avatar

@kmizzi

If you have a memory constraint, consider using Builder@lazyById():

<?php

use App\Models\Product;
use Illuminate\Support\Facades\Artisan;

Artisan::command('test', function () {
    $this->info(now()->toDateTimeString());

    $products = Product::query()
        ->with(['inventory'])
        ->lazyById();

    foreach ($products as $product) {
        $this->info(
            vsprintf('[%s] %05d: [%s]', [
                now()->toDateTimeString(),
                $product->getKey(),
                $product->inventory->pluck('id')->implode(','),
            ]),
        );
    }

    $this->info(now()->toDateTimeString());

    $this->info(memory_get_peak_usage(true) / 1024 / 1024);
});

It will fetch 1,000 products a time and eager load their inventory. So you would have much fewer models on the memory at a time.

  • On SQLite, total execution was 2 seconds and memory went down to 36 MB.
  • On MySQL, total execution was 3 seconds and memory went down to 38 MB.

reference: https://laravel.com/docs/9.x/eloquent#chunking-using-lazy-collections

kalvinmizzi's avatar

Here is the entire context... after I did the preprocessing before the chunk each, it is now very fast, but running out of memory :/:

<?php

namespace App\Actions;

use App\Helpers;
use App\Models\InventoryMovement;
use App\Models\InventorySnapshot;
use App\Repositories\InventoryMovementRepository;
use App\Repositories\InventorySnapshotRepository;
use Illuminate\Support\Facades\DB;

class TakeInventorySnapshot
{
    /**
     * @var array
     */
    protected array $productIds;

    protected InventoryMovementRepository $inventoryMovementRepository;
    protected InventorySnapshotRepository $inventorySnapshotRepository;

    /**
     *
     * @param array $productIds
     */
    public function __construct(array $productIds = [])
    {
        $this->inventoryMovementRepository = app(InventoryMovementRepository::class);
        $this->inventorySnapshotRepository = app(InventorySnapshotRepository::class);
        $this->productIds = $productIds;
    }

    public function handle()
    {
        $this->inventorySnapshotRepository->seed();

        foreach ($this->inventorySnapshotRepository->getProductIdsWithInvalidCache($this->productIds)->chunk(10000) as $chunk)
        {
            DB::transaction(function () use ($chunk) {
                /*
                 * The following datasets get preloaded from the database 10,000 products at a time so that we don't
                 * overwhelm the database with reads for each product
                 */
                $validSnapshots = $this->inventorySnapshotRepository->getValidInventorySnapshots($chunk->toArray());
                $validSnapshotsData = [];
                $validSnapshots->each(function (InventorySnapshot $inventorySnapshot) use (&$validSnapshotsData) {
                    $validSnapshotsData[$inventorySnapshot->product_id][] = $inventorySnapshot;
                });
                unset($validSnapshots);

                $invalidSnapshots = $this->inventorySnapshotRepository->getInvalidProductSnapshots($chunk->toArray());
                $invalidSnapshotsData = [];
                $invalidSnapshots->each(function(InventorySnapshot $inventorySnapshot) use(&$invalidSnapshotsData) {
                    $invalidSnapshotsData[$inventorySnapshot->product_id][] = $inventorySnapshot;
                });

                $startDate = Helpers::utcStartOfLocalDate($invalidSnapshots->min('date'));
                $endDate = Helpers::utcEndOfLocalDate($invalidSnapshots->max('date'));
                unset($invalidSnapshots);

                //Return inventory movements starting at start date and ending at end date
                $inventoryMovements = $this->inventoryMovementRepository->getInventoryMovementsDailySummaryWithinRange($startDate, $endDate, $chunk->toArray());

                $inventoryMovementsData = [];
                $inventoryMovements->each(function (InventoryMovement $inventoryMovement) use (&$inventoryMovementsData) {
                    $inventoryMovementsData[$inventoryMovement->product_id][] = $inventoryMovement;
                });
                unset($inventoryMovements);

                $data = [];
                $chunk->each(function (int $productId) use (&$bulkData, $validSnapshotsData, $inventoryMovementsData, &$data, $invalidSnapshotsData) {
                    $invalidSnapshotsForProduct = collect($invalidSnapshotsData[$productId]);
                    $startDate = Helpers::utcStartOfLocalDate($invalidSnapshotsForProduct->min('date'));
                    $endDate = Helpers::utcEndOfLocalDate($invalidSnapshotsForProduct->max('date'));

                    $inventoryMovementsForProduct = collect($inventoryMovementsData[$productId])
                        ->whereBetween('date', [$startDate, $endDate]);

                    $productSnapshot = isset($validSnapshotsData[$productId]) ? collect($validSnapshotsData[$productId])
                        ->where('date', '<=', $startDate->clone()->subDay())
                        ->sortBy('date', SORT_DESC)
                        ->last() : null;

                    $tally = [
                        'inventory_available' => $productSnapshot?->inventory_available ?? 0,
                        'inventory_reserved' => $productSnapshot?->inventory_reserved ?? 0,
                        'inventory_in_transit' => $productSnapshot?->inventory_in_transit ?? 0,
                        'inventory_stock_value' => $productSnapshot?->inventory_stock_value ?? 0,
                    ];

                    $inventoryMovementsForProduct->each(function (InventoryMovement $inventoryMovement) use (&$data, $productSnapshot, $productId, &$tally)
                    {
                        switch ($inventoryMovement->inventory_status)
                        {
                            case InventoryMovement::INVENTORY_STATUS_ACTIVE:
                                $tally['inventory_available'] += $inventoryMovement->total_quantity;
                                break;
                            case InventoryMovement::INVENTORY_STATUS_RESERVED:
                                $tally['inventory_reserved'] += $inventoryMovement->total_quantity;
                                break;
                            case InventoryMovement::INVENTORY_STATUS_IN_TRANSIT:
                                $tally['inventory_in_transit'] += $inventoryMovement->total_quantity;
                                break;
                        }

                        $data[$inventoryMovement->date][$productId]['inventory_available'] = $tally['inventory_available'];
                        $data[$inventoryMovement->date][$productId]['inventory_reserved'] = $tally['inventory_reserved'];
                        $data[$inventoryMovement->date][$productId]['inventory_in_transit'] = $tally['inventory_in_transit'];


                        if (!isset($data[$inventoryMovement->date][$productId]['inventory_stock_value']))
                        {
                            $data[$inventoryMovement->date][$productId]['inventory_stock_value'] = $tally['inventory_stock_value'];
                        }

                        $data[$inventoryMovement->date][$productId]['inventory_stock_value'] += $inventoryMovement->inventory_value;
                    });
                });

                foreach ($data as $date => $products)
                {
                    foreach ($products as $product_id => $inventory_data)
                    {
                        $bulkData[] = [
                            'is_cache_valid' => 1,
                            'date' => $date,
                            'product_id' => $product_id,
                            'inventory_available' => $inventory_data['inventory_available'] ?? 0,
                            'inventory_reserved' => $inventory_data['inventory_reserved'] ?? 0,
                            'inventory_in_transit' => $inventory_data['inventory_in_transit'] ?? 0,
                            'inventory_stock_value' => $inventory_data['inventory_stock_value'],
                        ];
                    }
                }

                if (empty($bulkData))
                {
                    return;
                }

                $this->inventorySnapshotRepository->saveBulk($bulkData);
            });
        }
    }
}
justinjerez's avatar

I don't know if this is what you're looking for, but earlier I was reading an article from James Bannister about using generator for pagination, in his example he uses generators and LazyCollection, hope it can help.

kalvinmizzi's avatar

I got it to run successfully by unsetting variables to clear memory throughout the script and upping the memory_limit.

The scale of the number of records would only be high on the first run of this script. Subsequent runs will only process newly invalidated cache records. So for the scale of data, the whole thing executes in 3.5 minutes and 1.43 GB of memory which I don't think is bad considering everything.

Subsequent run after initial ran in just 2 seconds!

kalvinmizzi's avatar

Here is the updated pseudocode that reads nicely and operates fast! Thank to @rodrigo.pedra for the groupBy suggestion:

$productIds = $this->repository->getProductIds();
$productInventory = $this->repository
	->getInventoryForProducts($productIds->toArray())
	->groupBy('product_id');

$productIds->each(function (int $productId) use ($productInventory) {
  $productInventoryForProduct = $productInventory[$productId];
  // Processing of $productInventoryForProduct
}

1 like
rodrigo.pedra's avatar

@kmizzi

See my latest response on using Builder@lazyById() to get memory usage down.

As you are using the repository pattern, you might need to add a new method to your product's repository.

1 like
kalvinmizzi's avatar

@rodrigo.pedra Thank you! I'm trying to grasp it within the more complex context that I'm working in. Will update with the results!

1 like
kalvinmizzi's avatar

@rodrigo.pedra The challenge I'm seeing with loading the inventory movements with the products (if you look at the full context) is that there are a LOT of inventory movements for each product. So that is why I'm doing something like this:

//Return inventory movements starting at start date and ending at end date
$inventoryMovements = $this->inventoryMovementRepository
  ->getInventoryMovementsDailySummaryWithinRange($startDate, $endDate, $chunk->toArray())
  ->groupBy('product_id');

Where the start date and end date is just the range of the invalidated caches needing updates (as opposed to all inventory movments).

1 like
kalvinmizzi's avatar

I did try using cursors and lazyById and it slows things down significantly. I think because only the first time around has large data needs, I'm happy to sacrifice the memory usage for performance. Subsequent runs would not use a lot of memory (or time)

1 like
rodrigo.pedra's avatar

@kmizzi no problem.

That is precisely why I asked for more contexts.

For a specific use case, we can make a trade-off on resources and performance.

The best is that you could get a performance boost for your use-case. Have a nice day =)

1 like
kalvinmizzi's avatar

Note I was able to improve things even further by using mysql CASE statements to return less "inventoryMovement" records.

public function getInventoryMovementsDailySummaryWithinRange(Carbon $startDate, Carbon $endDate, array $productIds): EloquentCollection
    {
        $inventoryMovements = InventoryMovement::query()
            ->select([
                'inventory_movements.product_id',
            ])
            ->selectRaw("
            CONVERT_TZ(
                DATE_FORMAT(
                    CONVERT_TZ(
                        inventory_movement_date, 'UTC', '" . Helpers::setting(Setting::KEY_DEFAULT_TIMEZONE) . "'
                    ), '%Y-%m-%d'
                ), '" . Helpers::setting(Setting::KEY_DEFAULT_TIMEZONE) . "', 'UTC'
            ) as date")
            ->selectRaw('SUM(CASE WHEN inventory_status = "' . InventoryMovement::INVENTORY_STATUS_ACTIVE . '" THEN quantity ELSE 0 END) as inventory_available')
            ->selectRaw('SUM(CASE WHEN inventory_status = "' . InventoryMovement::INVENTORY_STATUS_RESERVED . '" THEN quantity ELSE 0 END) as inventory_reserved')
            ->selectRaw('SUM(CASE WHEN inventory_status = "' . InventoryMovement::INVENTORY_STATUS_IN_TRANSIT . '" THEN quantity ELSE 0 END) as inventory_in_transit')
            ->selectRaw('IFNULL(SUM(quantity * (fifo_layers.total_cost / fifo_layers.original_quantity)), 0) as inventory_stock_value')
            ->leftJoin('fifo_layers', function(JoinClause $join) {
                $join->on('inventory_movements.layer_id', 'fifo_layers.id')
                    ->where('layer_type', FifoLayer::class);
            })
            ->whereBetween('inventory_movement_date', [$startDate, $endDate])
            ->groupBy([
                DB::raw("DATE_FORMAT(CONVERT_TZ(inventory_movement_date, 'UTC', '" . Helpers::setting(Setting::KEY_DEFAULT_TIMEZONE) . "'), '%Y-%m-%d')"),
                'inventory_movements.product_id',
            ]);

        if (!empty($productIds))
        {
            $inventoryMovements->whereIn('inventory_movements.product_id', $productIds);
        }

        return $inventoryMovements->get();
    }

The next challenge of optimization is that I'm still using a loop in memory to build the tally:

$inventoryMovementsForProduct->each(function (InventoryMovement $inventoryMovement) use (&$data, $productSnapshot, $productId, &$tally)
                {
                    $tally['inventory_available'] += $inventoryMovement->inventory_available;
                    $tally['inventory_reserved'] += $inventoryMovement->inventory_reserved;
                    $tally['inventory_in_transit'] += $inventoryMovement->inventory_in_transit;
                    $tally['inventory_stock_value'] += $inventoryMovement->inventory_stock_value;

                    $data[] = [
                        'date' => $inventoryMovement->date,
                        'product_id' => $inventoryMovement->product_id,
                        'is_cache_valid' => 1,
                        'inventory_available' => $tally['inventory_available'],
                        'inventory_reserved' => $tally['inventory_reserved'],
                        'inventory_in_transit' => $tally['inventory_in_transit'],
                        'inventory_stock_value' => $tally['inventory_stock_value'],
                    ];
                });

I've been reading up on the use of SUM over in MySQL with Common Table Expressions: https://stackoverflow.com/questions/2563918/create-a-cumulative-sum-column-in-mysql

I could see how that could work in a simpler case but the complication in my context is that we do not want to load ALL inventory movements to tally, but rather we want to utilize any prior valid snapshots as a starting point to make things faster, which is what I'm trying to do here:

$productSnapshot = isset($validSnapshots[$productId]) ? collect($validSnapshots[$productId])
                    ->where('date', '<=', $startDate->clone()->subDay())
                    ->sortBy('date', SORT_DESC)
                    ->last() : null;

If I can figure out how to do more of this in MySQL it would save even more memory and time.

Please or to participate in this conversation.