Mar 15, 2025

Level 3

It takes a long time to perform count()

I have this in laravel:

// Getting the request filters
$cnaes = $request->input('cnae', []); // Selected CNAEs
$location = $request->input('location', []); // Location
$state = $request->input('state', ''); // State
$startDate = $request->input('startDate', ''); // Start date
$endDate = $request->input('endDate', ''); // End date
$status = $request->input('status', ''); // Status
$size = $request->input('size', ''); // Company size
$legalNature = $request->input('legalNature', []); // Legal nature
$capital = $request->input('capital', ''); // Share Capital
$mei = $request->input('mei', ''); // MEI option

// Start the query
$query = Establishment::query();

// Filtering by CNAE
if (!empty($cnaes)) {
$query->whereIn('cnae_principal', $cnaes);
}

// Filtering by location
if (!empty($location)) {
$query->whereIn('logradouro_municipio', $location);
}

// Filtering by state
if (!empty($state)) {
$query->where('logradouro_uf', $state);
}

$query->whereBetween('data_abertura_corrigida', [$startDate, $endDate]);

// Filtering by status
if (!empty($status)) {
$query->where('registration_status', $status);
}

// Filtering by size
if (!empty($size)) {
$query->whereHas('empresa', function($query) use ($size) {
$query->where('company_size', $size);
});
}

// Filtering by legal nature

if (!empty($legalNature)) {
$query->whereHas('empresa', function($query) use ($legalNature) {
$query->whereIn('legal_nature', $legalNature);
}); }

// Filtering by share capital
if (!empty($capital)) {
// Example: Filtering by capital ranges
if ($capital == '1') {
$query->whereHas('empresa', function($query) use ($capital) {
$query->where('capital_social_num', '<=', 10000);
});

} elseif ($capital == '2') {
$query->whereHas('empresa', function($query) use ($capital) {
$query->whereBetween('capital_social_num', [10001, 100000]);
});

} elseif ($capital == '3') {

$query->whereHas('empresa', function($query) use ($capital) {
$query->where('capital_social_num', '>', 100000);
});

}
}

if (!empty($mei)) {
$query->whereHas('empresa.simples', function ($query) use ($mei) {
$query->where('simples.opcao_pelo_mei', $mei);
});
}

// Returning the quantity or other data as needed
return response()->json(['quantidade' => $query->count()]);

the idea is to query the data according to the filters selected by the user. However, the database contains over 64 million records, and the way I'm doing it is extremely slow, since my intention is just to count the numbers. And in some cases, it seems that the more filters, the slower it gets (but not always), I don't understand this behavior.

Could anyone give me any suggestions for improvement?

note: All columns are properly indexed

jlrdw

1 year ago

Level 75

Do you have indexes set up on your search fields. Perhaps you need to drill down more.

1 like

mvnobrega

1 year ago

Level 3

@jlrdw

I have all the indexes yes

jdc1898

1 year ago

Level 21

You have a where between in there on dates. My initial thought is a partition. As the size of the data grows, this problem will increase. You’re going to need to find a way to access the data faster through index’s partitions or other means.

1 like

mvnobrega

1 year ago

Level 3

@jdc1898

The problem doesn't seem to be that, since by default all searches contain a data range. But if I include the company's status as "Active" for example, it takes longer than just the data range. I don't understand this, since it's a more restricted query, and it should be faster.

JussiMannisto

1 year ago

Level 50

@mvnobrega This is completely expected if you have separate indexes on each column. Think about it this way:

select count(*) from sometable where created_at between '2024-01-01 00:00:00' and '2024-12-31 23:59:59';

Here you're performing a range scan on the indexed created_at column. Most databases store indexes as B+tree structures. The storage engine traverses the tree to find the first matching node, then scans through the (ordered) leaf nodes to count the rows until it reaches the last matching node. That's very efficient.

select count(*) from sometable where status = 'Active' and created_at between '2024-01-01 00:00:00' and '2024-12-31 23:59:59';

Now things get more complicated. The range scan is still needed, but so are the status values, and the information is in separate indexes. The query optimizer has several options, each of which is sub-optimal:

Do a range scan on the created_at index like in the first query, then look up every matching row from the table and filter them by status for counting. This is inefficient if the result set is large.
Do a scan on the status index, fetch the matching rows and filter them by the created_at value. This only makes sense if status has better selectivity than created_at, which seems unlikely. This is inefficient for the same reason as 1.
Some databases might scan both indexes and merge the results using index intersection, but that can be inefficient. One index is usually picked instead.

The way to "fix" this particular query would be to create a composite index such as (created_at, status). The order of columns is important. That would still allow an efficient range scan while filtering out non-matching status values at the index level, without having to fetch any rows from the table.

jdc1898

1 year ago

Level 21

Is there an index on registration status ?

If you run the query on the DB not in code is it slow?

Can that column be an enum?

Tray2

1 year ago

Level 73

@mvnobrega I would suggest that you instead of the count, add ->toSql() and then run explain on the SQL you got back.

Replace the SELECT with the result of the ->toSql() call.

EXPLAIN SELECT COUNT(*) FROM table1;

jlrdw

1 year ago

Level 75

Also look at it like this.

You have a database with 5000 customers and say 25000 records, you do a monthly accounts receivable report, it will be fast.

Now doing the same on a large (millions of records) of course is longer.

I have never had to, but look up partitioning a mysql database, might help. But also learn how to use two databasess, archive completed stuff and only have what is needed in an active database.

Also only query what is needed, and have good indexes.

Another technique is combining results.

Like in accounts receivables, I could query A through C, D through F, etc.

It takes some trial and error sometimes to see what works the best.

For that many records, and just my opinion, I would write regular queries and not use eloquent.

You could also consider moving to a larger scale database like big data, or lookup what companies like Google, fedex, UPS, etc use.

1 like

JussiMannisto

1 year ago

Level 50

It's crucial to analyze the queries as @tray2 suggested. One immediate issue I see is those repeated whereHas queries on the empresa table. Each of those results in a separate subquery, and they could be combined in a single whereHas call.

note: All columns are properly indexed

Does this mean each column has its own separate index? That might not be enough with a table this big. The query optimizer usually picks a single index to use in the query per table, based on statistics and heuristics. You could get a significant performance boost by adding composite indexes tailored to your queries.

Some databases support index merging to utilize multiple separate indexes, but that comes with some limitations. If individual conditions select large numbers of rows ("low selectivity"), then merging those can be computationally expensive and take a lot of memory. Usually the optimizer just picks a single index.

One minor note about EXPLAIN: When you call ->toSql() on the query builder, it returns the SQL query with placeholder parameters. You can call ->getBindings() on the builder to get the values to use in the EXPLAIN statement. You can place the values directly in the SQL, but it might produce an unrealistic result since Eloquent uses prepared statements, and their execution plans are generated without any parameters present. If parameters are given in the SQL straight up, the optimizer might make further optimizations based on the values, resulting in a different execution plan.

Here's an example on how to run EXPLAIN with a prepared statement on MySQL/MariaDB:

PREPARE stmt FROM 'EXPLAIN SELECT COUNT(*) FROM establishments WHERE status = ? AND created_at > ?';
EXECUTE stmt USING 'active', '2025-01-01 00:00:00';

By the way, which database are you using? With databases as big as this, configuration is important.

Ps. Don't forget the most important optimization of all: adding indents to your PHP code 🙂

Please or to participate in this conversation.