Napo7's avatar
Level 15

Optimizing a search by "partial words"

Hi ! I've got a Products table that contains, of course, names.

In my app, I want to allow the user to search from partial name words, whatever the order they put the words in. Let's imagine I've got a product name "cookie with chocolate chips" in my table. The user is able to perform a search typing in the search form "choc cook", which returns this exact product, but the MySQL performance is not so great :

Behind the scene, the query is split in "tokens", and each token is put in the where : name like '%cook%' and name like '%choc%'. The main concern is that Mysql can't use index on name because of the wildcard in the begining of the LIKE clause. It would use the index if the WHERE clause was Like 'choc%', but the search wouldn't return the cookie anymore !

One way to resolve this seems to use a fulltext index, but I'm annoyed since fulltext index are not supported on sqlite (the DB I use for the tests I write...)

I though of one other way and I'd like to discuss it before testing it : splitting each product name's word (token) in a table, like so

1 | cookie
1 | with
1 | chocolate
1 | chips
2 | ....

which would allow to use LIKE 'cook%', but each update on a product, if the name has been changed, would require an update of many records in this table... Perhaps not-so-performance-proof on the update side, but the search query is multiple times more used than update ones !

What are your thoughts ?

0 likes
11 replies
LaryAI's avatar
Level 58

Creating a separate table to store individual words from product names can indeed improve search performance, as you can then use an index on the word column to speed up searches. However, as you mentioned, this approach can lead to increased complexity when updating product names.

Here's a potential solution that balances search performance with ease of maintenance:

  1. Create a new table to store individual words, as you suggested:
CREATE TABLE product_words (
    product_id INT,
    word VARCHAR(255),
    FULLTEXT(word)
);
  1. Whenever a product name is inserted or updated, split the name into words and insert/update the corresponding entries in the product_words table. You can do this in your application code or use a database trigger.

  2. Use a full-text search when performing the search query:

SELECT p.*
FROM products p
JOIN product_words pw ON p.id = pw.product_id
WHERE MATCH(pw.word) AGAINST('+cook* +choc*' IN BOOLEAN MODE)
GROUP BY p.id;

This query uses the MATCH() ... AGAINST() syntax for full-text search, which is much faster than LIKE with a leading wildcard. The + sign indicates that the word must be present, and the * is a wildcard that allows for partial matches.

Regarding the issue with SQLite not supporting full-text indexes, you could consider using conditional logic in your application to switch between search strategies based on the database driver being used. For instance, use full-text search with MySQL and fall back to a less efficient method with SQLite.

Here's an example in pseudo-code:

if (usingMySQL()) {
    // Perform full-text search
} else if (usingSQLite()) {
    // Perform a less efficient search, perhaps using LIKE with wildcards
}

This way, you can still run your tests with SQLite, accepting that the search will be less efficient, but in production with MySQL, you'll benefit from the full-text search performance.

Remember to handle the synchronization between the products table and the product_words table carefully to ensure data consistency. This includes handling deletions and updates appropriately.

Napo7's avatar
Level 15

@Snapey Thanks for your reply, I've heard of Scout, but I want to avoid as much as possible of external services (each external service SLA added to your product reduces your own product SLA...). So, I'd like to avoid Melisearch and algolia... Currently, Scout ships with Algolia, Meilisearch, and MySQL / PostgreSQL (database) drivers This let us Mysql and Postgre, that could be a possibility, but I already imagine that unit testing against the sqlite DB will not be supported !

And to add a small drop to the problem, I've forgot to mention I have, in facts, two products table :

  • The products that "belongs" to the users: his own product catalog.
  • An "external catalog" that contains all products that many users can purchase from different suppliers. When a user search a product, it is in fact searched against both catalogs, with some logic that UNION both products table, but excluding products from external catalog that has been imported to user's own catalog ;) That seems a bit overcomplicated, but this allow to manage +200K of external products which are shared for all users without having to copy them into each user's catalog (and so on for external catalog updates, such as name, prices, etc...)

Weighting pros and cons, it seems that my custom solution of "splitting words" into a table could probably the less worst solution to my problem !

Snapey's avatar

@Napo7 dont forget that when you find matching words, you still need to check if the words belong to rhe same product, probably grouping by product_id and counting the number in each group

Napo7's avatar
Level 15

@Snapey That might be a new feature, allows to search by OR or AND keywords ;) ....like did search engine in the begining of the Internet Era ;)

Remember the way it worked before... +this +that -thisone :)

And once the query returned is only a small subset of both tables, it might be cheaper to finally apply a filter on it, be it on mysql or php side !?

Snapey's avatar

@Napo7 i wasnt thinking about 'or'

consider. search of "cookies with chocolate chips"

you might get a count of three from a word search finding "chocolate chip cookies" and put this at the top of your results

But you could get a 2 from "choccy chip cookie" or from "chocolate chip cookie", and these could be relevant matches but ranked lower

Napo7's avatar
Level 15

@Snapey Yeah that's a cool feature... Perhaps also use a SOUNDEX instead of a like, to have approaching words!

But I love your recommendation, group by product id, count number of matching words , and sort by it desc ;)

Tray2's avatar

I would suggest moving your test database to something other than SQLite for your tests.

You really should have done that from the beginning, that way you run your tests one the same kind of system that your production will run on.

Napo7's avatar
Level 15

@Tray2 I know this, using sqlite instead of mysql can lead to unexpected behavior in production. I've tried to run my tests against mysql DB, but :

  • I have more than 800 tests
  • a big bunch of tables : + 55 tables, 155 migration files
  • a powerfull CPU, 24 threads, that can run 24 tests in parallel

Those figures leads to some awful but real figures :

  • running the 800+ tests in parallel against SQLITE takes an average of 20 seconds.
  • Running them against Mysql takes an average of ... 230 seconds... because testing against mysql in parallel requires to create a test DB for each thread, so it creates 24 databases, and this takes a lot of time ! ... And testing without parallel is still not faster than sqlite : 18s until the first test fires (still because of DB setup), and 65s in total.... I don't want to wait a minute , or even 20s before each of my test running !...

If I want to limit my tests to a subset of them, I still have to wait for the setup taking 18s seconds , that takes less than 0.2s on SQLite !

And since you will ask : no, I don't want to manually re-migrate the tables manually only when needed.. This is not a part of my workflow because of TDD.... To TDD or not is not the point of this discussion, but I'm open to talk about it in another post and how faster it allows me to work !

But if you know a solution that allows to run tests as fast on mysql than on sqlite (even when refreshing the database), I'm willing to test against a mysql db !

Napo7's avatar
Level 15

@Tray2 I think I've already tested this one. It's better than the "native refresh" database trait, but it still requires 20s each time I change a migration file. If testing against a mysql DB was a real issue, I think I might consider it, but until now, I still have more pros than cons to still use sqlite ;)

Thanks anyway !

Please or to participate in this conversation.