Sep 13, 2025

Level 63

str_word_count doesn't return the right number of words

Hello,

I have this code.

$count = str_word_count("L’Auvergne, cœur volcanique de la France, se distingue par ses chaînes de puys.", 0);

str_word_count() returns 16 works whereas there are only 14 words.

1 - L'

2 - Auvergne

3 - coeur

4 - volcanique

5 - de

6 - la

7 - France

8 - se

9 - distingue

10 - par

11 - ses

12 - chaînes

13 - de

14 - puys

Any idea ?

Thanks a lot.

LaryAI

10 months ago

Level 58

The issue comes from how str_word_count() defines a "word" and which characters it treats as word boundaries. By default, str_word_count() does not handle Unicode apostrophes (like ’) or accented characters properly, and it may split words unexpectedly, especially with French text.

For example, in your string:

L’Auvergne, cœur volcanique de la France, se distingue par ses chaînes de puys.

The character ’ (Unicode RIGHT SINGLE QUOTATION MARK) is not treated as part of a word.
Accented characters (like é, è, ç) may not be included as word characters unless specified.

Solution:

To handle this properly, you can use a regular expression with preg_match_all() that supports Unicode and includes apostrophes and accented characters.

Here's how you can do it:

$text = "L’Auvergne, cœur volcanique de la France, se distingue par ses chaînes de puys.";

// This regex matches words with Unicode letters and allows for apostrophes inside words.
preg_match_all('/\p{L}+(?:[’\'-]\p{L}+)*/u', $text, $matches);

$count = count($matches[0]);

echo $count; // Should output 14

Explanation:

\p{L}+ matches one or more Unicode letters.
(?:[’\'-]\p{L}+)* allows for apostrophes (both ' and ’) or hyphens inside words.
The u modifier enables Unicode support.

Summary:
str_word_count() is limited for non-English or accented text. For accurate word counts in languages like French, use preg_match_all() with a Unicode-aware regular expression as shown above.

JussiMannisto

10 months ago

Best Answer

Level 50

Old PHP string functions often don't support multibyte strings. That's the case with str_word_count:

Note that multibyte locales are not supported.

That means it won't work correctly with unicode strings containing multibyte characters like "œ" or "î".

1 like

Glukinho

10 months ago

Level 33

People say you should use IntlBreakIterator: https://www.php.net/manual/en/class.intlbreakiterator.php

You may create Str::intlWords() macro that uses it inside.

Merklin

10 months ago

Level 7

Small hint: if the function starts with mb_, that means it works correctly (is intended to be used) with multibyte characters. ;)

Full list here: https://www.php.net/manual/en/ref.mbstring.php

Please or to participate in this conversation.