vincent15000's avatar

str_word_count doesn't return the right number of words

Hello,

I have this code.

$count = str_word_count("L’Auvergne, cœur volcanique de la France, se distingue par ses chaînes de puys.", 0);

str_word_count() returns 16 works whereas there are only 14 words.

1 - L'

2 - Auvergne

3 - coeur

4 - volcanique

5 - de

6 - la

7 - France

8 - se

9 - distingue

10 - par

11 - ses

12 - chaînes

13 - de

14 - puys

Any idea ?

Thanks a lot.

V

0 likes
4 replies
LaryAI's avatar
Level 58

The issue comes from how str_word_count() defines a "word" and which characters it treats as word boundaries. By default, str_word_count() does not handle Unicode apostrophes (like ) or accented characters properly, and it may split words unexpectedly, especially with French text.

For example, in your string:

L’Auvergne, cœur volcanique de la France, se distingue par ses chaînes de puys.
  • The character (Unicode RIGHT SINGLE QUOTATION MARK) is not treated as part of a word.
  • Accented characters (like é, è, ç) may not be included as word characters unless specified.

Solution:

To handle this properly, you can use a regular expression with preg_match_all() that supports Unicode and includes apostrophes and accented characters.

Here's how you can do it:

$text = "L’Auvergne, cœur volcanique de la France, se distingue par ses chaînes de puys.";

// This regex matches words with Unicode letters and allows for apostrophes inside words.
preg_match_all('/\p{L}+(?:[’\'-]\p{L}+)*/u', $text, $matches);

$count = count($matches[0]);

echo $count; // Should output 14

Explanation:

  • \p{L}+ matches one or more Unicode letters.
  • (?:[’\'-]\p{L}+)* allows for apostrophes (both ' and ) or hyphens inside words.
  • The u modifier enables Unicode support.

Summary:
str_word_count() is limited for non-English or accented text. For accurate word counts in languages like French, use preg_match_all() with a Unicode-aware regular expression as shown above.

JussiMannisto's avatar
Level 50

Old PHP string functions often don't support multibyte strings. That's the case with str_word_count:

Note that multibyte locales are not supported.

That means it won't work correctly with unicode strings containing multibyte characters like "œ" or "î".

1 like

Please or to participate in this conversation.