The issue comes from how str_word_count() defines a "word" and which characters it treats as word boundaries. By default, str_word_count() does not handle Unicode apostrophes (like ’) or accented characters properly, and it may split words unexpectedly, especially with French text.
For example, in your string:
L’Auvergne, cœur volcanique de la France, se distingue par ses chaînes de puys.
- The character
’(Unicode RIGHT SINGLE QUOTATION MARK) is not treated as part of a word. - Accented characters (like
é,è,ç) may not be included as word characters unless specified.
Solution:
To handle this properly, you can use a regular expression with preg_match_all() that supports Unicode and includes apostrophes and accented characters.
Here's how you can do it:
$text = "L’Auvergne, cœur volcanique de la France, se distingue par ses chaînes de puys.";
// This regex matches words with Unicode letters and allows for apostrophes inside words.
preg_match_all('/\p{L}+(?:[’\'-]\p{L}+)*/u', $text, $matches);
$count = count($matches[0]);
echo $count; // Should output 14
Explanation:
\p{L}+matches one or more Unicode letters.(?:[’\'-]\p{L}+)*allows for apostrophes (both'and’) or hyphens inside words.- The
umodifier enables Unicode support.
Summary:
str_word_count() is limited for non-English or accented text. For accurate word counts in languages like French, use preg_match_all() with a Unicode-aware regular expression as shown above.