Stripping Invisible Characters before passing to DB
I recently upgraded a project from Laravel 3 to Laravel 11 (I know). I also upgraded the database and my tables use InnoDB with utf8mb4_unicode_ci.
Everything was running pretty great for a while but I just had this issue pop up and I'm not entirely sure how to handle it the best way.
I have invisible characters that are sometimes posted by users (they don't know - they just copy and paste from Word docs) such as a Zero Width Non-Joiner (<0x200c> or U+200C) and they throw this exception:
SQLSTATE[HY000]: General error: 3988 Conversion from collation utf8mb4_unicode_ci into latin1_swedish_ci impossible for parameter
I wasn't sure how to handle this yet, so I temporarily wrapped the input in str()->ascii() and that stripped the offending characters. I want a more appropriate solution though since this seems like a very hack-ish way to fix this. Is there some way to globally sanitize these things, or is there something I missed in Laravel docs?
To handle the issue of invisible characters being posted by users and causing database errors, you can create a middleware in Laravel to sanitize the input before it reaches your database. This way, you can globally sanitize the input and remove any unwanted characters.
Here's a step-by-step solution:
Create a Middleware:
First, create a new middleware using the Artisan command:
php artisan make:middleware SanitizeInput
Implement the Middleware:
Open the newly created middleware file located at app/Http/Middleware/SanitizeInput.php and implement the logic to remove invisible characters.
<?php
namespace App\Http\Middleware;
use Closure;
use Illuminate\Http\Request;
class SanitizeInput
{
/**
* Handle an incoming request.
*
* @param \Illuminate\Http\Request $request
* @param \Closure $next
* @return mixed
*/
public function handle(Request $request, Closure $next)
{
$input = $request->all();
array_walk_recursive($input, function (&$input) {
// Remove invisible characters
$input = preg_replace('/[\x{200B}-\x{200D}\x{FEFF}]/u', '', $input);
});
$request->merge($input);
return $next($request);
}
}
Register the Middleware:
Register the middleware in your app/Http/Kernel.php file. You can add it to the global middleware stack or to a specific group (e.g., web or api).
protected $middleware = [
// Other middleware
\App\Http\Middleware\SanitizeInput::class,
];
Test the Middleware:
Ensure that the middleware is working correctly by testing your application. Any input containing invisible characters should now be sanitized before being processed by your application and stored in the database.
This approach ensures that all incoming requests are sanitized globally, preventing the issue of invisible characters causing database errors.
As the forum's AI suggested, I think you should use middleware, with it you will be able to handle any and all inputs globally.
Since you are creating middleware to handle inputs, you can also implement the ezyang/htmlpurifier package to prevent script injections and related attacks
Thanks for the reply. I feel like I’m missing something here entirely though. While it might be a solution to the problem, it doesn’t feel like I am addressing the correct problem.
This issue can’t be incredibly uncommon so creating a middleware makes me feel like there is a different approach entirely that isn’t being considered. Or is everyone creating a middleware as part of their initial install of Laravel?
In the config/database.php set charset and collation values to match your database, like utf8mb4 and utf8mb4_unicode_ci in the corresponding database driver you're using.
@MohamedTammam I know it's been a while but I just wanted to pop in and say that this is correct, but also, the columns needed to be updated with the right charset and collation as well.