There’s little to no information online about how to properly handle multi-byte characters through the life cycle of a PHP script.
If we need to be certain that our $_POST
data is valid UTF-8
and safe from invalid byte sequences, which function(s) should be used to guarantee this?
HTML 5
On the modern internet, HTML 5 is the standard and UTF-8 is the default (or only) encoding used.
What do we need to do as developers to tell the browser that we are using UTF-8 for both input and output, everywhere?
Role of the Browser
Is the browser supposed to do input character translation for us?
ie. if a user submits a textarea element as part of an
HTML 5 form that has Windows-1252
encoded characters pasted in from MS Word with curly quotes, is it the browser’s job to convert the Windows-1252
to UTF-8
on paste (without any javascript) and only send UTF-8
to the server?
PHP settings
Which settings need to be set in PHP, in general, to tell PHP that incoming POST
and GET
data should be UTF-8
, and that output needs to be UTF-8
?
PHP Default Behavior
Does PHP do any character encoding translation automatically when it sets up the internal $_GET
and $_POST
arrays?
Assume that a malicious person did not use a browser, but sends a deliberately malformed character string to the PHP endpoint directly.
Will PHP automatically replace malformed byte sequences with a substitution character, or will $_POST
contain the raw bytes that could have dangerous sequences, and could be any encoding?
To put it another way, does PHP automatically strip malformed characters or is this the developer’s job?
User Input Sanitization
If the developer is responsible for assuring that the incoming user input is actually valid UTF-8
and not malformed UTF-8
or some other encoding, what tool(s) should be used?
PHP has an mb_scrub()
function, but it seems that this function replaces invalid byte sequences with a simple question mark ?
and not with the U+FFFD
Unicode replacement character.
It seems you should be able to set the replacement character with
mb_substitute_character(0xFFFD);
But the man page says: “This setting affects mb_convert_encoding()
, mb_convert_variables()
, mb_output_handler()
, and mb_send_mail()
.”
It doesn’t mention mb_scrub()
.
Question
So the question all of this is working towards is:
If we want to get our $_POST
data and have it scrubbed of invalid byte sequences, and replace invalid bytes with 0xFFFD
, what is the correct function(s) to do this, so we can be guaranteed that we are working with a safe UTF-8
string, no matter what a user may throw at us?