Character Encoding
Published in PHP Architect on 28 Feb 2006I want to give thanks to Ilia Alshanetsky, who has agreed to take over Security Corner. It has been my pleasure to be the author of this column for the past few years. I think it’s valuable to hear from different sources of security expertise. Ilia is a well-known PHP expert and educator, and I’m confident that you’ll learn a lot from what he has to say.
Character encoding is a vast topic that I don’t plan to cover in much detail. In fact, the purpose of this month’s Security Corner is to illustrate why character encoding matters, not to explain character encoding mechanics. I highly encourage you to learn as much as you can about character encoding, because I think it will not only make you a better developer, but also lead to web apps that are more accessible.
Escaping
I use escaping to describe all techniques that represent data in such a way that it is preserved in a different context. From a PHP developer’s perspective, there are three primary contexts that involve escaping:
- URLs
- HTML
- SQL
The format of URLs does not support other character encodings, so urlencode()
sufficiently preserves data within a URL (e.g., as the value of a query string parameter). However, properly escaping data to be used in SQL queries or HTML requires more attention.
HTML
Escaping HTML is typically performed with htmlentities()
or htmlspecialchars()
, but the simplest use of these functions does not indicate which character encoding to use:
<?php
$html = array();
$html['username'] = htmlentities($clean['username']);
echo "<p>Welcome back, {$html['username']}.</p>";
?>
This example suggests that the username has already been filtered (hence $clean['username']
), so it’s unlikely to cause problems when used in the context of HTML. If $_POST['username']
were used instead, however, this would be vulnerable to XSS, despite the use of htmlentities()
.
The best way to illustrate this is to recreate Google’s recent XSS vulnerability. I used the following example when I blogged about this:
<?php
header('Content-Type: text/html; charset=UTF-7');
$string = "<script>alert('XSS');</script>";
$string = mb_convert_encoding($string, 'UTF-7');
echo htmlentities($string);
?>
The Content-Type
header indicates a character encoding of UTF-7. Browsers such as Internet Explorer automatically detect the encoding, in which case this line can be removed, but I wanted to make sure the example works in any browser.
The next two lines create $string
, which represents the attack -- a typical XSS attack encoded with UTF-7. By default, htmlentities()
assumes the character encoding is ISO-8859-1, so it misinterprets the characters used in the attack and fails to escape them properly. Thus, if you try this example yourself, you should see the following:
In order to avoid this type of vulnerability, it's best to always be explicit about the character encoding:
<?php
header('Content-Type: text/html; charset=UTF-8');
$html = array();
$html['username'] = htmlentities($clean['username'], ENT_QUOTES, 'UTF-8');
echo "<p>Welcome back, {$html['username']}.</p>";
?>
SQL
Character encoding is more important when escaping HTML than when escaping SQL. The HTML you output is interpreted by many different browsers. When you're communicating with a database, you're communicating with a particular database, and you control how it interprets characters.
It is both interesting and educational to see how character encoding inconsistencies can be problematic in the context of SQL. In order to provide an example, I'll demonstrate an SQL injection attack that is
immune to addslashes()
, because this function also assumes ISO-8859-1. For this demonstration, I'll
use GBK, a multi-byte character encoding.
In GBK, 0xbf27
is not a valid multi-byte character, but 0xbf5c
is. Interpreted as single-byte characters, 0xbf27
is 0xbf
(¿
) followed by 0x27
('
), and 0xbf5c
is 0xbf
(¿
) followed by 0x5c
(\
).
The goal of many SQL injection attacks is to inject a single quote without it being escaped. If addslashes()
is being used, this can seem impossible, because it inserts a backslash before every single quote. However, all an attacker must do is inject something like 0xbf27
, because addslashes()
modifies this to become 0xbf5c27
, a valid multi-byte character followed by a single quote. In other words, a single quote can be injected, despite the escaping. This is because 0xbf5c
is considered to be a single character.
In order to illustrate this further, I provided a concrete example in my blog that I want to share. If you want to try this yourself, make
sure you're using GBK. You can do this in /etc/my.cnf
:
[client]
default-character-set=GBK
You'll need a table called users:
CREATE TABLE users (
username VARCHAR(32) CHARACTER SET GBK,
password VARCHAR(32) CHARACTER SET GBK,
PRIMARY KEY (username)
);
The following script mimics a situation where only addslashes()
is used to escape the data being used in a query:
<?php
$mysql = array();
$db = mysqli_init();
$db->real_connect('localhost', 'myuser', 'mypass', 'mydb');
/* SQL Injection Example */
$_POST['username'] = chr(0xbf) .
chr(0x27) .
' OR username = username /*';
$_POST['password'] = 'guess';
$mysql['username'] = addslashes($_POST['username']);
$mysql['password'] = addslashes($_POST['password']);
$sql = "SELECT *
FROM users
WHERE username = '{$mysql['username']}'
AND password = '{$mysql['password']}'";
$result = $db->query($sql);
if ($result->num_rows) {
/* Success */
} else {
/* Failure */
}
?>
Despite the use of addslashes()
, an attacker can log in successfully without knowing a valid username or
password.
To avoid this type of vulnerability, use mysql_real_escape_string()
, bound parameters, or any of the major database abstraction libraries.
This type of attack is possible with any character encoding where there is a valid multi-byte character that ends in 0x5c
, because addslashes()
can be tricked into creating a valid multi-byte character instead of escaping the single quote that follows.
Until Next Time…
Hopefully you appreciate the importance of character encoding consistency and will always indicate the character encoding in your htmlentities()
calls, your Content-Type
headers, and the like. If
you're using MySQL, use mysql_real_escape_string()
instead of addslashes()
, or if at all possible, use bound parameters.
Until next time, be safe.