Sanitizing HTML input value

30,544

Solution 1

There really are two questions that you're asking (or at least can be interpreted):

  1. Can the quoted value attribute of input[type="text"] be injected if quotes are disallowed?

  2. Can an arbitrary quoted attribute of an element be injected if quotes are disallowed.

The second is trivially demonstrated by the following:

<a href="javascript:alert(1234);">Foo</a>

Or

<div onmousemove="alert(123);">...

The first is a bit more complicated.

HTML5

According to the HTML5 spec:

Attribute values are a mixture of text and character references, except with the additional restriction that the text cannot contain an ambiguous ampersand.

Which is further refined in quoted attributes to:

The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by a single """ (U+0022) character, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal U+0022 QUOTATION MARK characters ("), and finally followed by a second single """ (U+0022) character.

So in short, any character except an "ambiguous ampersand" (&[a-zA-Z0-9]+; when the result is not a valid character reference) and a quote character is valid inside of an attribute.

HTML 4.01

HTML 4.01 is less descriptive than HTML5 about the syntax (one of the reasons HTML5 was created in the first place). However, it does say this:

When script or style data is the value of an attribute (either style or the intrinsic event attributes), authors should escape occurrences of the delimiting single or double quotation mark within the value according to the script or style language convention. Authors should also escape occurrences of "&" if the "&" is not meant to be the beginning of a character reference.

Note, this is saying what an author should do, not what a parser should do. So a parser could technically accept or reject invalid input (or mangle it to be valid).

XML 1.0

The XML 1.0 Spec defines an attribute as:

Attribute ::= Name Eq AttValue

where AttValue is defined as:

AttValue ::= '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'"

The & is similar to the concept of an "ambiguous ampersand" from HTML5, however it's basically saying "any unencoded ampersand".

Note though that it explicitly denies < from attribute values.

So while HTML5 allows it, XML1.0 explicitly denies it.

What Does It Mean

It means that for a compliant and bug free parser, HTML5 will ignore < characters in an attribute, and XML will error.

It also means that for a compliant and bug free parser, HTML 4.01 will behave in unspecified and potentially odd ways (since the specification doesn't detail the behavior).

And this gets down to the crux of the issue. In the past, HTML was such a loose spec, that every browser had slightly different rules for how it would deal with malformed html. Each would try to "fix" it, or "interpret" what you meant. So that means that while a HTML5 compliant browser wouldn't execute the JS in <input type="text" value="<script>alert(0)</script>">, there's nothing to say that a HTML 4.01 compliant browser wouldn't. And there's nothing to say that a bug may not exist in the XML or HTML5 parser that causes it to be executed (though that would be a pretty significant problem).

THAT is why OWASP (and most security experts) recommend you encode either all non-alpha-numeric characters or &<" inside of an attribute value. There's no cost in doing so, only the added security of knowing how the browser's parser will interpret the value.

Do you have to? no. But defense in depth suggests that, since there's no cost to doing so, the potential benefit is worth it.

Solution 2

When users submit data, you need to make sure that they've provided something you expect.

For example, if you expect a number, make sure the submitted data is a number. You can also cast user data into other types. Everything submitted is initially treated like a string, so forcing known-numeric data into being an integer or float makes sanitization fast and painless.

You need to make sure that fields that should not have any HTML content do not actually contain HTML. There are different ways in you can deal with this problem.

You can try escaping HTML input with htmlspecialchars. You should not use htmlentities to neutralize HTML, as it will also perform encoding of accented and other characters that it thinks also need to be encoded.

You can try removing any possible HTML. strip_tags is quick and easy, but also sloppy. HTML Purifier does a much more thorough job of both stripping out all HTML and also allowing a selective whitelist of tags and attributes through.

You can use the OWASP PHP Filters. They're really simple to use and effective.

You can use the filter extension, which provides a comprehensive way to sanitize user input.

Examples

the below code will remove all HTML tags from a string:

$string = "<h1>Hello, World!</h1>";
$new_string = filter_var($string, FILTER_SANITIZE_STRING);
// $new_string is now "Hello, World!"

The below code will ensure the value of the variable is a valid IP address:

$ip = "127.0.0.1";
$valid_ip = filter_var($ip, FILTER_VALIDATE_IP);
// $valid_ip is TRUE
 
$ip = "127.0.1.1.1.1";
$valid_ip = filter_var($ip, FILTER_VALIDATE_IP);
// $valid_ip is FALSE

Sanitizing and validating email addresses:

<?php
$a = '[email protected]';
$b = 'bogus - at - example dot org';
$c = '([email protected])';

$sanitized_a = filter_var($a, FILTER_SANITIZE_EMAIL);
if (filter_var($sanitized_a, FILTER_VALIDATE_EMAIL)) {
    echo "This (a) sanitized email address is considered valid.\n";
}

$sanitized_b = filter_var($b, FILTER_SANITIZE_EMAIL);
if (filter_var($sanitized_b, FILTER_VALIDATE_EMAIL)) {
    echo "This sanitized email address is considered valid.";
} else {
    echo "This (b) sanitized email address is considered invalid.\n";
}

$sanitized_c = filter_var($c, FILTER_SANITIZE_EMAIL);
if (filter_var($sanitized_c, FILTER_VALIDATE_EMAIL)) {
    echo "This (c) sanitized email address is considered valid.\n";
    echo "Before: $c\n";
    echo "After:  $sanitized_c\n";    
}
?>

Reference:

What are the best PHP input sanitizing functions?

http://code.tutsplus.com/tutorials/sanitize-and-validate-data-with-php-filters--net-2595

https://security.stackexchange.com/q/42498/71827

http://php.net/manual/en/filter.examples.sanitization.php

Solution 3

If your question is "what types of xss-attacks are possible" then you better google it. I'll just leavev some examples of why you should sanitize your inputs

  • If input is generated by echo '<input type="text" value="$var">', then simple ' breaks it.

  • If input is plain HTML in PHP page then value=<?php deadly_php_script ?> breaks it

  • If this is plain HTML input in HTML file - then converting doublequotes should be enough.

Although, converting other special symbols (like <, > and so on) is a good practice. Inputs are made to input info that would be stored on server\transferred into another page\script, so you need to check what could break those files. Let's say we have this setup:

index.html:

<form method=post action=getinput.php> <input type="text" name="xss"> <input type="submit"></form>

getinput.php:

echo $_POST['xss'];

Input value ;your_deadly_php_script breaks it totally (you can also sanitize server-side in that case)

If that's not enough - provide more info on your question, add more examples of your code.

Solution 4

I believe the person is referring to cross site scripting attacks. They tagged this as php, security, and xss

take for example

<input type="text" value=""><script>alert(0)</script><"">

The above code will execute the alert box code;

<?php $var= "\"><script>alert(0)</script><\""; ?>
<input type="text" value="<?php echo $var ?>">

This will also execute the alert box. To solve this you need to escape ", < >, and a few more to be safe. PHP has a couple of functions worth looking into and each have their ups and downs!

htmlentities() - Convert all applicable characters to HTML entities
htmlspecialchars() - Convert special characters to HTML entities
get_html_translation_table() - Returns the translation table used by  htmlspecialchars and htmlentities
urldecode() - Decodes URL-encoded string

What you have to be careful of is that you are passing in a variable and there ways to create errors and such to cause it to break out. Your best bet is to make sure that data is not formatted in an executable manner in case of errors. But you are right if they are no quotes you can't break out but there are ways you or I don't understand at this point that will allow that to happen.

Share:
30,544
KaekeaSchmear
Author by

KaekeaSchmear

Updated on July 18, 2022

Comments

  • KaekeaSchmear
    KaekeaSchmear almost 2 years

    Do you have to convert anything besides the quotes (") to (&quot;) inside of:

    <input type="text" value="$var">

    I personally do not see how you can possibly break out of that without using " on*=....

    Is this correct?

    Edit: Apparently some people think my question is too vague;

    <input type="text" value="<script>alert(0)</script>"> does not execute. Thus, making it impossible to break out of using without the usage of ".

    Is this correct?