What does FILTER_SANITIZE_STRING do?

50,035

Solution 1

According to PHP Manual:

Strip tags, optionally strip or encode special characters.

According to W3Schools:

The FILTER_SANITIZE_STRING filter strips or encodes unwanted characters.

This filter removes data that is potentially harmful for your application. It is used to strip tags and remove or encode unwanted characters.

Now, that doesn't tell us much. Let's go see some PHP sources.

ext/filter/filter.c:

static const filter_list_entry filter_list[] = {                                       
    /*...*/
    { "string",          FILTER_SANITIZE_STRING,        php_filter_string          },  
    { "stripped",        FILTER_SANITIZE_STRING,        php_filter_string          },  
    { "encoded",         FILTER_SANITIZE_ENCODED,       php_filter_encoded         },  
    /*...*/

Now, let's go see how php_filter_string is defined.
ext/filter/sanitizing_filters.c:

/* {{{ php_filter_string */
void php_filter_string(PHP_INPUT_FILTER_PARAM_DECL)
{
    size_t new_len;
    unsigned char enc[256] = {0};

    /* strip high/strip low ( see flags )*/
    php_filter_strip(value, flags);

    if (!(flags & FILTER_FLAG_NO_ENCODE_QUOTES)) {
        enc['\''] = enc['"'] = 1;
    }
    if (flags & FILTER_FLAG_ENCODE_AMP) {
        enc['&'] = 1;
    }
    if (flags & FILTER_FLAG_ENCODE_LOW) {
        memset(enc, 1, 32);
    }
    if (flags & FILTER_FLAG_ENCODE_HIGH) {
        memset(enc + 127, 1, sizeof(enc) - 127);
    }

    php_filter_encode_html(value, enc);

    /* strip tags, implicitly also removes \0 chars */
    new_len = php_strip_tags_ex(Z_STRVAL_P(value), Z_STRLEN_P(value), NULL, NULL, 0, 1);
    Z_STRLEN_P(value) = new_len;

    if (new_len == 0) {
        zval_dtor(value);
        if (flags & FILTER_FLAG_EMPTY_STRING_NULL) {
            ZVAL_NULL(value);
        } else {
            ZVAL_EMPTY_STRING(value);
        }
        return;
    }
}

I'll skip commenting flags since they're already explained on the Internet, like you said, and focus on what is always performed instead, which is not so well documented.

First - php_filter_strip. It doesn't do much, just takes the flags you pass to the function and processes them accordingly. It does the well-documented stuff.

Then we construct some kind of map and call php_filter_encode_html. It's more interesting: it converts stuff like ", ', & and chars with their ASCII codes lower than 32 and higher than 127 to HTML entities, so & in your string becomes &. Again, it uses flags for this.

Then we get call to php_strip_tags_ex, which just strips HTML, XML and PHP tags (according to its definition in /ext/standard/string.c) and removes NULL bytes, like the comment says.

The code that follows it is used for internal string management and doesn't really do any sanitization. Well, not exactly - passing undocumented flag FILTER_FLAG_EMPTY_STRING_NULL will return NULL if the sanitized string is empty, instead of returning just an empty string, but it's not really that much useful. An example:

var_dump(filter_var("yo", FILTER_SANITIZE_STRING, FILTER_FLAG_EMPTY_STRING_NULL));
var_dump(filter_var("\0", FILTER_SANITIZE_STRING, FILTER_FLAG_EMPTY_STRING_NULL));
var_dump(filter_var("yo", FILTER_SANITIZE_STRING));
var_dump(filter_var("\0", FILTER_SANITIZE_STRING));

string(2) "yo"
NULL
string(2) "yo"
string(0) ""

There isn't much more going on, so the manual was fairly correct - to sum it up:

  • Always: strip HTML, XML and PHP tags, strip NULL bytes.
  • FILTER_FLAG_NO_ENCODE_QUOTES - This flag does not encode quotes.
  • FILTER_FLAG_STRIP_LOW - Strip characters with ASCII value below 32.
  • FILTER_FLAG_STRIP_HIGH - Strip characters with ASCII value above 127.
  • FILTER_FLAG_ENCODE_LOW - Encode characters with ASCII value below 32.
  • FILTER_FLAG_ENCODE_HIGH - Encode characters with ASCII value above 127.
  • FILTER_FLAG_ENCODE_AMP - Encode the & character to & (not &).
  • FILTER_FLAG_EMPTY_STRING_NULL - Return NULL instead of empty strings.

Solution 2

I wasn't sure if "stripping tags" means just the < > characters, and if it preserves content between tags, e.g. the string "Hello!" from <b>Hello!</b>, so I decided to check. Here are the results, using PHP 7.1.5 (and Bash for the command line):

curl --data-urlencode 'my-input='\
'1. ASCII b/n 32 and 127: ABC abc 012 '\
'2. ASCII higher than 127: Çüé '\
'3. PHP tag: <?php $i = 0; ?> '\
'4. HTML tag: <script type="text/javascript">var i = 0;</script> '\
'5. Ampersand: & '\
'6. Backtick: ` '\
'7. Double quote: " '\
'8. Single quote: '"'" \
http://localhost/sanitize.php
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: &#34; 8. Single quote: &#39;
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_NO_ENCODE_QUOTES);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: " 8. Single quote: '
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: &#34; 8. Single quote: &#39;
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_BACKTICK);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: 7. Double quote: &#34; 8. Single quote: &#39;
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_HIGH);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: &#195;&#135;&#195;&#188;&#195;&#169; 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: & 6. Backtick: ` 7. Double quote: &#34; 8. Single quote: &#39;
    • sanitize.php: <?php echo filter_input(INPUT_POST,'my-input', FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_AMP);
    • output: 1. ASCII b/n 32 and 127: ABC abc 012 2. ASCII higher than 127: Çüé 3. PHP tag: 4. HTML tag: var i = 0; 5. Ampersand: &#38; 6. Backtick: ` 7. Double quote: &#34; 8. Single quote: &#39;

Also, for the flags FILTER_FLAG_STRIP_LOW & FILTER_FLAG_ENCODE_LOW, since my Bash doesn't display these characters, I checked using the bell character (, ASCII 007) and Restman Chrome extension that:

  • without either of these flags, the character is preserved
  • with FILTER_FLAG_STRIP_LOW, it is removed
  • with FILTER_FLAG_ENCODE_LOW, it is encoded to &#7;
Share:
50,035
Admin
Author by

Admin

Updated on July 09, 2022

Comments

  • Admin
    Admin almost 2 years

    There's like a million Q&A that explain the options like FILTER_FLAG_STRIP_LOW, but what does FILTER_SANITIZE_STRING do on its own, without any options? Does it just filter tags?

  • Octopus
    Octopus almost 9 years
    as usual, the W3School's definition is completely useless. without defining what "unwanted characters" are, this could do practically anything according to them.
  • guido
    guido about 6 years
    What about backslash?
  • William Entriken
    William Entriken almost 4 years
    as even more usual, the official PHP.net definition is nonexistent