How to allow specific characters with OWASP HTML Sanitizer?

11,236

Solution 1

The danger in XSS is that one user may insert html code in his input data that you later inserts in a web page that is sent to another user.

There are in principle two strategies you can follow if you want to protect against this. You can either remove all dangerous characters from user input when they enter your system or you can html-encode the dangerous characters when you later on write them back to the browser.

Example of the first strategy:

User enter data (with html code)

  1. Server remove all dangerous characters
  2. Modified data is stored in database
  3. Some time later, server reads modified data from database
  4. Server inserts modified data in a web page to another user

Example of second strategy:

  1. User enter data (with html code)
  2. Unmodified data, with dangerous characters, is stored in database
  3. Some time later, server reads unmodified data from database
  4. Server html-encodes dangerous data and insert them into a web page to another user

The first strategy is simpler, since you usually reads data less often that you use them. However, it is also more difficult because it potentially destroys the data. It is particulary difficult if you needs the data for something other than sending them back to the browser later on (like using an email address to actually send an email). It makes it more difficult to i.e. make a search in the database, include data in an pdf report, insert data in an email and so on.

The other strategy has the advantage of not destroying the input data, so you have a greater freedom in how you want to use the data later on. However, it may be more difficult to actually check that you html-encode all user submitted data that is sent to the browser. A solution to your particular problem would be to html-encode the email address when (or if) you ever put that email address on a web page.

The XSS problem is an example of a more general problem that arise when you mix user submitted data and control code. SQL injection is another example of the same problem. The problem is that the user submitted data is interpreted as instructions and not data. A third, less well known example is if you mix user submitted data in an email. The user submitted data may contain strings that the email server interprets as instructions. The "dangerous character" in this scenario is a line break followed by "From:".

It would be impossible to validate all input data against all possible control characters or sequences of characters that may in some way be interpreted as instructions in some potential application in the future. The only permanent solution to this is to actually sanitize all data that is potentially unsafe when you actually use that data.

Solution 2

You may want to use ESAPI API to filter specific characters. Although if you like to allow specific HTML element or attribute you can use following allowElements and allowAttributes.

// Define the policy.

Function<HtmlStreamEventReceiver, HtmlSanitizer.Policy> policy
     = new HtmlPolicyBuilder()
         .allowElements("a", "p")
         .allowAttributes("href").onElements("a")
         .toFactory();

 // Sanitize your output.
 HtmlSanitizer.sanitize(myHtml, policy.apply(myHtmlStreamRenderer));

Solution 3

I know I am answering question after 7 years, but maybe it will be useful for someone. So, basically I agree with you guys, we should not allow specific character for security reasons (you covered this topic, thanks). However I was working on legacy internal project which requried escaping html characters but "@" for reason I cannot tell (but it does not matter). My workaround for this was simple:

private static final PolicyFactory PLAIN_TEXT_SANITIZER_POLICY = new HtmlPolicyBuilder().toFactory();


public static String toString(Object stringValue) {
    if (stringValue != null && stringValue.getClass() == String.class) {
        return HTMLSanitizerUtils.PLAIN_TEXT_SANITIZER_POLICY.sanitize((String) stringValue).replace("&#64;", "@");
    } else {
        return null;
    }
}

I know it is not clean, creates additional String, but we badly need this. So, if you need to allow specific characters you can use this workaround. But if you need to do this your application is probably incorrectly designed.

Solution 4

To be honest you should really be doing a whitelist against all user supplied input. If it's an email address, just use the OWASP ESAPI or something to validate the input against their Validator and email regular expressions.

If the input passes the whitelist, you should go ahead and store it in the DB. When displaying the text back to a user, you should always HTML encode it.

Your blacklist approach is not recommended by OWASP and could be bypassed by someone who is committed to attacking your users.

Share:
11,236
ams
Author by

ams

I love software development.

Updated on June 12, 2022

Comments

  • ams
    ams almost 2 years

    I am using the OWASP Html Sanitizer to prevent XSS attacks on my web app. For many fields that should be plain text the Sanitizer is doing more than I expect.

    For example:

    HtmlPolicyBuilder htmlPolicyBuilder = new HtmlPolicyBuilder();
    stripAllTagsPolicy = htmlPolicyBuilder.toFactory();
    stripAllTagsPolicy.sanitize('a+b'); // return a&#43;b
    stripAllTagsPolicy.sanitize('[email protected]'); // return foo&#64;example.com
    

    When I have fields such as email address that have a + in it such as [email protected] I end up with the wrong data in the the database. So two questions:

    1. Are characters such as + - @ dangerous on their own do they really need to be encoded?
    2. How do I configure the OWASP html sanitizer to allow specific characters such as + - @?

    Question 2 is the more important one for me to get an answer to.