Alok Menghrajani

Previously: security engineer at Square, co-author of HackLang, put the 's' in https at Facebook. Maker of CTFs.

This blog does not use any tracking cookies and does not serve any ads. Enjoy your anonymity; I have no idea who you are, where you came from, and where you are headed to. Let's dream of an Internet from times past.

A simple yet robust approach to sanitizing user supplied HTML and CSS

Oct 11, 2015

tags: html | css | parser | xss

Web applications sometimes need to render a piece of HTML that has been supplied by the users. This happens for example when dealing with content ediable/rich text editors or 3rd party integration (ads, games, etc.).

The web security risk with having user supplied HTML in a page is obvious: if the page fails to properly strip all scripts, a malicious user will be able to run arbitrary javascript and hijack the user experience (i.e. the page will be vulnerable to XSS).

This page presents a a robust way to sanitize user supplied HTML and CSS in ~100 lines of JavaScript.

Traditional approaches

A perfectly safe way to isolate user supplied HTML is to enable a strict CSP ruleset, render the content in an iframe or host the entire page on a sandbox subdomain.

In some cases, these isolation methods aren't flexible enough and web developers need a way to sanitize the user supplied HTML. Writing a parser works but can be tricky: you need to handle the complexity of the HTML specification yet only allow a whitelist of tags.

Leveraging the browser

We can however let the browser parse the user supplied string (something browsers are really good at doing) and then recursively sanitize the DOM tree before attaching the content to the page.

I believe this approach is very robust for two reasons: we are manipulating DOM nodes instead of strings and there is no risk of "time of check time to use" bugs because the same browser is used to parse the HTML and to render the sanitized string.

Given that strings like <sc%00ript> can end up trigger a XSS on a subset of browsers, any method which involves parsing user supplied strings into a tree and then returning a serialized version of a subset of the tree is going to be more bullet-proof than code trying to directly sanitize a string. Failing to write a proper parser is one of the reasons FBML was riddled with security bugs.

Finally I like the approach of walking the DOM tree because it's simple and can be implemented in a small amount of JavaScript.

Demo

Below is a demonstration of this method. The input field lets you enter arbitrary HTML but only keeps <a>, <img> , <div>, ,  , ,  and  tags. In addition, the sanitizer allows setting border, margin and padding CSS properties.

Can you get the page to run alert(1)?

Output as string:

Output as html:

Code

Notes, credits and links

If you plan to post-process the sanitized DOM, keep in mind that some attribute have side effects which might have already taken effect. E.g. setting the src attribute on an image fires an http request right away (even before the image is actually added to the DOM). You are probably better off performing all your processing as the new nodes get created.

Credits to Erling for pointing out the need for document.implementation.createHTMLDocument()!

Credits to Ben Gotow for suggesting some improvements.