Web applications sometimes need to render a piece of HTML that has been supplied by the users. This happens for example when dealing with content ediable/rich text editors or 3rd party integration (ads, games, etc.).
A perfectly safe way to isolate user supplied HTML is to enable a strict CSP ruleset, render the content in an iframe or host the entire page on a sandbox subdomain.
In some cases, these isolation methods aren't flexible enough and web developers need a way to sanitize the user supplied HTML. Writing a parser works but can be tricky: you need to handle the complexity of the HTML specification yet only allow a whitelist of tags.
Leveraging the browser
We can however let the browser parse the user supplied string (something browsers are really good at doing) and then recursively sanitize the DOM tree before attaching the content to the page.
I believe this approach is very robust for two reasons: we are manipulating DOM nodes instead of strings and there is no risk of "time of check time to use" bugs because the same browser is used to parse the HTML and to render the sanitized string.
Given that strings like
<sc%00ript> can end up trigger a XSS
on a subset of browsers, any method which involves parsing user
supplied strings into a tree and then returning a serialized version
of a subset of the tree is going to be more bullet-proof than code
trying to directly sanitize a string. Failing to write a proper parser
is one of the reasons FBML was riddled with security bugs.
Below is a demonstration of this method. The input field lets you enter
arbitrary HTML but only keeps
<u> tags. In addition, the
sanitizer allows setting
padding CSS properties.
Can you get the page to run
Output as string:
Output as html:
Next steps and links
If people find this sanitizer useful I'm going to turn it in "production" code by adding some tests, getting it peer reviewed and packaging it into a NPM.
Credits to Erling for pointing out the need for