Thoughts on Software Engineering: Sanitizing User Input, Part I

Many years ago I took a secure coding class. I mainly remember one thing from the course: “Assume all user input is evil.” This is fine because the instructors did say “If you only remember one thing from the course, remember this!”

What can a user do with input alone? Let’s say you are a malicious user of a web forum. You could create an account on the forum and set your display name to “<script>alert(‘surprise!’);</script>”. After registration, there is a script tag with javascript being stored in the database where your username should be.

With that in place, any time another user of the forum loads a page where your username is displayed (say on any of your comments) your custom javascript will execute on their browser! This is a Bad Thing because usually a malicious user would not just pop up “surprise” but instead use that snippet of javascript to, say, grab all the cookies in your browser and send them to said user for nefarious purposes.

This is called Cross Site Scripting (XSS) and works best on pages that are rendered on the server because the script is always loaded by the browser that way. The script might not be run if it’s added to the page after the page is loaded by an AJAX call. But an AJAX application can still be susceptible, for example if the username is added to the DOM like this:

// bad code that is susceptible to XSS

// JSON object returnedUser.displayName is

// “<script>alert(‘surprise!’);</script>”

var newdiv = document.createElement('div');

newdiv.innerHTML = returnedUser.displayName;

$('user_listing').append(newdiv);

Sometimes developers in an overzealous commitment to security decide to html encode all input, or strip all special characters such as “<” and “>”. There are worse things to be overzealously committed to, but we can do better. Instead of sanitizing every field every time, we can say that the check depends on what the content is supposed to represent and how it will be displayed. For example, in a user display name, a link tag is probably not valid to have as part of the name, but a link tag or style tag may be very helpful and relevant in a displayed product description. For this reason, most HTML sanitizers are very flexible about how they can be configured, so that we can easily allow different kinds of HTML in different places.

Now that we’ve agreed we need to sanitize user-supplied text, the next question is when to do the sanitizing. You could make an argument to sanitize user-supplied text on input or on output. A reason for sanitizing input is that philosophically it makes sense to catch potential security issues as early as possible, and you would avoid storing malicious input on your system at all. This takes malicious input and turns it into a validation issue, just like storing numbers for a zip code is a validation issue. A reason for sanitizing output is that the site then has the option to change sanitization policy dynamically. At some point html may be considered unsafe (say, allowing links in a product description) and later it could be considered safe (due to a business decision to allow it). If you sanitize only output, you are free to decide the policy whenever you want.

Some libraries are designed to operate on input. Lacking a strong driver to be able to change the sanitization policy dynamically, I would favor sanitizing on input as well.

Stay tuned for Part II where we’ll look at a slick way to sanitize input as an input validation problem!

Thoughts on Software Engineering

Sunday, May 19, 2013

Sanitizing User Input, Part I

No comments:

Post a Comment