I run into conflicting information on this subject. Should you strip html, javascript before entering into a database or allow it and escape it upon output? I’m not talking about validating data to ensure it’s of the proper type.

For example, let’s say you allow users to post comments on a news article on your site. Do you get the input, strip any html/javascript, then store it, or do you trim it, store it and escape it upon output? Or both? What are the pros and cons here?
Always at the point of use.
If you’re using data in a SQL query, you should always escape it immediately before adding it into the query string.
If you’re escaping output to send to the browser, always do so immediately before you write it to the output stream (or whatever mechanism eventually ends up in the output stream).
If data that is “trusted” gets passed off to another module in your application, and then you receive it back, it’s no longer trusted and must be encoded. Even if you wrote both modules yourself.
For the sake of argument, imagine a situation where you are filtering user provided HTML, you filter it at input and your application is live for months or years, then you discover your input filtering missed something, maybe an event handler attribute slipped through, or you now need to restrict where images can be linked to.
Now you’re going to have to go back and re-process every single input field that used that filtering, and hope that in the process you don’t mangle them in a way that breaks something else that forces a database rollback. If you don’t go back and re-process everything then you could be allowing existing threats to persist in your database.
Alternatively, if you do it immediately before use, your original data is immutable, you create a temporary copy of it and modify it however you so wish before it gets used in a potentially dangerous operation. Your original data remains untouched.
That being the case, once you identify a hole, you plug it in a central location (your output encoding / escaping), and that protection now applies to everything going through it, regardless of if that data is entered in the future or long ago in the past.
Yes, this means by default you’re going to be running that filtering maybe 100,000 times, instead of just once. For complex sanitization with a measurable performance overhead, that’s where caching comes into it, by turning the CPU overhead into a much easier memory/storage overhead.
For things like htmlspecialchars / htmlentities, just swallow the tiny bit of extra CPU and do it each time, in almost all cases you would spend more CPU cycles accessing the cache than it would take to run the function.
Edit for addendum: None of this is to suggest that you can’t look at a users data on input and reject it at that point, giving them a suitable error message, but that’s just for the benefit of the user, and is NOT where your security comes from.


tl;dr:
Validate on input (check it meets your functional requirements, length, characters, format etc).
Encode and escape on output (including outputting a string ready to send to a database, or web browser).
If you sanitize on write, you’re stuck with that forever. If you sanitize on read, you can retroactively apply changes to the sanitization process, and thus patch up potential threats you missed at first, or allow things that are now safe thanks to new browser functionality.
After. It’s a one-way operation, and in case you need to do any sort of re-parsing of the content, you always want the raw data.
There are two schools of thought here:
Sanitize on output. This gives you ultimate flexibility. If you change the output (e.g. from HTML to email) you can just change the escaping mechanism to match. HOWEVER, this is also a potential performance nightmare. Repeating the same operating unnecessarily is a waste of resources, which takes us to two:
Escape once on the way in for the most performant option. This is less flexible down the line, but let’s be honest, how often does your output change?
There is an alternative: both.
Store the original, but also store a sanitized version as essentially a long-lived cache which is what you actually use. If you change the output format, use the original to regenerate all the cached versions. Yes, this takes up ~twice the space but text compresses well.
The good thing that you can add the caching at ani given point of time, means you can keep with the generic solution and only make it more performant when (if ever) will be necessary.
Validate input before you persist to the DB. You don’t want to save Invalid inputs. However sanitize and or filter those values when they are used, based on the Context in which they are being used.
At any point in time, whenever user inputs something, even before it gets used ion any way, it gets santized. Especially today when there is so high risk of your data being stolen/used in wrong way, its absolute must-to
Easier to say than do. The problem is, nobody knows what’s concrete “sanitization” should be made “at any point of time”.
Members
Online

source