5.1. Data Integrity Policies
Data integrity is key to a successfully engineered application. The data you receive, process, and store is what your application is all about. Regardless of what transformations you apply to your data, or the novel way in which you present it and allow it to be interrogated, it's worthless unless the data you're working with is valuable.
Application data in web applications is traditionally the most protected elementwritten to multiple machines in multiple datacenters. Multiple disks in mirrored or parity RAID configurations help avoid data loss. Backup tapes and offsite backup cycles aid in disaster recovery. A lot of money goes into protecting the data you store, but it's all for nothing if your application allows garbage into the system.
A data integrity policy (if it were a common enough topic to have an acronym it would be DIP) is a set of rules and regulations regarding how your application ensures that it stores and outputs expected data. It should cover everything from checking that data is valid within your character set and encoding, through filtering undesirable characters, to contextual integrity, such as ensuring the well-formed-ness of stored markup.
A workable data integrity policy is based on the founding principle that data inside the application is pristine. That is to say, incoming data is filtered at the border and stored filtered. Outputting of data can then happen with no processing required.
This approach is sensible for two important reasons. First, filtering is not a trivial effort, especially for encoding conversions and syntax checking. Web applications typically output a stored chunk of data many times more than that data is injected (which occurs only once). By performing conversions at input time, the filtering needs to be applied only once instead of on every view.
Second, a typical chunk of application data will have fewer input than output vectors. If your application allows you to title an item of content, the titling might happen in only one place, while the display of that title can happen in many more contexts. By filtering at the input border, you can reduce the code complexity when outputting, which will reduce your overall code size. Outputting can then become as easy as reading straight from the database and echoing the content, safe in the knowledge that the content is valid.
Should you store the data escaped and unescape it for the edge case, or store it unescaped and escape it for the typical case? This is very much a point of style. I prefer to store all data unescaped and escape it during display, but the reverse can easily be argued. Set a standard for your application and stick to itmixing the two methods is a recipe for disaster, leading to unescaped vulnerabilities or double-escaped ugliness.
Depending on the needs of your application, you may not want or be able to process data in this way. If you receive signed data from users that you need to store and pass on, making any changes to the data will break the signature. In instances like this, you may decide to store only unfiltered data and filter it at display time if needed or store two copies of the dataone raw with its signature intact and one processed for display purposes. The policy you come up with should reflect the way you use the data within your system.