5.2. Good, Valid, and Invalid
When dealing with the filtering of incoming data, the best approach is to group it into three categoriesgood, valid, and invalid. Good data is the kind of data you expect and want. Valid data is the kind of data that your application can process, store, and manipulate, but that might not make contextual sense. Invalid data, finally, is data that breaks some element of your application's storage, processing, or output.
Bringing this into the context of a UTF-8-encoded Unicode application, we can put these labels on specific groups of data. Invalid data represents incoming byte streams that are not valid UTF-8 sequences. Outputting this data will result in invalid pages, and processing it will result in undefined outcomes, possibly including the corruption of other data or potential security vulnerabilities. A good example would be someone sending the byte stream 0b11000000 0b11000000 as a username; you can store it, but any manipulation of it is suspect.
Valid data covers valid UTF-8 byte sequences that contain contextually invalid data. A good example is somebody choosing a username with a carriage return in the middlea sequence you can store, manipulate, and output, but which might ultimately break your application's behavior. Valid data also covers data that is fine to process and present, but makes no contextual sense, such as a string of digits being entered as a user's occupation in her profile.
Good data covers the data you expectyou can store, process, and output it with no problems, and its manipulation follows your business rules. Following from our previous examples, good input would include the username "foo." It doesn't contain any invalid byte sequences, any unexpected characters, and is indeed a username.
Your ideal goal is to always process and store good data. Your priority is that all of the data you store is at least valid. Tracking down the bugs related to invalid data wastes a lot of time, and the fix may not only involve correcting the code to check input is valid, but also going back and reprocessing all the data you've stored to check for validity. As your data set grows, this task will become more and more daunting.
We'll next take a look at some techniques for ensuring valid data (including filtering UTF-8 and control characters) and turning valid data into good data (by filtering HTML and avoiding XSS issues). Most of the work that goes into making sure valid data is good data revolves around the context in which the data is usedthere's nothing intrinsic about a stream of bytes that makes it goodso we can't make any hard and fast rules about this second stage of data processing.