Sunday, April 03, 2005

Solving the programming world's problems (one character at a time)

I just spent many minutes correcting a bug in some software that has 'reasonably' complex interactions - ie it passes filename strings from Flash to JavaScript to ASP.NET to SQLServer and back.

It will come as little surprise any programmer that there were, during development, and indeed this instance, problems with string quoting. Someone somewhere (of course) decides to put in a ' in their filename. So, Flash, Java, .NET and SQL all need to quote, unquote, encode, decode, replace, escape and what have you just to allow one character that meant nothing much important to the person who put it in their filename to pass through the system.

It struck me that there is a very simple solution, in two parts (the first solving most of the problem and the second ameliorating the rest), so simple indeed that whoever first decided how to delimit a string should have their knuckles broken:

  • Insight: Why on earth are we using the same damn character for 'the start of a string' and 'the end of a string'.
  • Solution: If we were to simply change to two characters, eg, ` and ´or « and » or whatever there would be no more confusion created by nested quoting schemes.
  • Because: Whereas "this "is a "nested" quote" "" but who knows..." has several possible interpretations, «this «is a «nested» quote» «» but who knows...» has only one.
  • Insight: Why are we using the same character for a quoting scheme(open and close, like brackets, braces etc) that is the same character used for another sort of delimiter - the apostrophe.
  • Solution: Obviously - choose another character code. Say ' for apostrophe and (as above) ` ´ for a string (or character if you must) delimiter.
  • Because: No longer will a lone character that happens to be inserted toggle open or closed a quoted string.

Thus the string that I found SQL returning to JavaScript which it choked on: "this is the 'c.v' of P. O'Neil.doc" could be rendered as «this is the `c.v´ of P. O'Neil.doc» and be universally (close to anyway) understood.

Like this, one is only responsible for the quoted strings one generates, as any valid quoted string can be inserted into another valid quoted string and produce a valid quoted string. Which is not the case today. If you are given a string (ie the string is valid) then you could use it where and how you like. Most of the problems that you see with string quoting and escaping come from the receiver of a 'valid' string having to take responsibility for the actual characters in the string... Has someone escaped it, are there embedded characters that I feel are string delimiters, how will the next guy along cope if I escape the string by, say, replacing " with "", or is that """, or \" ... 'oh fudge'.

Thank you for listening