This post will contain some tips on how to set up your web development process to use UTF-8 end to end. What happened was, I saw a pair of posts by Sam Ruby (Unicode and weblogs, Aggregator i18n tests). I can be a bit of a careful (read: slow) thinker at times, so I had to let this percolate through my brain for awhile. But these posts were published during two projects that I was working on.
The first project I was working on was a site containing English and French, with a ColdFusion based content management system, and was trying to deal with accented characters. With my co-workers, we figured out how to reliably cough out entities in the right places.
Another project was an XML/XSLT based one that was getting some fancy characters like bullets and emdashes from some form submissions, and dealing badly with them. I think we did the entity replacement trick here, too.
What we found remarkable was that we’ve been building web sites for years, and it seems like all of a sudden this has become a problem for us.
Well, finally I grokked what Sam was talking about, and I went ahead and modified my web development toolchain to work in utf-8. Let me tell you: it was far easier than I had originally thought it was going to be. This blog hasn’t been updated yet, but my other website is running as UTF-8.
See, unicode is a lot more than just accented characters, or Asian characters. It’s also got all the finer typography controls too — you want curly quotes? Emdashes? It’s all there. With a little wiki magic, you could even set up your content management system to automatically convert standard ASCII quotes to curly quotes, and dashes to emdashes, so you’d never have to learn how to input them!
Seriously, you don’t need to use HTML entities anymore, barring angle brackets and the ampersand. That’s a big win — it makes your source code readable if the HTML is stripped out. Everything still looks good in plain text.
<meta http-equiv="content-type" content="text/html; charset=utf-8">
acceptcharset="utf-8" accept-charset="utf-8"
Content-type: text/html; charset=utf-8
AddCharset utf-8 .html
Even with all this, I still don’t feel like I’m an expert. For example, the accept-charset tags mentioned above aren’t supported in version 4 browsers. What’s the encoding of text submitted by form on those browsers? If it’s not UTF-8, then you’d need to patch your CGI’s to check for these browsers, and convert the contents to UTF-8. As I learn more, I’ll try to keep you updated.