UTF-8: Global Character Support

Sep 26 2011 Published by under Best Practices

Content is king. It’s still true, and if you want people to be able to witness its majesty you need to use the right technology to support each character. Character encoding tells the browser (or whatever GUI application the content comes through) how to properly interpret the characters so that they render as expected. You’ve undoubtedly seen examples of when this has gone wrong. You will have seen either empty boxes or black diamonds or capitalized, strangely accented letters instead of legible text strings.

Corrupt SpanishWhile fonts play into this, the base issue is typically encoding–especially when we’re talking about browser-based applications. Every digital character has meta data attached to it, and this data has to be decoded by the rendering application (e.g. browser). This meta data is a small packet of instructions to the application which, if followed, describes exactly how to put the information on screen. This info packet contains directives about what the base character looks like, whether it’s capital or lower case, what accents may be associated with it, how it interacts with other characters around it. Interaction with adjacent characters is an important one when dealing with scripted languages like Thai and Arabic. The same letter appears differently depending on its position in the term or sentence.

Set or Save EncodingEncoding is set at the system or document level. We’ll talk in terms of XML. Setting encoding happens differently depending on the text editor, but there will either be a preference option to set or else it happens at the time of the save, as with Notepad. It’s important to set that properly the first time, otherwise it can be a struggle to convert the encoding without a script later.

Encoding is declared in the document head. This is something that a developer assigns in conjunction with setting the encoding. The declaration tells the rendering application what encoding the file is set to. Sometimes a browser can figure that out without a declaration, but why leave it to chance?

It’s a very good idea to always use UTF-8 as your encoding. This is because it has become the global Unicode standard for XML and many other markup languages. Why worry about researching which individual encoding supports Chinese, Japanese, German and Canadian French when UTF-8 will support them all? Not only does it make things easier from a development standpoint, but it makes multilingual pages (i.e. one page with multiple languages on it) much easier to develop and maintain. Additionally, your entire global suite will be more cohesive because the browser (or whatever rendering application you use) will not have to switch between encodings to support different target languages.

Enhanced by Zemanta
Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)

No responses yet

You must be logged in to post a comment.