What is Unicode? If you have ever tried to incorporate foreign text using a non-Latin script, like Arabic, Chinese or Bengali into your translated documents or web pages, you may well have encountered a few problems. The most likely reason for issues involves text that has been written and stored in something other than Unicode. What’s Unicode you say? Read on to discover a brief history of text’s relationship with computers….
Do you know your ASCII? This was one of the first standards that were used by computers to display characters on the screen or on an output device like a printer. Since ASCII is an American system it was based on the English alphabet. ASCII uses 128 (7-bit) characters that could be used on early computers. These included punctuation, numbers, symbols and other control codes.
The Problem with ASCII
The limited number of characters in ASCII meant that the code was not able to display other characters or non-Latin scripts. French or German for instance, use a number of accented characters, there just wasn’t enough space left in ASCII to fit these in.
In order to solve this problem, the computing industry extended ASCII by increasing the number of characters to 256 (8-bit). A large number of various extensions were created to add additional accented characters which could be used to display scripts using the Greek or Cyrillic alphabets. These extensions to ASCII are also known as code pages.
The Problem with Code Pages
There were two main problems with code pages. The first problem was the large number of incompatible code pages. Microsoft, IBM, ISO, Apple all produced differing standards. Text produced on one system would not display correctly on another system without converting the text. The second problem was that the extensions were never designed to be mixed. This meant that Greek, Central European, Turkish and Cyrillic text could not be used in the same document as their code pages used the same underlying numbers.
Chinese, Japanese and Korean
These writing systems contain so many characters that the only way to represent them was to use two 8-bit numbers per character. This allows for over 60 thousand characters. A number of code pages were developed to handle these such as Chinese GBK, Chinese Big5, Korean EUC-KR and Japanese Shift-JIS.
Unicode and ISO/IEC 10646
Two major initiatives Unicode and ISO 10646 which came into fruition around 1990, attempted to solve the problem and provide a single code page. Since 1991, these two standards are effectively the same code page. Unicode extends this further by providing standards for collation and bidirectional text (e.g. Arabic).
Why it’s important?
Finally, there is now one common code page that is capable of displaying or representing a large number of languages in a single document. Unicode is now used for most web pages and is also commonly referred to as UTF-8, one of many methods for encoding and storing Unicode characters.
Unicode is not without its problems. Since the standard grew out from the many standards that existed at the time, it has inherited a number of issues. Many issues are a result of Unicode’s philosophy of encoding characters rather than glyphs. A glyph is the visual representation of a character and may take on different forms depending on context. Ligatures are combinations of characters displayed with a single glyph.
Indian scripts such as Bengali and Hindi make use of a large number of ligatures, which are comprised of a series of characters. Since Unicode only encodes the characters, additional processing is required to turn the characters into glyphs. This is also the case for many complex scripts like Arabic, Tamil, Urdu and Vietnamese.
Chinese, Japanese and Korean all use Chinese Han characters which Unicode has unified into a single code block. However, these characters have evolved into quite different visual representations across these languages and require a suitable font.
Legacy software that relies upon the older code pages are not compatible with Unicode text. Often pasting Unicode text into legacy applications produces a series of question marks.
The last piece of the puzzle is that you have a suitable font to display the text. In order to display the text, the computer needs to convert the text into an image for display on the screen or for printing. Even if the text is encoded in Unicode you may still experience trouble viewing the text if you do not have the right font installed. Most modern operating systems now provide basic fonts to cover most languages.
So the reason you might be experience difficulty inserting your Bengali text into your web page for example is because it may have been written using an alternative encoding method to Unicode. Webpages and certain bits of software often rely on Unicode to accurately display text, if the text you are using isn’t Unicode it will likely appear as a jumbled mess of characters or even just placeholder icons.
This post was written by K International’s Studio Manager. We supply quality designed translation artwork in any format, to find out more about this specialist service please visit our dedicated multilingual studio page.
© Image Copyright wikipedia.org