Compart - Document- and Output-Management

  • Unicode: What all is involved?
Unicode

Unicode: What all is involved?

The current version of Unicode covers nearly every writing system in the world – in theory. But no font supports all characters. The question, therefore, is how should companies handle the Unicode issue.  

 

Unicode encoding

When it comes to digital character encoding, several shortcomings quickly come to light. In fact, most companies do use Unicode or multiple code pages that cover the basic letters of the Latin alphabet. However, special characters, other alphabets, and seldom-used diacritical marks quickly push the usual limits. That is problematic, of course, because person and product names, brands, addresses, etc. that contain special characters are sometimes entered differently or even displayed incorrectly, depending on the character set in use.

That is a sensitive issue, especially in public administration, where spelling can have legal implications. Furthermore, in many EU countries, the US, and Canada, citizens are legally entitled to the accurate spelling of their names, and transcription occasionally causes difficulties.

Take the name Møller, for example. In companies whose character set does not include the ø, the name may be entered as Möller or Moller. Because the name is written in different forms, a search of the customer database or civil register could fail.

 

Unicode: Concentrate on what’s important

At first glance the problem appears solvable using Unicode; after all, the current version of this character set standard can cover nearly every writing system in the world. In fact, the current version has well over 100,000 characters. Only, what good are the more than one million theoretically available Unicode code points if the fonts used do not support them? It isn’t enough to code the letters or characters. It’s clear that they also need to be displayable (see Box 1).
Many conventional fonts are quite limited, supporting only 400 to 500 characters. The limits come into focus when you consider that the authorities in Germany, for example, have already agreed on the uniform use of 700 letters and symbols.

Companies and organizations are therefore faced with the question of what Unicode characters they even need and how they would go about displaying them. The fact is that no font supports all the Unicode characters, let alone the aforementioned 700. Meanwhile the topic is gaining momentum, considering that the internationalization of our society is also affecting business communication. Market pressure wakes companies up to the fact that the customer’s language is an increasingly important competitive factor, beginning with the correct spelling of names.
The problem, however, is that many firms have outdated code page structures and rely on code-page-based processing. Consequently, they are incapable of mapping the more than 100 different letters and symbols. Old IT structures need to be made Unicode-capable.

Define the rules for using Unicode

There is no getting around the Unicode standard; this is indisputable. Its implementation, on the other hand, is another story. How can companies and organizations convert their existing IT structures to Unicode with efficiency? Perplexity and confusion often reign. Some want to play it safe and include every character. Others follow their gut feeling, blind to the consequences of omitting Unicode characters.

One thing is for certain. With Unicode, you need to limit yourself to what is essential. The public sector in Germany is a pioneer in this regard. There are clearly established rules on the Unicode characters to be covered. In its April 2014 decision, for example, the German IT Planning Council of the federation and the states defined a uniform Unicode character set for registry-keeping and data transmission. It specifies that the names of individuals must be stored in identical form in all public electronic registers.

Seek support from OM specialists

Other branches like banks and insurers are lagging behind. Some have no Unicode support whatsoever; others have converted their applications to the standard but don’t really know how to work with it. What is missing are precise rules for handling – the “guardrails” so to speak. Industry associations and institutions will soon have no choice but to give this matter some thought and make their recommendations known.

Meanwhile, companies need to reboot and define their own guidelines. Years will pass before company document creation and processing systems are able to support the specified character repertoire with a high level of quality.
Latin code pages alone no longer suffice. On the other hand, the greater the Unicode coverage targeted, the more complicated it gets. Ultimately it affects all document processing systems – from generation, formatting, and conversion to delivery via different communication channels. The best advice is to seek the support of a document and output management specialist who is also well-versed in Unicode specifications.

How it all began: a brief history of Unicode

Conventional computer code pages cover a limited number of characters only. In Western character encodings, this limit is usually 128 (7-bit) code points – like in the familiar ASCII standard – or 256 (8-bit) characters, such as in ISO 8859-1 (also known as Latin 1) or variants of EBCDIC. After subtracting the control characters, there remain only 95 elements for displaying letters and special characters in ASCII and 191 elements in the 8-bit ISO character sets.
The problem with these character encodings is that the display of characters in different languages in one and the same text is difficult, if not impossible. This fact considerably handicapped international data exchange in the 1980s and 1990s.

So Unicode was developed a quarter of a century ago, largely driven by companies such as Microsoft and Apple. The goal was and remains to overcome the incompatibility of different codings. First, the previous character set of the conventional code pages was expanded from the original 256 to 65,636 (256 X 256).

The first version of Unicode, version 1.0 (published in 1991), already covered more than 50,000 different characters. They included the Latin, Arabic, Cyrillic, Hebrew and Greek alphabets as well as several “exotic” languages such as Thai, Laotian, Tamil, Malayalam and Telugu. Unicode 1.0 even included the so-called CJK scripts (Chinese, Japanese, Korean), although not until Release 1.0.1 (June 1992).

But limitations were encountered again and again, thus resulting in Unicode’s continual expansion to this day. For example, the latest iteration of Unicode, Version 9.0, has 135 different encoded writing systems. But that’s far from the end of the story. Characters from other writing systems will be continually added to Unicode and managed under the designation of ISO 10646 as the Universal Coded Character Set (UCS) of the International Organization for Standardization (ISO).

Unicode’s development potential is boundless. Current work is devoted to the support of emoticons – which may seem silly to some, but in certain industries like telecommunications, the topic is understandably arousing great interest.

Unicode, ASCII, code pages – a concise glossary of digital character encoding

ASCII (American Standard Code for Information Interchange)
  • The standard defined in the US in the late 1960s for the 7-bit encoding of 128 characters (95 printable and 33 nonprintable).
  • Basis for subsequent encoding based on more bits.
  • Printable characters: Latin alphabet (large/small letters), ten Arabic digits, punctuation marks, and various special characters.
  • Nonprintable characters: Control characters such as the line feed, tab, protocol characters (end of transmission, confirmation, separators.
  • Drawback: covers only the English character set; does not include diacritical marks (dots, hooks, curves, strokes, rings) and letters that occur only in certain Latin alphabets (incl. French, Spanish, Portuguese, Turkish) as well as other scripts (incl. Cyrillic, Greek, Hebrew, Arabic, and various Indian languages).
Codepage
  • Character set table for 8-bit coding of 256 characters maximum (of those, 128 are covered by the ASCII standard).
  • Problem: The 256-character limit prevents mapping all the world’s alphabets and scripts in a single table, hence the existence of different code pages in corresponding standards, such as ISO Latin 1 for most Western European characters. Currently there are a total of 15 defined 8-bit character sets, which are combined in the ISO 8859 standard.
  • There are still so-called multibyte code pages, but they are rather complicated to use in practice, which is why they are rarely found in Europe.
Diacritics
  • Marks such as dots, strokes, hooks, curves or rings that are added to letters to denote a different pronunciation or word stress from the original; diacritical marks are positioned above or below, and in some cases through, the letters. The modified letter may be considered the same letter or a separate letter. The diacritical marks expand the alphabet without having to create new letters.
  • Diacritical marks are found in many languages. The Latin alphabet alone has 1,338 letters that result from diacritical marks. They are also found in the Arabic, Hebrew, and Indian languages, where they are used primarily to indicate vocalization. Certain diacritical marks are unique to individual or related languages and can be used to identify them.
Unicode
  • A 25-year-old standard for the digital encoding of characters from different written languages.  In the first stage, the number of characters covered in the original code pages was expanded from 256 to 65,536. Unicode Version 1.0 covered 50,000 characters.
  • In theory, Unicode today can be used to digitally code all the world’s alphabets and scripts; that's more than 1.1 million characters. In fact, only 100,000 characters are assigned in the current version, Version 8.0.
  • Unicode defines not only languages but also mathematical and special characters (incl. Braille, emoticons, currency signs).
  • Unicode originated from a corporative initiative that included Microsoft and Apple.
Unicode database

Compart Unicode database

One of the most comprehensive indexes of its kind offers endless research options

Since June of last year, the Compart website (www.compart.com) has featured a database for Unicode, the international standard for the digital encoding of characters. It contains the most common codes from different writing systems and is continually updated and expanded. Character sets from China, Japan, and Korea will be added soon. The purpose is to offer programmers a reliable and comprehensive reference to support them in their work.

The Unicode index, available in German and English, follows the principles of responsive design for optimal viewing and speed. All characters are logically classified for easy research. Detailed information is provided for each character, including source citations and related links.

The greatest benefit of the comprehensive index is its advanced research options. You can find answers to basic questions (What characters are saved in a specific code page? How many letters have a dieresis?), or ask very specific questions (What is the AFP character for a specific letter or symbol?)


To access the Compart Unicode database, go to: www.compart.com/en/unicode

Get the Answers
and Solutions You Need