How does Exodus handle Unicode

From NEOSYS Dev Wiki
Jump to navigationJump to search

All text in Exodus is treated as unicode. This results in culturally correct sorting in all languages and scripts. Ordering is a,A,b,B and not A,B,a,b. Exodus' case conversion also works in all scripts that have the concept of case. Exodus doesnt have any case insensitive versions of its functions at the present time since this is not traditional Pick. These might be added at a later date.

To compare strings and convert case in cultural way, Exodus uses whatever operating system unicode aware functions are available. Qt uses the same strategy so Exodus is in good company. Postgres is probably the same. This is probably more portable in the long run than depending on a library such as ICU which may not be available.

Exodus uses standard C++ STL wchar/wstring strings internally to achieve maximal portability. This means that it is UTF-16 on windows and UTF-32 on Unix/Linux platforms including OSX. If necessary, Exodus could be converted to use UTF-16 on those *nix platforms which have a C++ compiler (eg gcc4) that supports a 16 bit wchar.

Why the focus on UTF-16? From the programming language point of view, Windows, Java, ICU, and QT all use UTF-16 for Unicode while only Unix OS remainly firmly UTF-8 at heart. Even Unix is increasingly supporting UTF-16 eg in gcc4. In the end, Exodus is trying to ease application level programming, which UTF-16 does much better than UTF-8, while minimising inter-UTF type conversions at program interfaces, even though such conversions are very fast and mechanical. UTF-8 looks like remaining the encoding of choice for inter-system data transfer (despite its excessive storage requirements for indo-oriental scripts) probably due to its backward compatibility with ASCII but also since, unlike UTF-16 and UTF-32, it has no "endian" issue.

Most databases, including postgres, are UTF8 with the exception of MSSQL which is UTF16.

Character indexing: Jbase's JBASIC language's in-memory representation of strings is UTF8 - which means that indexing characters requires a slow count from the beginning of the string unless it has some very clever programming. Exodus, and I think Qt, have a dirty rotten cheat regarding accessing characters in a string by numerical position in that we both ignore the rare Unicode "surrogate characters" which occupy four bytes. Many people don’t realise that UTF16 characters are either 2 or 4 bytes long. Both Exodus, and I think Qt treats these four byte "surrogate pairs" as if they were two separate 2 byte characters although this not allowed by the definition of UTF16. The massive benefit though is the ability to continue to treat text strings as randomly accessible arrays of characters using [] or .at() and substr() and their ilk. Theoretically text strings should never be accessed randomly but converting all the old software to use iterators that treat strings as streams would be a pain.

SAP got a lot of flak for standardising on UTF32 for all in-memory structures at one point. SAP clients were complaining about having to upgrade servers everywhere due to memory issues. It might have been an Oracle "negative PR" campaign but there is generally a lot of resistance to UTF32 despite the fact that it resolve the character indexing issue perfectly. UTF32 is said to typically double memory throughput compared to UTF16 and triple it compared to UTF8. It is speculated that this defeats CPU memory caching and causes a noticably serious impact on CPU speed.

Some interesting information from the ICU project

How do I index into a UTF-16 string?

Typically, indexes and offsets in strings count string units, not characters (although in c and java they have a char type).

For example, in old-fashioned MBCS strings, you would count indexes and offsets by bytes, not by the variable-width character count. In UTF-16, you do the same, just count 16-bit units (in ICU: UChar).

What is the performance difference between UTF-8 and UTF-16?

Most of the time, the memory throughput of the hard drive and RAM is the main performance constraint. UTF-8 is 50% smaller than UTF-16 for US-ASCII, but UTF-8 is 50% larger than UTF-16 for East and South Asian scripts. There is no memory difference for Latin extensions, Greek, Cyrillic, Hebrew, and Arabic.

For processing Unicode data, UTF-16 is much easier to handle. You get a choice between either one or two units per character, not a choice among four lengths. UTF-16 also does not have illegal 16-bit unit values, while you might want to check for illegal bytes in UTF-8. Incomplete character sequences in UTF-16 are less important and more benign. If you want to quickly convert small strings between the different UTF encodings or get a UChar32 value, you can use the macros provided in utf.h and its siblings utf8.h and utf16.h. For larger or partial strings, please use the conversion API.