How does Exodus handle Unicode

From NEOSYS Dev Wiki
Revision as of 14:47, 27 June 2009 by 85.17.154.66 (talk)
Jump to navigationJump to search

All text in Exodus is treated as unicode resulting in culturally correct sorting in all languages/scripts. Ordering is a,A,b,B and not A,B,a,b. Exodus case conversion also works in all scripts (that have the concept of case). Exodus doesnt have any case insensitive versions of its functions at the present time since this is not traditional Pick but these might be added at a later date.

Exodus uses whatever operating system unicode aware functions are available to compare strings and convert case as unicode. Qt uses the same strategy so Exodus is in good company. Postgres is probably the same. This might be non-portable but it is easier in the short term and perhaps even in the long term than depending on the unicode library ICU.

Exodus uses standard C++ STL wchar/wstring strings internally to achieve maximal portability. This means that it is UTF16 on windows and UTF32 on Unix/Linux platforms including OSX. If necessary, Exodus could be converted to use UTF16 on those *nix platforms which have a C++ compiler (eg g++ 4) that supports a 16 bit wchar and ICU.

Why the focus on UTF16? Well, Windows, Java, ICU, and QT all use UTF16 for Unicode and only Unix OS is firmly UTF8. In the end you are merely trying to pick a minimise inter-UTF type conversions at program interfaces even though they are very mechanical and fast. UTF8 has natural advantages when it comes to data storage and transfer where there is no endian issue.

Most databases, including postgres, are UTF8 with the exception of MSSQL which is UTF16.

Character indexing: Jbase's JBASIC language's in-memory representation of strings is UTF8 - which means that indexing characters requires a slow count from the beginning of the string unless it has some very clever programming. Exodus, and I think Qt, have a dirty rotten cheat regarding accessing characters in a string by numerical position in that we both ignore the rare Unicode "surrogate characters" which occupy four bytes. Many people don’t realise that UTF16 characters are either 2 or 4 bytes long. Both Exodus, and I think Qt treats these four byte "surrogate pairs" as if they were two separate 2 byte characters although this not allowed by the definition of UTF16. The massive benefit though is the ability to continue to treat text strings as randomly accessible arrays of characters using [] or .at() and substr() and their ilk. Theoretically text strings should never be accessed randomly but converting all the old software to use iterators that treat strings as streams would be a pain.

SAP got a lot of flak for standardising on UTF32 for all in-memory structures at one point. SAP clients were complaining about having to upgrade servers everywhere due to memory issues. It might have been an Oracle "negative PR" campaign but there is generally a lot of resistance to UTF32 despite the fact that it resolve the character indexing issue perfectly. UTF32 is said to typically double memory throughput compared to UTF16 and triple it compared to UTF8. It is speculated that this defeats CPU memory caching and causes a noticably serious impact on CPU speed.

Some interesting information from the ICU project

http://userguide.icu-project.org/icufaq

How do I index into a UTF-16 string?

Typically, indexes and offsets in strings count string units, not characters (although in c and java they have a char type).

For example, in old-fashioned MBCS strings, you would count indexes and offsets by bytes, not by the variable-width character count. In UTF-16, you do the same, just count 16-bit units (in ICU: UChar).

What is the performance difference between UTF-8 and UTF-16?

Most of the time, the memory throughput of the hard drive and RAM is the main performance constraint. UTF-8 is 50% smaller than UTF-16 for US-ASCII, but UTF-8 is 50% larger than UTF-16 for East and South Asian scripts. There is no memory difference for Latin extensions, Greek, Cyrillic, Hebrew, and Arabic.

For processing Unicode data, UTF-16 is much easier to handle. You get a choice between either one or two units per character, not a choice among four lengths. UTF-16 also does not have illegal 16-bit unit values, while you might want to check for illegal bytes in UTF-8. Incomplete character sequences in UTF-16 are less important and more benign. If you want to quickly convert small strings between the different UTF encodings or get a UChar32 value, you can use the macros provided in utf.h and its siblings utf8.h and utf16.h. For larger or partial strings, please use the conversion API.