How does Exodus handle Unicode

From NEOSYS Dev Wiki
Revision as of 19:53, 26 June 2009 by 85.17.154.66 (talk)
Jump to navigationJump to search

All text in Exodus is treated as unicode resulting in culturally correct sorting in all languages/scripts. Ordering is a,A,b,B and not A,B,a,b. Exodus case conversion also works in all scripts (that have the concept of case). Exodus doesnt have any case insensitive versions of its functions at the present time since this is not traditional Pick but these might be added at a later date.

Exodus uses whatever operating system unicode aware functions are available to compare strings and convert case as unicode. Qt uses the same strategy so Exodus is in good company. Postgres is probably the same. This might be non-portable but it is easier in the short term and perhaps even in the long term than depending on the unicode library ICU.

Exodus uses standard C++ STL wchar/wstring strings internally to achieve maximal portability. This means that it is UTF16 on windows and UTF32 on Unix/Linux platforms including OSX. If necessary, Exodus could be converted to use UTF16 on those *nix platforms which have a C++ compiler (eg g++ 4) that supports a 16 bit wchar and ICU.

Why the focus on UTF16? Well, Windows, Java, ICU, and QT all use UTF16 for Unicode and only Unix OS is firmly UTF8. In the end you are merely trying to pick a minimise inter-UTF type conversions at program interfaces even though they are very mechanical and fast. UTF8 has natural advantages when it comes to data storage and transfer where there is no endian issue.

Most databases, including postgres, are UTF8 with the exception of MSSQL which is UTF16.

Character indexing: Jbase's JBASIC language's in-memory representation of strings is UTF8 - which means that indexing characters requires a slow count from the beginning of the string unless it has some very clever programming. Exodus, and I think Qt, have a dirty rotten cheat regarding accessing characters in a string by numerical position in that we both ignore the rare Unicode "surrogate characters" which occupy four bytes. Many people don’t realise that UTF16 characters are either 2 or 4 bytes long. Both Exodus, and I think Qt treats these four byte "surrogate pairs" as if they were two separate 2 byte characters although this not allowed by the definition of UTF16. The massive benefit though is the ability to continue to treat text strings as randomly accessible arrays of characters using [] or .at() and substr() and their ilk. Theoretically text strings should never be accessed randomly but converting all the old software to use iterators that treat strings as streams would be a pain.

SAP got a lot of flak for standardising on UTF32 for all in-memory structures at one point. SAP clients were complaining about having to upgrade servers everywhere due to memory issues. It might have been an Oracle "negative PR" campaign but there is generally a lot of resistance to UTF32 despite the fact that it resolve the character indexing issue perfectly.