Welcome to the second part of our mini-series explaining the mechanics of how string comparisons are handled in OpenInsight. In our previous post we looked at the inner workings when running in ANSI mode - this time we'll look at UTF8 mode instead.
(Note that we've included some Basic+ pseudo-code examples in this post to illustrate more clearly how some parts of the comparison routines work. These are simplifications of the actual C++ internal functions and not actual code from the system itself.)
In UTF8 mode characters can be multi-byte and therefore have a value greater than 255 (normally referred to as their "code point", or in Basic+ terms, the Seq() value of a character), so this means that the standard ANSI-mode method described previously cannot be used. Instead, a slightly different approach is taken to allow higher code points to be included in custom sorting.
When the system is loaded the UTF8 library creates an internal character-map (called the "ANSI-map") which is a 256-element array (0-255) of code-point values. This is initialized to the same values as the standard ANSI character set, i.e. position 65 will have the code point for the ANSI character with the value of 65, position 230 will have the code point for the ANSI character with the value of 230 and so on.
This ANSI-map this can be changed at runtime so that code points that are higher than 255 can be included, and code points that appear in the ANSI-map are always sorted lower than those that aren't, regardless of their actual value. The following functions (exported from RevUTF8.dll) are used to query and update the ANSI-map:
GetAnsiToUnicode - returns the code point for a specified map element.
// MapIndex - must be an integer between 0 and 255 CodePoint = GetAnsiToUnicode( MapIndex )
SetAnsiToUnicode - updates the code point for a specified map element.
// MapIndex - must be an integer between 0 and 255 // NewCodePoint - integer value of the code point to set Call SetAnsiToUnicode( MapIndex, NewCodePoint )
When comparing two characters we first need to find a "sort index" for a character which is determined as follows:
// Pseudo-code dim ansiMap( 255 ) sortIndex = -1 ; // Not found codePoint = seq( ch ) testIndex = bitAnd( codePoint1, 0xFF ) if ( ansiMap( testIndex ) == codePoint ) then // Found sortIndex = testIndex end else // Not found for testIndex = 255 to 0 step -1 if ( ansiMap( testIndex ) == codePoint ) then // Found and exit loop sortIndex = testIndex end next end
Once this has been done for both characters we use the following comparison procedure:
// Pseudo-code begin case case ( sortIndex1 == -1 ) and ( sortIndex2 == -1 ) // Both Non-ANSI-mapped - use a simple code point compare cmpVal = codePoint1 - codePoint2 case ( sortIndex1 == -1 ) // sortIndex2 was found in the ANSI map so it's sorted lower cmpVal = 1 case ( sortIndex2 == -1 ) // sortIndex1 was found in the ANSI map so it's sorted lower cmpVal = -1 case OTHERWISE$ // Both are ANSI mapped begin case case hasCollationSequence sortVal1 = seq( collationSequence[sortIndex1+1,1] ) sortVal2 = seq( collationSequence[sortIndex2+1,1] ) case isCaseInsensitive sortVal1 = bitAnd( sortIndex1, 0xDF ) sortVal2 = bitAnd( sortIndex2, 0xDF ) case OTHERWISE$ sortVal1 = codePoint1 sortVal2 = codePoint2 end case cmpVal = sortVal1 - sortVal2 end case
So, this system works pretty well out of the box for languages that can be expressed using the ANSI character set, but for other languages much of the burden falls on the application developer to maintain and tune the language settings and collation sequences to their requirements. This could require considerable effort and ignores much of the functionality provided by the OS itself, so in the next post we'll take a look at how this is being addressed in the next release.