String comparison in OpenInsight - Part 1 - ANSI Mode
Published 18 MAY 2021 at 11:56:00AM
Developers with systems that require Unicode processing will be please to know that the next release adds some new and much-needed string comparison functionality, to which we have given the super-catchy title of the "Linguistic String Comparison Mode". However, before we get into the details of that, it's worth taking a look at how string comparison is currently handled in the system as it has a huge effect on how your data is sorted, and it seems to be one of those murky and little understood areas which includes arcane terms like "collation sequences" and "ANSI Maps".
(Note that we've included some Basic+ pseudo-code examples in this post to illustrate more clearly how some parts of the comparison routines work. These are simplifications of the actual C++ internal functions and not actual code from the system itself.)
String comparison in ANSI mode
String comparison in ANSI mode (i.e. where every character has a value between 0 and 255) is usually a straightforward exercise of comparing the byte value of characters against each other. However, it is possible to customize this using the classic ARev technique of "collation sequences", which allow a developer to assign a custom weighting, or "sort value", to a specific character when it is compared against others.
Collation sequences are contained in records in the SYSENV table with a prefix of "CM_". By default OpenInsight includes the following "CM_" records:
- CM_ANSI
- CM_ISO
- CM_US
A collation sequence may be attached to a language definition ("LND") record by specifying the "CM_" key in field <10>, and when you load the LND record the collation sequence is loaded too. For example, in the German LND record (LND_GERMAN_D) you will see CM_ISO has been specified like so:
Each CM record may contain one or two fields, each field containing a block of 256 bytes that form a collation sequence:
<1> Case-sensitive sequence (required) <2> Case-insensitive sequence (optional)
To give a character a weighting simply find it it's 1-based index in the sequence, and enter the byte value you want to give it instead. For example, if we wanted to give the "3" character a sort value of "87" then we do the following:
- Find the ANSI byte value of the character "3", which is 51 (0-based value).
- In field<1> we replace the character at index 52 (1-based value) with a character that has the ANSI byte value of 87, which is "W".
Points to note:
- The first field (sequence) in a CM record is required, while the case-insensitive sequence is not - the system will use a default method for case-insensitive comparison as discussed below.
- A collation sequence must be 256 characters in length. If not the system will not use it.
- The last two characters in the sequence (255 and 256, 1-based) are always set to the byte values of 254 and 255 (@fm and @rm).
- Some LND records have data in field<11> - this is for historical reasons and may be ignored.
So, now that we know what collation sequences are, we can see how the system uses them at runtime when it needs to compare strings values.
ANSI comparison when using a collation sequence
If a collation sequence has been specified (for either a case-sensitive operation, or a case-insensitive operation) the system uses it to extract the sorting value of the character being processed:
- Get the ANSI byte value of the string character being compared.
- Use the ANSI byte value as an index into the appropriate collation sequence.
- The sorting value for the string character is the byte value at that index.
- E.g. The character "3" has an ANSI byte value of 51 - using this as an index we get byte 52 (1-based) from the collation sequence - the value of that byte is the sorting value.
// Pseudo-code ansiVal1 = seq( ch1 ) sortVal1 = seq( collationSequence[ansiVal1+1,1] ) ansiVal2 = seq( ch2 ) sortVal2 = seq( collationSequence[ansiVal2+1,1] ) cmpVal = ( sortVal1 - sortVal2 )
ANSI case-insensitive comparison without a collation sequence
Case-insensitive comparison of two characters without a collation sequence uses the classic ASCII-style technique of bit-masking both character ANSI byte values with a value of 0xDF and then comparing the results.
// Pseudo-code sortVal1 = bitAnd( seq( ch1 ), 0xDF ) sortVal2 = bitAnd( seq( ch2 ), 0xDF ) cmpVal = ( sortVal1 - sortVal2 )
ANSI case-sensitive comparison without a collation sequence
Case-sensitive comparison of two characters without a collation sequence simply compares the ANSI byte value of the characters against each other.
// Pseudo-code cmpVal = ( seq( ch1 ) - seq( ch2 ) )
That concludes this look at ANSI string comparison - in the next post we'll take a look at how string comparison is handled in UTF8 mode, which is, as you might expect, a little more complex.