String comparison in OpenInsight - Part 3 - Linguistic Mode

String comparison in OpenInsight - Part 3 - Linguistic Mode

Published 19 MAY 2021 at 12:29:09PM

Welcome to the final part of this mini series on the string comparison mechanics in OpenInsight. In the first two parts we reviewed how this task is currently handled in both ANSI and UTF8 modes, but this time we'll take a look at a new capability introduced for the next release which is called the "Linguistic String Comparison Mode".

As we've seen previously, there is certainly room for improvement when dealing with string comparisons and sorting in non-English languages, mainly due to the burden placed on the developer to maintain the sorting parameters, especially once the requirements extend beyond the basic ANSI character set. There is also no advantage taken of the capabilities of Windows itself, which provides a comprehensive National Language Support (NLS) API for testing strings for linguistic equality.

What is "linguistic equality"?

If you're unfamiliar with the term "linguistic equality" it essentially means comparing strings according to the language rules of a specific locale thereby providing appropriate results for a user of that locale. For example, consider the following cases that illustrate how comparisons differ for the same characters in different locales:

Many locales equate the ae ligature (æ) with the letters ae. However, Icelandic (Iceland) considers it a separate letter and places it after Z in the sorting sequence.
The A Ring (Å) normally sorts with merely a diacritic difference from A. However, Swedish (Sweden) places the A Ring after Z in the sorting sequence.

In a standard OI system these sort of rules would need the developer to define the collation sequence records that represent them, which is simply duplicating effort when Windows itself is easily capable of handling this for us.

Using Linguistic Mode

In order to utilize this API without impacting current systems we have introduced a new "mode" into OpenInsight that allows you to determine exactly when you wish to enable linguistic support. This mode comprises three elements:

Mode ID - this is the mode itself, which can be one of the following values:
- (0) Normal, non-linguistic mode.
- (1) Linguistic mode.
Mode Flags - A set of bit-wise flags for use with the Linguistic mode.
Mode Locale - A locale identifier for use with the Linguistic mode (defaults to the current user's locale).

It's simply a case of setting the mode when you want it to apply to sorting and case-insensitive operations, and turning it off when you don't. Just like with Extended Precision Mode you can set a default mode for your application and then adjust this at runtime as desired.

(Note that using the Linguistic mode is not affected by OpenInsight's ANSI or UTF8 mode, as the string comparisons are processed "outside" in Windows itself.)

The following five functions are used to control the Linguistic Mode:

GetDefaultStrCmpMode - returns the default application mode settings.
SetDefaultStrCmpMode - sets the default application mode.
GetStrCmpMode - returns the current mode settings.
SetStrCmpMode - sets the current mode.
GetStrCmpStatus - returns the status of a string comparison operation.

Along with this set of equates:

RTI_STRCMPMODE_EQUATES
MSWIN_COMPARESTRING_EQUATES

Example:

$Insert RTI_StrCmpMode_Equates
$Insert MSWin_CompareString_Equates
 
// Set the mode to Linguistic, sorting digits as numbers, case-insensitive, 
// and with linguistic casing, using the "en-UK" locale
SCFlags = BitOr( LINGUISTIC_IGNORECASE$, NORM_LINGUISTIC_CASING$ )
SCFlags = BitOr( SCFlags, SORT_DIGITSASNUMBERS$ )
 
Call SetStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "en-UK" ) 
 
// Now do some sorting ...
Call V119( "S", "", "A", "L", data, "" )

Full details on each of these functions can be found at the end of this post, but let's take a look in more detail at the each of the mode settings:

Mode ID

This is an integer value that controls how string comparisons are made:

When set to "0" then the application will run in "normal" mode, which means that string comparisons will use the methods described in parts 1 and 2 of this series. The Mode Flags and Mode Locale settings are ignored.

When set to "1" the application uses the Windows CompareStringEx function for string comparisons instead. The Mode Flags and the Mode Locale settings will also be used with this.

Mode Flags

This setting is a integer comprising one or more optional bit-flags that are passed to the Windows CompareStringEx function when running in Linguistic Mode (It may be set to 0 to apply the default behavior). A full description of their use can be found in the Microsoft documentation for the CompareStringEx function, but briefly these are:

Flag	Description
LINGUISTIC_IGNORECASE$	Ignore case, as linguistically appropriate.
LINGUISTIC_IGNOREDIACRITIC$	Ignore nonspacing characters, as linguistically appropriate.
NORM_IGNORECASE$	Ignore case.
NORM_IGNOREKANATYPE$	Do not differentiate between hiragana and katakana characters.
NORM_IGNORENONSPACE$	Ignore nonspacing characters.
NORM_IGNORESYMBOLS$	Ignore symbols and punctuation.
NORM_IGNOREWIDTH$	Ignore the difference between half-width and full-width characters.
NORM_LINGUISTIC_CASING$	Use the default linguistic rules for casing, instead of file system rules.
SORT_DIGITSASNUMBERS$	Treat digits as numbers during sorting.
SORT_STRINGSORT$	Treat punctuation the same as symbols.

Mode Locale

This is can be the name of the locale to use (like "en-US", "de-CH" etc.), or one of the following special values:

"0" or null - Use the current user locale (LOCALE_NAME_USER_DEFAULT).
"1" - Use the current OS locale (LOCALE_NAME_SYSTEM_DEFAULT).
"2" - Use an invariant locale that provides stable locale and calendar data (LOCALE_NAME_INVARIANT)

The Linguistic Mode and Basic+

The following Basic+ operators and functions are affected by the Linguistic Mode :

LT operator
LE operator
EQ operator
NE operator
GE operator
GT operator
_LTC operator
_LEC operator
_EQC operator
_NEC operator
_GEC operator
_GTC operator
IndexC function
V119 function
Locate By statement
LocateC statement

Note that when used with the case-insensitive operators and functions (such as _eqc, IndexC() etc.) the LINGUISTIC_IGNORECASE$ flag is always applied if the NORM_IGNORECASE$ has not been specified.

Performance considerations

Using the Linguistic Mode can impact performance for two reasons:

There is just more work to do - comparison of strings using more complex rules will always be slower that a simple comparison of ordinal byte values or code points.
The strings must be copied and transformed into UTF16 (wide) strings before passing to the Windows CompareStringEx function. While this is not a slow operation in and of itself it will add some overhead.

Because of this Linguistic Mode is not enabled by default - you are free choose when to apply it yourself.

String Comparison Mode functions

GetDefaultStrCmpMode function

This function returns an @fm-delimited dynamic array containing the current default string comparison mode settings for the application in the format:

<1> Mode
<2> Flags
<3> Locale

Example:

$Insert RTI_StrCmpMode_Equates
 
DefSCM  = GetDefaultStrCmpMode()
DefMode = DefSCM<GETSTRCMPMODE_MODE$>

SetDefaultStrCmpMode function

This function sets the default string comparison mode for an application. The mode is set to these default values for each new request made to the engine (i.e each event or web-request). This is to protect against situations where an error condition could force the engine to abort processing before the mode could be reset, thereby leaving it in an unknown state.

This function takes three arguments:

Name	Description
Mode	Specifies the default mode to set: "0" for Normal mode, or "1" for Linguistic Mode.
Flags	Bitmask integer that specifies the default flags to use when in Linguistic Mode
Locale	Specifies the name of the default locale to use.

Example:

$Insert RTI_StrCmpMode_Equates
$Insert MSWin_CompareString_Equates
 
// Set the default mode to Linguistic, sorting digits as numbers, using the
// user's locale
SCFlags = SORT_DIGITSASNUMBERS$
 
Call SetDefaultStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "" )

GetStrCmpMode function

This function returns an @fm-delimited dynamic array containing the current string comparison mode settings for the application in the format

<1> Mode
<2> Flags
<3> Locale

Example:

$Insert RTI_StrCmpMode_Equates
 
CurrSCMode = GetStrCmpMode()<GETSTRCMPMODE_MODE$>

SetStrCmpMode function

This function sets the current string comparison mode for an application. Note that the mode is set to the default values for each new request made to the engine (i.e each event or web-request). This is to protect against situations where an error condition could force the engine to abort processing before the mode could be reset, thereby leaving it in an unknown state.

This function takes three arguments:

Name	Description
Mode	Specifies the mode to set: "0" for Normal mode, or "1" for Linguistic Mode.
Flags	Bitmask integer that specifies the flags to use when in Linguistic Mode
Locale	Specifies the name of the locale to use.

Example:

$Insert RTI_StrCmpMode_Equates
$Insert MSWin_CompareString_Equates
 
// Set the mode to Linguistic, sorting digits as numbers, case-insensitive, 
// and with linguistic casing, using the "en-UK" locale
SCFlags = BitOr( LINGUISTIC_IGNORECASE$, NORM_LINGUISTIC_CASING$ )
SCFlags = BitOr( SCFlags, SORT_DIGITSASNUMBERS$ )
 
Call SetStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "en-UK" ) 
 
// Now do some sorting ...
Call V119( "S", "", "A", "L", data, "" )

GetStrCmpStatus function

While it is unlikely that the CompareStringEx function will raise any errors it is possible if incompatible flags or parameters are used. In this case Windows returns an error code which may be accessed in Basic+ via this function (See the CompareStringEx documentation for more details on error values).

Example:

$Insert RTI_StrCmpMode_Equates
$Insert MSWin_CompareString_Equates
 
// Set the mode to Linguistic, sorting digits as numbers, case-insensitive, 
// and with linguistic casing, using the "en-UK" locale
SCFlags = BitOr( LINGUISTIC_IGNORECASE$, NORM_LINGUISTIC_CASING$ )
SCFlags = BitOr( SCFlags, SORT_DIGITSASNUMBERS$ )
 
Call SetStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "en-UK" ) 
 
// Now do some sorting ...
Call V119( "S", "", "A", "L", data, "" )
 
SCError = GetStrCmpStatus()
If SCError Then
   ErrorText = RTI_ErrorText( "WIN", SCError )
End

Conclusion

This concludes this mini-series on OpenInsight string comparison processing. Hopefully you'll find the new Linguistic Mode useful in your own applications, bearing in mind that some of the custom sorting options, such as "Treat Digits As Numbers", can have a use in any application beyond simply dealing with non-English language sets.

Some of the more astute readers among you may have noticed that no mention of indexing has been made so far with respect to Linguistic Mode. This is because work is currently ongoing in this part of the system, and we'll give you more details regarding this at a later date.

Table of Contents

String comparison in OpenInsight - Part 3 - Linguistic Mode

Published 19 MAY 2021 at 12:29:09PM

Using Linguistic Mode

Mode ID

Mode Flags

Mode Locale

The Linguistic Mode and Basic+

Performance considerations

String Comparison Mode functions

Conclusion

Further reading

Comments

Original ID: revdevx.com/?p=3312