Welcome to the final part of this mini series on the string comparison mechanics in OpenInsight. In the first two parts we reviewed how this task is currently handled in both ANSI and UTF8 modes, but this time we'll take a look at a new capability introduced for the next release which is called the "Linguistic String Comparison Mode".
As we've seen previously, there is certainly room for improvement when dealing with string comparisons and sorting in non-English languages, mainly due to the burden placed on the developer to maintain the sorting parameters, especially once the requirements extend beyond the basic ANSI character set. There is also no advantage taken of the capabilities of Windows itself, which provides a comprehensive National Language Support (NLS) API for testing strings for linguistic equality.
What is "linguistic equality"?
If you're unfamiliar with the term "linguistic equality" it essentially means comparing strings according to the language rules of a specific locale thereby providing appropriate results for a user of that locale. For example, consider the following cases that illustrate how comparisons differ for the same characters in different locales:
In a standard OI system these sort of rules would need the developer to define the collation sequence records that represent them, which is simply duplicating effort when Windows itself is easily capable of handling this for us.
In order to utilize this API without impacting current systems we have introduced a new "mode" into OpenInsight that allows you to determine exactly when you wish to enable linguistic support. This mode comprises three elements:
It's simply a case of setting the mode when you want it to apply to sorting and case-insensitive operations, and turning it off when you don't. Just like with Extended Precision Mode you can set a default mode for your application and then adjust this at runtime as desired.
(Note that using the Linguistic mode is not affected by OpenInsight's ANSI or UTF8 mode, as the string comparisons are processed "outside" in Windows itself.)
The following five functions are used to control the Linguistic Mode:
Along with this set of equates:
Example:
$Insert RTI_StrCmpMode_Equates $Insert MSWin_CompareString_Equates // Set the mode to Linguistic, sorting digits as numbers, case-insensitive, // and with linguistic casing, using the "en-UK" locale SCFlags = BitOr( LINGUISTIC_IGNORECASE$, NORM_LINGUISTIC_CASING$ ) SCFlags = BitOr( SCFlags, SORT_DIGITSASNUMBERS$ ) Call SetStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "en-UK" ) // Now do some sorting ... Call V119( "S", "", "A", "L", data, "" )
Full details on each of these functions can be found at the end of this post, but let's take a look in more detail at the each of the mode settings:
This is an integer value that controls how string comparisons are made:
When set to "0" then the application will run in "normal" mode, which means that string comparisons will use the methods described in parts 1 and 2 of this series. The Mode Flags and Mode Locale settings are ignored.
When set to "1" the application uses the Windows CompareStringEx function for string comparisons instead. The Mode Flags and the Mode Locale settings will also be used with this.
This setting is a integer comprising one or more optional bit-flags that are passed to the Windows CompareStringEx function when running in Linguistic Mode (It may be set to 0 to apply the default behavior). A full description of their use can be found in the Microsoft documentation for the CompareStringEx function, but briefly these are:
Flag | Description |
LINGUISTIC_IGNORECASE$ | Ignore case, as linguistically appropriate. |
LINGUISTIC_IGNOREDIACRITIC$ | Ignore nonspacing characters, as linguistically appropriate. |
NORM_IGNORECASE$ | Ignore case. |
NORM_IGNOREKANATYPE$ | Do not differentiate between hiragana and katakana characters. |
NORM_IGNORENONSPACE$ | Ignore nonspacing characters. |
NORM_IGNORESYMBOLS$ | Ignore symbols and punctuation. |
NORM_IGNOREWIDTH$ | Ignore the difference between half-width and full-width characters. |
NORM_LINGUISTIC_CASING$ | Use the default linguistic rules for casing, instead of file system rules. |
SORT_DIGITSASNUMBERS$ | Treat digits as numbers during sorting. |
SORT_STRINGSORT$ | Treat punctuation the same as symbols. |
This is can be the name of the locale to use (like "en-US", "de-CH" etc.), or one of the following special values:
The following Basic+ operators and functions are affected by the Linguistic Mode :
Note that when used with the case-insensitive operators and functions (such as _eqc, IndexC() etc.) the LINGUISTIC_IGNORECASE$ flag is always applied if the NORM_IGNORECASE$ has not been specified.
Using the Linguistic Mode can impact performance for two reasons:
Because of this Linguistic Mode is not enabled by default - you are free choose when to apply it yourself.
GetDefaultStrCmpMode function
This function returns an @fm-delimited dynamic array containing the current default string comparison mode settings for the application in the format:
<1> Mode <2> Flags <3> Locale
Example:
$Insert RTI_StrCmpMode_Equates DefSCM = GetDefaultStrCmpMode() DefMode = DefSCM<GETSTRCMPMODE_MODE$>
SetDefaultStrCmpMode function
This function sets the default string comparison mode for an application. The mode is set to these default values for each new request made to the engine (i.e each event or web-request). This is to protect against situations where an error condition could force the engine to abort processing before the mode could be reset, thereby leaving it in an unknown state.
This function takes three arguments:
Name | Description |
Mode | Specifies the default mode to set: "0" for Normal mode, or "1" for Linguistic Mode. |
Flags | Bitmask integer that specifies the default flags to use when in Linguistic Mode |
Locale | Specifies the name of the default locale to use. |
Example:
$Insert RTI_StrCmpMode_Equates $Insert MSWin_CompareString_Equates // Set the default mode to Linguistic, sorting digits as numbers, using the // user's locale SCFlags = SORT_DIGITSASNUMBERS$ Call SetDefaultStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "" )
GetStrCmpMode function
This function returns an @fm-delimited dynamic array containing the current string comparison mode settings for the application in the format
<1> Mode <2> Flags <3> Locale
Example:
$Insert RTI_StrCmpMode_Equates CurrSCMode = GetStrCmpMode()<GETSTRCMPMODE_MODE$>
SetStrCmpMode function
This function sets the current string comparison mode for an application. Note that the mode is set to the default values for each new request made to the engine (i.e each event or web-request). This is to protect against situations where an error condition could force the engine to abort processing before the mode could be reset, thereby leaving it in an unknown state.
This function takes three arguments:
Name | Description |
Mode | Specifies the mode to set: "0" for Normal mode, or "1" for Linguistic Mode. |
Flags | Bitmask integer that specifies the flags to use when in Linguistic Mode |
Locale | Specifies the name of the locale to use. |
Example:
$Insert RTI_StrCmpMode_Equates $Insert MSWin_CompareString_Equates // Set the mode to Linguistic, sorting digits as numbers, case-insensitive, // and with linguistic casing, using the "en-UK" locale SCFlags = BitOr( LINGUISTIC_IGNORECASE$, NORM_LINGUISTIC_CASING$ ) SCFlags = BitOr( SCFlags, SORT_DIGITSASNUMBERS$ ) Call SetStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "en-UK" ) // Now do some sorting ... Call V119( "S", "", "A", "L", data, "" )
GetStrCmpStatus function
While it is unlikely that the CompareStringEx function will raise any errors it is possible if incompatible flags or parameters are used. In this case Windows returns an error code which may be accessed in Basic+ via this function (See the CompareStringEx documentation for more details on error values).
Example:
$Insert RTI_StrCmpMode_Equates $Insert MSWin_CompareString_Equates // Set the mode to Linguistic, sorting digits as numbers, case-insensitive, // and with linguistic casing, using the "en-UK" locale SCFlags = BitOr( LINGUISTIC_IGNORECASE$, NORM_LINGUISTIC_CASING$ ) SCFlags = BitOr( SCFlags, SORT_DIGITSASNUMBERS$ ) Call SetStrCmpMode( STRCMPMODE_LINGUISTIC$, SCFlags, "en-UK" ) // Now do some sorting ... Call V119( "S", "", "A", "L", data, "" ) SCError = GetStrCmpStatus() If SCError Then ErrorText = RTI_ErrorText( "WIN", SCError ) End
This concludes this mini-series on OpenInsight string comparison processing. Hopefully you'll find the new Linguistic Mode useful in your own applications, bearing in mind that some of the custom sorting options, such as "Treat Digits As Numbers", can have a use in any application beyond simply dealing with non-English language sets.
Some of the more astute readers among you may have noticed that no mention of indexing has been made so far with respect to Linguistic Mode. This is because work is currently ongoing in this part of the system, and we'll give you more details regarding this at a later date.
More information on this subject may be found here: