guides:oi10:presentation_server:appendix_i_utf8_processing

Appendix I - UTF8 processing

The Presentation Server is a Unicode (UTF16-encoded) application, which means that when a user enters characters into control like "ü" (code 252, or 0xFC in hex notation), the information is actually stored in the PS as the full UTF16 encoding: In this case the "ü" would be stored using two bytes (as 0x00FC) instead.

OpenEngine (OE) is not a Unicode application in the same way because it doesn't use UTF16 to encode strings, rather it can run in one of two modes, ANSI or UTF8, which means it interprets strings as either single-byte characters or multi-byte characters respectively. This is done to both preserve backwards compatibility with existing data sets, and to gain better memory performance at runtime.

These different encodings therefore require some form of conversion when the PS and the engine communicate (which they do when you use Get_Property, Set_Property, and Exec_Method), and leads to the following translation processes which are detailed below:

Using Get_Property:

• UTF16 → ANSI (when the engine is in ANSI mode)

• UTF16 → UTF8 (when the engine is in UTF8 mode)

Using Set_Property:

• ANSI → UTF16 (when the engine is in ANSI mode)

• UTF8 → UTF16 (when the engine is in UTF8 mode)

Using Exec_Method (for passed parameters – same as Set_Property):

• ANSI → UTF16 (when the engine is in ANSI mode)

• UTF8 → UTF16 (when the engine is in UTF8 mode)

Using Exec_Method (for returned values – same as Get_Property):

• UTF16 → ANSI (when the engine is in ANSI mode)

• UTF16 → UTF8 (when the engine is in UTF8 mode)

Translating UTF16 to ANSI (as per Get_Property)

This is a very simple translation - when translating a UTF16 character to ANSI the most significant byte is always removed. So, if the "ü" character (0x00FC) is entered in a control it is converted into a single byte 0xFC character when passed back to Basic+.

Of course this would then be interpreted as the @svm delimiter, which can lead to other problems such as corrupting the structure of dynamic arrays or records. Solving this issue was the primary reason UTF8 mode was implemented and ANSI mode should not be used where you expect to be dealing with user input containing characters that may collide with the System Delimiters.

(Another early attempt to deal with this issue resulted in the SYSTEM CHARMAP property, but this has since been deprecated in favor of UTF8 mode applications).

Translationg ANSI to UTF16 (as per Set_Property)

This translation is complicated by the need to deal with Revelation System Delimiters, such as @fm, @vm and so forth. When used in Basic+ programs and stored in records these delimiters are represented by single bytes like so:

@rm - 0xFF (code 255)

@fm - 0xFE (code 254)

@vm - 0xFD (code 253)

@svm - 0xFC (code 252)

@tm - 0xFB (code 251)

@stm - 0xFA (code 250)

However, when these need to be translated to UTF16 (two bytes) they should become:

@rm - 0xF0FF

@fm - 0xF0FE

@vm - 0xF0FD

@svm - 0xF0FC

@tm - 0xF0FB

@stm - 0xF0FA

which places them inside the Unicode Private Area. Characters within this range are not valid characters in any language, so they may be considered "safe" when passed outside of the engine. For example, @svm in UTF16 is represented as 0xF0FC, and therefore it is not the same as an actual "ü" character which is 0x00FC).

However, the complication arises because we need to decide if a character such as "ü" (0xFC), which has the same value as @svm, should be converted to the UTF16 version of "ü" (0x00FC), or the UTF16 version of @svm (0xF0FC).

E.g. If we wish to use Set_Property to set the TEXT property of a control, and the text string contains a character that could be a system delimiter like code 252, we need to tell the system how to translate it: is the code 252 character an @svm or a "ü"?

I.e. should the translation be this?

0xFC → 0x00FC ("ü" character in UTF16)

Or this?

0xFC → 0xF0FC (@svm delimiter in UTF16)

So that the system knows which translation to use we need to set the "Delimiter Count". This is a simple integer between 0 and 8 that specifies which single byte characters (counting backwards from 256) should be treated as system delimiters when performing the ANSI to UTF16 translation.

E.g. If the delimiter count is 6 (which is the default when the application is loaded) any single byte characters greater than code 250 inclusive (256-6) are translated to UTF16 System Delimiters like so:

@rm (code 255) → 0xF0FF

@fm (code 254) → 0xF0FE

@vm (code 253) → 0xF0FD

@svm (code 252) → 0xF0FC

@tm (code 251) → 0xF0FB

@stm (code 250) → 0xF0FA

If the delimiter count is 3 then any single byte characters greater than code 253 inclusive (256-3) are translated to UTF16 System Delimiters like so:

@rm (code 255) → 0xF0FF

@fm (code 254) → 0xF0FE

@vm (code 253) → 0xF0FD

  1. ————————–

@svm (code 252) → 0x00FC

@tm (code 251) → 0x00FB

@stm (code 250) → 0x00FA

If the delimiter count is 0 then no single byte characters are translated to UTF16 System Delimiters:

@rm (char 255) → 0x00FF

@fm (char 254) → 0x00FE

@vm (char 253) → 0x00FD

@svm (char 252) → 0x00FC

@tm (char 251) → 0x00FB

@stm (char 250) → 0x00FA

Translating UTF8 to UTF16 (as per Get_Property)

When in UTF8 mode the engine is using multi-byte characters for any data character above code 127. However, it also uses the familiar single byte characters for System Delimiters, but these are not valid UTF8 characters! This means that it is easy to separate the data from the delimiters:

E.g.

If the user enters a "ü" character (0x00FC) in a control then it will be translated as 0xC3BC, not 0xFC.

This means that it is preserved correctly and does not collide with any System Delimiters. When running in UTF8 mode the dynamic arrays and records in Basic+ are essentially made up of "fragments" of valid UTF8 strings delimited by invalid UTF8 single chars.

Translating UTF8 to UTF16 (as per Set_Property)

Again this is an easy translation: valid UTF8 data is translated to UTF16 as per normal and the "invalid" UTF8 single-byte system delimiters are always translated to their valid UTF16 equivalents.

i.e.

0xC3BC becomes 0x00FC

0xFC becomes 0xF0FC

And so on.

Setting the UTF8 Mode and the Delimiter Count in the IDE

Both the UTF8 Mode and Delimiter Count for an application can be set at design time in the Application Settings dialog (accessed from the Settings menu in the IDE). These settings are loaded when the application is initialized.

<Application Settings dialog image needed>

Setting the UTF8 Mode and the Delimeter Count at runtime

At runtime the application UTF8 Mode and Delimiter Count may be changed by using the UTF8 and DELIMCOUNT SYSTEM properties respectively: Setting these properties changes the UTF8 settings for the rest of the session, or until they are updated again.

Note that it is also possible to adjust the UTF8 Mode and Delimiter Count by using the SetUTF8 and SetNoOfDelimiters functions as well. The difference in this case is that the change only applies for the duration of the currently executing event. When the next event fires the UTF8 and Delimiter Count are reset to the same values as the SYSTEM properties mentioned above; this is to preserve data integrity in the event that the currently executing program is forced to abort.

  • guides/oi10/presentation_server/appendix_i_utf8_processing.txt
  • Last modified: 2023/10/25 10:49
  • by 127.0.0.1