third_party_content:sprezz_blog:15412.55625

UTF8 and the extended string operators

Published 12 MAR 2010 at 01:21:00PM by Captain C

Updated on 20 MAR 2010 at 01:21:00PM

As was pointed out in a recent post the performance of the "[]" string operators in UTF8 mode is pretty poor. In fact it's downright painful - If you've not seen the effects before go and create yourself a UTF8 application and then try compiling a program. The speed drop you see is due to the system pre-compiler (REV_COMPILER_EXPAND) making heavy use of the "[]" operators during the compilation process in a manner similar to this:

0001  /* 0002     This is the usual way of implementing fast string parsing in Basic+. 0003     We scan along the string looking for a delimiter, and remember 0004     where we found it via the Col2() function. For the next iteration 0005     we increment that position and scan from that point. 0006  */    0007     src = xlate( "SYSPROCS", "MSG", "", "X" ) 0008   0009     pos = 1 0010     eof = len( src ) 0011      0012     loop 0013        token = src[pos," "] 0014        pos   = col2() + 1 0015         0016        * // Do stuff… 0017         0018     while ( pos < eof ) 0019     repeat

The problem is caused by the need to find the correct starting character position before any string processing can take place. Because UTF8 is a multi-byte encoding scheme it is necessary to start looking from the beginning of the string to find the byte offset of the specified character, as it's possible for a character to be encoded in more than one byte. As you can appreciate, parsing a large string over many iterations wastes a lot of time looking for the character at the specified position - we could get much better performance if we had some way to access the actual byte offset and pass that to the "[]" operators instead.

Well, the good news is that with the upcoming release of OpenInsight 9.2 Revelation have addressed this problem by extending the "[]" operators and adding two new functions to allow access to the byte offset: BCol1() and BCol2().

BCol1() and BCol2()

The usual way to access the position of delimiters found with the "[]" operators or the Field() function is to use the Col1() and Col2() functions, which return the character position. The new BCol1() and BCol2() functions work in a similar fashion but return the byte offset instead, so you know how many bytes from the beginning of the string that a particular character was found.

The extended "[]" operators

Although BCol1() and BCol2() allow access to the byte offsets they can't be used with a normal "[]" operator because it expects the character index as the first argument not the byte offset. The extended "[]" operators take an extra argument (a simple "1" as a flag) to indicate that the first argument is a byte offset, and can be used like so:

0001  /* 0002     This example shows a UTF8-friendly way of parsing a string using 0003     byte offsets with the extended "[]" operators.  0004  */ 0005     src = xlate( "SYSPROCS", "MSG", "", "X" ) 0006      0007     pos       = 1 0008     delim     = " " 0009     delimSize = GetByteSize( delim ) 0010     eof       = GetByteSize( src ) 0011      0012     loop 0013        token = src[pos,delim,1]    ; * // Extended - note the last "1" argument 0014        pos   = BCol2() + delimSize ; * // Get the byte offset and increment by 0015                                    ; * // the delimiter _byte_ size 0016         0017        * // Do stuff… 0018         0019     while ( pos < eof ) 0020     repeat

(Note also that we check the byte size of the delimiter we are using - although we *know* that a space is 1 byte in both ANSI and UTF8 modes, it's good practice to check this at runtime in case you ever end up using a delimiter that is multi-byte encoded instead)

Both Field() and the normal "[]" operators update the BCol1 and BCol2 offsets, as well as the normal Col1 and Col2 positions. The extended "[]" operators only update the BCol1 and BCol2 offsets for obvious reasons.

[EDIT: 20 March 2010] To maintain naming consistency with other UTF8-related enhancements the Col1B and Col2B functions have been renamed to BCol1 and BCol2 - this has been changed in the post above.

Comments

Comment by Matt Crozier on 17 MAR 2010 at 12:18:28AM:

This is great!! Could the same be done for the REMOVE command? A 'remove' loop would perform inordinately faster if it kept track of byte position rather than character position, especially for the circumstances in which it is most commonly used.

Cheers, M@

Comment by Captain C on 17 MAR 2010 at 10:00:17AM:

Hi M@,

I'm sure it *could* be extended or something like a RemoveB statement added - have to say it's not a function I use much - I find parsing with the "[]" operators much more convenient as they don't break on each system delimiter.

You ought to come to Vegas to lobby the Powers That Be at Revelation for changes like this :) As you guys do use UTF8 I think your input on this topic would be very useful!

Comment by Matt Crozier on 18 MAR 2010 at 07:10:40AM:

Yeah, I'm sorry to be missing the conference this time around. There's heaps of stuff there we could learn. A lot of our legacy code uses Remove, but changing that to use [] instead isn't a big change. Although adding a RemoveB would be consistent, I guess.

Cheers, M@

Comment by Captain C on 20 MAR 2010 at 06:01:54PM:

Hi M@,

Take a look here:

http://www.sprezzatura.com/blog/2010/03/utf8-bremove-statement.html

:)

Original ID: post-4500062265536946813
  • third_party_content/sprezz_blog/15412.55625.txt
  • Last modified: 2024/01/17 19:45
  • by 127.0.0.1