UTF8 and the extended string operators
Published 12 MAR 2010 at 01:21:00PM by Captain C
Updated on 20 MAR 2010 at 01:21:00PM
As was pointed out in a recent post the performance of the "[]" string operators in UTF8 mode is pretty poor. In fact it's downright painful - If you've not seen the effects before go and create yourself a UTF8 application and then try compiling a program. The speed drop you see is due to the system pre-compiler (REV_COMPILER_EXPAND) making heavy use of the "[]" operators during the compilation process in a manner similar to this:
0001 /* 0002 This is the usual way of implementing fast string parsing in Basic+. 0003 We scan along the string looking for a delimiter, and remember 0004 where we found it via the Col2() function. For the next iteration 0005 we increment that position and scan from that point. 0006 */ 0007 src = xlate( "SYSPROCS", "MSG", "", "X" ) 0008 0009 pos = 1 0010 eof = len( src ) 0011 0012 loop 0013 token = src[pos," "] 0014 pos = col2() + 1 0015 0016 * // Do stuff… 0017 0018 while ( pos < eof ) 0019 repeat
The problem is caused by the need to find the correct starting character position before any string processing can take place. Because UTF8 is a multi-byte encoding scheme it is necessary to start looking from the beginning of the string to find the byte offset of the specified character, as it's possible for a character to be encoded in more than one byte. As you can appreciate, parsing a large string over many iterations wastes a lot of time looking for the character at the specified position - we could get much better performance if we had some way to access the actual byte offset and pass that to the "[]" operators instead.
Well, the good news is that with the upcoming release of OpenInsight 9.2 Revelation have addressed this problem by extending the "[]" operators and adding two new functions to allow access to the byte offset: BCol1() and BCol2().
BCol1() and BCol2()
The usual way to access the position of delimiters found with the "[]" operators or the Field() function is to use the Col1() and Col2() functions, which return the character position. The new BCol1() and BCol2() functions work in a similar fashion but return the byte offset instead, so you know how many bytes from the beginning of the string that a particular character was found.
The extended "[]" operators
Although BCol1() and BCol2() allow access to the byte offsets they can't be used with a normal "[]" operator because it expects the character index as the first argument not the byte offset. The extended "[]" operators take an extra argument (a simple "1" as a flag) to indicate that the first argument is a byte offset, and can be used like so:
0001 /* 0002 This example shows a UTF8-friendly way of parsing a string using 0003 byte offsets with the extended "[]" operators. 0004 */ 0005 src = xlate( "SYSPROCS", "MSG", "", "X" ) 0006 0007 pos = 1 0008 delim = " " 0009 delimSize = GetByteSize( delim ) 0010 eof = GetByteSize( src ) 0011 0012 loop 0013 token = src[pos,delim,1] ; * // Extended - note the last "1" argument 0014 pos = BCol2() + delimSize ; * // Get the byte offset and increment by 0015 ; * // the delimiter _byte_ size 0016 0017 * // Do stuff… 0018 0019 while ( pos < eof ) 0020 repeat
(Note also that we check the byte size of the delimiter we are using - although we *know* that a space is 1 byte in both ANSI and UTF8 modes, it's good practice to check this at runtime in case you ever end up using a delimiter that is multi-byte encoded instead)
Both Field() and the normal "[]" operators update the BCol1 and BCol2 offsets, as well as the normal Col1 and Col2 positions. The extended "[]" operators only update the BCol1 and BCol2 offsets for obvious reasons.
[EDIT: 20 March 2010] To maintain naming consistency with other UTF8-related enhancements the Col1B and Col2B functions have been renamed to BCol1 and BCol2 - this has been changed in the post above.
Comments
Comment by Matt Crozier on 17 MAR 2010 at 12:18:28AM:
This is great!!
Could the same be done for the REMOVE command? A 'remove' loop would perform inordinately faster if it kept track of byte position rather than character position, especially for the circumstances in which it is most commonly used.
Cheers, M@
Comment by Captain C on 17 MAR 2010 at 10:00:17AM:
Hi M@,
I'm sure it *could* be extended or something like a RemoveB statement added - have to say it's not a function I use much - I find parsing with the "[]" operators much more convenient as they don't break on each system delimiter.
You ought to come to Vegas to lobby the Powers That Be at Revelation for changes like this :) As you guys do use UTF8 I think your input on this topic would be very useful!
Comment by Matt Crozier on 18 MAR 2010 at 07:10:40AM:
Yeah, I'm sorry to be missing the conference this time around. There's heaps of stuff there we could learn.
A lot of our legacy code uses Remove, but changing that to use [] instead isn't a big change. Although adding a RemoveB would be consistent, I guess.
Cheers, M@
Comment by Captain C on 20 MAR 2010 at 06:01:54PM:
Hi M@,
Take a look here:
http://www.sprezzatura.com/blog/2010/03/utf8-bremove-statement.html
:)