====== UTF8 and the extended string operators ====== === Published 12 MAR 2010 at 01:21:00PM by Captain C === == Updated on 20 MAR 2010 at 01:21:00PM == As was pointed out in a [[http://www.sprezzatura.com/blog/2010/03/utf8-to-malloc-or-not-to-malloc-that-is.html|recent post]] the performance of the "[]" string operators in UTF8 mode is pretty poor. In fact it's downright painful - If you've not seen the effects before go and create yourself a UTF8 application and then try compiling a program. The speed drop you see is due to the system pre-compiler (REV_COMPILER_EXPAND) making heavy use of the "[]" operators during the compilation process in a manner similar to this: 0001  /* 0002     This is the usual way of implementing fast string parsing in Basic+. 0003     We scan along the string looking for a delimiter, and remember 0004     where we found it via the Col2() function. For the next iteration 0005     we increment that position and scan from that point. 0006  */    0007     src = xlate( "SYSPROCS", "MSG", "", "X" ) 0008   0009     pos = 1 0010     eof = len( src ) 0011      0012     loop 0013        token = src[pos," "] 0014        pos   = col2() + 1 0015         0016        * %%//%% Do stuff... 0017         0018     while ( pos < eof ) 0019     repeat The problem is caused by the need to find the correct starting character position before any string processing can take place. Because UTF8 is a multi-byte encoding scheme it is necessary to start looking from the beginning of the string to find the byte offset of the specified character, as it's possible for a character to be encoded in more than one byte. As you can appreciate, parsing a large string over many iterations wastes a lot of time looking for the character at the specified position - we could get much better performance if we had some way to access the actual byte offset and pass that to the "[]" operators instead. Well, the good news is that with the upcoming release of OpenInsight 9.2 Revelation have addressed this problem by extending the "[]" operators and adding two new functions to allow access to the byte offset: BCol1() and BCol2(). BCol1() and BCol2() The usual way to access the position of delimiters found with the "[]" operators or the Field() function is to use the Col1() and Col2() functions, which return the character position. The new BCol1() and BCol2() functions work in a similar fashion but return the byte offset instead, so you know how many bytes from the beginning of the string that a particular character was found. The extended "[]" operators Although BCol1() and BCol2() allow access to the byte offsets they can't be used with a normal "[]" operator because it expects the character index as the first argument not the byte offset. The extended "[]" operators take an extra argument (a simple "1" as a flag) to indicate that the first argument is a byte offset, and can be used like so: 0001  /* 0002     This example shows a UTF8-friendly way of parsing a string using 0003     byte offsets with the extended "[]" operators.  0004  */ 0005     src = xlate( "SYSPROCS", "MSG", "", "X" ) 0006      0007     pos       = 1 0008     delim     = " " 0009     delimSize = GetByteSize( delim ) 0010     eof       = GetByteSize( src ) 0011      0012     loop 0013        token = src[pos,delim,1]    ; * %%//%% Extended - note the last "1" argument 0014        pos   = BCol2() + delimSize ; * %%//%% Get the byte offset and increment by 0015                                    ; * %%//%% the delimiter _byte_ size 0016         0017        * %%//%% Do stuff... 0018         0019     while ( pos < eof ) 0020     repeat (Note also that we check the byte size of the delimiter we are using - although we *know* that a space is 1 byte in both ANSI and UTF8 modes, it's good practice to check this at runtime in case you ever end up using a delimiter that is multi-byte encoded instead) Both Field() and the normal "[]" operators update the BCol1 and BCol2 offsets, as well as the normal Col1 and Col2 positions. The extended "[]" operators only update the BCol1 and BCol2 offsets for obvious reasons. [EDIT: 20 March 2010] To maintain naming consistency with other UTF8-related enhancements the Col1B and Col2B functions have been renamed to BCol1 and BCol2 - this has been changed in the post above. === Comments === == Comment by Matt Crozier on 17 MAR 2010 at 12:18:28AM: == ''%% This is great!!%%'' ''%%Could the same be done for the REMOVE command? A 'remove' loop would perform inordinately faster if it kept track of byte position rather than character position, especially for the circumstances in which it is most commonly used.%%'' ''%%Cheers, M@ %%'' == Comment by Captain C on 17 MAR 2010 at 10:00:17AM: == ''%% Hi M@,%%'' ''%%I'm sure it *could* be extended or something like a RemoveB statement added - have to say it's not a function I use much - I find parsing with the "[]" operators much more convenient as they don't break on each system delimiter.%%'' ''%%You ought to come to Vegas to lobby the Powers That Be at Revelation for changes like this :) As you guys do use UTF8 I think your input on this topic would be very useful! %%'' == Comment by Matt Crozier on 18 MAR 2010 at 07:10:40AM: == ''%% Yeah, I'm sorry to be missing the conference this time around. There's heaps of stuff there we could learn.%%'' ''%%A lot of our legacy code uses Remove, but changing that to use [] instead isn't a big change. Although adding a RemoveB would be consistent, I guess.%%'' ''%%Cheers, M@ %%'' == Comment by Captain C on 20 MAR 2010 at 06:01:54PM: == ''%% Hi M@,%%'' ''%%Take a look here:%%'' ''%%http://www.sprezzatura.com/blog/2010/03/utf8-bremove-statement.html%%'' ''%%:) %%'' == Original ID: post-4500062265536946813 ==