utf8 and seq (OpenInsight 32-Bit)
At 20 AUG 2004 04:57:56PM Jim Vaughan wrote:
I am trying to read in some data that is in a unicode utf-16 format, it needs to be read one character at a time.
I will not know in advance that the file is non-ASCII.
The first couple of bytes in an utf-16 file are either FE FF or FF FE.
What I wanted to do was OSBREAD the first two characters, check if they are either of the above. If they are then read two bytes at a time converting them using the unicode_str function to create utf-8 characters that OI can handle.
As you might guess I have hit a problem.
If I do this:
OSBREAD CHARACTERS FROM FILENAME AT 0 LENGTH 2TEST=SEQ(CHARACTERS1,1)TEST is equal to -1 and not 255 (FF).
If I turn off UTF8 mode in application properties then SEQ works as expected and TEST is equal to 255.
CHAR() has a similar problem as SEQ; so I can not simply use that.
Any ideas? Is this a bug or is it meant to work like this?
At 20 AUG 2004 05:12PM Pat McNerthney wrote:
Jim,
I assume you are in OI 7.0.
Is you found out, in UTF8 mode, SEQ and CHAR operate on whole UTF8 characters, which can range from 1 to 4 bytes of the string. Also, in UTF8 mode CHAR(255) now represents the UTF8 character that has the value 255 (which uses a 2 byte sequence) and is not equivalent to a Record Mark (which still uses the \FF\ byte value).
There are now a series of intrinsic functions for manipulating binary data in a UTF8 mode independant manner. However, for your problem, I think the simplest thing to do would be to just compare the first two bytes to the hex strings \FEFF\ and \FFFE\.
Pat
At 20 AUG 2004 05:26PM Jim Vaughan wrote:
Perfect, an explanation and a solution.
I didn't know (or had forgotten about) about the
format; many thanks Pat.