Handling foreign characters (OpenInsight 32-bit) [Revelation On-Line Wiki]

Sign up on the Revelation Software website to have access to the most current content, and to be able to ask questions and get answers from the Revelation community

At 10 DEC 2012 04:25:58PM Paxton Scott wrote:

OpenInsight 32-bit

Greetıngs!

I am using OI 9.1

I use UTF8 character mode.

But, I still get tangled when handling ANSI 252 (ASCII 129) as it is the sub value mark

I work with names and song titles which I want to store in the database and index and retrieve.

I am working on expanding my understanding of character encoding but wonder if someone could point me in the right direction for more understanding and for handling strings containing ü ( I see this forum is ASCII coded)

Thanks for guidance…

Have fun!

Paxton

At 12 DEC 2012 05:01AM Carl Pates wrote:

Hi Paxton,

Not a small topic :) I'll try and explain how this works witha small example - hope I don't add to your confusion…

If you're in UTF8 mode here's what happens when you enter an the "ANSI 252" character (ü) in an OI edit line and save it.

1) You enter the "ü" character in the editline. The OpenInsight Presentation Server (OINSIGHT.EXE) is actually a Windows "Unicode application" so it holds each character in a string using two bytes (aka UTF16) and your character is held as \00FC\ in the editline's buffer at that point.

2) You save the record - the save logic in Basic+ pulls the data from the editline using Get_Property( "TEXT") and because you're running in UTF8 mode the system translates the UTF16 character to UTF8. UTF8 is a multibyte encoding scheme so the UTF16 \00FC\ is actually translated to \C3BC\ in UTF8.

3) \C3BC\ is now stored in your record. Notice that although it's a single character its actually two bytes!

This does not conflict with @svm because @svm is a single byte (\FC\) which as you can see is totally different to \C3BC\. Also single bytes above char(127) are not actually legal bytes in UTF8 encoding and you will never find them in a UTF8 string.

This is how OpenInsight can tell the difference between a "ü" character in UTF8 encoded data and an @svm. Your UTF8 encoded data is basically held between "illegal" UTF8 bytes - When OI parses a record or dynamic array to extract data using the <> operators it looks for single byte system delimiters (which it knows are not UTF8 characters) and returns the legal UTF8 bytes that represent characters between them.

Does that help any? If not can you be more specific as to what you're looking for?

Carl Pates

Sprezzatura

The Sprezzatura Blog

World leaders in all things RevSoft

At 12 DEC 2012 07:49AM Dave Harmacek wrote:

Carl, is it true that @svm is altered during UTF8 mode to be \C3BC\?

Thus, if we instead use char(127) or svm = \FC\ or equ xvm$ to \FC\ then OI cannot properly parse that as a Revelation delimiter?

Dave Harmacek

Harmacek Database Systems

At 12 DEC 2012 09:02AM Carl Pates wrote:

Hi Dave,

Carl, is it true that @svm is altered during UTF8 mode to be \C3BC\?

The Basic+ variable @svm is NEVER altered at any point - it is ALWAYS single byte: \FC\. It doesn't change, and neither do any of the others like @rm, @fm, @vm and so on.

Thus, if we instead use char(127) or svm = \FC\ or equ xvm$ to \FC\ then OI cannot properly parse that as a Revelation delimiter?

To stop me writing an essay here (which alas, this might require) can you give me a more specific example of the scenario you're thinking about? I'm not clear what you mean by parse - under what conditions?

Regards,

Carl Pates

Sprezzatura

The Sprezzatura Blog

World leaders in all things RevSoft

At 12 DEC 2012 10:27AM Dave Harmacek wrote:

This is how OpenInsight can tell the difference between a "ü" character in UTF8 encoded data and an @svm. Your UTF8 encoded data is basically held between "illegal" UTF8 bytes - When OI parses a record or dynamic array to extract data using the <> operators it looks for single byte system delimiters (which it knows are not UTF8 characters) and returns the legal UTF8 bytes that represent characters between them.

In re-reading I see you have already answered my question.

The string has both UTF8-encoded characters and single-byte delimiters, just so that I don't have to worry.

Thanks

Dave Harmacek

Harmacek Database Systems

At 12 DEC 2012 02:16PM paxton scott wrote:

Thanks, Carl.

I really appreciate your explanation, so now I can get my issues sorted out!

With much appreciation,

Have fun,

Paxton

At 12 DEC 2012 02:20PM paxton scott wrote:

I forgot to comment that I do nothing in presentation server, all in BASIC+. But, with your explanation, I believe I can do the proper manipulations to get my web application to work properly.

Cheers

Paxton

At 13 DEC 2012 07:45AM Dave Harmacek wrote:

Thanks Carl. Now, if the system is in UTF8 mode, how well does Arev32 fare?

I ask because I have a web-based system that handles orders, some where a foreign character has been entered into the web form, and I can't figure out how to store and reproduce this properly back at the English-speaking server!

Dave Harmacek

Harmacek Database Systems

At 13 DEC 2012 08:17AM Carl Pates wrote:

Hi Dave,

I'll have to leave that one for Bryan to answer I'm afraid as I'm not really privy to the internal presentation aspects of Arev32 alas :)

I assume that the foreign character is stored correctly in it's multibyte UTF8 form in the record?

Carl Pates

Sprezzatura

The Sprezzatura Blog

World leaders in all things RevSoft

View this thread on the forum...