Sign up on the Revelation Software website to have access to the most current content, and to be able to ask questions and get answers from the Revelation community

V119 sorting problem (OpenInsight 32-bit Specific)

At 08 JUN 2003 06:46:08PM Paul Rowe wrote:

We've just found that sorting using V119 is coming back with different results in OI 4.1.2 than in OI 3.7.

We have been using right justified sorts to sort complex alphanumberic data (and V119 was doing something quite clever with it). As an example, A 1, B 1, B 2, B 10, A 10, A 2 would become A 1, A 2, A 10, B 1, B 2, B 10.

In 4.1.2 the results are inconsistent. When sorting on a single column, the sorted data is in the same order as in 3.7. However, when sorting on two columns, where the first column is Right justified, the second column comes out in a different order than expected (on larger lists we were not able to determine any kind of sorting!).

The following bit of code demonstrates the problem. The data passed has nothing in the first column so it should then sort on the details in the second column. However, the sort results are different depending whether the first column is left or right justified. (if the first column contains the same value for every row then the results are the same).

justs=RR'; bys=AA'

data =':@fm: 'B 2': @rm

data := :@fm: 'A 10': @rm data := :@fm: 'B 1': @rm

data := :@fm: 'B 10': @rm data := :@fm: 'A 1': @rm

data := :@fm: 'A 2': @rm sort.file=drive():'\DATAVOL\TEST1.TMP' Call V119( 'D', sort.file, , , , ) Call V119( 'I', sort.file, , , , flag)

If flag then

temp.data=data
' temp.data will sort in an unexpected order
Call V119( 'S', sort.file, bys, justs, temp.data, flag)
temp.data=data
justs=LR'
' temp.data will sort in the expected order
Call V119( 'S', sort.file, bys, justs, temp.data, flag)

end

Any idea what's going on here? We're not using UTF8.

Cheers,

Paul


At 09 JUN 2003 06:25AM Revelation Technical Support wrote:

Paul-

We just tested this here using the 4.1.3 build and it looked fine:

a1, a2, a10, b1, b2, b10

We used the same code you posted.

Thanks


At 09 JUN 2003 04:29PM Oystein Reigem wrote:

I tried Paul's code in 4.1.2. I got different results -

B 1 , A 1 , B 2 , A 2 , A 10 , B 10

versus

A 1 , A 2 , A 10 , B 1 , B 2 , B 10

I assume these are the results he gets.

I haven't changed to 4.1.3 yet. Should I worry? I'd like my sorts to work correctly.

- Oystein -


At 09 JUN 2003 05:38PM Paul Rowe wrote:

I've tested this today in 4.1.3 with the sample code and the problem still occurs.

In the sample code there were two v119 sorts. The first one (two Right justified columns) returns the strange results:

B 1 , A 1 , B 2 , A 2 , A 10 , B 10

versus the second one (Left justified for column 1, Right just for column 2)

A 1 , A 2 , A 10 , B 1 , B 2 , B 10

I've also tried it on a NT 4.0 and XP and both had the problem, so it doesn't seem to be operating system specific.

Thanks,

Paul


At 09 JUN 2003 06:23PM Warren wrote:

I'm surprised this sorts as expected in v3.7. Seems to me the bug is in 3.7.

Since the length of the numeric portion varies, the position of the alpha and following space character shifts. Thus the alpha character will compare against null, another alpha character or space.

I know Mac OSes will collate as you expect, but I've never seen a PC system collate like that unless you use a fixed length on numeric portion.


At 09 JUN 2003 09:08PM Paul Rowe wrote:

I'm not sure I follow you Warren.

In the short results that I believe are incorrect B 1 comes BEFORE A 1 in an ascending sort. How can that make sense? These two strings are the same length so it shouldn't matter whether it is right of left justified. In the same set of results, B 10 comes AFTER A 10, so there doesn't seem to be any pattern.

There's definitely a problem here because adding a field containing null before this field has resulted in a completely different sort order (sorting on one field containing the values listed will give the same result as we used to get in 3.7).

You are right that the results in 3.7 aren't a simple right justified sort. The results were:

A 1 , A 2 , A 10 , B 1 , B 2 , B 10 - this seems to be a clever combination of left and right sorting (not sure how it is done) that achieves the most natural looking order.

A straight right justified order would be:

A 1, A 2, B 1, B 2, A 10, B 10.

While this is correct in simple terms it would be a great pain for us as we would no longer be able to sort on these alphanumberic codes (neither right or left justified would sort them how we could in 3.7).

Not an easy problem!

Thanks for the input.

Paul


At 10 JUN 2003 03:31AM Oystein Reigem wrote:

Warren, Revelation,

You must agree there's something funny here no matter how "clever" V119 is when it sorts the second column.

Paul's code sorts the same data twice. In both cases he first sorts on the first column, which is empty. The only difference between his two sorts is the justification of that first column.

Paul,

Without having studied the details I agree with Warren when he thinks you shouldn't expect a good sort on mixed data like that.

What I would do in your case is either

(1) split the values into non-numeric and numeric parts, and put the parts in their own sort fields, e.g,

value: 'B 10' -] 'B ' : @RM : '10'

and then sort the two fields with alternate justifications 'LR',

or

(2) convert your values so the numeric pieces become fixed length, right justified and zero filled. Also put a very low character as a delimiter between your pieces. E.g,

value: 'B 10' -] 'B ' : char(1) : '000010'

Then do a left justified sort. Before you convert you must of course decide how many digits your numbers might contain.

Both these methods can be generalized to handle more complex values, e.g,

A34-XY192

B123-Z5

- Oystein -


At 10 JUN 2003 06:52PM Paul Rowe wrote:

Today I checked in Arev 2.12, and even back then the Right justified sorts give the order:

A 1, A 2, A 10, B 1 etc.

All of the sort fields in our application are right justified because that has always given us the best results regardless of the data - text appears in alphabetical order, while numbers within the text are sorted numerically.

We use this right justified sorts on text fields and it would not be practical to split this into two fields as data may not always fit a specific pattern. If we can't get the right justified sorts to work how they have worked in the past then we'd need to change them all to left justified for the text fields. This would be a step backwards for us.

We've used the logic of swapping 1 with 00001, 2 with 00002 etc before, but this becomes difficult to implement on free text fields where the format of the data can be quite different from site to site.

A common example in our application is the name of a location within a building. Each museum has its own format for this, which can be consistent within that museum but completely different to everyone else.

Thanks for the suggestions. I'll be interested to see what Revelation can find out.

Paul


At 11 JUN 2003 05:59AM Oystein Reigem wrote:

Paul,

All of the sort fields in our application are right justified because that has always given us the best results regardless of the data - text appears in alphabetical order, while numbers within the text are sorted numerically. We use this right justified sorts on text fields and it would not be practical to split this into two fields as data may not always fit a specific pattern.

You don't have to split into two fields. See my second alternative, where I keep the value in one field but use an "internal" delimiter char(1) between the non-numeric and numeric parts. (The delimiter must be a value lower than any character naturally occurring in the field to guarantee a correct sort.)

You can't sort on the original field though. You must have a symbolic or something for the converted, char(1)-delimited version, and sort on that.

We've used the logic of swapping 1 with 00001, 2 with 00002 etc before, but this becomes difficult to implement on free text fields where the format of the data can be quite different from site to site. A common example in our application is the name of a location within a building. Each museum has its own format for this, which can be consistent within that museum but completely different to everyone else.

Sure. That's the kind of data I've been using my technique for. In my case it's catalog numbers of artefacts or photographs. Every museum has their own system, or even several different systems at the same time. E.g, the number of numeric parts may vary.

In general such data follow the syntax:

where the leading and trailing non-numeric parts might be empty, and the number of numeric parts can be zero or more.

My code converts such data to:

When the converted data are sorted left justified they normally produce a sensible result.

I can think of two cases where the system breaks down:

- Unintentional variations in the use of special characters, like XYZ-1990-123A versus XYZ.1990-123A

- Non-numeric numbering, like Roman numerals (IX will sort before V) and letters (AZ will sort OK, but not the next "numbers" AAZZ).

The programming is not difficult. I haven't got a version you can use right away, but can prepare one for you if you like.

Or challenge me with a set of real data.

- Oystein -


At 11 JUN 2003 07:01PM Paul Rowe wrote:

Thanks Oystein. I agree that your system is workable and we use exactly the same system for managing catalogue numbers for artefacts.

A more complex example would be:

A1A,B1A,B2A,A1BB,A10A,A2A,B10A

Prior to OI 4 this was sorted in a right justified sort as:

A1A,A1BB,A2A,A10A,B1A,B2A,B10A

This is exactly what we want. While we could right a symbolic to process this as you suggest, it would require looping through the data character by character to separate the alphabetic and number elements first.

We're relunctant to change it for a few reasons.

1. We'd need to create new symbolic fields for some of the existing right justified alphanumeric fields (there are 100's of these, but some are already symbolics). The new symbolic code would also add an additional layer of processing which would slow the sort down (possibly not significantly).

2. We be changing something that has worked for the last 10 years.

3. There's definitely a problem in V119 at present, and until it is corrected we don't really know what the default right justified sort order V119 will end up being, so we'd rather hold off making any major changing until Revelation respond.

Thanks for all of your input. It's great to have responses from someone dealing with similar data.

Cheers,

Paul


At 12 JUN 2003 04:17AM Oystein Reigem wrote:

Paul,

Here's to summarize and clarify some of the matter we discuss:

In a database app sorting happens in two different places. Some sorting is done by the system when it makes indexes and result lists. Let's call that "system sorting". Some sorting is done on data in variables or OS files by the developer's own code. Let's call that "programmatic sorting". In OI programmatic sorting normally is done by calling V119. Presumably also all system sorting uses V119.

Now what if the developer (user) needs a non-standard sort order on some data? Today's solution in both cases is to run data through some conversion before it is sorted. The trick is to reshape the data so it can be sorted with a standard sort. In the first case (indexes and result lists) one must use symbolic fields that do that conversion. This was the method that I suggested.

Here are some new ideas (but they need Revelation to be realized):

A possible different approach would be to change what's at the core of a sort function - the code that compares one value with another value. What if a new, improved V119 could accept a user-defined (developer-defined) comparison function? That could make non-standard programmatic sorting easier. For non-standard system sorting to work each field's dictionary item must have an additional entry with information about which comparison to use, and system sorting must use that information when making indexes and sorting result lists.

You mention some disadvantages with using symbolics. Among them is the possible need for a lot of symbolics - one for each field that needs a different sort. Yet another approach would be to have kind of "promoted" or "global" symbolics - formulas that could easily be applied to any field. E.g, let's say you have fields FIELD1, FIELD2, FIELD3, etc, that all need the same kind of symbolic. Today one has to make symbolics for each of them: FIELD1_SYMB, FIELD2_SYMB, etc. Wouldn't it be nice to be able to re-use a formula (conversion function) MYSORT by saying

FIELD1.MYSORT, FIELD2.MYSORT, FIELD3.MYSORT, etc,

or

/i]MYSORT(FIELD1), MYSORT(FIELD2), MYSORT(FIELD3)[/i], etc.

Wouldn't it be nice to be able to do

SELECT SOMETABLE BY SOMEFIELD.MYSORT

or

SELECT SOMETABLE BY MYSORT(SOMEFIELD) ?

- Oystein -


At 12 JUN 2003 09:42AM Warren wrote:

My mistake, I was thinking Left justified sort.


At 12 JUN 2003 10:30AM Warren wrote:

Isn't a user conversion (ICONV/OCONV) essentially a global symbolic?

Sort is done on the ICONVed data, user sees the OCONVed data. Simple enough and no need for Revelation's intervention.


At 12 JUN 2003 10:56AM [url=http://www.sprezzatura.com" onMouseOver=window.status= Click here to visit our web site?';return(true)]The Sprezzatura Group[/url] wrote:

Respect.

The Sprezzatura Group

World Leaders in all things RevSoft


At 12 JUN 2003 07:31PM Paul Rowe wrote:

True - conversion routines can be a standard way of turning display data into sortable data, with date fields being a common example.

However, the sort requirements for data are not always the same as the validation and display requirements for the data.

e.g. We have a text title field that may start with 'The ' or 'A '. A symbolic field allows us to convert the data to upper case and gives the user the option of ignoring certain starting words during sorting (our telephone directory has the same sort order, ignoring 'The ' at the start of a company name).

I like the idea of hooking in a custom sort option into V119. I guess this is similar to how we use the Btree.extract option to hook in custom search comparisons. Technically, it would be more difficult at present with V119 being an external assembler routine.

We currently use some global symbolics which we have implemented in SYSDICT and then pass parameters to in our own @vars.

Any word from Revelation on the original problem in V119 that started this discussion?

Paul


At 13 JUN 2003 03:35AM Oystein Reigem wrote:

Warren,

Hmm. I never thought of that.

Paul has some reservations, though. I have some second thoughts too:

- To implement the scheme all relevant dictionary items must be changed, all relevant forms must be recompiled (mustn't they?), and all relevant fields must be run through a conversion to the new internal format. So it doesn't come free.

- The numeric parts might already be fixed length, right justified, zero-filled, e.g XY.00007, XY.00123, XY.11109. Or they might be variable length: Z-1, Z-345, Z-1001. Both cases might occur in the same field of the same database. But unless the code has specific knowledge about all of each client's numbering schemes it needs to handle all numeric parts the same way. So in the internal version all numeric parts are right justified, zero-filled with a fixed length of say 8 digits: XY.#00000007, XY.#00000123, XY.#00011109, Z-#00000001, Z-#00000345, Z-#00001001. (For # read char(1), or some other suitable delimiter.) And in the external version all the numeric parts get stripped of their leading zeros: XY.7, XY.123, XY.11109, etc. This will not go down well with users who've had a long-standing relationship with their fixed-length numbers. They occur on labels on artefacts, in old manual catalogs, etc.

- The internal version of such mixed numeric/non-numeric data might grow to be very large. Say client A has values that are typically 16 characters long in their external version and contain three smallish numeric parts, with 8 digits in all. Say client B has values that are typically 12 characters long with one numeric part 8 characters long. To cover both these cases (which might not even be the extreme cases) with the same scheme the internal values must have numeric parts that are 8 digits long, so client A's internal value become about 37 characters long…

- …Now say I use these values for keys. (Is it at all wise to have conversion on keys?)

I will keep your reminder about user conversions in mind, though. They can be used for more than I thought.

- Oystein -


At 16 JUN 2003 03:50PM Revelation Technical Support wrote:

An update to this thread has been posted to the Works discussion group.

View this thread on the forum...