Sorting out Collation Sequences by Mike Pope
Published By | Date | Version | Knowledge Level | Keywords |
---|---|---|---|---|
Sprezzatura Ltd | 01 MAR 1992 | 2.10+ | EXPERT | COLLATION, MAP, SEQUENCE, LANGUAGE, SET, ASCII, CM, CM_US, CM_ISO, %LOCAL_GROUP%, @LOWER.CASE, @UPPER.CASE |
The introduction of language sets in Version 2.1 of Advanced Revelation was a great step forward in allowing the product to be 'localised' for specific languages and countries. Along with allowing language-specific attributes such as date format, time format, etc., the system now enables you to create language-specific alphabetic (or 'collation') sequences.
In earlier versions of Advanced Revelation, sorting was done strictly according to the ASCII sequence of the characters involved. For instance, the character 'A' (ASCII 65) would be followed by 'B' (ASCII 66). However, '' is ASCII 142, so it would sort incorrectly for German speakers, who expect '' to fall immediately after 'A'.
As of Version 2.10, Advanced Revelation allows resequencing using 'collation maps' (CMs). A CM is a table that stores the desired sequence of a set of characters, regardless of the actual ASCII-sequence value of the characters.
Collation Maps in Advanced Revelation
A CM 256 bytes long. Each byte of the map tells Advanced Revelation what the relative (ordinal) sort value is for that position in the ASCII chart. For example, the 65th position of the map tells the system what the sort order is for the character 'A' relative to any other character in the map.
When sorting using a CM, Advanced Revelation follows these steps:
- extract the next character to be sorted
- look into the CM at that character's ASCII sequence value (+1 to account for ASCII 0)
- extract the sort sequence of that character
For example, here is an extract from a system CM:
(... 65 chars here)FILO
which translates into this:
ÚÄÄÄÄÄÄÄÂÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ ³ Char ³ Pos ³ Sort Sequence ³ ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´ ³ A ³ 66 ³ F (70) ³ ³ B ³ 67 ³ I (73) ³ ³ C ³ 68 ³ L (76) ³ ³ D ³ 69 ³ O (79) ³ ³ (etc.)³ ³ ³ ÀÄÄÄÄÄÄÄÁÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
Why doesn't the sort sequence for 'B' (73) immediately follow that for 'A' (70)? Because the characters '' and 'b' come between them. If you look into the CM, indeed, you would find that the position for '' (143) is the character 'G' (71) and that at position for 'b' (99) is the character 'H' (72).
Advanced Revelation actually implements two types of CM, one for case sensitive sorting (in which 'a' and 'A' sort differently) and another for case insensitive sorting (in which 'a' and 'A' sort the same). The example above is of a case-sensitive CM, in which each character has a unique sort sequence value. A case insensitive map looks identical, but repeats the relative sort value for as many characters as necessary.
Storing Collation Maps
Advanced Revelation stores CMs as two fields. In Version 2.10, these were the 10th (case sensitive) and 11th (case insensitive) fields of the LND_ records in the SYSTEM file. This meant that the CMs were bound to individual language sets, which led to difficulties. In Version 2.11 CMs have been isolated into stand-alone records in the SYSTEM file named with a CM_ prefix, again with two fields (1/2 = sensitive/insensitive). The language set records, instead of containing the CMs themselves, just store in <10> the CM record key for that language.
Version 2.11 ships with two CMs: CM_US and CM_ISO. The former produces sort values that match the ASCII table; the latter treats alternates of a character (e.g. a ) as having sequential sort values.
Using Collation Maps
At login or level-1 reset, the system looks at the current language set (from the environment) and attempts to load the CMs associated with it. If there is any error (e.g. LEN(map) NE 256) Advanced Revelation reverts to using the default (ASCII) CM.
In R/BASIC, CMs are used for all character comparisons (e.g. IF 'B' LT ''). Because of this, specific sort operations such as LOCATE BY and V119 will use the current CMs. Obviously, various system routines use LOCATE BY and V119, and therefore also implicitly take advantage of CMs. For example, R/LIST uses V119 for BY clauses, and indexes, of course, use LOCATE BY.
To ensure that all users use the same CM to update indexes, the system stamps CM information on the index when you create it, and checks it against the current CM when you attempt an update. The key to the CM is stored in the 8th field of the field record in the !file. It is this latter that caused problems in 2.10. In 2.10, this CM key was the language set name. If you changed language sets, even if the collation sequence remained the same, the indexing system prohibited modifications. In 2.11 the problem was resolved by storing only the actual CM name, which remains the same across many language sets. Of course, if you load up a new CM altogether, indexing will insist on a rebuild in order to resequence the index according to the new CM. For Quick/Rightdexes, the CM name is stored in the dict of the file in %LOCAL.GROUP% (value1 = dict CM, value2 = data CM).
So when are case-insensitive CMs used? The only place is when you use the special case-insensitive comparison operators (e.g. _EQC). You might think that they'd be used in indexes. However, indexing simply converts @LOWER.CASE to @UPPER.CASE for case-insensitive indexes (which means that you should check and adjust those two variables in the LND record). (Editors note: This can especially be a problem when using the B catalyst lookup logic. If @LOWER.CASE and @UPPER.CASE do not contain the foreign language set characters, then lookups involving them may fail if entered in lowercase - AMcA). Creating Your Own Collation Maps It is not difficult (just confusing) to create your own CMs. Following is a utility to do this. To use it, create two records in the LISTS file called NEW_CM_CASE and NEW_CM_NOCASE. For the case-sensitive list, put the characters, one per field, in the order that you want to re-sequence (e.g. A,,,a,,B,b – ','=@FM). In the second (case insensitive) list, each field can contain multiple characters per field to indicate that these have the same sort sequence (e.g. Aa,Bb). Include just the characters to be resequenced. Any missing characters are added into the CMs, in ASCII order, where you might have left any holes.
The program works as a reverse of how Advanced Revelation uses the CMs. For each character in the desired sequence, it goes to that ASCII value of the character in the new map, and inserts its sort sequence (relative position) there. Additional features of the utility are that it assumes standard sort order for ASCII values 1- 32, and that it checks for duplicate assignments.
The programs creates a new sort order record called CM_NEW. To create one of a different name, change the line which reads WRITEV NEW_CM ON SYSTEM_FILE, 'CM_NEW', FCTR.
/* program: reset_cm author: mike pope, RT(UK)Ltd date: 11 Feb 1992 notes: SOURCE user's desired sort order NEW_CM new CM created by this program ASCII_TABLE ascii-chart values, used to backfill */ OPEN 'SYSTEM' TO SYSTEM_FILE ELSE CALL FSMSG() ; STOP SOURCE_CASE = XLATE('LISTS','NEW_CM_CASE', '','X') SOURCE_NOCASE = XLATE('LISTS','NEW_CM_NOCASE','','X') /* REMEMBER! The records from the LISTS file should be @FM delimited */ SOURCE = SOURCE_CASE ; FCTR = 1 ; GOSUB MAKE_MAP SOURCE = SOURCE_NOCASE ; FCTR = 2 ; GOSUB MAKE_MAP STOP /*--------------------------------------------------------*/ MAKE_MAP: IF SOURCE ELSE RETURN NEW_CM = STR( @FM, 256 ) ; * initialize to all high values ASCII_TABLE = NEW_CM * Now build table of ASCII values > space FOR CTR = 33 TO 256 ASCII_TABLE[CTR,1] = CHAR(CTR) NEXT /* initialize NEW_CMs to have the 1st 32 characters already in place */ FOR CTR = 1 TO 32 NEW_CM[CTR,1] = CHAR(CTR) NEXT FIELD_CNT = COUNT( SOURCE, @FM ) + 1 ; * We know Source is not null FOR FIELD_CTR = 1 TO FIELD_CNT NEXT_FIELD = SOURCE< FIELD_CTR > CHAR_CNT = LEN(NEXT_FIELD) ;* loop thru each char (req. for case-insens) FOR PTR = 1 TO CHAR_CNT NEXT_CHAR = NEXT_FIELD[ PTR, 1 ] VALUE = SEQ( NEXT_CHAR ) SORT_SEQ = CHAR(FIELD_CTR+32) ;*offset to skip 1st 32 char /* see if this value has already been put into NEW_CM - if there is not an @FM at this position, it has already been used */ IF NEW_CM[ VALUE, 1 ] NE @FM THEN CALL MSG('Char %1% is duplicated!', '', '', NEXT_CHAR ) STOP END ELSE NEW_CM[ VALUE, 1 ] = SORT_SEQ ASCII_TABLE[ VALUE,1] = @FM ; * @fm means 'already used' END NEXT PTR NEXT FIELD_CTR /* backfill characters not in new CM */ NEW_CM_PTR = 1 FOR CTR = 33 TO 256 NEXT_CHAR = ASCII_TABLE[ CTR, 1 ] IF NEXT_CHAR NE @FM THEN * scan NEW_CM (from NEW_CM_PTR) for holes (marked by @FM) HOLE_FOUND = 0 LOOP NEXT_CM_CHAR = NEW_CM[ NEW_CM_PTR, 1 ] IF NEXT_CM_CHAR EQ @FM THEN HOLE_FOUND = 1 END ELSE NEW_CM_PTR += 1 END ; * if next_cm_char eq @fm UNTIL HOLE_FOUND REPEAT NEW_CM[ NEW_CM_PTR, 1 ] = CHAR( CTR ) NEW_CM_PTR += 1 END ; * if next_char ne @fm NEXT CTR * shift CM 1 char right to add in ascii 0, convert hi- end chars to @VM CONVERT @FM:@RM TO '' IN NEW_CM NEW_CM[ 254, 2 ] = @VM:@VM ; * positions 254/255 are taken by @VMs in CMs * note that this means that @FMs and @RMs cannot be remapped NEW_CM = CHAR(0) : NEW_CM IF LEN(NEW_CM) = 256 THEN WRITEV NEW_CM ON SYSTEM_FILE, 'CM_NEW', FCTR END ELSE * Call appropriate (900 or 901) SYS.MESSAGES error message CALL MSG( 899+FCTR, '', '', 'CM_NEW' ) ; STOP END RETURN
(Volume 3, Issue 9, Pages 7-10)