Sorting out Collation Sequences by Mike Pope

Published ByDateVersionKnowledge LevelKeywords
Sprezzatura Ltd01 MAR 19922.10+EXPERTCOLLATION, MAP, SEQUENCE, LANGUAGE, SET, ASCII, CM, CM_US, CM_ISO, %LOCAL_GROUP%, @LOWER.CASE, @UPPER.CASE

The introduction of language sets in Version 2.1 of Advanced Revelation was a great step forward in allowing the product to be 'localised' for specific languages and countries. Along with allowing language-specific attributes such as date format, time format, etc., the system now enables you to create language-specific alphabetic (or 'collation') sequences.

In earlier versions of Advanced Revelation, sorting was done strictly according to the ASCII sequence of the characters involved. For instance, the character 'A' (ASCII 65) would be followed by 'B' (ASCII 66). However, 'Ž' is ASCII 142, so it would sort incorrectly for German speakers, who expect 'Ž' to fall immediately after 'A'.

As of Version 2.10, Advanced Revelation allows resequencing using 'collation maps' (CMs). A CM is a table that stores the desired sequence of a set of characters, regardless of the actual ASCII-sequence value of the characters.

A CM 256 bytes long. Each byte of the map tells Advanced Revelation what the relative (ordinal) sort value is for that position in the ASCII chart. For example, the 65th position of the map tells the system what the sort order is for the character 'A' relative to any other character in the map.

When sorting using a CM, Advanced Revelation follows these steps:

  • extract the next character to be sorted
  • look into the CM at that character's ASCII sequence value (+1 to account for ASCII 0)
  • extract the sort sequence of that character

For example, here is an extract from a system CM:

(... 65 chars here)FILO

which translates into this:

         ÚÄÄÄÄÄÄÄÂÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
         ³  Char ³ Pos ³ Sort Sequence ³
         ÃÄÄÄÄÄÄÄÅÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ´
         ³   A   ³ 66  ³  F  (70)      ³
         ³   B   ³ 67  ³  I  (73)      ³
         ³   C   ³ 68  ³  L  (76)      ³
         ³   D   ³ 69  ³  O  (79)      ³
         ³ (etc.)³     ³               ³
         ÀÄÄÄÄÄÄÄÁÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ

Why doesn't the sort sequence for 'B' (73) immediately follow that for 'A' (70)? Because the characters 'Ž' and 'b' come between them. If you look into the CM, indeed, you would find that the position for 'Ž' (143) is the character 'G' (71) and that at position for 'b' (99) is the character 'H' (72).

Advanced Revelation actually implements two types of CM, one for case sensitive sorting (in which 'a' and 'A' sort differently) and another for case insensitive sorting (in which 'a' and 'A' sort the same). The example above is of a case-sensitive CM, in which each character has a unique sort sequence value. A case insensitive map looks identical, but repeats the relative sort value for as many characters as necessary.

Advanced Revelation stores CMs as two fields. In Version 2.10, these were the 10th (case sensitive) and 11th (case insensitive) fields of the LND_ records in the SYSTEM file. This meant that the CMs were bound to individual language sets, which led to difficulties. In Version 2.11 CMs have been isolated into stand-alone records in the SYSTEM file named with a CM_ prefix, again with two fields (1/2 = sensitive/insensitive). The language set records, instead of containing the CMs themselves, just store in <10> the CM record key for that language.

Version 2.11 ships with two CMs: CM_US and CM_ISO. The former produces sort values that match the ASCII table; the latter treats alternates of a character (e.g. a …ƒ„) as having sequential sort values.

At login or level-1 reset, the system looks at the current language set (from the environment) and attempts to load the CMs associated with it. If there is any error (e.g. LEN(map) NE 256) Advanced Revelation reverts to using the default (ASCII) CM.

In R/BASIC, CMs are used for all character comparisons (e.g. IF 'B' LT '„'). Because of this, specific sort operations such as LOCATE BY and V119 will use the current CMs. Obviously, various system routines use LOCATE BY and V119, and therefore also implicitly take advantage of CMs. For example, R/LIST uses V119 for BY clauses, and indexes, of course, use LOCATE BY.

To ensure that all users use the same CM to update indexes, the system stamps CM information on the index when you create it, and checks it against the current CM when you attempt an update. The key to the CM is stored in the 8th field of the field record in the !file. It is this latter that caused problems in 2.10. In 2.10, this CM key was the language set name. If you changed language sets, even if the collation sequence remained the same, the indexing system prohibited modifications. In 2.11 the problem was resolved by storing only the actual CM name, which remains the same across many language sets. Of course, if you load up a new CM altogether, indexing will insist on a rebuild in order to resequence the index according to the new CM. For Quick/Rightdexes, the CM name is stored in the dict of the file in %LOCAL.GROUP% (value1 = dict CM, value2 = data CM).

So when are case-insensitive CMs used? The only place is when you use the special case-insensitive comparison operators (e.g. _EQC). You might think that they'd be used in indexes. However, indexing simply converts @LOWER.CASE to @UPPER.CASE for case-insensitive indexes (which means that you should check and adjust those two variables in the LND record). (Editors note: This can especially be a problem when using the B catalyst lookup logic. If @LOWER.CASE and @UPPER.CASE do not contain the foreign language set characters, then lookups involving them may fail if entered in lowercase - AMcA). Creating Your Own Collation Maps It is not difficult (just confusing) to create your own CMs. Following is a utility to do this. To use it, create two records in the LISTS file called NEW_CM_CASE and NEW_CM_NOCASE. For the case-sensitive list, put the characters, one per field, in the order that you want to re-sequence (e.g. A,Ž,,a,„,B,b – ','=@FM). In the second (case insensitive) list, each field can contain multiple characters per field to indicate that these have the same sort sequence (e.g. AŽa„,Bb). Include just the characters to be resequenced. Any missing characters are added into the CMs, in ASCII order, where you might have left any holes.

The program works as a reverse of how Advanced Revelation uses the CMs. For each character in the desired sequence, it goes to that ASCII value of the character in the new map, and inserts its sort sequence (relative position) there. Additional features of the utility are that it assumes standard sort order for ASCII values 1- 32, and that it checks for duplicate assignments.

The programs creates a new sort order record called CM_NEW. To create one of a different name, change the line which reads WRITEV NEW_CM ON SYSTEM_FILE, 'CM_NEW', FCTR.

 /*
   program: reset_cm
   author: mike pope, RT(UK)Ltd
   date:   11 Feb 1992
   notes: SOURCE      user's desired sort order
          NEW_CM      new CM created by this program
          ASCII_TABLE    ascii-chart values, used to backfill
 */

 OPEN 'SYSTEM' TO SYSTEM_FILE ELSE CALL FSMSG() ; STOP
 SOURCE_CASE   = XLATE('LISTS','NEW_CM_CASE',  '','X')
 SOURCE_NOCASE = XLATE('LISTS','NEW_CM_NOCASE','','X')

 /*
  REMEMBER! The records from the LISTS file should be @FM delimited
 */
 SOURCE = SOURCE_CASE   ; FCTR = 1  ; GOSUB MAKE_MAP
 SOURCE = SOURCE_NOCASE ; FCTR = 2  ; GOSUB MAKE_MAP
 STOP
 /*--------------------------------------------------------*/

MAKE_MAP:
 IF SOURCE ELSE RETURN
 NEW_CM = STR( @FM, 256 ) ; * initialize to all high values
 ASCII_TABLE = NEW_CM
 * Now build table of ASCII values > space
 FOR CTR = 33 TO 256
  ASCII_TABLE[CTR,1] = CHAR(CTR)
 NEXT
 /*
  initialize NEW_CMs to have the 1st 32 characters already in place
 */
 FOR CTR = 1 TO 32
  NEW_CM[CTR,1] = CHAR(CTR)
 NEXT
 FIELD_CNT = COUNT( SOURCE, @FM ) + 1 ; * We know Source is not null
 FOR FIELD_CTR = 1 TO FIELD_CNT
  NEXT_FIELD = SOURCE< FIELD_CTR >
  CHAR_CNT = LEN(NEXT_FIELD) ;* loop thru each char (req. for case-insens)
  FOR PTR = 1 TO CHAR_CNT
   NEXT_CHAR = NEXT_FIELD[ PTR, 1 ]
   VALUE     = SEQ( NEXT_CHAR )
   SORT_SEQ = CHAR(FIELD_CTR+32) ;*offset to skip 1st 32 char
   /*
     see if this value has already been put into NEW_CM - if there is not an
     @FM at this position, it has already been used
   */
   IF NEW_CM[ VALUE, 1 ] NE @FM THEN
     CALL MSG('Char %1% is duplicated!', '', '', NEXT_CHAR )
     STOP
   END ELSE
     NEW_CM[ VALUE, 1 ] = SORT_SEQ
     ASCII_TABLE[ VALUE,1] = @FM ; * @fm means 'already used'
   END
  NEXT PTR
 NEXT FIELD_CTR
 /* backfill characters not in new CM */
 NEW_CM_PTR = 1
 FOR CTR = 33 TO 256
  NEXT_CHAR = ASCII_TABLE[ CTR, 1 ]
  IF NEXT_CHAR NE @FM THEN
   * scan NEW_CM (from NEW_CM_PTR) for holes (marked by @FM)
   HOLE_FOUND = 0
   LOOP
     NEXT_CM_CHAR = NEW_CM[ NEW_CM_PTR, 1 ]
     IF NEXT_CM_CHAR EQ @FM THEN
      HOLE_FOUND = 1
     END ELSE
      NEW_CM_PTR += 1
     END ; * if next_cm_char eq @fm
   UNTIL HOLE_FOUND REPEAT
   NEW_CM[ NEW_CM_PTR, 1 ] = CHAR( CTR )
   NEW_CM_PTR += 1
  END ; * if next_char ne @fm
 NEXT CTR

 * shift CM 1 char right to add in ascii 0, convert hi- end chars to @VM
 CONVERT @FM:@RM TO '' IN NEW_CM
 NEW_CM[ 254, 2 ] = @VM:@VM ; * positions 254/255 are taken by @VMs in CMs
 * note that this means that @FMs and @RMs cannot be remapped
 NEW_CM = CHAR(0) : NEW_CM
 IF LEN(NEW_CM) = 256 THEN
  WRITEV NEW_CM ON SYSTEM_FILE, 'CM_NEW', FCTR
 END ELSE
  * Call appropriate (900 or 901) SYS.MESSAGES error message
  CALL MSG( 899+FCTR, '', '', 'CM_NEW' ) ; STOP
 END
RETURN

(Volume 3, Issue 9, Pages 7-10)

  • tips/revmedia/v3i9a6.txt
  • Last modified: 2024/06/19 20:20
  • by 127.0.0.1