BTREE.EXTRACT Using Metaphone Lookups

Published ByDateVersionKnowledge LevelKeywords
Revelation Technologies17 JAN 19912.XEXPERTPARSING, MODULE, METAPHONE

The December 1990 issue of Computer Language presents an article by Lawrence Philips which describes a new method of conducting "sounds like" searches. This Technical Bulletin shows you how to apply this technique to Advanced Revelation Btree lookups.

Traditionally, "sounds like" lookups have been implemented using an algorithm called Soundex. Soundex implementations typically have two drawbacks. First, they require that you know the first letter of the word. For example, to get a correct lookup on the word "phone" you would have to know that it begins with the letter "P", not the letter "F".

Second, they can miss similar words and report words that do not sound at all Metaphone eliminates the first problem and reduces the second.

Metaphone works like Soundex in that it takes as input a word and returns an encoded version of that word. Unlike Soundex, Metaphone returns a phonetic version of the word.

The phonetic version of the word is obtained by applying a series of transformation rules to the word. The end result is a word containing one of 16 consonant sounds.

The Metaphone rules are simple and follow standard English pronunciation rules. The goal is to reduce the word to one of these consonant values:

B X S K J T F H L M N R P R 0 W Y

Of these, two have special meaning. "X" should be read as "SH". "0" should be read as "TH".

A vowel is kept only when it appears at the beginning of a word.

The rules:

  1. If the word begins with "KN", "GN", "PN", "AE", "WR", or "MN", drop the first letter.
  2. If the word begins with "X", change it to "S".
  3. If the word begins with "WH", drop the "H".
  4. The letters "F", "J", "L", "M", "N", "R" are retained.
  5. "B" is retained unless it is the last letter of the word and follows an "M", in which case it is silent. For example: "numb".
  6. "C" is changed to "K" except if followed by "H" or "IA", then it is changed to "X" if followed by "E", "I", or "Y", then it is changed to "S" if preceded by "S" and followed by "E", "I", or "Y", it is silent.
  7. "D" is changed to "T" unless it is followed by "GE", "GI", or "GY", then it is changed to "J".
  8. "G" is changed to "K" except
    - in the construction "GH" and not at the end of the word or before a vowel, then silent
    - in the construction "GN", "GNED", or "GNER" at the end of a word, then silent
    - if before "I", "E", or "Y" and not a double "G", then it is changed to "J".
  9. "H" is retained unless it follows a vowel and is not followed by a vowel, then it is silent.
  10. "K" is retained unless it follows "C", then it is silent.
  11. "P" is retained unless it is followed by "H", then it is changed to "F".
  12. "Q" is changed to "K".
  13. "S" is retained except if it is followed by "H", "IO" or "IA", then it is changed to "0".
  14. "T" is retained except if it is followed by "IA", "IO", or "CH", then it is changed to "X" if it is followed by "H", then it is changed to "0"
  15. "V" is changed to "F".
  16. "X" is changed to "KS".
  17. "Y" is retained if it is followed by a vowel, silent otherwise.
  18. "Z" is changed to "S".

Figure 1 provides an example of how these rules can be implemented in an R/BASIC program.

This implementation of Metaphone differs in several respects from the implementation given by Philips.

First, Philips processes each letter of the input word, even if previous rules have made it unnecessary. For example, in the word "DODGER" the "G" and the "E" following the second "D" will be evaluated, even though the rule for "D" has already determined that "DGE" should be replaced with "J".

By maintaining a pointer to the current letter being processed and a separate variable to increment the pointer, we can move through the word much faster. Using the example above, when the second "D" is processed, the program determines that the two letters following the D have been accounted for and adjusts the increment counter so the pointer will point to the "R" at the top of the next loop. By not reprocessing the "GE" the program finishes sooner.

This change affects the implementation of some of the rules. It does not affect the results.

Second, Philips uses a large CASE statement to control his program. In R/BASIC, a lengthy CASE statement can be quite expensive in terms of processing speed, particularly for the conditions toward the end of the statement.

The program given here overcomes this problem by controlling the branches to specific subroutines through an ON…GOSUB statement.

Philips skips all double letter combinations except for "CC". Although he does not explain why, this is probably to account for words like "ACCESS", in which each "C" has a different sound. However, he does not deal with the possibility elsewhere in the program.

The R/BASIC program does not check for double "C" combinations. The "CC" in "ACCOUNT" will be treated the same way as the double "C" in "ACCESS", even though in the latter case it is "incorrect".

Finally, Philips says:

    For practical purposes, usually only the first four (if there are that many) letters of the phonetic spelling are used...

You may find, however, that four characters yields too many potential matches in your application. If so, experiment with higher values for METAPHMAX$ to see how that affects the number of matches.

Both the Philips implementation and the one given here suffer from the vagaries of the English language. The "ACCOUNT" vs "ACCESS" example given above is just one case in which exceptions to the rules make the programmer's task difficult. Other examples include words like "TECHNICAL" and "MECHANICAL", where the "CH" is sounded as a "K" but the program treats it as a "SH".

Because Metaphone is rooted in the rules of English pronunciation, it does not deal well with words and phrases borrowed from other languages. This becomes particularly troublesome for proper names. The President of Revelation Technologies is Jim Acquaviva. Metaphone returns "AKKF" for the correct spelling of his name. "Aquaviva" returns "AKFF", and "Akwaviva" returns "AKWF".

By modifying the rules you can tune Metaphone to suit the needs of your application.

Once you have a Metaphone routine working to your satisfaction, it is a straightforward matter to hook it into BTREE.EXTRACT.

Note: Technical Bulletin #34, User Index in BTREE.EXTRACT fully explains the technique given here. This bulletin concentrates on the specifics of implementing Metaphone as a user index.

The user index requires two programs. The first is responsible for identifying search criterion and is referred to as the parsing module. The second, the compare module, determines whether values from the index match the criterion given.

The parsing module is identified to BTREE.EXTRACT by a record in the VOC called %USER.INDEX%. It has the layout:

FieldValue
1USER.INDEX (literal text)
2name of the parsing module
3name of the default compare routine

Using the program given in Figure 2 as an example, %USER.INDEX% would look like this:

FieldValue
1USER.INDEX
2PARSER

Note that field 3 is optional and, in this case, empty.

While the parsing module can use any means to determine whether the criterion passed to it are its concern, some sort of trigger character is typically used.

For the search routine illustrated in Figure 2, the trigger is a dollar sign. If the search routine sees a dollar sign as the first character of a search criterion, the routine assumes that a Metaphone search is required. If any character but a dollar sign appears as the first character, the search is passed on to BTREE.EXTRACT for normal processing.

Note: Any character can be used. The dollar sign was selected because it has an "S" shape which can be used to remind users of its "sounds like" role.

Once found, the trigger is stripped from the search criterion, the criterion is converted to the Metaphone equivalent, and the search starting point in the index is determined.

In most cases the search will begin at the first value in the index that begins with the same letter as the search criterion. If no starting value is given, the search begins at the start of the index, possibly causing unnecessary comparisons and thus delay.

For example, if the search criterion is "NORBERT" searching index values that begin with "A", "B", "C", etc is a waste of time. The search should begin with the first value that starts with "N".

There is a complication here, however. Recall that the Metaphone rules can change the first letter of the search word. In fact, the starting point of the search cannot be based just on the first letter of the word returned from Metaphone. "NORBERT", which returns "NRBR" will work. But what about "GNOME", which returns "NM"? Starting the search for GNOME in the Ns will miss GNOME.

To make sure the index is searched thoroughly, but not excessively, PARSE maintains a lookup table of the earliest entry points necessary for special characters. For example, in order to ensure that all candidates for words whose Metaphone value begins with "N" are found, the index search must being with the "G"s. Similarly if the Metaphone value begins with "E", the search must being with the "A"s.

Note that in some cases, "X" and "V", for example, most of the index still must be traversed. However, this is preferable to having to traverse the entire index for all cases.

The compare module is responsible for taking the search value (as modified by the parsing module) and comparing it to a candidate value. The candidate value is provided by BTREE.EXTRACT and comes from the Btree index. Figure 3 is an example of a compare module.

In addition to simply identifying a candidate as a match, the compare routine can also instruct BTREE.EXTRACT to stop searching. This is useful when it is known that all further comparisons will not return hits. For example, once index values starting with "B" have been encountered, there is no reason to continue the search if the search value is "APPLE".

Just as the parse module has the problem of determining where to being the search, the compare module has the problem of ensuring that the search is not ended prematurely. And, as with the parse module, the compare module uses a lookup table to determine how far is far enough.

So, for example, if the search value starts with "K" all index values through "Q" must be searched.

Once enough index values have been searched, the compare module tells BTREE.EXTRACT that no further searching needs to be done. BTREE.EXTRACT then takes care of all necessary ANDing and ORing logic.

Figure 1

SUBROUTINE METAPHONE(METAPH)

/*
This program is based on the program presented in "Hanging on the
Metaphone", Computer Language, December 1990, pp 39-43
*/

EQU TRUE$      TO 1
EQU FALSE$     TO 0
EQU OTHERWISE$ TO 1
EQU SPACE$     TO \20\
EQU NULL$      TO ""

EQU METAPHMAX$ TO 4 ;* Max len of the resulting code.
EQU VOWELS$    TO "AEIOU"
EQU FRONTV$    TO "EIY"
EQU NOCHANGE$  TO "FJLMNR" ;* These letters are not changed.
EQU TRANSFORM$ TO "BCDGHKPQSTVWXYZ" ;* These letters indicate a change.
EQU NONALPHA$  TO ".,/`';{}-=<>?:~[]+_()*^%$#@!\| " : \22\ : ï"0123456789"

/*

FIRST_TWO$ is a list of letter pairs that, when found at the beginning of a
word, the first letter is stripped off and just the second is used.

PN - as in PNEUMATIC
AE - as in AEROSOL
KN - as in KNOW
GN - as in GNAW
WR - as in WRITE
MN - as in MNEMONIC

*/

EQU FIRST_TWO$ TO "AEKNGNWRMNPN"

/* Make sure word has only letters and is upper case. */
CONVERT NONALPHA$   TO NULL$ IN TEXT
CONVERT @LOWER.CASE TO @UPPER.CASE IN TEXT

/* Check the beginning of the word for special cases. */
FIRST = TEXT[1,2]
IF INDEX(FIRST_TWO$, FIRST, 1) THEN
  TEXT[1,1] = NULL$
END

/* If the first letter is an "X", change it to "S". */
IF TEXT[1,1] = "X" THEN
  TEXT[1,1] = "S"
END

/* If the initial letters are "WH", strip out the "H" */
IF TEXT[1,2] = "WH" THEN
  TEXT[2,1] = NULL$
END

WORD_LEN = LEN(TEXT)
METAPH   = NULL$
PTR      = 1 ;* Pointer to the current letter.

LOOP
   INCR = 1 ;* Default pointer increment
   CURR_LET = TEXT[PTR, 1] ;* Letter to process
   IF CURR_LET = TEXT[PTR + 1, 1] THEN
     /*
     There is a pair of letters, set incr to jump over the second in the
     pair.
     */
     INCR += 1
   END
   BEGIN CASE
      CASE INDEX(VOWELS$, CURR_LET, 1)
         IF PTR = 1 THEN
            * This is the first letter, keep the vowel, otherwise skip.
            METAPH = CURR_LET
         END

      CASE INDEX(NOCHANGE$, CURR_LET, 1)
         /* These letters aren't changed, add them to the metaph */
         METAPH := CURR_LET

      CASE INDEX(TRANSFORM$, CURR_LET, 1)
         /*
         Some sort of transformation is indicated. Find out which one and
         branch to the appropriate subroutine.
         */
         POS = INDEX(TRANSFORM$, CURR_LET, 1)
         ON POS GOSUB B,C,D,G,H,K,P,Q,S,T,V,W,X,Y,Z
   END CASE

   PTR += INCR ;* Increment the character pointer
WHILE (PTR <= WORD_LEN) AND (LEN(METAPH) < METAPHMAX$)
REPEAT
RETURN
STOP

/***********************************************************************
INTERNAL SUBROUTINES
***********************************************************************/

B:
/*
B stays a B unless followed by an M at the end of a word, then silent -
numb, dumb, crumb
*/

IF PTR = WORD_LEN AND TEXT[PTR - 1, 1] = "M" ELSE
  METAPH := "B"
END
RETURN

C:
/*
C changes to K unless
followed by H or IA, then X (sh) - march, acacia followed by I, E, or
Y, then S - mice, lecithin, lacy
preceded by S and followed by I, E, or Y, then silent - science,
scene, scythe preceded by S and followed by H, then K - school
*/

BEGIN CASE
  CASE TEXT[PTR + 1, 1] = "H"
    IF TEXT [PTR - 1, 1] = "S" THEN
      METAPH := "K"
    END ELSE
      METAPH := "X"
    END
    INCR += 1

  CASE TEXT[PTR + 1, 2] = "IA"
    METAPH := "X"
    INCR += 2

  CASE INDEX(FRONTV$, TEXT[PTR + 1, 1], 1)
    IF TEXT[PTR - 1, 1] = "S" ELSE
      /* Only add S to METAPH if *not* following an S */
      METAPH := "S"
    END
    INCR += 1

  CASE OTHERWISE$
     METAPH := "K"

  END CASE
RETURN

D:
/*
D changes to T unless followed by ge, gy, gi then J - edge, edginess
*/

IF TEXT[PTR + 1, 1] = "G" AND INDEX(FRONTV$, TEXT[PTR + 2, 1], 1) THEN
  METAPH := "J"
  INCR += 2
END ELSE
  METAPH := "T"
END
RETURN


G:
/*
G changes to K unless followed by H and not followed by vowel then silent
sight - silent
ghost - not silent
followed by N at the end of the word, then silent - sign, benign
followed by I, E, or Y and not double G, then J - giant, legion
*/

BEGIN CASE
  CASE TEXT[PTR + 1, 1] = "H"
    IF INDEX(VOWELS$, TEXT[PTR + 2, 1], 1) THEN
      METAPH := "K"
      INCR += 2
    END

  CASE TEXT[PTR + 1, 1] = "N"
    IF PTR + 1 = WORD_LEN ELSE
      /* Not at the end of the word yet*/
      IF TEXT[PTR + 1, 3] = "NED" OR TEXT[PTR + 1, 3] = "NER" THEN
        /* Allow for words like 'signed', 'signer' */
        IF PTR + 3 = WORD_LEN ELSE
          /* Still not at the end, so the G is not silent*/
          METAPH := "K"
          INCR += 3
        END
      END ELSE
        METAPH := "K"
        INCR += 1
      END
    END

  CASE TEXT[PTR + 1, 1] = "G"
    /* There is a double G */
    METAPH := "K"
    INCR += 1

  CASE INDEX(FRONTV$, TEXT[PTR + 1, 1] , 1)
    METAPH := "J"
    INCR += 1

  CASE OTHERWISE$
    METAPH := "K"

END CASE
RETURN


H:
/*
H stays H unless not followed by a vowel, then silent -
mohair - keep the H
ohm - H is silent
*/

IF INDEX(VOWELS$, TEXT[PTR + 1, 1], 1) THEN
  METAPH := "H"
  INCR += 1
END
RETURN


K:
/*
K stays K unless it comes after C, then silent - back, tacky
*/
IF TEXT[PTR - 1, 1] = "C" ELSE
  METAPH := "K"
END
RETURN

P:
/*
P stays P unless it is followed by an H, then F - phone, graph
*/
IF TEXT[PTR + 1, 1] = "H" THEN
  METAPH := "F"
  INCR += 1
END ELSE
  METAPH := "P"
END
RETURN


Q:
/*
Q is changed to K
*/
METAPH := "K"
RETURN

S:
/*
S stays S unless
followed by H, IO, IA, then X (sh) - ship, pension, ambrosia
*/

BEGIN CASE
  CASE TEXT[PTR + 1, 1] = "H"
    METAPH := "X"
    INCR += 1

  CASE TEXT[PTR + 1, 2] = "IO" OR TEXT[PTR + 1, 2] = "IA"
    METAPH := "X"
    INCR += 2

  CASE OTHERWISE$
    METAPH := "S"

END CASE
RETURN


T:
/*
T stays a T unless:
followed by IA or IO, then X (sh) - attention
followed by H, then 0 (th) - there, then, those
followed by CH, then X (sh) - catch, batch
*/

BEGIN CASE
  CASE TEXT[PTR + 1, 2] = "IO" OR TEXT[PTR + 1, 2] = "IA"
    METAPH := "X"
    INCR += 2

  CASE TEXT[PTR + 1, 1] = "H"
    METAPH := "0"
    INCR += 1

  CASE TEXT[PTR + 1, 2] = "CH"
    METAPH := "X"
    INCR += 2

  CASE OTHERWISE$
    METAPH := "T"
END CASE
RETURN


V:
/*
V changes to F
*/
METAPH := "F"
RETURN


W:
/*
W stays W unless not followed by a vowel, then silent.
*/
IF INDEX(VOWELS$, TEXT[PTR + 1, 1], 1) THEN
  METAPH := "W"
END
RETURN

X:
/*
X changes to KS
Note that it is possible to exceed the max len of METAPH if the X is
encountered after the first three positions of METAPH have been filled. For
example, REFLEX --> RFLKS. Since the extra character does not have an impact
on this application, let it slide.
*/
METAPH := "KS"
RETURN


Y:
/*
Y stays Y unless not followed by a vowel, then silent.
*/
IF INDEX(VOWELS$, TEXT[PTR + 1, 1], 1) THEN
  METAPH := "Y"
END
RETURN


Z:
/*
Z changes to S
*/
METAPH := "S"
RETURN

Figure 2

SUBROUTINE PARSER(SEARCH_FOR, START_VALUE, USER_INDEX_FLAG, COMPARE_MOD)

DECLARE SUBROUTINE METAPHONE

EQU TRUE$    TO 1
EQU TRIGGER$ TO "$"
EQU COMPARE$ TO "COMPARE"
EQU NULL$    TO ""

EQU START$     TO "ENX0VJKT" ; * Special case first Metaphone characters
EQU START_VAL$ TO "AGCTFGCD" ; * ...and their corresponding start positions.

IF SEARCH_FOR[1,1] = TRIGGER$ THEN
  SEARCH_FOR[1,1] = NULL$
  METAPHONE(SEARCH_FOR)
  START_VALUE = SEARCH_FOR[1,1]
  POS = INDEX(START$, START_VALUE, 1)
  IF POS THEN
    START_VALUE = START_VAL$[POS,1]
  END
  USER_INDEX_FLAG = TRUE$
  COMPARE_MOD = COMPARE$
END
RETURN

Figure 3

SUBROUTINE COMPARE(CANDIDATE, SEARCH_FOR, FLAG)

DECLARE SUBROUTINE METAPHONE

EQU HIT_TRUE$    TO 1
EQU HIT_FALSE$   TO 0
EQU QUIT_SEARCH$ TO 2

EQU START$   TO "K0FDFS" ;* Special case first Metaphone characters...
EQU END_VAL$ TO "QTVTPX" ;* ...and their corresponding end positions.

QUIT_CHAR = SEARCH_FOR[1,1]
POS       = INDEX(START$, QUIT_CHAR, 1)
IF POS THEN
  QUIT_CHAR = END_VAL$[POS,1]
END
IF CANDIDATE[1,1] > QUIT_CHAR THEN
  FLAG = QUIT_SEARCH$
END ELSE
  METAPHONE(CANDIDATE)
  IF SEARCH_FOR = CANDIDATE THEN
    FLAG = HIT_TRUE$
  END ELSE
    FLAG = HIT_FALSE$
  END
END
RETURN

Lawrence Philips, "Hanging on the Metaphone," Computer Language, December 1990, pp. 39-43.

  • tips/revmedia/r71.txt
  • Last modified: 2024/06/19 20:20
  • by 127.0.0.1