BTREE.EXTRACT Using Metaphone Lookups
Published By | Date | Version | Knowledge Level | Keywords |
---|---|---|---|---|
Revelation Technologies | 17 JAN 1991 | 2.X | EXPERT | PARSING, MODULE, METAPHONE |
The December 1990 issue of Computer Language presents an article by Lawrence Philips which describes a new method of conducting "sounds like" searches. This Technical Bulletin shows you how to apply this technique to Advanced Revelation Btree lookups.
Traditionally, "sounds like" lookups have been implemented using an algorithm called Soundex. Soundex implementations typically have two drawbacks. First, they require that you know the first letter of the word. For example, to get a correct lookup on the word "phone" you would have to know that it begins with the letter "P", not the letter "F".
Second, they can miss similar words and report words that do not sound at all Metaphone eliminates the first problem and reduces the second.
Metaphone
Metaphone works like Soundex in that it takes as input a word and returns an encoded version of that word. Unlike Soundex, Metaphone returns a phonetic version of the word.
The phonetic version of the word is obtained by applying a series of transformation rules to the word. The end result is a word containing one of 16 consonant sounds.
Transformation Rules
The Metaphone rules are simple and follow standard English pronunciation rules. The goal is to reduce the word to one of these consonant values:
B X S K J T F H L M N R P R 0 W Y
Of these, two have special meaning. "X" should be read as "SH". "0" should be read as "TH".
A vowel is kept only when it appears at the beginning of a word.
The rules:
- If the word begins with "KN", "GN", "PN", "AE", "WR", or "MN", drop the first letter.
- If the word begins with "X", change it to "S".
- If the word begins with "WH", drop the "H".
- The letters "F", "J", "L", "M", "N", "R" are retained.
- "B" is retained unless it is the last letter of the word and follows an "M", in which case it is silent. For example: "numb".
- "C" is changed to "K" except if followed by "H" or "IA", then it is changed to "X" if followed by "E", "I", or "Y", then it is changed to "S" if preceded by "S" and followed by "E", "I", or "Y", it is silent.
- "D" is changed to "T" unless it is followed by "GE", "GI", or "GY", then it is changed to "J".
- "G" is changed to "K" except
- in the construction "GH" and not at the end of the word or before a vowel, then silent
- in the construction "GN", "GNED", or "GNER" at the end of a word, then silent
- if before "I", "E", or "Y" and not a double "G", then it is changed to "J". - "H" is retained unless it follows a vowel and is not followed by a vowel, then it is silent.
- "K" is retained unless it follows "C", then it is silent.
- "P" is retained unless it is followed by "H", then it is changed to "F".
- "Q" is changed to "K".
- "S" is retained except if it is followed by "H", "IO" or "IA", then it is changed to "0".
- "T" is retained except if it is followed by "IA", "IO", or "CH", then it is changed to "X" if it is followed by "H", then it is changed to "0"
- "V" is changed to "F".
- "X" is changed to "KS".
- "Y" is retained if it is followed by a vowel, silent otherwise.
- "Z" is changed to "S".
Figure 1 provides an example of how these rules can be implemented in an R/BASIC program.
Notes About This Implementation
This implementation of Metaphone differs in several respects from the implementation given by Philips.
First, Philips processes each letter of the input word, even if previous rules have made it unnecessary. For example, in the word "DODGER" the "G" and the "E" following the second "D" will be evaluated, even though the rule for "D" has already determined that "DGE" should be replaced with "J".
By maintaining a pointer to the current letter being processed and a separate variable to increment the pointer, we can move through the word much faster. Using the example above, when the second "D" is processed, the program determines that the two letters following the D have been accounted for and adjusts the increment counter so the pointer will point to the "R" at the top of the next loop. By not reprocessing the "GE" the program finishes sooner.
This change affects the implementation of some of the rules. It does not affect the results.
Second, Philips uses a large CASE statement to control his program. In R/BASIC, a lengthy CASE statement can be quite expensive in terms of processing speed, particularly for the conditions toward the end of the statement.
The program given here overcomes this problem by controlling the branches to specific subroutines through an ON…GOSUB statement.
Philips skips all double letter combinations except for "CC". Although he does not explain why, this is probably to account for words like "ACCESS", in which each "C" has a different sound. However, he does not deal with the possibility elsewhere in the program.
The R/BASIC program does not check for double "C" combinations. The "CC" in "ACCOUNT" will be treated the same way as the double "C" in "ACCESS", even though in the latter case it is "incorrect".
Finally, Philips says:
For practical purposes, usually only the first four (if there are that many) letters of the phonetic spelling are used...
You may find, however, that four characters yields too many potential matches in your application. If so, experiment with higher values for METAPHMAX$ to see how that affects the number of matches.
Both the Philips implementation and the one given here suffer from the vagaries of the English language. The "ACCOUNT" vs "ACCESS" example given above is just one case in which exceptions to the rules make the programmer's task difficult. Other examples include words like "TECHNICAL" and "MECHANICAL", where the "CH" is sounded as a "K" but the program treats it as a "SH".
Because Metaphone is rooted in the rules of English pronunciation, it does not deal well with words and phrases borrowed from other languages. This becomes particularly troublesome for proper names. The President of Revelation Technologies is Jim Acquaviva. Metaphone returns "AKKF" for the correct spelling of his name. "Aquaviva" returns "AKFF", and "Akwaviva" returns "AKWF".
By modifying the rules you can tune Metaphone to suit the needs of your application.
Using Metaphone in BTREE.EXTRACT
Once you have a Metaphone routine working to your satisfaction, it is a straightforward matter to hook it into BTREE.EXTRACT.
Note: Technical Bulletin #34, User Index in BTREE.EXTRACT fully explains the technique given here. This bulletin concentrates on the specifics of implementing Metaphone as a user index.
The user index requires two programs. The first is responsible for identifying search criterion and is referred to as the parsing module. The second, the compare module, determines whether values from the index match the criterion given.
The Parsing Module
The parsing module is identified to BTREE.EXTRACT by a record in the VOC called %USER.INDEX%. It has the layout:
Field | Value |
---|---|
1 | USER.INDEX (literal text) |
2 | name of the parsing module |
3 | name of the default compare routine |
Using the program given in Figure 2 as an example, %USER.INDEX% would look like this:
Field | Value |
---|---|
1 | USER.INDEX |
2 | PARSER |
Note that field 3 is optional and, in this case, empty.
While the parsing module can use any means to determine whether the criterion passed to it are its concern, some sort of trigger character is typically used.
For the search routine illustrated in Figure 2, the trigger is a dollar sign. If the search routine sees a dollar sign as the first character of a search criterion, the routine assumes that a Metaphone search is required. If any character but a dollar sign appears as the first character, the search is passed on to BTREE.EXTRACT for normal processing.
Note: Any character can be used. The dollar sign was selected because it has an "S" shape which can be used to remind users of its "sounds like" role.
Once found, the trigger is stripped from the search criterion, the criterion is converted to the Metaphone equivalent, and the search starting point in the index is determined.
In most cases the search will begin at the first value in the index that begins with the same letter as the search criterion. If no starting value is given, the search begins at the start of the index, possibly causing unnecessary comparisons and thus delay.
For example, if the search criterion is "NORBERT" searching index values that begin with "A", "B", "C", etc is a waste of time. The search should begin with the first value that starts with "N".
There is a complication here, however. Recall that the Metaphone rules can change the first letter of the search word. In fact, the starting point of the search cannot be based just on the first letter of the word returned from Metaphone. "NORBERT", which returns "NRBR" will work. But what about "GNOME", which returns "NM"? Starting the search for GNOME in the Ns will miss GNOME.
To make sure the index is searched thoroughly, but not excessively, PARSE maintains a lookup table of the earliest entry points necessary for special characters. For example, in order to ensure that all candidates for words whose Metaphone value begins with "N" are found, the index search must being with the "G"s. Similarly if the Metaphone value begins with "E", the search must being with the "A"s.
Note that in some cases, "X" and "V", for example, most of the index still must be traversed. However, this is preferable to having to traverse the entire index for all cases.
The Compare Module
The compare module is responsible for taking the search value (as modified by the parsing module) and comparing it to a candidate value. The candidate value is provided by BTREE.EXTRACT and comes from the Btree index. Figure 3 is an example of a compare module.
In addition to simply identifying a candidate as a match, the compare routine can also instruct BTREE.EXTRACT to stop searching. This is useful when it is known that all further comparisons will not return hits. For example, once index values starting with "B" have been encountered, there is no reason to continue the search if the search value is "APPLE".
Just as the parse module has the problem of determining where to being the search, the compare module has the problem of ensuring that the search is not ended prematurely. And, as with the parse module, the compare module uses a lookup table to determine how far is far enough.
So, for example, if the search value starts with "K" all index values through "Q" must be searched.
Once enough index values have been searched, the compare module tells BTREE.EXTRACT that no further searching needs to be done. BTREE.EXTRACT then takes care of all necessary ANDing and ORing logic.
Examples
Figure 1
SUBROUTINE METAPHONE(METAPH) /* This program is based on the program presented in "Hanging on the Metaphone", Computer Language, December 1990, pp 39-43 */ EQU TRUE$ TO 1 EQU FALSE$ TO 0 EQU OTHERWISE$ TO 1 EQU SPACE$ TO \20\ EQU NULL$ TO "" EQU METAPHMAX$ TO 4 ;* Max len of the resulting code. EQU VOWELS$ TO "AEIOU" EQU FRONTV$ TO "EIY" EQU NOCHANGE$ TO "FJLMNR" ;* These letters are not changed. EQU TRANSFORM$ TO "BCDGHKPQSTVWXYZ" ;* These letters indicate a change. EQU NONALPHA$ TO ".,/`';{}-=<>?:~[]+_()*^%$#@!\| " : \22\ : ï"0123456789" /* FIRST_TWO$ is a list of letter pairs that, when found at the beginning of a word, the first letter is stripped off and just the second is used. PN - as in PNEUMATIC AE - as in AEROSOL KN - as in KNOW GN - as in GNAW WR - as in WRITE MN - as in MNEMONIC */ EQU FIRST_TWO$ TO "AEKNGNWRMNPN" /* Make sure word has only letters and is upper case. */ CONVERT NONALPHA$ TO NULL$ IN TEXT CONVERT @LOWER.CASE TO @UPPER.CASE IN TEXT /* Check the beginning of the word for special cases. */ FIRST = TEXT[1,2] IF INDEX(FIRST_TWO$, FIRST, 1) THEN TEXT[1,1] = NULL$ END /* If the first letter is an "X", change it to "S". */ IF TEXT[1,1] = "X" THEN TEXT[1,1] = "S" END /* If the initial letters are "WH", strip out the "H" */ IF TEXT[1,2] = "WH" THEN TEXT[2,1] = NULL$ END WORD_LEN = LEN(TEXT) METAPH = NULL$ PTR = 1 ;* Pointer to the current letter. LOOP INCR = 1 ;* Default pointer increment CURR_LET = TEXT[PTR, 1] ;* Letter to process IF CURR_LET = TEXT[PTR + 1, 1] THEN /* There is a pair of letters, set incr to jump over the second in the pair. */ INCR += 1 END BEGIN CASE CASE INDEX(VOWELS$, CURR_LET, 1) IF PTR = 1 THEN * This is the first letter, keep the vowel, otherwise skip. METAPH = CURR_LET END CASE INDEX(NOCHANGE$, CURR_LET, 1) /* These letters aren't changed, add them to the metaph */ METAPH := CURR_LET CASE INDEX(TRANSFORM$, CURR_LET, 1) /* Some sort of transformation is indicated. Find out which one and branch to the appropriate subroutine. */ POS = INDEX(TRANSFORM$, CURR_LET, 1) ON POS GOSUB B,C,D,G,H,K,P,Q,S,T,V,W,X,Y,Z END CASE PTR += INCR ;* Increment the character pointer WHILE (PTR <= WORD_LEN) AND (LEN(METAPH) < METAPHMAX$) REPEAT RETURN STOP /*********************************************************************** INTERNAL SUBROUTINES ***********************************************************************/ B: /* B stays a B unless followed by an M at the end of a word, then silent - numb, dumb, crumb */ IF PTR = WORD_LEN AND TEXT[PTR - 1, 1] = "M" ELSE METAPH := "B" END RETURN C: /* C changes to K unless followed by H or IA, then X (sh) - march, acacia followed by I, E, or Y, then S - mice, lecithin, lacy preceded by S and followed by I, E, or Y, then silent - science, scene, scythe preceded by S and followed by H, then K - school */ BEGIN CASE CASE TEXT[PTR + 1, 1] = "H" IF TEXT [PTR - 1, 1] = "S" THEN METAPH := "K" END ELSE METAPH := "X" END INCR += 1 CASE TEXT[PTR + 1, 2] = "IA" METAPH := "X" INCR += 2 CASE INDEX(FRONTV$, TEXT[PTR + 1, 1], 1) IF TEXT[PTR - 1, 1] = "S" ELSE /* Only add S to METAPH if *not* following an S */ METAPH := "S" END INCR += 1 CASE OTHERWISE$ METAPH := "K" END CASE RETURN D: /* D changes to T unless followed by ge, gy, gi then J - edge, edginess */ IF TEXT[PTR + 1, 1] = "G" AND INDEX(FRONTV$, TEXT[PTR + 2, 1], 1) THEN METAPH := "J" INCR += 2 END ELSE METAPH := "T" END RETURN G: /* G changes to K unless followed by H and not followed by vowel then silent sight - silent ghost - not silent followed by N at the end of the word, then silent - sign, benign followed by I, E, or Y and not double G, then J - giant, legion */ BEGIN CASE CASE TEXT[PTR + 1, 1] = "H" IF INDEX(VOWELS$, TEXT[PTR + 2, 1], 1) THEN METAPH := "K" INCR += 2 END CASE TEXT[PTR + 1, 1] = "N" IF PTR + 1 = WORD_LEN ELSE /* Not at the end of the word yet*/ IF TEXT[PTR + 1, 3] = "NED" OR TEXT[PTR + 1, 3] = "NER" THEN /* Allow for words like 'signed', 'signer' */ IF PTR + 3 = WORD_LEN ELSE /* Still not at the end, so the G is not silent*/ METAPH := "K" INCR += 3 END END ELSE METAPH := "K" INCR += 1 END END CASE TEXT[PTR + 1, 1] = "G" /* There is a double G */ METAPH := "K" INCR += 1 CASE INDEX(FRONTV$, TEXT[PTR + 1, 1] , 1) METAPH := "J" INCR += 1 CASE OTHERWISE$ METAPH := "K" END CASE RETURN H: /* H stays H unless not followed by a vowel, then silent - mohair - keep the H ohm - H is silent */ IF INDEX(VOWELS$, TEXT[PTR + 1, 1], 1) THEN METAPH := "H" INCR += 1 END RETURN K: /* K stays K unless it comes after C, then silent - back, tacky */ IF TEXT[PTR - 1, 1] = "C" ELSE METAPH := "K" END RETURN P: /* P stays P unless it is followed by an H, then F - phone, graph */ IF TEXT[PTR + 1, 1] = "H" THEN METAPH := "F" INCR += 1 END ELSE METAPH := "P" END RETURN Q: /* Q is changed to K */ METAPH := "K" RETURN S: /* S stays S unless followed by H, IO, IA, then X (sh) - ship, pension, ambrosia */ BEGIN CASE CASE TEXT[PTR + 1, 1] = "H" METAPH := "X" INCR += 1 CASE TEXT[PTR + 1, 2] = "IO" OR TEXT[PTR + 1, 2] = "IA" METAPH := "X" INCR += 2 CASE OTHERWISE$ METAPH := "S" END CASE RETURN T: /* T stays a T unless: followed by IA or IO, then X (sh) - attention followed by H, then 0 (th) - there, then, those followed by CH, then X (sh) - catch, batch */ BEGIN CASE CASE TEXT[PTR + 1, 2] = "IO" OR TEXT[PTR + 1, 2] = "IA" METAPH := "X" INCR += 2 CASE TEXT[PTR + 1, 1] = "H" METAPH := "0" INCR += 1 CASE TEXT[PTR + 1, 2] = "CH" METAPH := "X" INCR += 2 CASE OTHERWISE$ METAPH := "T" END CASE RETURN V: /* V changes to F */ METAPH := "F" RETURN W: /* W stays W unless not followed by a vowel, then silent. */ IF INDEX(VOWELS$, TEXT[PTR + 1, 1], 1) THEN METAPH := "W" END RETURN X: /* X changes to KS Note that it is possible to exceed the max len of METAPH if the X is encountered after the first three positions of METAPH have been filled. For example, REFLEX --> RFLKS. Since the extra character does not have an impact on this application, let it slide. */ METAPH := "KS" RETURN Y: /* Y stays Y unless not followed by a vowel, then silent. */ IF INDEX(VOWELS$, TEXT[PTR + 1, 1], 1) THEN METAPH := "Y" END RETURN Z: /* Z changes to S */ METAPH := "S" RETURN
Figure 2
SUBROUTINE PARSER(SEARCH_FOR, START_VALUE, USER_INDEX_FLAG, COMPARE_MOD) DECLARE SUBROUTINE METAPHONE EQU TRUE$ TO 1 EQU TRIGGER$ TO "$" EQU COMPARE$ TO "COMPARE" EQU NULL$ TO "" EQU START$ TO "ENX0VJKT" ; * Special case first Metaphone characters EQU START_VAL$ TO "AGCTFGCD" ; * ...and their corresponding start positions. IF SEARCH_FOR[1,1] = TRIGGER$ THEN SEARCH_FOR[1,1] = NULL$ METAPHONE(SEARCH_FOR) START_VALUE = SEARCH_FOR[1,1] POS = INDEX(START$, START_VALUE, 1) IF POS THEN START_VALUE = START_VAL$[POS,1] END USER_INDEX_FLAG = TRUE$ COMPARE_MOD = COMPARE$ END RETURN
Figure 3
SUBROUTINE COMPARE(CANDIDATE, SEARCH_FOR, FLAG) DECLARE SUBROUTINE METAPHONE EQU HIT_TRUE$ TO 1 EQU HIT_FALSE$ TO 0 EQU QUIT_SEARCH$ TO 2 EQU START$ TO "K0FDFS" ;* Special case first Metaphone characters... EQU END_VAL$ TO "QTVTPX" ;* ...and their corresponding end positions. QUIT_CHAR = SEARCH_FOR[1,1] POS = INDEX(START$, QUIT_CHAR, 1) IF POS THEN QUIT_CHAR = END_VAL$[POS,1] END IF CANDIDATE[1,1] > QUIT_CHAR THEN FLAG = QUIT_SEARCH$ END ELSE METAPHONE(CANDIDATE) IF SEARCH_FOR = CANDIDATE THEN FLAG = HIT_TRUE$ END ELSE FLAG = HIT_FALSE$ END END RETURN
Lawrence Philips, "Hanging on the Metaphone," Computer Language, December 1990, pp. 39-43.