Index types (OpenInsight 16-Bit Specific) [Revelation On-Line Wiki]

Sign up on the Revelation Software website to have access to the most current content, and to be able to ask questions and get answers from the Revelation community

At 20 JAN 2003 04:08:03PM b cameron wrote:

OpenInsight 16-Bit Specific

Hello, Happy New year.

Needing to create and index on a table for the field email.

What are developers recommending, xref or btree?

If stripping @ and . , then is xreference better?

Suggestions, experience?

TIA

At 21 JAN 2003 07:53AM Don Miller - C3 Inc. wrote:

If you're trying to index web addresses then XREF would probably be better. Make sure the @-sign and periods are stripped. Then you'll be able to find Someone with the address Someone@somewhare.com. You might consider ignoring the "com" "org" and "edu" suffixes like "A" "An" "The" too.

Don M.

At 21 JAN 2003 08:04AM Oystein Reigem wrote:

Bruce,

Hard to say without knowing more.

Here are some ideas and comments:

Having an index might just be part of the solution. Some programming might improve search capabilities.

E-mail addresses consist of "words" delimited by periods and at-signs. (There might be a proper technical term for these "words" but I'm certain you understand what I mean.) If the user typically searches for whole words (e.g, find Mike, find Revelation, find Mike and Revelation) that's an argument for going Xref. An Xref can also support searches for truncated words (e.g, find Oy], find Sprezz]). But if there is a need for searching substrings of words in general (e.g, find stein, find zz) there might not be any point in having an Xref index. A Btree index might support such searches just as well. I.e, there might not be much difference in -searching the index and -searching the data table itself, but at least an index can't hurt performance.

Note that the "words" of some e-mail addresses might not be the words we think of when we search. E.g, my hotmail address is reigemoystein@hotmail.com, so an exact Xref search on Oystein or Reigem wouldn't find that address, neither would a search on Oystein. Don Bakke's address is dbakke@srcps.com, with Don or Donald absent and Bakke only retrievable with a kind of search.

Try to think of the different kinds of searches a user will do. Try to find out how well the two kinds of indexes support those searches. Consider the two standard measures of search quality - recall and precision. Recall is how many of the relevant hits you find. Precision is how many of the hits you get actually are relevant. High recall is probably very important in your case. How important is high precision? E.g, doing a kind of search on the string 'se' for all people with Swedish addresses will probably return a lot of noise.

To improve precision you might offer the user the capability of searching each main part of the address separately (the name or similar on the left side of @ versus company or similar on the right side). You must then have symbolics for the two main parts of the address, a suitable index on each of them, and a search interface where the user can fill out one or both criteria. When the user does a search he probably knows if the string he's searching for belongs on the left or right side, so the feature won't make things harder for the user.

A final thought: Is it really the e-mail address field you want to search? I assume you want to retrieve e-mail addresses, but perhaps there are other fields present more suitable for the search itself? If your table also includes the names of the people having the addresses, the name and company field(s) might be better candidates for search. E.g, searching for Andrew in an e-mail address field won't find Mr McAuley, because his address is amca@sprezzatura.com.

- Oystein -

At 21 JAN 2003 08:11AM Oystein Reigem wrote:

Don,

I'd consider having 'com', 'org' and 'edu' stripped (or treated as stop words) only if they reduce precision. I'd think they sometimes are useful. E.g, I might want to search for that University guy David whatshisname (find David and edu).

- Oystein -

At 21 JAN 2003 10:17AM Don Miller - C3 Inc. wrote:

Oystein ..

I read your previous post too and it's good food for thought. Perhaps leaving the suffix MIGHT be useful (universities, etc.). The idea of using in-string searching of B-Tree indexes is also promising. I think it depends on how large the universe is, too. Sometimes it's just as fast to search the entire record than it is to search through relatively long B-Trees.

As usual, you're very perceptive.

Don M.

At 21 JAN 2003 11:42AM b cameron wrote:

Don and Oystein,

Thanks for the responses.

I decided to go ahead and try the btree. So far so good.

Basically I have a table with a Seq# unique id. Within the

record is the email address.

I am importing data from a text file that has its own unique

seq#. It is coming off a web site and the only required

info is the email address.

To bring this into our app. I first want to check for possible

dups. At this point it only makes sense to check off the email

address and name (if they even input any).

I used the btree.extract after building btree indexes on email

field of the native table. Seems to be working so far.

Bruce

View this thread on the forum...