Cross Reference Index Questions (None Specified) [Revelation On-Line Wiki]

None Specified, David McGehee, Thomas Dunlop, Oystein Reigem

Join The Works program to have access to the most current content, and to be able to ask questions and get answers from Revelation staff and the Revelation community

Cross Reference Index Questions (None Specified)

At 15 SEP 1999 09:21:01AM David McGehee wrote:

These questions apply to BOTH OI and AREV, as I need answers to both environments.

1. Where is the application level default exclude list kept and where are the individual index custom lists kept? If query words excluded from the index are not automatically ignored, then I will need to determine whether or not an index has default/custom excludes and access the list(s).

2. Is an excluded word ignored in a BTREE.EXTRACT level query? At any level? If not then we have the problem, for example, of excluding CORP & CORPORATION then having an inquiry on MOBIL CORP. We probably have the same problem, even if excludes are automatically handled, if we try MOBIL CORP].

3. Is there any way to make the BTREE.EXTRACT use an active select list as the beginning point for an extract comparison? For example, if I am searching a name xref for DAVID E, I don't really want the extract to gather all of the DAVIDs and all of the Es then match them. I want to be able to at least get it down to DAVIDs and match them to the E entries?

At 15 SEP 1999 10:20AM David McGehee wrote:

To partially answer my own questions, in AREV the default is stored in the SYSENV file in the record keyed by application_ENVIRONMENT. Also, the custom list for each index is stored in the call in the index_XREF dictionary item. Therefoe, I would like to assume that even at the BTREE.EXTRACT level, the excludes of the requested words are automatic – but then you know what assuming gets you.

At 15 SEP 1999 01:03PM Thomas Dunlop wrote:

Hi Dave:

Here's what I believe is the case for OI. The default is also under SYSENV under APPNAME_ENV. The custom is found in !TABLENAME under row !TABLENAME.

This is about all I have been able to come up with so far. As I learn more, I'll post here.

Hope this helps,

Tom Dunlop

Revelation Software

At 15 SEP 1999 02:13PM David McGehee wrote:

I appreciate the info. so far. I have found with AREV, at least, that SELECT doesn't exclude the excludes. That is SELECT xxx with yyy_XREF=BUDGET" AND WITH yyy_XREF=CORP" yields NO records even though CORP is a stop word and budget corp is a valid record.

At 16 SEP 1999 12:02AM David McGehee wrote:

An additional consideration on xref indexing. It IS NOT viable for large databases! Even though the indexing process may relatively quickly complete the sort phase - producing even 20 million entries (much less 100 million entries) - the build process will stall at somewhere around 40 % (using a 1024 byte frame) to 80% (using an 8192 byte frame). It will start off doing - say - 5 million entries per hour but when approaching the stall point will be down to 30 thousand per DAY! I have successfully done 6 million base records with few words in the indexed field but street address, for example, often contain several words, including purely numeric words.

I would have thought that, because the entries are sorted (phase 1) that the build would be a linear process. So many entries per hour for n hours. It even seems to work that way for a while but because the OV soon gets to be 5 - 10 times the size of the LK, the process eventually grinds to a virtual halt.

This really seems a shame as the cross reference indexing capability is one of the primary attractions for AREV/OI in data brokering / data mining. If any solutions are available, I despiritely need to find them.

At 16 SEP 1999 03:51AM Oystein Reigem wrote:

David,

1. Where is the application level default exclude list kept and where are the individual index custom lists kept? If query words excluded from the index are not automatically ignored, then I will need to determine whether or not an index has default/custom excludes and access the list(s).

Let's say you add a cross reference index to a field F in a table T. The system then creates a new multivalued symbolic field F_XREF, with a formula to extract the individual words from F. The cross reference index on F is really a Btree index on F_XREF. What you can find with a query are the words that got indexed, which are the words that got extracted to F_XREF. And which words got extracted depend on your choice of stop list settings.

Try look at the formula with the System Editor. Open row F_XREF in table DICT.T. To better understand the content of a dictionary row open the DICT_EQUATES inserts row of table SYSPROCS.

The XREF field formula is a simple call to the XREF function. To determine whether or not a cross ref index has default/custom stop lists your program must take a look at the third and fourth parameters. The online help on the XREF function will help you further understand what goes on.

(You might also want to program your own utilities to modify the formula (change stop list settings and word delimiters).)

2. Is an excluded word ignored in a BTREE.EXTRACT level query? At any level? If not then we have the problem, for example, of excluding CORP & CORPORATION then having an inquiry on MOBIL CORP. We probably have the same problem, even if excludes are automatically handled, if we try MOBIL CORP].

Stop words are tricky. A well-written app should warn the user if he tries to search on a stop word. Or at least exclude the stop word from the query.

If you have users who want stop words (e.g to keep the size of the database down) but still insist they occasionally want to search for them, you could perhaps do such queries with a (slow) non-indexed search on the source field (F).

Or perhaps you could add your own symbolic inverse F_STOP of F_XREF. F_XREF contains all words from F that are not stop words. F_STOP could contain all the stop words. Put a Btree index on F_STOP and you could programmatically change the query (F_XREF=MOBIL") AND (F_XREF=CORP") to (F_XREF=MOBIL") AND (F_STOP=CORP").

(I think I should quit now before I get too many far-fetched ideas.)

3. Is there any way to make the BTREE.EXTRACT use an active select list as the beginning point for an extract comparison?

I don't think so. I think you must do the filtering (AND-ing) yourself then.

For example, if I am searching a name xref for DAVID E, I don't really want the extract to gather all of the DAVIDs and all of the Es then match them. I want to be able to at least get it down to DAVIDs and match them to the E entries?

I think you must switch from Btree.Extract to e.g Rlist then.

Since Btree.Extract cannot handle large result lists that might be a good idea…

An additional consideration on xref indexing. It IS NOT viable for large databases! Even though the indexing process may relatively quickly complete the sort phase - producing even 20 million entries (much less 100 million entries) - the build process will stall at somewhere around 40 % (using a 1024 byte frame) to 80% (using an 8192 byte frame). It will start off doing - say - 5 million entries per hour but when approaching the stall point will be down to 30 thousand per DAY! I have successfully done 6 million base records with few words in the indexed field but street address, for example, often contain several words, including purely numeric words.

I've never tried with that much data myself.

When data to be sorted exceeds a certain amount, you must expect some drop in performance, when data cannot longer be sorted in memory only. Or when data cannot longer be sorted in one chunk, but in several, and afterwards merged. But I think there must be a different problem in your case.

I would have thought that, because the entries are sorted (phase 1) that the build would be a linear process. So many entries per hour for n hours. It even seems to work that way for a while but because the OV soon gets to be 5 - 10 times the size of the LK, the process eventually grinds to a virtual halt.

Sizelock??? Check with Database Manager.

This really seems a shame as the cross reference indexing capability is one of the primary attractions for AREV/OI in data brokering / data mining. If any solutions are available, I despiritely need to find them.

(I don't think there is a problem with cross reference indexing as such - just that a cross reference index typically will contain much more data than a plain index.)

- Oystein -

View this thread on the Works forum...