How does btree.extract work on various fields and large lists (AREV Specific)
At 22 NOV 2001 10:51:35AM Wilhelm Schmitt wrote:
On a large file, we do btree.extracts on several indexed fields within one call and using the 64K flag to extend the selection. The different fields are OR related.
If we do selections on just 1 field we get the results in a very short time, with 100,000 hits in 1-2 minutes. But if we do "simultaneous" extracts on 3 or 4 indexes, it gets really slow (] 30 minutes and counting …).
1) Is there a way to get some feedback during the extract process, so that we may post status messages?
2) If we did extracting on an individual basis and combined the resulting hit lists, is there a function - hidden somewhere in the toolbox - which allows this combination? The excessive time consumption seems to ocurr when creating the final key list.
3) The end-user doesn't know in advance, if he activated a reasonable selection or not. If we were able to tell him, that he is about to start a 100,000 records list, he certainly would like to abort, or avoid to execute the process at all. Is there a way to give a quick estimate?
Any ideas are welcome.
Wilhelm
At 22 NOV 2001 05:01PM Curt Putnam wrote:
With very large lists, you may be better off writing some of your own specific routines rather than depending on the generalized toolset provided by Arev.
If you are simply concatenating Btree hit lists (isn't that what OR does?), why not make multiple btree calls and build your own list in the lists file? If you have to order the list, can you do it yourself?
As to warning users, the 1st call will produce some number of keys. Given the context of the query, that number may be usable as a predictor of overall list size.
Perhaps a more specific example of your problem would allow this group to help a little bit more.
At 26 NOV 2001 07:44AM Dan Reese wrote:
Not sure if this will help your situation, but we use this technique a lot… If we have a complex query that takes a long time to complete, and we do it on a regular basis, we will move the logic to a symbolic dictionary item and index that item. For example,
if (A or B or C) and not(D) then
@ans=1end else
@ans=0end
Put a btree index on this column, then select on the column. For example, SELECT MYTABLE WITH MYCOLUMN. You'll have your results in seconds.
There is one limitation to this technique. When you index symbolic columns you cannot index a column that relies on an xlate to another table, since the index will not be updated when the data in the other table is changed. However, if your symbolic looks at hard data in the current table, or symbolic data in the current table that does not rely on xlates, it will work.
At 26 NOV 2001 01:04PM Victor Engel wrote:
What version of Arev are you using? How selects on multiple indexed fields works was changed (at version 3.1 I think – someone with release notes feel free to correct me) to be more efficient.
Additionally, you should check to be sure that the environment setting of the client is set the same as the index in regard to case sensitivity. If the environment does not match the index, then the hit list must be examined row by row for final resolution to resolve any potential differences in case.
Is there sorting going on as part of the select? If so, then the indexes will be used to accomplish the sort and then a second pass will be done to reduce the hit list. If your hit list is small, then it may be faster to do the reduce first. In this case, you'd probably want to change the single select into a 2-pass select.
Finally, your data may be such that there are certain key fields that narrow down the result set much more significantly than others. You may with to add code to select on these first, thus significantly reducing the result set for a subsequent select call. Along the same lines, maybe you have too many fields indexed. It can be faster with fewer indexes if that would result in less set-manipulation. The idea here is to index only those fields that substantially reduce the result set in a select.
Of course none of this is black and white, and it will likely have to be evaluated on a case by case basis. Indexing symbolics works if the same combination of fields is used frequently. However, for an ad hoc query, this doesn't seem practical.
At 03 DEC 2001 03:40PM Wilhelm Schmitt wrote:
Curt,
Thanks for your reply. Thanks also to Dan and Victor.
If you are simply concatenating Btree hit lists (isn't that what OR does?), why not make multiple btree calls and build your own list in the lists file?
With multiple btree calls you get for example 3 lists: where LIST1 shows keys A,B,C, LIST2 has B,D and LIST3 has A,D,E,G. Now, the resulting list should be A,B,C,D,E. Throwing the individual lists together gives a selection of A,B,B,C,D,D,E,G instead of simply A,B,C,D,E.
Perhaps a more specific example of your problem would allow this group to help a little bit more.
The application is a real-estate database for the public records (with more than 1,500,000 records in the main file). There are xref indexes on actual_owner, former_owner and applicant. Many of them have 2 first names and 2 last names, and you may have co-propietors (sometimes up to 100).
A frequent search pattern is: person1 or person2 or person 3, either as actual owner or former owner in city X. The city is included in the key, but has no individual index on it (example: "A12*34567" which means property type A in city 12 number 34567).
We first extract from the indexed name fields (mutliple lookup) and write the temporary list to a file. (Btree.extract doesn't offer any feedback before this step.) Then we activate the temporary list and narrow down by city with a SELECT. (ARev 3.12 is working as the search-engine for OICGI.) On less common names, there is a fast response, but on common names and small cities, we often get a temporary list of many thousands and a final list of just a few hits.
In order to have more control over the process, the steps would be:
1. extract & write individual list for each indexed field,
2. eliminate redundant keys & write temporary list
3. select on temporary list (by city or other non-indexed criteria)
Our basic problem lies in step 2: eliminating redundant keys quickly in several lists.
Hope, this gives a clearer picture.
Wilhelm
At 04 DEC 2001 11:16AM Bill Titus wrote:
Our basic problem lies in step 2: eliminating redundant keys quickly in several lists.
As a first step to optimizing your selection process, could you create a symbolic field which concatenates all the names, and build your XREF index on it? This should eliminate the appearance of duplicate IDs on multiple lists and save some time, too.
Bill
At 04 DEC 2001 03:54PM Wilhelm Schmitt wrote:
Bill,
We have the APPLICANT field and the OWNER field. If I want to look for either APPLICANT=DONALD DUCK" OR OWNER=MICKEY MOUSE", a symbolic field on APPLICANT and OWNER would also return keys for "MICKEY DUCK" and "DONALD MOUSE".
With several names in each field, you get a lot of useless combinations. I think that this would only complicate things, apart from the additional index.
Wilhelm
At 04 DEC 2001 04:22PM Bill Titus wrote:
Wilhelm,
True enough. However, you could eliminate spaces between first and last names in APPLICANT and OWNER, so the symbolic becomes "MICKEYMOUSE DONALDDUCK".
If that works, can spaces in such "fullname" fields also be eliminated from user entries prior to the select processes?
Bill
At 04 DEC 2001 06:52PM Wilhelm Schmitt wrote:
Bill,
Eliminating spaces creates another problem
a) MOUSE, MICKEY would not show up. End users just have a browser window in front of them and expect something similar to a web search-engine.
b) Owners usually have 4 names, 2 first and 2 last names, sometimes, only 1 first name and 1 last name is given.
c) Co-owners definitely could not be traced, because all possible combinations of names can become impressive.
Wilhelm
At 04 DEC 2001 08:36PM Bill Titus wrote:
Wilhelm,
Thanks for your patience. Perhaps a step backward is in order.
With multiple btree calls you get for example 3 lists: where LIST1 shows keys A,B,C, LIST2 has B,D and LIST3 has A,D,E,G. Now, the resulting list should be A,B,C,D,E. Throwing the individual lists together gives a selection of A,B,B,C,D,D,E,G instead of simply A,B,C,D,E.
If you were to perform each SELECT, and then LOOP/READNEXT through it writing null,id records to a TEMPFILE, you would end up with no duplicate ids and could then SELECT TEMPFILE for further processing. You just need to create a TEMPFILE that has a unique, user-specific name. You might even want to include a little data in the new TEMPFILE records that could be used for additional segmentation.
Hope this is more helpful.
Bill
At 04 DEC 2001 11:05PM Wilhelm Schmitt wrote:
Bill,
I like the idea of writing null,id records to eliminate duplicate keys. Sounds quite simple.
On the other side, with large selections (example: two lists with 50,000 keys each) there seems too much overhead involved, so I would be looking for something more direct - in memory.
Wilhelm
At 05 DEC 2001 08:26AM Bill Titus wrote:
Wilhelm,
If we do selections on just 1 field we get the results in a very short time, with 100,000 hits in 1-2 minutes. But if we do "simultaneous" extracts on 3 or 4 indexes, it gets really slow (] 30 minutes and counting …). If it's any help, I test-wrote 50,000 null,key records to a tempfile in 27 seconds. 2 min. per select + 1 min. per tempfile write=3 min. Using this method it could take me about 12 minutes to complete the initial selection process with no dupes. I have no experience in working with large lists in memory, so I'll leave that part of the discussion to others. I hope you are able to resolve things successfully! Best wishes, Bill </QUOTE> —- === At 05 DEC 2001 10:05AM Oystein Reigem wrote: === <QUOTE>Wilhelm, One of your problems seems to be that the user doesn't know exactly which values are in the database. E.g, one and the same person could be registered in the database with one or more variants, like MOUSE, MICKEY MOUSE MICKEY MOUSE, MICKEY DEE MOUSE, MICKEY D. MOUSE, MICKEY D MOUSE, MICKEY DEE DAVE MOUSE, M MOUSE, M. MOUSE, M D MOUSE, M. D. MOUSE MICKEY MOUSE So what about an index lookup? Most of the values above would appear together and easily be spotted by the user. But I assume it isn't the names and persons themselves you want to find, so this would have to be the first step in a two-step procedure. Another idea is something I implemented to help my own clients in a similar situation - a variation of Xref indexing combined with ] search. E.g, with the name value MOUSE, MICKEY DEE DAVE I did not index the individual names (words) MOUSE MICKEY DEE DAVE but rather the incremental and special-character-stripped values MOUSE MOUSE MICKEY MOUSE MICKEY DEE MOUSE MICKEY DEE DAVE Now a search on e.g MOUSE MICKEY D] would find many of the possible variations of this person's name. A search on e.g MOUSE M] would find even more of the possible variations (but also more irrelevant hits, of course). You could even let the user start with his most optimistic version MOUSE MICKEY DEE DAVE of the search criterion, and have the search program generate a suite of progressively broader criteria, e.g MOUSE MICKEY DEE D] MOUSE MICKEY D] MOUSE M] MOUSE] do a separate search on each, and present the result as a ranked list. Of course I might be totally off target regarding your needs. - Oystein - </QUOTE> —- === At 05 DEC 2001 10:53AM Victor Engel wrote: === <QUOTE>Perhaps this is a job for Ros Man. Ros files are fine for local use, which would apply in this situation. The thing about ROS files is that RAM is used automatically to cache the data. I just ran a test to compare performance to a REV file vs. a ROS file. On this slow machine I got 819 seconds for the Rev file vs. 263 seconds for the ROS file. If your keys are sequential number, you can also use OSBWRITE, which is even faster (13.5 seconds). But if your keys are numbers, why not simply use them to address bytes (or even bits) in a string (0.7 seconds). </QUOTE> —- === At 06 DEC 2001 03:58PM Wilhelm Schmitt wrote: === <QUOTE>Victor, thanks for your hint on ROS. I didn't get the message on the following: If your keys are sequential number, you can also use OSBWRITE, which is even faster (13.5 seconds). But if your keys are numbers, why not simply use them to address bytes (or even bits) in a string (0.7 seconds). This sounds very interesting, but could you be a little more specific? We use 2 part keys of the type prefix*number, like 12*345678 (or also A12*345678). Do you mean, just to put a mark at position 345678 in DOS file TEST12, for example? Wilhelm </QUOTE> —- === At 06 DEC 2001 04:07PM Wilhelm Schmitt wrote: === <QUOTE>Oystein, The xref on names is fine. In our specific case the user doesn't always know the exact order of names, so that MOUSE, MICKEY DEE DAVE could also be looked up as DEE MICKEY or DEE MICKEY MOUSE. With all the possible combinations, the index would become explosive. Thanks Wilhelm </QUOTE> —- === At 06 DEC 2001 04:21PM Victor Engel wrote: === <QUOTE>What I meant was that if your keys are sequential numbers, you could use the key as the pointer in the OSBWRITE. For example, OSBWRITE 1 ON DOS_HANDLE AT REV_KEY Of course, you need to make sure to initialize the DOS file first (OSBWRITE up to max size with 0s or something) When finished, you can OSBREAD through the file checking for these flags to create a select list, thusly. listrec=' for pos=0 to maxpos (maximum key encountered) step blocksize osbread block from dos_handle at pos length blocksize for i=1 to len(block) if blocki,1=1' then listrec := i:@fm * Here you would add code to flush the list to the LISTS file (creating a saved list) when it gets over some threshold (about 30K if you want to match existing saved lists) next next I hope you get the idea from this. </QUOTE> —- === At 06 DEC 2001 05:01PM Wilhelm Schmitt wrote: === <QUOTE>Victor, Just tested… 1,000,000 marks written in 18 seconds. Very impressive. Thanks a lot. :) Wilhelm </QUOTE> —- === At 07 DEC 2001 03:42AM Oystein Reigem wrote: === <QUOTE>Wilhelm, I realize I'm a bit confused about your situation. Is the variability in the database? E.g, could the person with the full name of Donald Duck exist in the database as both DUCK, DONALD DONALD DUCK and DUCK, D Or is the variability strictly in the range of queries presented by the users? - Oystein - </QUOTE> —- === At 10 DEC 2001 10:43AM Wilhelm Schmitt wrote: === <QUOTE>Oystein, the variability in the database comes from the legal system, i.e. names have to be stated exactly as they appear in the contracts to be recorded. Wilhelm </QUOTE> —- === At 10 DEC 2001 01:10PM Oystein Reigem wrote: === <QUOTE>Wilhelm I see. (Finally.)
- Oystein - </QUOTE> View this thread on the forum...