Large systems (AREV Specific)
At 27 SEP 2000 07:16:57PM David Craig wrote:
We're planning our next system development effort and were wondering if there are any large databases (hundreds of tables, hundreds of thousands to millions of records each with 1 to 200 fields, heavy indexing and many multivalued and/or fairly large text fields) running on Arev, OI or JRev.
My experience so far is that you run into performance limitations when you get to around 500,000 records with tables like that, at that point searches and sorts start to become slow even for indexed tables. On the other hand, when we moved from Novell to NT we also did a substantial hardware upgrade and performance improved noticeably on all fronts (response times, stability), which might lead one to believe that the limitations can be overcome by jacking up the hardware side of the equation.
If anyone out there has experience with larger systems on these platforms I'd be interested in hearing about it.
David Craig
ABC-CLIO
At 28 SEP 2000 08:53AM Cameron Christie wrote:
We're running in AREV 3.12 on Novell Server/NLM/NT Workstation. Our biggest file (in terms of number of rows) now has 3.7 million entries, and our biggest physical file last year hit an OV size of 2Gb. (We have since archived data out of it, but the biggest size the whole system has reached is around 7.5 Gb.)
Both files mentioned have several BTree indexes which are vital to the performance of the system; said performance has actually improved over the last couple of years despite constantly growing business volumes as we continue to keep the network and workstations "up to date."
HOWEVER : in various tests we have run into severe index degradation when tables get this big, especially if the keys are unnecessarily long (DON'T use Date*Time*Station! ) In the last few years we have been forced to convert keys to a smaller format, and I would strongly recommend designing for a partitioned structure up front (e.g. annual archive tables.)FWIW, it's not the number of btrees that seems critical, or the physical framesize/threshold, but rather the number of nodes in the upper "branches" of the tree, which we have precious little control over. Even with the smallest possible numeric keys, we still hit fatal degradation round about 9-10 million rows.
We were also forced to archive the 2Gb table when it began reporting spurious GFEs (I suspect an inbuilt 2^32 limit somewhere in the linear hash filing system.)
That said, given current projected volumes we have no plans to move away from the platform at the moment, but it IS admittedly becoming a bit more of a chore to keep the housekeeping up to date.
Cameron
PS And if I've missed any obvious solutions to the aforementioned problems, would someone please let me know!
![]()
At 28 SEP 2000 09:31AM Don Miller - C3 Inc. wrote:
You are essentially correct in your observations. Be sure to avoid the dreaded sequential numeric key trap since this causes "clustering" in the top-level nodes. This situation can be alleviated to some extent by using any alpha character in the key (even if it's the same character appended to a numeric sequential key).
There is no inherent 232 issue in RTP57 (Linear Hash) although some of the maintenance utilities do bump against it internally. The algorithms for access in AREV are substantially weaker than in Pick which has a boatload of optimized choices for many possibilities - but that's another issue entirely. In general, larger frame sizes (which can help the dreaded overflow situation) are not particularly efficient for memory utilization. I've seen situations where only 1 table required ]4K frame size but because the app had a lot of tables, there were significant memory utilization issues (particularly in string space). HTH Don Miller C3 Inc. </QUOTE> —- === At 28 SEP 2000 11:38AM Cameron Christie wrote: === <QUOTE>Thanks Don, Sequential counters are essentially inefficent as they have to be locked, read, updated and unlocked (and heaven help you on a busy system if a station crashes still holding said lock!) For those reasons we use a "drunkards walk" algorithm to pick a fixed length random number for the key, then look to see if it already exists - if it DOES, we pick a new random number, and repeat. No extra locks needed, and there are less file accesses than a counter whenever the number of records in the table is less than half the maximum key value (e.g. with an 8-digit key this method is more "efficient" up to a total of 50,000,000 rows!) I must confess though, I didn't put an alpha character on it - you can bet it'll be happening soon though, after your tip! Cameron PS You should have heard our accountants scream when we told them we were using random numbers to control all their posting files. PPS If there's no 232 in LH, then what do you reckon was causing the phantom GFEs over 2Gb? There's no documented limit anywhere near this value on Novell or NT that I'm aware of???
We could sequentially readnext all records, but DUMPLH reported definite errors, and write operations fell over if they hashed into frames beyond the end of the file.
At 28 SEP 2000 11:45AM David Craig wrote:
Cameron;
Thanks very much - this is excellent information.
David Craig.
At 28 SEP 2000 01:07PM Wilhelm Schmitt wrote:
David,
Our internal system is based on ARev 3.12, NT Service, NT4, 50 LAN users and 40 additional WAN users (through Citrix Winframe). For WEB publishing we use OICGI and SAMBAR web server.
(By the way, most of our web programming, like processes, selections, HTML/JScript windows etc. is done in AREV, copying the routines to OI and recompiling there.)
In the LAN/WAN environment, each of the branch offices has a separate volume with its own files and indexes, many in the range of 100,000 records. The biggest file in the main office has around 700,000 records. The record size of the that file is variable. It averages 1K, but goes from 0.5K up to 25K.
On the LAN/WAN we switch between offices by simply attaching the corresponding volume (because file names and dictionaries within the volumes are identical).
On the WEB we give access to a duplicate of the database. Features are limited to record display, index lookups and rlist reports. Because of the restrictions in OICGI (every user is practically in the same session, so you cannot have user1 attaching office1 and concurrent user2 attaching office2 etc.) we had to throw all the data together. This means that the big datafile went up to 1,300,000 records.
On both sides we do lots of index lookups on the variable fields, mostly XREFS on data fields and symbolics. Some lookups are done through the btree.extract and for many reports we make direct reference to fieldname_xref (both from within programs and directly from TCL).
Response time is VERY GOOD in AREV!
Direct record access through the web (OICGI) is also good, but the index lookup in OI has several drawbacks:
1. On indexed words with many hits (10,000+), the web server times out after x minutes, while OI (RLIST) finishes until the very last record key is delivered and you cannot stop the process from within a program :o(
2. While comparing execution of commands (LIST FILENAME FIELD1 WITH FIELD2_XREF=ANYTHING") between AREV and OI we found out, that the same lookup takes up to 10 times longer in OI than in AREV!
3. Creating and rebuilding indexes - we definitely do that in AREV.
We also miss a secondary index utility -AREV and OI- for large files, (some sort of table to indicate each word that is indexed and the number of corresponding hits) so that the user may pre-select, before extracting from the main index.
We are planning to move index lookups on the WEB to one or more dedicated AREV stations, which would create the key lists and have the user send a separate request to OICGI to show the data from the key list.
I hope this information helps you.
Best regards
Wilhelm
At 28 SEP 2000 01:35PM Wilhelm Schmitt wrote:
Cameron,
Your comment about random numbers for the key sounds interesting,
but do you control sequence of documents in a (separate) data field, or don´t you use any sequential control?
If, for example, you issue cheque #1000, 1001 and 1002 you assume that there is no #1001.5 between. On the other hand, with random numbers, let's take #2985, 9002 and 3741 how do you control sequence there?
At 28 SEP 2000 02:01PM Don Miller - C3 Inc. wrote:
Wilhelm ..
Here's what I have done:
1. I don't use any keys that involve sequentially adjacent numeric values due to the "clustering" problem. Instead, I store a sequential number in another table that only contains the "real" key in field 1. When a new number is needed, the "other" table is checked to see if the value is in use and the counter is bumped. The form that contains this value specifies that the "new" value cannot exist as a key to the "other" table.
2. What I do do is to B-Tree this field sorted right justified which allows for sorting, searching and Btree Extract to work properly.
Given reasonaby sized "Pseudo-Keys" and numeric values, you can store an awful lot of data in a small space. The "clustering" effect doesn't seem to matter with a large number of values since each frame can hold a bunch. It also seems to help if the threshhold is set to a lower value (maybe start at 50%) to keep the data sparsely packed (at the expense of some disk space, but can guarantee less overflow transversing within a hash value). Disk space is relatively cheap.
In this way, my accounting types see the data in what they perceive to be an ordered set (sequential transactions by a posting batch, for example). It does get a bit tedious to use ID-SUPP though in reports and you have to teach Ad-Hoc report users to remember it.
Don Miller
C3 Inc.
At 28 SEP 2000 02:10PM Don Miller - C3 Inc. wrote:
Was the 2*32 in the NLM or NPP? I've definitely seen problems with DUMPLH (or its equivalents) due to numeric value calculations on "next" frame or "forward" pointers. I've seen a lot of tables in excess of 2.2gb (which would be in excess of that calculation) which can be read / written to without errors (phantom, mostly); but I have seen phantom GFE's in the NLM. I assumed that it was not in RTP57, but in the server piece, but you may be right. Frankly, I'm disinclined to write a program to test this since a lot of the errors I am seeing involve not just AREV, but various flavors if WINDOZE, Client Software, Service Packs, NLM wierdness (see the Sprezz post about COPYTABLE), NT service and the like.
Don Miller
C3 Inc.
At 29 SEP 2000 03:56AM C CHRISTIE wrote:
Don,
It may well have been an NLM issue, but there are so many variables with NT patches and Novell fixes these days that it's hard to be sure of anything without doing 10 years worth of testing!
As we've "converted" to an annualised archive now anyway I wouldn't waste any time worrying about it…
Thanks,
Cameron
At 29 SEP 2000 04:03AM Cameron Christie wrote:
Wilhelm,
We have the luxury of not needing to track any sequences on these accounting files - the keys are purely arbitrary and completely meaningless. As in Don's case, there are sufficient "real" batch counter, transaction number, and date fields in the body of the record to satisfy the auditors! (Since the old keys were Date*Time*Station (*Counter) based, ANYTHING would have been an improvement!
![]()