Moving from a large, multi-part key to single part key (AREV Specific) [Revelation On-Line Wiki]

Sign up on the Revelation Software website to have access to the most current content, and to be able to ask questions and get answers from the Revelation community

At 15 NOV 1998 05:04:31PM akaplan@sprezzatura.com - [url=http://www.sprezzatura.com]Sprezzatura, Inc.[/url] wrote:

AREV Specific

We've recently proposed moving a customers file structure from a 5 part key to a single part key. We've done some tests on this to try and determine the increases in speed that this will give us, however it's difficult to measure exactly the effects on a live system.

If anyone has done this before and kept benchmarks, can you kindly let us know the differences in the before and after?

The customer is curious as to the amount of bang for the buck they will receive.

Thanks much.

akaplan@sprezzatura.com

Sprezzatura, Inc.

At 16 NOV 1998 02:37AM Steve Smith wrote:

.

At 16 NOV 1998 02:41AM Steve Smith wrote:

Aaron, if I didn't know better, I'd say this question & response will be viewed here by the client. I doubt you need this feedback, with your unquestionable technical expertise in all matters AREV.

It depends on the customer's usage, and whether they need to

access the key components (after the change) either as a substring symbolically, or afterwards by key elements moved into the record.

Indexes may also have an effect. Would it mean you were indexing symbolics? This can be a pain sometimes.

The physical change in length will determine the scan speed to a minor extent, as AREV does not have to traverse as much stuff each time it accesses, say, the fifth element of @ID.

I suspect there are speed-related advantages, such as in record maintenance, template key entry times, lock token generation times, and reporting columns from the key.

Having used similar keys in an AREV timetabling package (student-room-teacher-period-subject) having that in the 5 part key was an advantage, as the access was always by different key parts (sometimes incomplete, looking for multiple hits). Same too with reindexing keys in a payroll package with employee-payperiod-paydate-payslip as a 4-part key - definite advantage.

Depends on your application, and the context of the file access.

Overall, I'd be surprised if there's much difference in performance one way or the other, unless it entails a significant reduction in key size.

Steve

stsm@ozemail.com.au

At 16 NOV 1998 05:23PM K Gilfilen wrote:

Aaron,

I am having a hard time thinking of performance-related reasons for going to single part keys. Any benefit you get will be very minor in comparison to the benefit to the data schema you get by having meaningful, concatenated-data record keys. I say that based on my expectation that most queries are done against virtual data (indexed data), not the actual data files. But even if you have no indexes, running through a file and having to parse record keys will not slow things down so much; a consciencious database administrator will first put indexes on the important columns before changing the data schema/record keys.

If, on the other hand, the purpose of your modifications is primarily to change the data schema, and secondarily to get performance improvements by decreasing the load on the dictionary calls tha parse record keys, that seems like good reasoning. But then I certainly don't have the benchmarks you're looking for.

At 16 NOV 1998 07:59PM Victor Engel wrote:

I don't have benchmarks. However I do have a run-in with this issue. In HR-1, the changes file is saved using up to a five part key, which includes change date, time, data key, sequential counter, etc. We have a Btree index on the data key and the change date fields, both of which are part of the key. These would be just as well served as data fields as they are as key parts. I once repopulated a file arranged with a sequential counter as the key and all the aforementioned parts as data items instead of key parts. Building indexes was markedly faster – several times faster. Sorry, but I don't have numbers, and this never was implemented in production anyway. Reads/saves were not appreciably faster or slower. If you think about it, the same volume of data (key plus record) must be saved to the file in either case, so the amount of data transfer is roughly equivalent. I don't think that answers your question, but it's all I've got.

At 17 NOV 1998 11:30AM akaplan@sprezzatura.com - [url=http://www.sprezzatura.com]Sprezzatura, Inc.[/url] wrote:

I don't know if the client will actually see this, I was asking more for information to save myself all the work.

Actually, the question was based more on index building time and index updated in general. I know of two sites that are essentially running the same system. It was an old RevG system that took two different paths to ARev 3.12. They have large files with very large keys. Building and maintaining indexes on these things was a pig. As the file got larger, the indexer was taking 30 to 40 seconds to update each transaction. As the system would sometimes get a couple of hundred transactions in a process, and there were 12 indexes in the file, this was really bogging down the indexer.

This particular customer is concerned about the index updating and rebuilds times. Updating I can understand, but the customer said they have not had to rebuild indexes in over a year.

Having said that, the advantage of being able to build keys on the fly, as they do, would probably offset. Otherwise, instead of being able to loop through the parent records to get the child keys, they'd have to btree.extract, unless they did some relational or pseudo-relational indexing.

But, it helps to have some concrete examples of the information so the customer can make an informed decision. They'd prefer not to do any serious maintence work on the system, but they want to do everything possible to speed it up some.

Thanks for the advice.

akaplan@sprezzatura.com

Sprezzatura, Inc.

At 17 NOV 1998 11:34AM akaplan@sprezzatura.com - [url=http://www.sprezzatura.com]Sprezzatura, Inc.[/url] wrote:

Thanks for the info. As I mentioned in my response to Steve, the customer has a series of hierarchecal type records, and embedded in each parent, is the info needed to add on to the current key to get all the children. It's very fast and saves all sorts of index times and eliminates the potential headaches that relationals cause.

The idea was that since updating the index transactions on such a large key can be very slow, would the increase in speed make much of a difference. The more I thought about it, I thought not, but it couldn't hurt to get a reality check from other developers.

Thanks again.

akaplan@sprezzatura.com

Sprezzatura, Inc.

At 17 NOV 1998 11:36AM akaplan@sprezzatura.com - [url=http://www.sprezzatura.com]Sprezzatura, Inc.[/url] wrote:

That's exactly the sort of information I wanted. I was curious about the index speeds and if it would offset the hits needed in other places.

Based on all the info presented so far, I do not think it worth it.

Thanks Victor.

akaplan@sprezzatura.com

Sprezzatura, Inc.

At 17 NOV 1998 02:14PM Victor Engel wrote:

I have an update, which may or may not help. I periodically move records from our CHANGES file to a CHANGES_ARCHIVE file. Each has exactly the same dictionary struction, including indexes. Normally, I archive these records fairly frequently during off peak hours, so the pending index transactions can be processed without a needless hit on real productivity.

This time, however, it has been a long time since I did this, and almost 100,000 records were archived. This created a tremendous number of transactions for the indexer. My program has a call to INDEX.FLUSH, and this proceeded on the CHANGES file without any significant delay beyond the expected. However, the flush to the CHANGES_ARCHIVE took so long (over 24 hours) that I eventually aborted the process and rebuilt the indexes. We have over 600,000 records on file. The index rebuild took about an hour I think. I didn't time it. I really haven't figured out exactly where the bottleneck was, but I suspect that redistributing the btree was taking up the most time. With larger keys, redistributions must be more frequent, right?

One a day to day basis, however, the indexer machine (I think it's a 486) is able to keep up with all the index updates to the data files and to this CHANGES file. This was over 5,000 changes last week – about 11,000 changes just yesterday.

At 17 NOV 1998 04:08PM akaplan@sprezzatura.com - [url=http://www.sprezzatura.com]Sprezzatura, Inc.[/url] wrote:

I don't have benchmarks on the site in question here, but I did have some on another site, and there we found it was definetly in the balancing of the trees. We gave them a debug f.distributor and did timing tests through the debugger.

At another site, I mailed them a custom cut indexing routine which changed the btree node size to about 10K. This increased his standard update to almost instantaneous.

So, from everything I've ever seen (plus the added advantage of having be able to trace through the debugger, with source!) the big speed hit on large keys is the time taken to balance the nodes.

There shouldn't be any on a simple read since the key length is stored (along with the record length), so it's a memory jump, not a disk read, and the added read of a few bytes from memory shouldn't make a difference in the real world.

All interesting theoritical stuff. Now, all I have to do is start implementing.

akaplan@sprezzatura.com

Sprezzatura, Inc.

At 18 NOV 1998 08:21PM Chris Vaughan wrote:

Our experience with large payroll histories (4 part key - year, period, employee number, pay component - average length 25 characters) was that indexing becomes impossible as the number of records increases. Systems started collapsing at around one million records. The time to update an index node grew exponentially to a point where it appeared to thrash in a node splitting loop. All this happened about six years ago - there was extensive dialogue with RevTech (or was it Cosmos ?)

Trials with using a simple transaction number as the key (max length 7) allowed us to index the required 2 million records. We also tried increasing the frame size for the !file (from 512 to 2048) in the hope of reducing the number of nodes and the incidence of node splitting. We found that the node split was hard coded to somewhere around 500 regardless of the index file frame size - we tried patching some of the indexing code - but fell into some really big holes.

I don't see the problem as being 'multi-part keys'. It is more a question of the number of index nodes and the amount of key data carried for each node. A lot of short keys on the one node will eventually cause the same problem as a lesser number of large keys.

I don't see Btree as being a true balanced binary tree - any casual investigation will show that the nodes are anything but 'balanced'. The limit to the depth of the node hierarchy is finite

View this thread on the forum...