Presizing a table (AREV Specific)
At 25 MAR 1999 12:05:04PM Bill Claybourgh wrote:
I've read in some of the posts that presizing an AREV table can eliminate some problems when importing a lot of records, especially in the speed department.
How exactly do I presize a table for this purpose? Do I change anything after I'm done importing as well?
Thanks in advance - Bill
At 25 MAR 1999 03:31PM akaplan@sprezzatura.com - [url=http://www.sprezzatura.com]Sprezzatura Group[/url] wrote:
When you create the table, you can define the options there, depending on the version.
All versions will allow you to recreate an existing table based on expected number of records and average size.
Personally, I got in the habit of using the recreate method because some versions (might have just been G) used to make the dictionary the same size as the data, and I really don't need a 10000 group dictionary.
Syntax is
REMAKETABLE tableName ExpectedNbrOfRows AverageRecLength VolumeForNewTable
If REMAKETABLE is not in your VOC, try RECREATE-FILE or REDEFINEFILE or REMAKEFILE.
Just entering in the command should give you options you can fill out. If you leave location blank, it will use the current location. Be warned that the DOS name of the file might change.
akaplan@sprezzatura.com
At 26 MAR 1999 09:41PM Warren wrote:
And just to be doubly sure use the DUMP utility at TCL and check to see that the SIZELOCK is set to 1 (or greater). If the SIZELOCK is 0 the file will resize the modulo as you write records to it.
With it set to 1, the file will not resize until a SELECT statement is run on the file (which sets it to zero). After that the file will resize normally. If you set the SIZELOCK to greater than 1 the file will not resize until it set manually to 1 or zero.
At 30 MAR 1999 01:20PM Bill Claybourgh wrote:
Aaron, thanks for your response. I pre-sized the table using the REMAKETABLE command and had some increase in the overall import time, but not what I expected. Here are the stats:
Approximately 150,000 rows, Avg][size=160 bytes.
Using the default resize threshold the LK file was pre-sized to around 30MB. However, the import began running as slow as it normally did at the end so I knew this wasn't going to work. Then I changed the resize threshold to 50 (just guessing) and the file was then pre-sized to almost 50MB. The import began faster than before (but still not quite as fast before pre-sizing) but the seemed to average out better. We eventually went from 6.5 hour import to an 4.5 hour import. Better, but I guess I was hoping for more.
Warren suggested setting the sizelock. We did not do this. Would this help as well?
Thanks again for any ideas. Bill
At 30 MAR 1999 02:18PM Victor Engel wrote:
Some comments about presizing a table.
I have found that sometimes it is helpful to presize a table, and sometimes it is not. Part of it relates to cache size. If you presize a table to a size that exceeds the size of the cache, then, unless you update the table in group order, the random access nature of linear hash will defeat the cache to a large degree. By presizing to a size smaller than the cache, full advantage can be taken of it. Subsequent resizing, though, will take more time than updating the file.
The optimum setting will be different depending upon where the file is located and how you are accessing it. If you are using the NLM, for example, all resizing happens on the server, so that part does not enter into the equation as much as if you are accessing the file locally. We use the NLM here, and I have found that I can populate a table on the server many times faster than I can on my own hard drive. The exception to this is if I can presize a file to a size smaller than a write-enabled cache on my hard drive. Then I am effectively writing to RAM. Doing this sometimes, but not always, causes problems, so I don't recommend it for a critical process.
If I were in your shoes, I would try several scenarios in a testing environment to see which is the most efficient. My bet is that you will wind up presizing a file to somewhat smaller than the ultimate optimum size with a sizelock of 2. Then remove the sizelock when finished. Why do I say this? Because with a smaller modulo, you will have a smaller total number of frames (LK + OV) so there will be less total data written. Also, there will be less thrashing in the LK portion, and the OV portion will basically be updated in sequential order. Don't presize it too small with a sizelock of 2, though, or you could max out your groups. By setting the sizelock to 2 you eliminate ALL resizing overhead during the population phase. Turn it off right after you've populated the file, and normal activities will even out the file. You will notice some degradation of performance during this activity until the file achieves a balance.
Just my two cents. Again, your results will depend upon your environment.
At 30 MAR 1999 02:22PM akaplan@sprezzatura.com - [url=http://www.sprezzatura.com]Sprezzatura Group[/url] wrote:
You should have set the sizelock to 1. A 1 setting allows the file to increase, but not decrease. All the disk trashing your hearing is the file shrinking back to empty.
akaplan@sprezzatura.com
At 30 MAR 1999 02:26PM akaplan@sprezzatura.com - [url=http://www.sprezzatura.com]Sprezzatura Group[/url] wrote:
I like to keep it at 1 in this situation. Has all the advantages of 2, but you don't have to worry about maxing out the groups, since the file will start to expand before hand.
Even with a very large table, provided your not working with sequentials, which seem to have poor hashing distribution, a very large, empty file should be just as fast as a just big enough empty file, for the purposes of updating.
There was an article way, way back in Revelation's magazine, like 1987 or so with a graph showing hashing distributions on file size and key structure, or something like that. Perhaps someone that still has it can dig it out. I have no idea what happened to my copy. Maybe RTI can scan it in and upload it to the site here.
akaplan@sprezzatura.com
At 30 MAR 1999 06:49PM Victor Engel wrote:
I remember that article. Could it have been as far ago as 1987? Was Arev even in existence then?
In any case, performance varies a lot by the type of environment as well as data. My suggestion was based upon some recent testing I did (I no longer have the data to support my claim).
One thing I have wanted to try but have not, so far, is to preselect the rows to be copied in group order. I think this can be done with a symbolic field making a direct call to the BFS to get the group number. The theory here is that caching is much more efficient for reads than for writes, particularly if write caching is disabled, in most instances. Actually, the main reason for wanting to do this is that the overflow file will naturally be created in an unfragmented state, with all frames in a group being contiguous. This will make subsequent access faster. Of course this is only relevant if no further resizing will happen.
At 31 MAR 1999 03:29AM amcauley@sprezzatura.com onmouseover=window.status=why not click here to send me email?;return(true)", [url=http://www.sprezzatura.com" onMouseOver=window.status=Why not click here to visit our web site?';return(true)]Sprezzatura Group[/url] wrote:
Sept/Oct 1988 to be precise - author one Mike Pope. Nice guy. Pat McN always recommended the write in hash order trick, simple BFS call as you say.
amcauley@sprezzatura.com
World Leaders in all things RevSoft
At 31 MAR 1999 10:10AM akaplan@sprezzatura.com - [url=http://www.sprezzatura.com]Sprezzatura Group[/url] wrote:
The BFS trick is good one, if you can make some arrangements before hand. The way I see it, if your importing data from an external source, and the key is predetermined, you'll have to make a x pass import, get the key, hash it, sort into group order, import. The added speed dealing with a presized file might make it moot.
If you can generate the keys yourself, it could be possible to make the BFS call (or use the GROUP_NUMBER function shipped on DUDS 4 or 5) and sort of generate keys based on groups. This has some advantage since random alphanumeric keys hash much more evenly than sequentials.
Either way though, you're really ending up with a multi-pass import.
However, the NLM and NT Service (probably the NPP as well) have fixed that pesky frame buffer size issue, so you can resize your frames accordingly without having to take the extra read hit. This in itself might make up for any advantage in doing a hash ordered import.
akaplan@sprezzatura.com
At 31 MAR 1999 10:57AM Victor Engel wrote:
]However, the NLM and NT Service (probably the NPP as well) have fixed that pesky frame buffer size issue, so you can resize your frames accordingly without having to take the extra read hit. This in itself might make up for any advantage in doing a hash ordered import.
I was not aware of this. Good information. So now we can goose the frame size on LISTS, SYSTEMP, etc. without wasting I/O.
At 31 MAR 1999 11:22AM akaplan@sprezzatura.com - [url=http://www.sprezzatura.com]Sprezzatura Group[/url] wrote:
So now we can goose the frame size on LISTS, SYSTEMP, etc. without wasting I/O.
You bet! I don't know why RTI doesn't announce this information in all the literature and marketing stuff. This is a huge enhancement to the product, probably much more so than most realise. Optimize out your SYSOBJ file and SYSREPOSEVENTEXES in OI, plus with the network frame size stuff on the server products and you can really do some heavy speed increases.
akaplan@sprezzatura.com
At 02 APR 1999 06:48PM Victor Engel wrote:
I just finished some benchmarks comparing various frame sizes on a couple of files we have. On our EMP file, record size generally ranges from 2k to 4k, although there are a few records as large as 30k. I created files with frame sizes of 2K, 4K, 6K, 8K, and 10K, populated them each with 22,000 records. For this load process, the 6K frame size was marginally the fastest. I don't have the statistics handy but will post them as a reply to this message. Following that, I performed a select using an indexed field to get a hit list that was not in the natural order of any of the files. After reading through the records, the gross read times were:
original file 22:30 105 Mb
2K frames 16:16 112 Mb
4K frames 13:47 120 Mb
6K frames 14:04 126 Mb
8K frames 14:29 131 Mb
10K frames 14:27 133 Mb
Bear in mind that the original file is fragmented compared to the others because it has had day-to-day I/O over a long period of time, whereas the others were created from scratch. I then performed the same test on the LISTS file, which has average record sizes of about 30K. I found no significant difference in read or write speed with any of the frame sizes. The packet size as defined by the NLM is 1458, which is optimal for our network, according to the NLM documentation.
At 03 APR 1999 05:08AM Victor Engel wrote:
As promised, here are the write times:
2k: 18:29
4k: 20:14
6k: 13:55
8k: 15:24
10k: 15:56
At 17 APR 1999 12:57AM Bill Claybourgh wrote:
Aaron and Victor,
Just wanted to give both of you an update. Setting the sizelock was the key. We went down from 6.5 hours to just over 1 hour.
Thanks very much again.
- Bill