LINEAR HASH FILE STRUCTURES - Part 1
Published By | Date | Version | Knowledge Level | Keywords |
---|---|---|---|---|
Sprezzatura Ltd | 01 MAY 1991 | 2.03+ | EXPERT | LINEAR, HASH, FILE, STRUCTURE, LK, OV, FRAMES |
Dave Harmacek of Harmacek Database Systems had a recent problem where his client corrupted his Linear Hash File. To recover the remaining information it was necessary to deal with the files at the DOS level. The file structure was not self evident so we had to undertake some research. The starting point was Technical Bulletin # 29 by Ron Phillips. In this he sets out most of the salient facts relating to the way in which Linear Hashing works. Unfortunately it was not written with a view to providing all the information required to produce an LH fixing utility, so some elements of it lack detail whilst others are a little misleading. Over the next two issues the physical structure of Linear Hash files will be addressed. In this, the first article, attention will be given primarily to the LK portion of the file and the structure of pointers. The second article will consider the OV portion and how frames are linked.
Header Information
Every LH file is split physically into two files, the LK file and the OV file. The LK file contains the primary data frames and the OV contains the overflow. Each of these files is split up into a number of "frames". Normally these frames are 1024 bytes in length, although this size can be changed when the file is created or recreated. Each frame is made up of some header information followed by the actual data in the frame. The header information can be one of four types -
- LK frame - First Group Header
- LK frame - Subsequent Group Headers
- OV frame - Free Frames Group Header
- OV frame - Subsequent Group Headers.
A frame can be identified by looking at the first byte in the frame, which will have one of the following values
- ASCII(26) First LK Group Header
- ASCII(13) Subsequent LK Group Headers
- ASCII(7) Free Frames OV Header
- ASCII(14) Subsequent OV Group Headers
Header Structure
The structure of the LK header is described in the following table (Where (x) is length)
ÚÄÄÄÄÄÂÄÄÄÄÂÄÄÄÄÄÄÄÂÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÂÄÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄÂÄÄÄÄÄÂÄÄÄÄÄÄÄÄÄ¿ ³Byte ³ 1 ³ 2-5 ³6-9 ³10-13 ³14-15³16-19 ³ 20 ³21-22³23-26 ³ ÃÄÄÄÄÄÅÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´ ³Group³Type³Forward³Skip³Modulo³Frame³In Use³Threshold³Size ³Rec Count³ ³1 LK ³ (1)³ (4) ³(4) ³ (4) ³Size ³ (4) ³ (1) ³Lock ³ (4) ³ ³ ³ ³ ³ ³ ³(2) ³ ³ ³ (2) ³ ³ ÃÄÄÄÄÄÅÄÄÄÄÅÄÄÄÄÄÄÄÅÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÅÄÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄÅÄÄÄÄÄÅÄÄÄÄÄÄÄÄÄ´ ³LK ³Type³Forward³Skip³Modulo³ - ³ - ³ - ³ - ³ - ³ ³Rest ³ ³ ³ ³ ³ ³ ³ ³ ³ ³ ÀÄÄÄÄÄÁÄÄÄÄÁÄÄÄÄÄÄÄÁÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÁÄÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÁÄÄÄÄÄÁÄÄÄÄÄÄÄÄÄÙ
Type | Is a single byte identifying the frame type as above |
Forward | Is a pointer to the next overflow frame |
Skip | Is a pointer to the overflow frame where the next record actually begins. Hence if the current record is very large, the system can skip out the intervening overflow frames and go straight to the overflow frame containing the next record. |
Modulo | The modulo (number of primary frames) in the file. Note that this value may change between Group 1 and other Groups. Group 1 is always the most current modulo, but other groups have a modulo which reflects the modulo of the file when the particular group was relocated (as the file grows or shrinks). |
Framesize | The size of each frame. This must be the same for each frame in the file, but can change from file to file. All utilities written to access LH files at the DOS level ought to take account of this fact and not assume a standard 1024 byte frame size. In practice, frame sizes seem to have little effect on performance. |
In Use | Records how much space is currently taken up by the file.This figure is used to work out when the file ought to be resized. |
Threshold | This figure is the threshold figure at which resizing will commence. If the percentage utilisation drops below this figure, the file will have its modulo decreased, if it goes above this figure the file will have its modulo increased. |
Size Lock | Set by system processes to indicate whether the file may be permitted to expand/contract. |
Rec Count | Number of records in the file. |
Converting From Base 256 to Base 10
The information is stored in a Base 256 format with the least significant byte first. Thus to convert the information from Base 256 to Base 10 one could use the following code segment.
CONVERT: RET_VALUE = seq(VALUE[1,1]) RET_VALUE += 256 * seq(VALUE[2,1]) RET_VALUE += 65536 * seq(VALUE[3,1]) RET_VALUE += 16777216 * seq(VALUE[4,1]) VALUE = RET_VALUE RETURN
The above code segment works for all lengths of number up to 4 digits. However, remembering that V118 is used for exactly these sort of conversions, but only to three figures, it is actually approximately three times quicker to use the following code segment (on a machine without an 80x87) (for four digit numbers)
CONVERT: CALL V118(VALUE[1,3], LOW_BITS) HIGH_BIT = 16777216 * seq(VALUE[4,1]) RET_VALUE = HIGH_BIT + LOW_BITS RETURN
Note that if the number being converted is less than three digits in length (as with frame size) and size lock, the value to convert must be padded to 3 digits with CHAR(0)s (i.e. replace the V118 call with V118(FMT(VALUE[1,3] , "L(" : CHAR(0) : ")#3" , LOW_BITS))
Records Within The Frame
After the header block within the frame, the data records begin. These are stored sequentially and at the end of each record is a record marker (CHAR(255). At the end of all the records within a group is an "End of Group" marker, a CHAR(128). To enable the system to move from record to record rapidly within a frame (and also to be able to detect frame format errors), each record is preceded by a string of between 2 and 6 bytes, combining the length of the record with the length of the record id.
The way in which these are stored is a little confusing. The maximum record size is 64K and the maximum record id size is theoretically the same. 64K can be represented in Base 256 by two bytes. It might therefore seem reasonable to assume that each record would be preceded by four bytes, two for the record length and two for the id length. This is not the case. In storing the record and ID lengths, AREV tries to optimise the information and only stores as much information as it needs to. If a record id or a record length is short, only one byte is needed to represent it, so AREV only stores one byte.
To ensure that the system knows that only one byte is being used (or two or three) AREV flags the last byte in the length chain. To do this, the MSB (Most Significant Bit, the eighth bit of the number byte) is set to 1. If the MSB is not used, the largest size the byte can represent is CHAR(127). Thus if a byte is greater than CHAR(127), the MSB must be set and this must be the last byte in the length chain. Finding out the Record length and Id length is then achieved by starting at the first byte in the data frame and reading one byte at a time until the byte is greater than CHAR(127). The bytes read up to this point represent the record length. The Id length is found by continuing to take a byte at a time until the next byte greater than CHAR(127) is found. The bytes read up to this point represent the Id length. It is traditional to test for bits being set by masking with other bytes. Thus to check for the eighth bit being set one would BITAND with 128. This will only return true if the eighth bit is set.
The approach of only storing the minimum required has one problem. Since the MSB is used as a flag, one bit per byte cannot be used to store information. There are thus only seven data bits per byte, meaning that each byte can only store up to 127. The maximum that can be stored in two bytes is therefore 16,256 (127 + (127 * 127)). To store a record length over 16,256 then three bytes are required, giving
127 + (127 * 127) + (127 * 127 * 127) = 2,064,639.
It is thus possible that in some circumstances this approach may use more bytes per record than two eight bit bytes. Normally this will not be the case.
Unlike the information stored in the header block, this information is stored with the most significant byte first. When the bytes making up the lengths have been identified they can be converted into decimal using the following code fragment
CONVERT_7_BITS: BEGIN CASE CASE LEN(V) = 3 V = (SEQ(V[1,1]*16384) + (SEQ(V[2,1]*128) V += SEQ(V[3,1]) - 128 ; * Strip flag bit CASE LEN(V) = 2 V = SEQ(V[1,1]*128 + SEQ(V[2,1] -128 CASE LEN(V) = 1 V = SEQ(V) - 128 END CASE RETURN
(Volume 3, Issue 1, Pages 4-7)