Repeating (Multivalued) Data in Advanced Revelation SQL: Features and Characteristics
Published By | Date | Version | Knowledge Level | Keywords |
---|---|---|---|---|
Revelation Technologies | 06 AUG 1989 | 2.X | NOVICE | ASSOCIATED, MULTIVALUES, KEYQUAL, MULTIVALUE, NORMALIZED, REPEATING, SQL, TABLES |
Orthodox texts on relational database theory and practice state flatly that 'relational database theory does not allow for repeating groups'. A repeating group is defined as 'a column that contains sets of data values instead of just one value in each row'.
No rationale is given for the exclusion of this concept from the relational model. On the other hand, users know from experience that 'repeating groups' referred to in Advanced Revelation as multivalued fields are extremely useful and natural in modeling real world situations. Experience has also shown that dealing with multivalued data can be extremely difficult when applied to the realities of queries.
Advanced Revelation SQL provides an alternate SQL query and data manipulation interface for "native" Revelation files (such as Linear Hash files) without restriction or modification of the file structures as defined in the data dictionary. The SQL language interface imposes a new view of the data that adheres to the standard definition of SQL while providing a consistent solution to entering and querying data from a file that contains multivalued fields.
Repeating Data, Auto-normalization, and First Normal Form
The Advanced Revelation SQL solution to handling repeating data is to automatically normalize the data, making it appear as if every value has its own row. This has always been done within Advanced Revelation for sorting by multivalues and is usually called 'exploding'. The result of the normalization process appears as a 'first-normal form' representation of the data.
(The distinction between an Advanced Revelation record and an SQL row can now be defined more precisely. A record denotes the conventional unit of data storage and may contain multivalued data. If a record contains no multivalues, it is equivalent to a row. If a record contains multivalued data, that record may represent n rows where n is the largest number of values held by any field in the record.)
First normal form is almost never used in database design because it typically involves redundant storage of data. First normal form is generally superior to higher normal forms in terms of performance, but the problems of storage space and data consistency usually outweigh the advantages. The application developer is almost always forced to write complex and slow routines to guarantee that non-repeating data is consistent between associated rows.
As an example of first normal form, assume a simple employee table. The employee has a number, name, address, job number, and list of dependents, including age and sex. Each employee may have multiple dependents. The non-repeating data is (should be) exactly the same for each row. Only the columns for dependent information dependent name, age, sex, and relationship should vary. The structure for such a table represented in its normalized format is illustrated in Figure 1.
Creating Repeating Data Tables
Data tables with repeating columns are usually created within Advanced Revelation by setting the multivalued attribute on a column in the dictionary. The Advanced Revelation implementation of the SQL CREATE TABLE syntax also allows for creation of data tables with repeating columns.
The last attribute that may be applied to a column in the CREATE TABLE syntax is the keyword REPEATING. This will set the multivalued flag in the dictionary entry for that column. The term "repeating" was chosen as being more commonly used in relational literature.
For example, a table that encompasses the example mentioned above can be created in Advanced Revelation using this SQL CREATE TABLE statement:
CREATE TABLE EMPLOYEES (EMP_NO INTEGER PRIMARY KEY, EMP_NAME VARCHAR NOT NULL, ADDRESS VARCHAR, JOB_NO INTEGER NOT NULL REFERENCES JOB_TABLE, DEP_NAME VARCHAR REPEATING, DEP_AGE INTEGER REPEATING, DEP_SEX CHAR(1) REPEATING)
Note the use of the Advanced Revelation SQL keywords 'REPEATING' to indicate the creation of multivalued columns.
Note: New development targeted specifically for SQL should probably avoid using this feature inasmuch as it is neither portable to other implementations of SQL nor compatible with the ANSI standard for SQL.
Querying and Manipulating Repeating Data Tables
Whenever a table has one or more repeating columns, an additional key part column 'KEYQUAL' is 'created'. The pseudo-column KEYQUAL may appear in a Selection clause or WHERE clause. It may be used to represent the key-qualifier (value number) of the row in question. In a sense, the explicit key specifies the record, and the key-qualifier KEYQUAL identifies the row. Refer again to Figure 1.
This column is of the data type INTEGER and may be queried and assigned like any other column. Currently, however, it is not projected in '*' selects.
Inserting Data into Repeating Data Tables
Data can be inserted into repeating columns using the KEYQUAL keyword to indicate the correct placement of the data within the group of values. To keep the use of repeating data in SQL from being overly cumbersome, a few convenience features appear in the system. When inserting new rows into a table:
- if the table contains repeating rows, and
- the data being inserted includes any of those repeating rows, and
- if the key-qualifier is not specified in the VALUES clause,
then repeating data will be appended to the data currently in that/those columns. This obviates the need to constantly keep track or query to find the next available value in that record. If the record currently exists, then all non-repeating data that appears in the VALUES clause must match the data currently in the record. If values for non-repeating columns in the INSERT statement do not match the values already in the record, an error results and the INSERT statement fails.
Automatic Data 'Forwarding'
When using the INSERT statement, values for any non-repeating columns not specified in the VALUES clause will be 'brought forward' into the new row. Only the repeating columns need to be inserted when adding a new row to an existing 'group'.
For instance, if the employee in the example table added a new dependent, it would be necessary only to use a statement such as this one:
INSERT INTO EMPLOYEES (EMP_NO, DEP_NAME, DEP_AGE, DEP_SEX ) VALUES (100, Mark Jr., 0, M)
This example demonstrates both automatic key qualifier assignment and data forwarding.
Automatic Data Consistency Validation
An advantage of this technique is system-enforced data consistency validation for INSERT statements. It would normally be up to either operator diligence or application programs to assure that non-repeating data entered into a normalized 'group' was consistent. This is probably one of the greatest reasons that this format of data structuring is rarely used in most systems, it is cumbersome and unreliable. However, Advanced Revelation guarantees that data inserted into non-repeating groups always matches the data already in the record.
Automatic Data Compression
This apparent first-form normalization gives the illusion of squandering a great deal of storage space on redundant data. However, Advanced Revelation 'factors out' repeating data and stores it only once, only expanding or 'normalizing' that data when required for extraction.
Updating Data in Repeating Data Tables
To update repeating columns, you must specify the KEYQUAL keyword, or the UPDATE operation will alter all rows in that record. This is how one would understand that UPDATE statements should perform if the table were truly normalized.
Deleting Data from a Repeating Data Table
You cannot delete data from non-associated repeating rows using the DELETE statement. DELETE (unlike SELECT, INSERT, and UPDATE) does not include any concept of projection. UPDATE could be used to achieve the same effect as deleting a row in a repeating record while sparing non-associated columns, but with some loss of elegance.
For example, suppose you wish to remove the second phone number for customer 100 in Figure 2. This statement would also delete the reference to invoice 1005 in the INV_NO column.
DELETE FROM EMP WHERE EMP_NO = 100 AND KEYQUAL = 2
However, this statement would eliminate the logical row from the view of the table (CUST_NO, CUST_NAME, PHONE):
UPDATE EMP SET PHONE = NULL WHERE CUST_NO = 100 AND KEYQUAL = 2
Thereafter, a query such as this one:
SELECT CUST_NO, CUST_NAME, PHONE FROM CUSTOMERS
would not even show a NULL row for the second item. In that view of the table, the row associated with keyqual = 2 would not exist.
If multiple associated repeating columns are involved (such as an additional PHONE_DESCRIPTION field), the values of all such associated repeating columns would have to be set to NULL for the row to be logically deleted.
Associated vs. Non-associated Multivalues
With the record taken as a whole, all multivalued columns are assumed to be associated. This representation problem can in most cases be avoided by following the policy of projecting only associated repeating columns into a view of the table. Non-associated columns may be introduced by self-joining another view of the table including only the other associated repeating columns.
For example, assume a customer table with a repeating invoice column and a non-associated repeating phone column. The structure of this file is shown in Figure 2.
You want a report of each invoice with some detail information from the invoices file, and with each invoice report you also want a listing of all phone numbers for that customer:
SELECT A.CUST_NO, A.CUST_NAME, B.PHONE, A.INV_NO, C.SHIP_DATE FROM CUST A, CUST B, INV C WHERE A.CUST_NO = B.CUST_NO AND A.CUST_NO = C.CUST_NO
The resulting report is shown in Figure 3. For presentation of the report, duplicate data would desirably be eliminated. The point is that repeating data even non-associated repeating data can be presented as normalized rows.
Accuracy of Query Results
When querying tables with repeating (multivalued) columns, SQL can produce more intuitive and reliable results than R/LIST! For the following example refer to the table in Figure 1.
In the R/LIST view of this file, it contains two records with a set of logically associated multivalued fields. If you wish to find all of the employee dependents who are male and over 10 years old you might use this R/LIST query:
LIST EMP EMP_NO EMP_NAME DEP_NAME DEP_AGE WITH DEP_SEX = M AND WITH DEP_AGE > 10
The result would be a listing of all of the dependents for all of the employees who have a male dependent and who have a dependent who is older than 10. In this case you would get a listing that includes all of Mark Halverson's dependents, none of whom meet the criteria and all of Susan Santi's dependents, only one of whom meets the criteria.
However, the SQL query:
SELECT EMP_NO, EMP_NAME, DEP_NAME, DEP_AGE FROM EMP WHERE DEP_SEX = 'M' AND DEP_AGE > 10
returns the row where EMP_NO = 1 and KEYQUAL = 2. SQL provides the necessary 'horizontal association' that R/LIST does not. This is due to SQL's requirement of first normalizing the table before performing any query operations.
Conclusions
The Advanced Revelation implementation of a repeating data construct, while in apparent violation of a formal definition of the relational model, remains a feasible and useful construction, even under SQL. Repeating data can be represented as if it had been normalized to a first-normal form. Extensions to the syntax of SQL statements allow Advanced Revelation full schema, query, and update access to repeating data.
In addition, Advanced Revelation SQL access to repeating data offers a variety of advantages overall in the maintenance and manipulation of data. Among these are:
- access to existing multivalued data
- comparisons against each element of a multivalued field
- automatic data consistency validation during SQL INSERT operations
- automatic data forwarding during SQL INSERT operations
- automatic data 'compression', insofar as data is actually stored in its most compact form while being represented in the redundant, normalized form
It is worth noting that databases with lower forms of normalization are generally superior in performance to higher normalized structures. The more wasteful storage requirements and consistency requirements that must typically be hand-programmed deter the use of first-normal form. However, Advanced Revelation's use of repeating data in columns allows developers to access the features of first-normal form structure while avoiding its shortcomings, which include disk accesses and join overhead.
Examples
Figure 1
EMP_NO KEYQUAL EMP_NAME ADDRESS JOB_NO DEPND Age Sex ------ ------- ----------------- --------------- ----- ------ --- --- 100 1 Mark Halverson 102 N. Pine 1023 Sarah 2 F 100 2 Mark Halverson 102 N. Pine 1023 Mark 7 M 100 3 Mark Halverson 102 N. Pine 1023 Pam 11 F 101 1 Susan Santi 33 Elm St. 1055 Jim 8 M 101 2 Susan Santi 33 Elm St. 1055 Jared 12 M
Figure 2
CUST_NO CUST_NAME INV_NO PHONE ------- ----------- ------ ---------- 100 XYZ WELDING 1000 682-7654 1005 682-7777 1007 1009 101 ABC CARS 1001 778-1007 1002 778-0999 1003
Figure 3
A.CUST_NO A.CUST_NAME B.PHONE A.INV_NO C.SHIP_DATE --------- ---------------- --------- -------- ----- 100 XYZ WELDING 682-7654 1000 JAN 1, 1989 100 XYZ WELDING 682-7777 1000 JAN 1, 1989 100 XYZ WELDING 682-7654 1005 JAN 12, 1989 100 XYZ WELDING 682-7777 1005 JAN 12, 1989 100 XYZ WELDING 682-7654 1007 JAN 15, 1989 100 XYZ WELDING 682-7777 1007 JAN 15, 1989 100 XYZ WELDING 682-7654 1009 FEB 2, 1989 100 XYZ WELDING 682-7777 1009 FEB 2, 1989 101 ABC CARS 778-1007 1001 JAN 2, 1989 101 ABC CARS 778-0999 1001 JAN 2, 1989 101 ABC CARS 778-1007 1002 JAN 5, 1989 101 ABC CARS 778-0999 1002 JAN 5, 1989 101 ABC CARS 778-1007 1003 JAN 6, 1989 101 ABC CARS 778-0999 1003 JAN 6, 1989