Repeating (Multivalued) Data in Advanced Revelation SQL: Features and Characteristics

Published By	Date	Version	Knowledge Level	Keywords
Revelation Technologies	06 AUG 1989	2.X	NOVICE	ASSOCIATED, MULTIVALUES, KEYQUAL, MULTIVALUE, NORMALIZED, REPEATING, SQL, TABLES

Orthodox texts on relational database theory and practice state flatly that 'relational database theory does not allow for repeating groups'. A repeating group is defined as 'a column that contains sets of data values instead of just one value in each row'.

No rationale is given for the exclusion of this concept from the relational model. On the other hand, users know from experience that 'repeating groups' referred to in Advanced Revelation as multivalued fields are extremely useful and natural in modeling real world situations. Experience has also shown that dealing with multivalued data can be extremely difficult when applied to the realities of queries.

Advanced Revelation SQL provides an alternate SQL query and data manipulation interface for "native" Revelation files (such as Linear Hash files) without restriction or modification of the file structures as defined in the data dictionary. The SQL language interface imposes a new view of the data that adheres to the standard definition of SQL while providing a consistent solution to entering and querying data from a file that contains multivalued fields.

The Advanced Revelation SQL solution to handling repeating data is to automatically normalize the data, making it appear as if every value has its own row. This has always been done within Advanced Revelation for sorting by multivalues and is usually called 'exploding'. The result of the normalization process appears as a 'first-normal form' representation of the data.

(The distinction between an Advanced Revelation record and an SQL row can now be defined more precisely. A record denotes the conventional unit of data storage and may contain multivalued data. If a record contains no multivalues, it is equivalent to a row. If a record contains multivalued data, that record may represent n rows where n is the largest number of values held by any field in the record.)

First normal form is almost never used in database design because it typically involves redundant storage of data. First normal form is generally superior to higher normal forms in terms of performance, but the problems of storage space and data consistency usually outweigh the advantages. The application developer is almost always forced to write complex and slow routines to guarantee that non-repeating data is consistent between associated rows.

As an example of first normal form, assume a simple employee table. The employee has a number, name, address, job number, and list of dependents, including age and sex. Each employee may have multiple dependents. The non-repeating data is (should be) exactly the same for each row. Only the columns for dependent information dependent name, age, sex, and relationship should vary. The structure for such a table represented in its normalized format is illustrated in Figure 1.

Data tables with repeating columns are usually created within Advanced Revelation by setting the multivalued attribute on a column in the dictionary. The Advanced Revelation implementation of the SQL CREATE TABLE syntax also allows for creation of data tables with repeating columns.

The last attribute that may be applied to a column in the CREATE TABLE syntax is the keyword REPEATING. This will set the multivalued flag in the dictionary entry for that column. The term "repeating" was chosen as being more commonly used in relational literature.

For example, a table that encompasses the example mentioned above can be created in Advanced Revelation using this SQL CREATE TABLE statement:

CREATE TABLE EMPLOYEES
  (EMP_NO INTEGER PRIMARY KEY,
   EMP_NAME VARCHAR NOT NULL,
   ADDRESS VARCHAR,
   JOB_NO INTEGER NOT NULL REFERENCES JOB_TABLE,
   DEP_NAME VARCHAR REPEATING,
   DEP_AGE INTEGER REPEATING,
   DEP_SEX CHAR(1) REPEATING)

Note the use of the Advanced Revelation SQL keywords 'REPEATING' to indicate the creation of multivalued columns.

Note: New development targeted specifically for SQL should probably avoid using this feature inasmuch as it is neither portable to other implementations of SQL nor compatible with the ANSI standard for SQL.

Whenever a table has one or more repeating columns, an additional key part column 'KEYQUAL' is 'created'. The pseudo-column KEYQUAL may appear in a Selection clause or WHERE clause. It may be used to represent the key-qualifier (value number) of the row in question. In a sense, the explicit key specifies the record, and the key-qualifier KEYQUAL identifies the row. Refer again to Figure 1.

This column is of the data type INTEGER and may be queried and assigned like any other column. Currently, however, it is not projected in '*' selects.

Data can be inserted into repeating columns using the KEYQUAL keyword to indicate the correct placement of the data within the group of values. To keep the use of repeating data in SQL from being overly cumbersome, a few convenience features appear in the system. When inserting new rows into a table:

if the table contains repeating rows, and
the data being inserted includes any of those repeating rows, and
if the key-qualifier is not specified in the VALUES clause,

then repeating data will be appended to the data currently in that/those columns. This obviates the need to constantly keep track or query to find the next available value in that record. If the record currently exists, then all non-repeating data that appears in the VALUES clause must match the data currently in the record. If values for non-repeating columns in the INSERT statement do not match the values already in the record, an error results and the INSERT statement fails.

When using the INSERT statement, values for any non-repeating columns not specified in the VALUES clause will be 'brought forward' into the new row. Only the repeating columns need to be inserted when adding a new row to an existing 'group'.

For instance, if the employee in the example table added a new dependent, it would be necessary only to use a statement such as this one:

INSERT INTO EMPLOYEES
  (EMP_NO, DEP_NAME, DEP_AGE, DEP_SEX )
VALUES (100, Mark Jr., 0, M)

This example demonstrates both automatic key qualifier assignment and data forwarding.

An advantage of this technique is system-enforced data consistency validation for INSERT statements. It would normally be up to either operator diligence or application programs to assure that non-repeating data entered into a normalized 'group' was consistent. This is probably one of the greatest reasons that this format of data structuring is rarely used in most systems, it is cumbersome and unreliable. However, Advanced Revelation guarantees that data inserted into non-repeating groups always matches the data already in the record.

This apparent first-form normalization gives the illusion of squandering a great deal of storage space on redundant data. However, Advanced Revelation 'factors out' repeating data and stores it only once, only expanding or 'normalizing' that data when required for extraction.

To update repeating columns, you must specify the KEYQUAL keyword, or the UPDATE operation will alter all rows in that record. This is how one would understand that UPDATE statements should perform if the table were truly normalized.

You cannot delete data from non-associated repeating rows using the DELETE statement. DELETE (unlike SELECT, INSERT, and UPDATE) does not include any concept of projection. UPDATE could be used to achieve the same effect as deleting a row in a repeating record while sparing non-associated columns, but with some loss of elegance.

For example, suppose you wish to remove the second phone number for customer 100 in Figure 2. This statement would also delete the reference to invoice 1005 in the INV_NO column.

DELETE FROM EMP
  WHERE EMP_NO = 100
  AND KEYQUAL = 2

However, this statement would eliminate the logical row from the view of the table (CUST_NO, CUST_NAME, PHONE):

UPDATE EMP
  SET PHONE = NULL
  WHERE CUST_NO = 100 AND KEYQUAL = 2

Thereafter, a query such as this one:

SELECT CUST_NO, CUST_NAME, PHONE
  FROM CUSTOMERS

would not even show a NULL row for the second item. In that view of the table, the row associated with keyqual = 2 would not exist.

If multiple associated repeating columns are involved (such as an additional PHONE_DESCRIPTION field), the values of all such associated repeating columns would have to be set to NULL for the row to be logically deleted.

With the record taken as a whole, all multivalued columns are assumed to be associated. This representation problem can in most cases be avoided by following the policy of projecting only associated repeating columns into a view of the table. Non-associated columns may be introduced by self-joining another view of the table including only the other associated repeating columns.

For example, assume a customer table with a repeating invoice column and a non-associated repeating phone column. The structure of this file is shown in Figure 2.

You want a report of each invoice with some detail information from the invoices file, and with each invoice report you also want a listing of all phone numbers for that customer:

SELECT A.CUST_NO, A.CUST_NAME, B.PHONE, A.INV_NO, C.SHIP_DATE
  FROM CUST A, CUST B, INV C
  WHERE A.CUST_NO = B.CUST_NO
  AND A.CUST_NO = C.CUST_NO

The resulting report is shown in Figure 3. For presentation of the report, duplicate data would desirably be eliminated. The point is that repeating data even non-associated repeating data can be presented as normalized rows.

When querying tables with repeating (multivalued) columns, SQL can produce more intuitive and reliable results than R/LIST! For the following example refer to the table in Figure 1.

In the R/LIST view of this file, it contains two records with a set of logically associated multivalued fields. If you wish to find all of the employee dependents who are male and over 10 years old you might use this R/LIST query:

LIST EMP EMP_NO EMP_NAME DEP_NAME DEP_AGE WITH DEP_SEX = M AND WITH
  DEP_AGE > 10

The result would be a listing of all of the dependents for all of the employees who have a male dependent and who have a dependent who is older than 10. In this case you would get a listing that includes all of Mark Halverson's dependents, none of whom meet the criteria and all of Susan Santi's dependents, only one of whom meets the criteria.

However, the SQL query:

SELECT EMP_NO, EMP_NAME, DEP_NAME, DEP_AGE
  FROM EMP
  WHERE DEP_SEX = 'M' AND DEP_AGE > 10

returns the row where EMP_NO = 1 and KEYQUAL = 2. SQL provides the necessary 'horizontal association' that R/LIST does not. This is due to SQL's requirement of first normalizing the table before performing any query operations.

The Advanced Revelation implementation of a repeating data construct, while in apparent violation of a formal definition of the relational model, remains a feasible and useful construction, even under SQL. Repeating data can be represented as if it had been normalized to a first-normal form. Extensions to the syntax of SQL statements allow Advanced Revelation full schema, query, and update access to repeating data.

In addition, Advanced Revelation SQL access to repeating data offers a variety of advantages overall in the maintenance and manipulation of data. Among these are:

access to existing multivalued data
comparisons against each element of a multivalued field
automatic data consistency validation during SQL INSERT operations
automatic data forwarding during SQL INSERT operations
automatic data 'compression', insofar as data is actually stored in its most compact form while being represented in the redundant, normalized form

It is worth noting that databases with lower forms of normalization are generally superior in performance to higher normalized structures. The more wasteful storage requirements and consistency requirements that must typically be hand-programmed deter the use of first-normal form. However, Advanced Revelation's use of repeating data in columns allows developers to access the features of first-normal form structure while avoiding its shortcomings, which include disk accesses and join overhead.

Figure 1

EMP_NO KEYQUAL EMP_NAME         ADDRESS         JOB_NO  DEPND Age Sex
------ ------- ----------------- --------------- ----- ------ --- ---
100    1       Mark Halverson   102 N. Pine      1023  Sarah  2   F
100    2       Mark Halverson   102 N. Pine      1023  Mark   7   M
100    3       Mark Halverson   102 N. Pine      1023  Pam   11   F
101    1       Susan Santi      33 Elm St.       1055  Jim    8   M
101    2       Susan Santi      33 Elm St.       1055  Jared 12   M

Figure 2

CUST_NO CUST_NAME     INV_NO PHONE
------- -----------   ------ ----------
100     XYZ WELDING   1000   682-7654
                      1005   682-7777
                      1007
                      1009
101     ABC CARS      1001   778-1007
                      1002   778-0999
                      1003

Figure 3

A.CUST_NO A.CUST_NAME      B.PHONE    A.INV_NO C.SHIP_DATE
--------- ---------------- ---------  -------- -----
100       XYZ WELDING      682-7654   1000     JAN 1, 1989
100       XYZ WELDING      682-7777   1000     JAN 1, 1989
100       XYZ WELDING      682-7654   1005     JAN 12, 1989
100       XYZ WELDING      682-7777   1005     JAN 12, 1989
100       XYZ WELDING      682-7654   1007     JAN 15, 1989
100       XYZ WELDING      682-7777   1007     JAN 15, 1989
100       XYZ WELDING      682-7654   1009     FEB 2, 1989
100       XYZ WELDING      682-7777   1009     FEB 2, 1989
101       ABC CARS         778-1007   1001     JAN 2, 1989
101       ABC CARS         778-0999   1001     JAN 2, 1989
101       ABC CARS         778-1007   1002     JAN 5, 1989
101       ABC CARS         778-0999   1002     JAN 5, 1989
101       ABC CARS         778-1007   1003     JAN 6, 1989
101       ABC CARS         778-0999   1003     JAN 6, 1989

Repeating (Multivalued) Data in Advanced Revelation SQL: Features and Characteristics

Repeating Data, Auto-normalization, and First Normal Form

Creating Repeating Data Tables

Querying and Manipulating Repeating Data Tables

Inserting Data into Repeating Data Tables

Automatic Data 'Forwarding'

Automatic Data Consistency Validation

Automatic Data Compression

Updating Data in Repeating Data Tables

Deleting Data from a Repeating Data Table

Associated vs. Non-associated Multivalues

Accuracy of Query Results

Conclusions

Examples

Figure 1

Figure 2

Figure 3

Revelation On-Line Wiki