Indexing in OpenInsight - Part 1 - What is indexing?
Published 05 NOV 2009 at 12:04:00PM by Sprezz
A discussion on the Revelation Forum reminded us that we haven’t reviewed our public documentation of how Revelation indexing works in over 20 years. Whilst the RevMedia article on our website remains largely accurate we thought that this provided an excellent opportunity for us to revisit the subject with our improved understanding.
The first question to ask is “what is an index in OpenInsight”?
Well to answer this strictly we’d have to distinguish between Btree/Xref indexes and Relational indexes as they both are very different in structure. A Relational index by its very nature is just an inverted index (http://en.wikipedia.org/wiki/Inverted_index) and so it easy to maintain - a read of the inverted row, an insertion and a write of the inverted row. A Btree/Xref on the other hand (the two being identical in index structure, the difference being that an Xref is a Btree on sub strings of a column as opposed to a Btree on the entire column) requires a lot more maintenance. For details of how BTree indexes work see http://en.wikipedia.org/wiki/B-tree. The insertion of a new or revised entry into a Btree index could generate a large amount of disk i/o as the tree is rebalanced and the operation is a costly one in resource terms.
When Revelation’s indexing was first introduced, hardware was a lot slower than it is today. When saving a single row it used to be possible to see the disk latency. You’d hit the save key then wait whilst the disk was updated. This was made acceptable in a data entry environment by adding a keyboard type ahead buffer (one of the reasons that Revelation used to ship with a utility to extend the keyboard typeahead buffer from the default 16 to the humungous 256 bytes required by touch typers). However forcing the user to wait whilst the BTree indexes associated with the table were updated would have introduced a completely unacceptable lag.
So the architects of Revelation came up with the idea of not updating the indexes at the point of saving. Rather the fact that the indexes needed to be updated was recorded in a series of transactions rows in what came to be known as the “Bang table” (the separate table named after the source table but prefixed with an exclamation mark) and these were either flushed by idle workstations or by any operations which required the indexes to be up to date. This way the resource intensive balancing operation was pushed off onto non data entry personnel and throughput was maintained.
In our next article we'll look at the way that index transactions get to be in the bang table in the first place.