Duplicate records and digital signatures - All Platforms Specific (OpenInsight 32-bit Specific)
At 21 OCT 2002 03:25:47PM Oystein Reigem wrote:
A problem that many of my clients have is duplicate records in their tables. What I mean is two or more records that represent the same entity in the real world, and really should be one single record. A typical example in my kind of apps is a table of people, containing records about all the various people related to the artefacts and historical photos in the museums' collections. Often one and the same person is entered into the table more than once. That can happen because of incompetence on the part of the person doing data entry, but sometimes there simply isn't enough available information to settle the question of identity, at least at the time of data entry. One special case producing duplicate records is the merging of two databases from the same museum, or from neighbouring museums.
During the times I have spent with museum apps I have made some tools and features to help avoid getting duplicate records into the database, and help merging duplicates that have made their way into the database. For my new app I'd like to make some better tools.
Let's consider identical records first.
Duplicate records may not always have identical content, but often they have. When the content of two records are identical it's usually safe to assume the records represent the same entity.
In my new set of tools I need some functionality that can take a new record, e.g a record to be imported from a different database, and search the table for identical records. But searching for identical records can be slow. The search might involve many fields and/or indexes, and/or much sequential processing. So I thought of the idea of having a symbolic field with a digital signature of the record. I hope that properly designed a not too long signature could distinguish between different content with high probability. An index on the symbolic would speed up the search. A final check on the content of the records found would weed out any false hits.
Then it's the case of duplicates with differing content.
When two records are different some fields might be more important than others for settling the question of identity. E.g, if the records of two persons have identical values in their name and birth year fields, the chances are good they are the same person. Such assumptions might not be good enough to base an automatic process on, but could be used in an interactive tool where the final decision is left to the user. Anyway - having a digital signature on a subset of fields might help speed up the search for potential duplicates, although I admit that if the subset is small there's not as much to gain as when comparing the whole record.
Of course when comparing field values for equality one should sometimes allow for variation in punctuation or spelling. This would normally spell (ugh) doom for the digital signature strategy. But such variation might be possible to "normalize" in a symbolic field, and then one could base the signature on the symbolic instead of the source field.
Now I'd like to have your comment on my idea of using digital signatures, some guidance on how to make them if you think the idea is good, or other ideas for solving the problem with duplicate records.
- Oystein -
At 21 OCT 2002 04:37PM Donald Bakke wrote:
Oystein,
It sounds like you've thought a lot of this out already. I think we've done something similar (but not nearly as complex) by writing the digital signature as a key to a look up table. That way we eliminate index searches. However, index searches can make the lookup more flexible I suppose.
At 21 OCT 2002 05:43PM [url=http://www.sprezzatura.com]The Sprezzatura Group[/url] wrote:
Oystein,
Herein lies the focus of the debate whether to use meaningful or meaningless keys in a table. Meaningful keys are sometimes awkward to alter or handle, but prevent data duplication. Meaningless keys are easier to handle, but add the risk of duplication of data attached to the key.
The digital signature idea is only really necessary for apps where the data design is flawed, or the data is attached to meaningless keys or deficient keys. As you observe, sometimes this is unavoidable.
A hash number (= digital signature) is sometimes used across the numeric data, or for text, the consonant data (after perhaps passing through a SEQ() function to return a numeric which is then factored). Records with high common denominators are then likely to correlate highly. Be prepared to work hard to avoid scientific notation conversion for very large numeric factors.
Sometimes statistical correlation co-efficients can be applied from a batch process to identify likely duplicates. Symbolics are very useful in this setting.
The other area to consider is the output stage or the usage of the data. A classic case is the one of the internet search engine. If a search engine returns multiple duplicate hits for similar content it may not really matter, as long as we get something of relevance and the number of returned items for the search is small enough. It may be the same with your own system. Duplication might be able to be ignored.
Of course, this problem is not new. Barcoding attempts to handle this problem. ISBN book numbers attempt to handle this problem. Car licence plates attempt to handle this problem. ID numbers on plant and equipment for purposes of depreciation are common to most stock control systems.
Clear unambiguous identification. Same problem, time after time.
As comedian Peter Cook once said, "They should *(%^%$ label things"
Steve
World Leaders in all things RevSoft
At 23 OCT 2002 08:32AM Oystein Reigem wrote:
When I talked about a "digital signature" I might have used the wrong term. To quote from a page at : "The digital signature of a document is a piece of information based on both the document and the signer's private key. It is typically created through the use of a hash function and a private signing function (encrypting with the signer's private key)…". But in my case there's no signer - or other entity - which identity needs to be encrypted in a "signature". I simply want to run the record content through a hash function. The same page calls the result a "message digest" ("apply a hash function to the message, creating what is called a message digest").
The hash functions that I have read about produce a binary result - a string of bits. The string might e.g be 128 or 160 bits long.
In my case I need the final result to be valid characters, not any bit pattern. The byte values I must keep away from are ASCII NUL, perhaps the other control characters too, and the delimiters. Also I think I should stay away from high-bit byte values in general, not to produce characters strings that are not valid UTF-8. So I'm left with the 96 byte values from Char(32) to Char(127). For simplicity I might settle for a range of 64 byte values - Char(64) to Char(127). Then I can simply chop my bit string into pieces 6 bits long and make each piece into a valid character by adding 64.
But what about the hash function itself? Can somebody supply me with one? Or can I use a function in the Windows API?
- Oystein -
PS: To quote from somewhere else (): "The W3C Digital Signature Working Group ("DSig") proposes a standard format for making digitally-signed, machine-readable assertions about a particular information resource." Now isn't this somebody we know?
![]()
At 23 OCT 2002 11:49AM [email protected] wrote:
Although you could come up with some sort of bit-rotation-inversion-nuclear smudging ..
Why not use a tool, like you were speaking of, and simply store the document on disk and reference the location?
and as for your other comment :
I refuse to take the bait
[email protected] onmouseover=window.status=the new revelation technology .. a refreshing change;return(true)"
David Tod Sigafoos ~ SigSolutions
Phone: 971-570-2005
OS: Win2k sp2 (5.00.2195)
OI: 4.1
At 23 OCT 2002 02:46PM Oystein Reigem wrote:
DSig,
Although you could come up with some sort of bit-rotation-inversion-nuclear smudging ..
That's exactly what I hoped you could come up with.
Why not use a tool, like you were speaking of, and simply store the document on disk and reference the location?
I suspect you didn't read my postings properly. Can I hire you as a hash function?
and as for your other comment :
I refuse to take the bait
But they stole your name!!
![]()
![]()
![]()
- Oystein -