Truthiness in Data Modeling
Steve Hoberman wrote an interesting article in the June 2006 DMReview. He runs the design challenge, an email-based community of data professionals that tries to come up with answers to difficult ETL and Database related problems. Design challenge #14 was somewhat interesting, and worth a look if you are so inclined. The challenge deals with the challenges around generating a unique identifier for an individual in a diverse data environment. For example, in an academic environment, the same person might be both an instructor and a student (think about the army of graduate students who will be TA’s for Psych 101 at any large institution) .
To be honest, I was not particularly impressed with the solutions that were suggested. Working with ETL and Data Modeling I think I see this situation on a regular basis. A lot of the suggestions seemed like overly complicated solutions that were more work than they were worth. However, there was a real gem in the discussion about the question itself:
As analysts and modelers, we continuously find ourselves asking the business, “Why?” For example, “Why do you need this report?” or “Why do you enter this information in three places?” Emma Fortnum, application architect, rightly asked why for this challenge as well. “What I would do is question if the university really needs a unique Person ID across all the roles. Do they really need to know that Bob the student is the same as Bob the instructor?” A holistic view of a person should only be created if there is significant business value. An organization-wide program such as a BI initiative or an enterprise resource planning implementation is usually the driver of the need for holistic view.
Hoberman (and Fortnum) are right on the mark here. My guess is that the requirement for a single ID per person came out of an ivory tower or a management meeting, and from someone who hasn’t had to deal with the hands-on design and implementation of a process in many years, if ever.
All too often, the highly touted, objective “Single Version of the Truth” is a concept that is simple, probably too simple for our subjective reality. Much as technologists abhor it (and Consultants are willing to bill hours to fight it), we live in a world where truthiness reigns, and we are the better for it. The “truth” for a given situation will be derived from subjective elements of data by subjective people with subjective interests from subjective data in a subjective model. Using an objective view of truth in the data warehouse is like using the movie Highlander as a model for your book club discussion. In the end it may work for whomever is left standing, but the process is long and bloody and ultimately isn’t worth the price.
technorati tags:information quality, data quality, etl, automation, automation









March 21st, 2007 at 9:49 am
Concerning the use of a single ID per person:
This not so much an ivory tower concept as something borne of experience: if you omit the single ID per person you will almost inevitably and unfortunately later find that there _is_ a business requirement for such an ID. And while solving the business requirement is trivial if the single ID is initially included in the design, it is often difficult to add the single ID later. Here’s a real-world example: a police records management system containing categories such as employees, civilians, police officers, suspects, felons, and victims. Unless you have the single ID you can’t easily answer such a question as “How many police officers were convicted of a felony?” or “Has this potential employee ever been a suspect in a crime?”
While it may seem counterintuitive, experience has shown that using such IDs usually makes solutions _easier_ rather than harder. So I will usually include such an ID if there is any reason it might prove useful in the future. The cost of the ID is low and the potential savings in future development costs very high. And as you state, the need for such IDs usually arises in systems with very large scope of application. But these are valid applications.
As for there being no single version of the truth you are spot on. And the Highlander metaphor is extremely apt: I think I’ll borrow it if you don’t mind.