June 13th, 2006 by morgan
Steve Hoberman wrote an interesting article in the June 2006 DMReview. He runs the design challenge, an email-based community of data professionals that tries to come up with answers to difficult ETL and Database related problems. Design challenge #14 was somewhat interesting, and worth a look if you are so inclined. The challenge deals with the challenges around generating a unique identifier for an individual in a diverse data environment. For example, in an academic environment, the same person might be both an instructor and a student (think about the army of graduate students who will be TA’s for Psych 101 at any large institution) .
To be honest, I was not particularly impressed with the solutions that were suggested. Working with ETL and Data Modeling I think I see this situation on a regular basis. A lot of the suggestions seemed like overly complicated solutions that were more work than they were worth. However, there was a real gem in the discussion about the question itself:
As analysts and modelers, we continuously find ourselves asking the business, “Why?” For example, “Why do you need this report?” or “Why do you enter this information in three places?” Emma Fortnum, application architect, rightly asked why for this challenge as well. “What I would do is question if the university really needs a unique Person ID across all the roles. Do they really need to know that Bob the student is the same as Bob the instructor?” A holistic view of a person should only be created if there is significant business value. An organization-wide program such as a BI initiative or an enterprise resource planning implementation is usually the driver of the need for holistic view.
Hoberman (and Fortnum) are right on the mark here. My guess is that the requirement for a single ID per person came out of an ivory tower or a management meeting, and from someone who hasn’t had to deal with the hands-on design and implementation of a process in many years, if ever.
All too often, the highly touted, objective “Single Version of the Truth” is a concept that is simple, probably too simple for our subjective reality. Much as technologists abhor it (and Consultants are willing to bill hours to fight it), we live in a world where truthiness reigns, and we are the better for it. The “truth” for a given situation will be derived from subjective elements of data by subjective people with subjective interests from subjective data in a subjective model. Using an objective view of truth in the data warehouse is like using the movie Highlander as a model for your book club discussion. In the end it may work for whomever is left standing, but the process is long and bloody and ultimately isn’t worth the price.
technorati tags:information quality, data quality, etl, automation, automation
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in Databases, ETL, Information Architecture, Systems Integration, Information Quality, Automation, People, Practices, Transformation, Relationships | 1 Comment »
June 12th, 2006 by morgan
Just spotted this article on a $45M processing error made by ANZ Bank. Evidently, an ETL process was accidentally run twice. According to the bank’s spokesman:
I think what’s happened is the tape containing all of these transactions was run twice. We are just trying to get the bottom of why that’s happened but we suspect there’s some sort of human error involved.
This absolutely floors me, for several reasons. I find it hard to fathom that a mission critical process:
- Was being run manually.
- Did not prevent human error from effecting the overall system.
Now, I realize that bank transactions are inherently more complicated, as they move things closer to real-time processing. Also, I do give credit to ANZ for addressing the problems quickly, I am sure they are much chagrined and are looking at things very closely to see how this happened. Still, this is a big deal.
The next time you are thinking about the costs around a project to improve ETL, automation, or information quality, think about how much it is worth to your organization avoid embarrassment like this.
BTW, it might seem like I am picking on non-US organizations lately, please let me assure you that this isn’t the case. I it might just be that this gets written about more internationally.
technorati tags:information quality, data quality, etl, automation
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Information Architecture, Information Quality, In the News, Automation | No Comments »
June 9th, 2006 by morgan
A lot of my recent work has been in real-time (actually near real-time) data warehousing. There are some real challenges for ETL and information quality when moving towards a real-time environment. Everything seems to become more dificult, and at times the constraints become almost unbearable to work with. You really, really, really need a real-time system in order to justify building one, especially from a data-centric point of view.
What got me writing about this was reading an that “some Cingular subscribers endure 4-hour-plus outage (and the fact that this isn’t the first time this has happened). I knew exactly what the Cingular representative was talking about when I read this quote …
“There’s a database that has all the customer numbers and somehow, we don’t knowwhy at this point, about 10 percent (of customers in the area) were prohibited from making or receiving calls,” Merriman said.
The big issues around real-time systems are in dealing with emergence within the system. Things get into an unexpected state, and it is very difficult to figure out why, especially after the fact. This is because when are running in real time:
- Resources are at a premium, and often this means that only enough data is kept in order to process what is available right now.
- Data handling is set up to ensure that the system doesn’t break, not to ensure optimal quality.
- Downtime usually means there is normally no information coming in. It is usually very hard to know what you know you don’t know.
- Breakage is normally catastrophic and the priority is on getting
things going again, not performing detailed analysis on what happened.
Because you have a lot less information than in a batch-processing type system it is a lot harder to figure out what is going on. Good luck to the Cingular engineers in preventing this type of thing in the future
technorati tags:data quality, information quality, real time, etl
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Systems Integration, Information Quality, Automation | No Comments »
May 29th, 2006 by morgan
There is a heartbreaking story that really demonstrates the sometimes all-too-high cost of poor information quality in the real world. The article IT Integration: The Army’s Pay Misstep discusses the problems that the US Army Reserve has had in paying its people properly and the impact that it has on real people, especially wounded soldiers and their families.
Like so many IQ related stories, this one has a bit of everything:
- Organizations Outpacing Systems
- Legacy Applications
- Manual Data Entry
- Complex Business Logic
- Regulatory Compliance
- Technical and Process Wizards Keeping Everything Running
The sad thing is, often it is the individual service members who end up paying the price. Something worth considering around Memorial Day.
technorati tags: information quality, data quality, integration, case, studies, automation, people, practices, data
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in Systems Integration, Information Quality, Case Studies, In the News, Automation, People, Practices | 1 Comment »
May 25th, 2006 by morgan
I was thinking about my previous article on metadata
and would like to expand on some of those ideas. I think that for ETL we can generally break metadata down into two types:
- Referential Metadata is a maintained repository that describe the data or process that we are interested in.
- Inferential Metadata is derived from the environment from which the data or process was created and/or lives.
For example, imagine a dataset that has a full description of how it is created, contents, formatting, use, and history. This information is stored in a central location (hopefully with the metadata for other files). This is would be referential metadata.
Now, imagine the same exact same dataset that is created by an undocumented shell script that writes to a certain directory on a certain server that only the operations staff knows about. There is no referential metadata, so we can only describe it with inferential metadata. Unfortunately, in the real world (and especially with legacy applications) all too often the only metadata available is inferential metadata.
Now, it may sound like referential metadata is the only way to go if you are building a system, but I would disagree. If this is what someone is telling you, then they are most likely a salesperson for an ETL tool company or a consultant who is paid by the hour
I would argue that any efforts around metadata should be evaluated on a cost/benefit basis, and that on that criteria you get the most bang for your buck with a combination of inferential metatada and standards-based programming practices.
More to come …
technorati tags: metadata, etl
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Information Architecture, Systems Integration, Automation, Metadata | No Comments »
|

This is the about me section, you will prob. want to edit this. If you want to change the image you may do so by changing the avatar.jpg located in the NewZen images directory.
|