July 6th, 2006 by morgan
Something I have been thinking about a lot lately is something I like to call process meta-usability. I define this as:
Process meta-usability: The degree to which a process is able to be operated and understood by those who design, develop, execute, and troubleshoot it.
Process meta-usability is probably a subset of the field of usability. However, with most data-oriented processes (such as ETL and databases), the proof is in the pudding. That is, the people who are actually consuming the data have no idea what went to making it happen, and the old quote about sausage and politics probably applies.
This brings up some subtleties that we need to consider:
- Most people who design data-oriented processes either don’t know or don’t care about usability. They aren’t normally impacted by it at all.
- Most people who operate data-oriented processes are too busy to care about usability, but are greatly impacted by it.
- Most people who sponsor data-oriented process don’t see the need to improve usability, and are impacted by it unknowingly.
To design for meta-usability, a data-oriented process needs to consider:
- Automation – CPU cycles, disk sectors and memory chips are cheap and reliable. Humans are not. Consider the cost of Moore’s Law vs. the cost of employee turnover for the life cycle of your process.
- Human intervention – Trap and deal with every possible situation in the design phase. Also, when a human does have to be involved, make it easy for them to understand what is going on at a glance. This probably means documentation, standards, naming conventions, log files, metadata, the works. This isn’t as hard as it sounds, there are some easy steps that you can take to make things run more smoothly for your friendly neighborhood operator.
- Understandability – A simple process is always better than a complicated one, readable source is always better than faster code. Again, consider standards, naming conventions, and all that stuff that developers hate to consider.
- Communication – If everyone knows what is going on, it is a lot easier to make things happen. If a process that is supposed to be automated is actually taking someone 10 hours a week to keep running then it is easier to justify spending 40 hours to fix the issue.
- Downtime – During the design phase, try to calculate the cost per incident of downtime for your data process. If you don’t have any hard numbers, consider the business opportunity cost plus the cost per hour of having an analyst, developer, DBA, and operator all on a conference call trying to figure out what is going on. Compare this with the cost of developing processes to avoid or mitigate downtime and see what side of the curve you want to be on.
- Staffing – Consider that for any data-oriented system (like feeding a shadow system, data mart, or data warehouse) the bottleneck is most often keeping the overall system running effectively. If it takes longer to resolve issues with individual processes then fewer of them can be run at the same staffing level. This isn’t because of a technological, but because there literally aren’t enough hours in a day.
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in Databases, ETL, Information Architecture, Systems Integration, Information Quality, Automation, People, Metadata | No Comments »
May 25th, 2006 by morgan
I was thinking about my previous article on metadata
and would like to expand on some of those ideas. I think that for ETL we can generally break metadata down into two types:
- Referential Metadata is a maintained repository that describe the data or process that we are interested in.
- Inferential Metadata is derived from the environment from which the data or process was created and/or lives.
For example, imagine a dataset that has a full description of how it is created, contents, formatting, use, and history. This information is stored in a central location (hopefully with the metadata for other files). This is would be referential metadata.
Now, imagine the same exact same dataset that is created by an undocumented shell script that writes to a certain directory on a certain server that only the operations staff knows about. There is no referential metadata, so we can only describe it with inferential metadata. Unfortunately, in the real world (and especially with legacy applications) all too often the only metadata available is inferential metadata.
Now, it may sound like referential metadata is the only way to go if you are building a system, but I would disagree. If this is what someone is telling you, then they are most likely a salesperson for an ETL tool company or a consultant who is paid by the hour
I would argue that any efforts around metadata should be evaluated on a cost/benefit basis, and that on that criteria you get the most bang for your buck with a combination of inferential metatada and standards-based programming practices.
More to come …
technorati tags: metadata, etl
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Information Architecture, Systems Integration, Automation, Metadata | No Comments »
May 23rd, 2006 by morgan
I am currently working on a process to add some instrumentation to an existing legacy system. Not physical instrumentation, but conceptually similar. Consider the relationship of the speedometer or the tachometer to the engine of an automobile; I am doing the same thing for an ETL process.
Basically, we are trying to track the process of data provided by manufacturing systems through the entire information architecture, from delivery to publishing. A textbook example of metadata creation. To be honest, it isn’t the most exciting work in the world, but it is at least interesting to dissect an existing process and come up with something useful. Most importantly, it is very useful to our customers, and this is the measurement I really care about.
Anyway, one of the big stumbling blocks with tracking metadata is that it is expensive to make it useful. It is easy to build controls into a process that tracks every potential error that occurs. It is really useful to have an overall view of a process (or of all processes across an organization) to see how it is doing, especially over time. Unfortunately, it is often very challenging to bridge the gap between these two.
For this project, I think I found a way to do it fairly easily. There were three important steps …
- We decided that all potential errors (invalid records, bad assignments, data that does not join properly) would be written to individual error files, one error per line (separated by a ‘\n’).
- We decided to give our error files names that would describe what was inside at a glance. In our case we used a standard of <process name>.<program id>.<useful error description>.err.
- We wrote a simple process that would parse these files and put the results into a table on a database. The table had fields for:
- process name
- program id
- useful error description
- error count
- parse date
For a file named xfer.999.invalid-file-names.err this would generate a record that looks like:
- process name = xfer
- program id = 9999
- useful error description = “invalid file names”
- error count = a simple count of the number of lines in the file
- parse date = the date the operation happened
Now, we can process any error file (from any process) into the same table and now we have a generic method for capturing error data. On the development side, the only real cost is that of adhering to the file naming conventions, which is relatively low. On the operations side, we have the ability to track the results of our processes historically with simple SQL queries. A win-win at very low cost!
I am very pleased with this solution, hopefully it will help you as well.
technorati tags: etl, metadata, data, quality, information, architecture
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Information Architecture, Information Quality, Case Studies, Automation, Understanding, Metadata | 1 Comment »
May 11th, 2006 by morgan
ETLGuru has posted a couple of articles on some of the same subjects we have been touching on.
I would recommend taking a look at:
(Ironically, I have a posting that is half done that was titled “What is ETL”, he beat me to the punch!)
The first article is a pretty good rundown of ETL for the data warehouse, although I would probably expand the scope of what exactly ETL covers. I don’t think that ETL is a data warehouse specific activity, although it is often focused around warehousing. Personally, I think a lot of people are doing ETL development, either on their own or with commercial tools, but call it something different, like “report writing”, “scripting”, or “database maintenance.” ETL is far more than populating a warehouse with a commercially produced tool like Ab Initio, Ascential, or Informatica.
One thing, I take issue with is the assertion (in the latter article) that an ETL process never creates new data. I would argue that a well architected process should create metadata, telling unambiguously what was done, how it was accomplished, and what was affected. Not only is it critically important data in its own right, metadata becomes disproportionately more valuable over time.
Why? Because the true cost of operating any system is calculated over the time of operation, not just the time of development. With many systems of scale, it isn’t possible to see a problem and follow it back to where it came from (sometimes called traceability). This is because the longer a process runs, the less an organization understands what it really does. The people who created it move on to other tasks and memories fade. Often (especially with legacy sources) metadata, is the first and only clue we have on how to fix things. Instead, we have to use clues and the breadcrumbs that were left for us to follow.
Without proper metadata,, maintaining a “black box” system is inefficient, expensive, and leaves customers frustrated. For example, if it takes two weeks for a problem to be resolved (say from a customer noticing a problem on a report all the way to a developer fixing the problem and an operator re-running the process) it is expensive. We are talking real money, costing an organization salaries (for all the people fixing the problem) and lost opportunities (for delayed or incorrect decisions due to the problem).
technorati tags: etl, metadata
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Information Architecture, Practices, Understanding, Metadata | No Comments »
|

This is the about me section, you will prob. want to edit this. If you want to change the image you may do so by changing the avatar.jpg located in the NewZen images directory.
|