January 3rd, 2007 by morgan
Steve Tuck from Datanomic has an post about data quality on dq:view, where he discusses (and tries to dismantle) the use of a government produced master data file for mailing addresses in the UK. While the posting is very specific to a single application, it speaks to a situation that drives a lot of data management issues.
He writes:
Authorative sources of data are indeed useful - just don’t count on them to tell the truth, the whole truth and nothing but the truth.
I believe that one of the biggest problems that we have in dealing with data is the false belief that for every organization and situation, there is a single view of information that can satisfy everyone’s needs. Now, this isn’t a technology problem and it isn’t a data problem, it’s an organizational problem.
The Myth of the Single View
In any organization, we end up with different groups with different needs, normally based around:
- Speed
- Reliability
- Accuracy
- Cost
Each group has specific needs based on their own situation. For example, when looking at customer data, the people in HQ might not care if every customer account has the most up-to-date address available, but the people in the warehouse certainly do. At the same time, the people in the warehouse don’t care about how much it costs to , while the people in HQ are much more focused on the bottom line.
Get these folks together in a room and you will have a terrific argument about what the organization needs and and how it is going to be done (BTW, there is a related post to this on the wonderful Creating Passionate Users).
While this sounds like a problem for human resources or general management, this phenomenon is usually expressed as a function of IT, because that is where the rubber hits the road. Since IT is often a shared resource and has a vested interest in interoperability, the issues of culture and organization come out as a function of architecture development.
An Honest Assessment
The honest truth is that there isn’t a single view of the business, its data, or its processes, that is going to meet the needs of the entire organization. A lot of vendors and consultants for CRM and MDM solutions are going to try to tell you otherwise, realize that they are selling something as they do this. The answer is that this is a complicated world, and things aren’t getting any easier.
If your IT is going to represent the entire organization, you must embrace complexity and understand the fact that there are going to be a cacophony of voices and a host of diverse world views that all exist simultaneously and are all using and competing for the same resources.
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in Databases, ETL, Information Architecture, Information Quality, Relationships, Understanding, Culture | No Comments »
December 8th, 2006 by morgan
I have been working with Amazon Web Services (and EC2) a lot lately, and have made some observations that really fly in the face of conventional wisdom.
I work in ETL, which means I need to get a hold of big iron to crunch on big data. Machines are expensive, licenses are expensive, storage and networks are cheap. Scalability is important, but measured.AWS would be perfect for sourcing ETL jobs that are one-offs or are particularly large or complex. However, the major vendors are very particular about making sure that their products are only installed on authorized machines. They make it pretty difficult for you to cluster easily, especially if you are a little guy just starting out.
This is antithetical to the AWS approach to problems. Here, machines and storage are dirt cheap, networks are pretty cheap, and scalability is paramount. The most difficult thing is arranging a problem so that it can be worked on by your infinite monkeys, in the form of EC2 instances. The biggest problem then becomes licensing.
In a highly scalable environment, it is incredibly compelling it becomes to use easily licensed software. Compelling to the point where it becomes worth it to build your own tools instead of purchasing off the shelf. For example, for a web server I could use Apache or Websphere. Apache is free, and I can install it on my instance with absolutely no problems (as a matter of fact, it is pre-installed). With Websphere I am going to have to purchase a license (or more), then monkey with the fact that it will be installed on a new machine with a new hostname each time. You can make the same argument for MySQL vs. Oracle, or Python vs. .NET.
Now this isn’t an anti-corporate rant, not by a long shot. But, I think it is a valid way to look at how licensing will be a competitive advantage in the future. Software vendors should start looking at their products in terms of AWS and other compute farms, especially at the enterprise level. Those who don’t get out in front of this are going to find their lunches eaten, and quickly. There is quite a hype around Web 2.0 companies these days, this could be a great way for someone to get their foot in the door of the Fortune 1000.
Perhaps Richard Stallman should send a Christmas card from the bazaar to Jeff Bezos over at the cathedral this year …
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Information Architecture, Systems Integration, Over the Horizon, Appliances, AWS | No Comments »
September 18th, 2006 by morgan
James Taylor (no, not that James Taylor, the other one) had an interesting article about SOA’s, agility, and architecture. While the article is a riff on another article (which makes this a meta-riff, I suppose) , it got me to thinking about the development lifecycle.
I think it is very ironic that in ETL and data-oriented programming we run into the same contradictions all the time:
- Development time is the smallest cost in the entire process in terms of time, resources, and money.
- Software development is scrutinized to death.
- On-time delivery is significantly more important than long-term cost savings, even if it impacts long-term functionality.
Now, I don’t think this is done out of malice or spite for IT. A lot of it may simply be because development is the one part of the development lifecycle that can be influenced by the project sponsor. However, as practitioners we need to make sure that information architecture is focused on consistently delivering tangible value to our organization. This means effectively communicating the true overall cost for systems development and making sure that the organization as a whole understands what we are doing.
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Information Architecture, People, Practices | No Comments »
September 15th, 2006 by morgan
Classifying ETL
It will help to take a bit of time to discuss how software development is classified. Historically, classification of software development were done around methodology and/or representations. Some common ways to look at development are:
Looking at things through the lens of methodology is a more academic view of things, and more prevalent in the early days of computing.
Another Way to Look at Things
Practitioners often look at things a bit differently, often through the functionality of what is being created. Some ways to look at development this way are:
- Web Programming (like PHP, AJAX, DHTML, etc)
- Glue Programming (PERL, Python, TCL, and too many scripting languages to list)
- UI Development (TK, XUL, UIML)
- Mathematics (MatLab, SAS, R, many others)
This is a more practical view of things, more prevalent today, especially in the IT world.
Where We Fall
ETL is function, so they are most easily classifed in a functional way. However Data Oriented Programming is more of a methodology (although more of a hybrid than anything else). So, it is tough to encompass this in just one category. It probably makes most sense to say that ETL should be viewed from the functional point of view, while the things that are used to build ETL processes should be viewed from a methodological point of view.
Next in the “Focus on ETL” series we will be looking at what goes into an ETL process.
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Information Architecture, Transformation | 1 Comment »
September 11th, 2006 by morgan
As a consultant, most of my time is spent working in, on, and around ETL projects and systems. It is a growing niche that is very useful and makes a lot of data warehousing and analysis possible. I enjoy the work and it pays pretty well.
As time has gone on, I have been on the lookout for some type of “first principles” for ETL, some method behind the madness. At first, I just figured I didn’t have the right website or book and just needed to dig further. However, I am at the point now where I think there just isn’t a consistent defintion of exactly what ETL is.
Some of this is probably because ETL is dominated by consultants, and when you are paid by the hour there is no need to speed things up things with total consistency. However, I think that there is no common definition for ETL because it is a unique discipline that. So, as the first part of my “focus on ETL”, I want to try and pin down some things about the discipline and how I see it.
A Very Visible Definition
Wikipedia defines ETL as:
… a process in data warehousing that involves
- extracting data from outside sources,
- transforming it to fit business needs, and ultimately
- loading it into the data warehouse.
This isn’t a terrible start, although I believe that it is too narrow and only reflects the current state of the industry from the point of view of tool vendors and consultants. This is really limiting, and doesn’t fully describe everything that ETL seems to cover.
My Definition
After a lot of thought, I have come up with a definition of my own:
ETL is the art, science, and magic of building coherent, useful information from disparate data sources. It encompasses everything from:
- The undertsanding and use of source systems and formats.
- The code and logic needed to manipulate and transform the data.
- The medium of transformation
In other words, ETL is data-oriented programming.
I don’t consider ETL to be an activity. Instead, I consider it to be a technical discipline that requires a lot of training, effort, experience, flexibility, and creativity. It spans across multiple platforms, languages, skillsets, and disciplines and delivers something unique and not well understood. While ETL is at the nexus of several different ideas, it stands on its own as something that is very useful.
Well, enough for now. In my next few posts, I will discuss the implications of data-oriented programming and look at how ETL compares to other areas of computer science and information technology.
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Practices | No Comments »
|

This is the about me section, you will prob. want to edit this. If you want to change the image you may do so by changing the avatar.jpg located in the NewZen images directory.
|