September 11th, 2006 by morgan
As a consultant, most of my time is spent working in, on, and around ETL projects and systems. It is a growing niche that is very useful and makes a lot of data warehousing and analysis possible. I enjoy the work and it pays pretty well.
As time has gone on, I have been on the lookout for some type of “first principles” for ETL, some method behind the madness. At first, I just figured I didn’t have the right website or book and just needed to dig further. However, I am at the point now where I think there just isn’t a consistent defintion of exactly what ETL is.
Some of this is probably because ETL is dominated by consultants, and when you are paid by the hour there is no need to speed things up things with total consistency. However, I think that there is no common definition for ETL because it is a unique discipline that. So, as the first part of my “focus on ETL”, I want to try and pin down some things about the discipline and how I see it.
A Very Visible Definition
Wikipedia defines ETL as:
… a process in data warehousing that involves
- extracting data from outside sources,
- transforming it to fit business needs, and ultimately
- loading it into the data warehouse.
This isn’t a terrible start, although I believe that it is too narrow and only reflects the current state of the industry from the point of view of tool vendors and consultants. This is really limiting, and doesn’t fully describe everything that ETL seems to cover.
My Definition
After a lot of thought, I have come up with a definition of my own:
ETL is the art, science, and magic of building coherent, useful information from disparate data sources. It encompasses everything from:
- The undertsanding and use of source systems and formats.
- The code and logic needed to manipulate and transform the data.
- The medium of transformation
In other words, ETL is data-oriented programming.
I don’t consider ETL to be an activity. Instead, I consider it to be a technical discipline that requires a lot of training, effort, experience, flexibility, and creativity. It spans across multiple platforms, languages, skillsets, and disciplines and delivers something unique and not well understood. While ETL is at the nexus of several different ideas, it stands on its own as something that is very useful.
Well, enough for now. In my next few posts, I will discuss the implications of data-oriented programming and look at how ETL compares to other areas of computer science and information technology.
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Practices | No Comments »
September 8th, 2006 by morgan
Project X Discussions has an interesting article about project staffing and the bloat that often occurs with data related projects.
In a number of client environments I have often been amazed by the number of people that can be assigned to a project. Project Managers, Business Users, Business Analysts, Architects, Technical Analysts, Developers, Database Analysts, Database Administrators, Data Modelers, Subject Matter Experts, Testing specialists, Data Assurance people, Production and Operations People and of course a couple of people like me, the consultants.
While it is impossible for one person to do everything I often wonder how many people on such a project team could be removed from the project without impacting and perhaps improving the outcome.
They make some good points. In a structured environment it is very easy to get project bloat due to specialization and role playing. In my experience, a few smart, determined generalists deliver more than a pack of highly qualified specialists most of the time.
The Twist
The article then goes on:
One person I worked with recently proved that a single person acting alone can accomplish amazing things if they have access to all the right tools and know how to use them. In this scenario the end user need some reports - in excel format.
My friend, used UNIX scripts against flat files and SQL queries against the database to create a number of SAS data sets. He ran SAS functions against the data to aggregate it and exported it into Excel for the end user. It took him a couple of days.
It is easy to confuse project bloat with the need for high-quality business processes, and we have to be careful in this regards. I believe that this person was working with what I call “borrowed productivity”. He was able to deliver his part quickly, in large part by doing things that will make things more difficult (and expensive) in the future. He avoided all the processes that will save money over time with a quick fix immediately.
Where the organization will pay down the line will be:
- Quality.
- Sustainability.
- Reusability.
- Consistency.
My Experience
I once worked for a large (Fortune 500) company who had a division that did almost all of their reporting from Microsoft Access databases. They were able to crank out report after report and get them into excel format and delivered via email very quickly. At one point, this group was generating millions of emails
At the same time, there were a lot of things that didn’t work on this model:
- The reporting group was really never able to get their reports to match up internally, not to mention reports generated through the centralized data warehouse.
- When business processes or changed it was very difficult to find all the places where code needed to be updated.
- There were a huge number of reports that were very similar, with only slightly different parameters.
- When more reports were needed the only recourse was to hire more Analysts.
- The users soon grew tired of having a large number of excel spreadhseets in their Inbox each morning and stopped reading many reports
Eventually, this reporting system became unsustainable and the group went through several major crises and ended up being mostly disbanded. This was a huge waste of resources and a bitter loss of business expertise and technical talent. A lot of good people lost their jobs because of a bad system.
Conclusion
Small teams are good, very good. They are efficient and often much easier to run. Small thinking is bad, very bad. It often unintentionally deceptive and very expensive in the long run. There is no causation between big groups and small thinking or small groups and big thinking, but often there is a correlation.
Regardless of your environment, in developing your inormation architecture make sure that you understand that:
- Data delivery has to be fast, at least as fast as the business that drives it.
- Initial development is usually the smallest cost in the data lifecycle.
Normally, a discussion around this subject will neglect one side or the other, but this really is just a disservice to the efficiency of your organization.
Other Reading
Rick Sherman has discussed shadow systems , this is one of the few works I have seen that view the discussion holistically. I would highly recommend taking a look at his work.
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in Databases, ETL, Information Quality, Practices, Transformation, Reporting | No Comments »
August 20th, 2006 by morgan
The ongoing debate about the planetary status of Pluto is a really great example of how the standards-making process really works. Like most debates, everyone involved is supposed to be rational, thinking adults. The most rational, thinking people on earth: scientists.
The debate is to keep Pluto classified as a planet or not. And, to be honest, I can’t think of a single good reason why we should or should not change the planet. On one hand, logically it looks like it probably shouldn’t be a planet, as the criteria used to classify it as one is no longer unique and is causing some inconsistencies. On the other hand, this would contradict everything that most adults have been taught, probably causing an uproar among some key constituency of some political party (personally, I am waiting for someone to ask, “Won’t somebody please think of the children?“)
This planetary debate is highly charged, personal, emotional, contradictory at times, and ended with a solution that defies logic and mollifies more than it satisfies. Of course, this reminds me of the times I have been sitting in a room trying to decide on standard ways to do or classify things. For the most part, this has been as a technologist (around data, ETL and architecture), but also as a member of a business and as a leader in a non-profit organization.
There are several lessons I have learned about setting (and breaking) standards over the years. While I wrote about this in an earlier article about standardization and conformity, I thought I would try to distill things down into a few truths about standards. Here they are:
- Standards are good if they save time, effort, money, or increase safety or happiness. Any other reason is just a justification for the exercise of power.
- Standards are set by people, not by logic, reason, money, or faith. Anyone who says differently is in denial or trying to pull the wool over your eyes.
- If there is more than one standard way of doing things, there is no standard way of doing things.
- Standard is not inherently better than non-standard. However, things may be more comfortable for some people to understand if they believe there is a standard.
- Not making the decision to have a standard is still an active decision on standardization, with very real personal, organizational, and financial implications.
- Standards are not good if they don’t work for the people who have to follow them every day. Over the long term, people won’t follow standards that don’t work for them.
- Mistakes will be made and good standards will take this into consideration.
- Someone will find an exception and want to to it differently, usually for a good reason. Handling this creatively and gracefully will be your greatest challenge.
At some point, every organization gets to a point where they can see if they just did things in a standard way. Normally, this is just after everything has completely changed or gone to hell, a group has burned themselves out one too many times, or the ball has been dropped in a large, preventable, but hard to predict way. This is a very interesting time, but also very dangerous one. There is momentum for change, but just doing things differently isn’t the same as doing things better.
Just remember, take your time and think about things carefully. It will all work itself out.
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Information Architecture, People, Practices | 2 Comments »
August 1st, 2006 by morgan
One of the real challenges about information quality is that the field is still very abstract. In the academic world, the theories (like PSP/IQ) are still being written and discussed. In practice, this means that there isn’t a standard way of doing things. Or, to be more precise, everyone has a “standard” way of doing things, it is just that they are all different. So, let me add my $0.02, perhaps this will help someone in some general way.
Previously, we talked about the semantic and statistical approaches to information quality. Two distinctly different ways of trying to do the same thing. How can we reconcile these two different ideas and actually accomplish something in the real world? The best way I know is to try and fall back to some well established practices and try to adapt them to our needs. While we are working with data instead of applications, I think that these approaches correspond directly to principles from software engineering. For most applications, there are two types of testing:
- White box testing uses an intimate knowledge of the internals of an application and tests to make sure everything works as expected.
- Black box testing uses an expectation of the behavior of an application and tests to make sure that it does what it should. A black box test will not test anything internal to the application, just the inputs and outputs.
Both of these types of tests do different types of things and work in different ways. As this terribly obscure but well-written and concise tutorial points out:
- White box testing does quality control, while black box testing does quality assurance.
- Black box testing finds sins of omission, white box testing finds sins of comission,
- Black box testing can be started as soon as the specifications are available, while white box testing must wait until the code is written.
- Black box testing is a lot cheaper than white box testing.
- Both types of testing are needed in order to truly verify that things are working properly.
In my opinion, a semantic approach to information quality is the equivalent of a white-box test. Conversely, the statistical approach is the equivalent of a black-box test.
Next in our series, we will be exporing the semantic approach to information quality with practical examples.
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in Databases, ETL, Systems Integration, Information Quality | 2 Comments »
July 20th, 2006 by morgan
Yesterday, Informatica and Salesforce.com announced an interesting deal that will allow the two tools to interact. After looking at a general overview of the technology, it looks like something that will be relatively useful for users of these products, and help to cement sales for both companies. I wouldn’t call this quite as appealing as the combination of Nike+iPod, but then again I am both a distance runner and an iPod owner.
I think that this is solid recognition that using data as a service in the enterprise is not only possible, but probable. Also, a nice move by Informatica to keep its products fresh and leaning towards the leading edge. At the same time, I wonder if we are trying to teach an old dog new tricks. After all, Cast Iron Systems sells EAI appliances that are also integrated with Salesforce.com. Probably not, as the really profitable customers of ETL tool vendors are probably not the same ones looking at appliances.
At least, not yet …
technorati tags:mashup, Informatica, ETL EAI,information architecture
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in ETL, Information Architecture, Systems Integration, In the News | 3 Comments »
|

This is the about me section, you will prob. want to edit this. If you want to change the image you may do so by changing the avatar.jpg located in the NewZen images directory.
|