Data Quality and the Single View

January 3rd, 2007 by morgan

Steve Tuck from Datanomic has an post about data quality on dq:view, where he discusses (and tries to dismantle) the use of a government produced master data file for mailing addresses in the UK. While the posting is very specific to a single application, it speaks to a situation that drives a lot of data management issues.

He writes:

Authorative sources of data are indeed useful - just don’t count on them to tell the truth, the whole truth and nothing but the truth.

I believe that one of the biggest problems that we have in dealing with data is the false belief that for every organization and situation, there is a single view of information that can satisfy everyone’s needs. Now, this isn’t a technology problem and it isn’t a data problem, it’s an organizational problem.

The Myth of the Single View

In any organization, we end up with different groups with different needs, normally based around:

  • Speed
  • Reliability
  • Accuracy
  • Cost

Each group has specific needs based on their own situation. For example, when looking at customer data, the people in HQ might not care if every customer account has the most up-to-date address available, but the people in the warehouse certainly do. At the same time, the people in the warehouse don’t care about how much it costs to , while the people in HQ are much more focused on the bottom line.

Get these folks together in a room and you will have a terrific argument about what the organization needs and and how it is going to be done (BTW, there is a related post to this on the wonderful Creating Passionate Users).

While this sounds like a problem for human resources or general management, this phenomenon is usually expressed as a function of IT, because that is where the rubber hits the road. Since IT is often a shared resource and has a vested interest in interoperability, the issues of culture and organization come out as a function of architecture development.

An Honest Assessment

The honest truth is that there isn’t a single view of the business, its data, or its processes, that is going to meet the needs of the entire organization. A lot of vendors and consultants for CRM and MDM solutions are going to try to tell you otherwise, realize that they are selling something as they do this. The answer is that this is a complicated world, and things aren’t getting any easier.

If your IT is going to represent the entire organization, you must embrace complexity and understand the fact that there are going to be a cacophony of voices and a host of diverse world views that all exist simultaneously and are all using and competing for the same resources.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Interesting Features of Google Docs & Sheets

December 18th, 2006 by morgan

I have been working with Google Docs and Sheets lately, in order to avoid the portability problem when working at different machines and locations. While it isn’t as fully featured as Excel, it does just about everything I need it to do, and then some. Plus, it adds in the collaboration features that are almost more useful to an internet-oriented business.

It would be incredibly boring for Google simply to replicate Excel and Word in a web format, unless you are an HTML groupie. However, there are some very, very interesting features that I think really turn the traditional office application on its ear. The first thing that caught my eye was the Google Lookup function, which allows one to incorporate search information dynamically into documents. The second thing was the Google Finance function, which allows financial information to be leveraged as well.  The third thing was the ability to embed portions or entire spreadsheets into a blog or web page.
Very cool stuff, and very interesting results. One could imagine this type of thing being leveraged with Froogle, Maps, or other service, within a document, presentation, or spreadsheet. Low cost, high reward stuff. However, there are some ramifications with using this type of information. For example, the spiffy new spreadsheet you put together for your boss could be modified by outside influences (like a Google Bomb).

Worth a look, at least.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

December 12th, 2006 by morgan

Mitch Ratcliffe writes about Swivel, a Web 2.0 site that combines YouTube with Microsoft Excel. Well, not exactly, but sort of.

It is an interesting idea, where you …

  1. Upload your dataset to Swivel.
  2. Share your data with other users.
  3. Chart your data, making it available on the web.
  4. Compare your data with the datasets from other users.

Interested? Here is a reasonable example and a not-so-reasonable example of what Swivel can do.

It was a little bid odd to me, at first. Why would I want to provide my data to a web site so that I could look at it the same way that I can on my own computer? Well, I can think of a couple of reasons …

Some Analysis

First, you can share your data with anyone who wants it. This sharing takes two parts, display and anaysis. You can display your data on the web, for everyone to see. Second, you can combine your data with the data that other people have uploaded and perhaps learn something that you didn’t already know.

Now, sharing might be good, and it might not be, depending on your point of view. An enterprise might want to keep it’s information secret, and that makes sense. However, an interesting thing about most data is that it is subject to the network effect, big time. Two unconnected data points might mean something, but you can’t really be sure. However, a hundred data points indicate a trend or a correlation.

Second, you can use a community to learn more about your own data, as well as your correlations. At the bottom of each page, there is a rating that gives a group evaluation of how related and comments about how useful the comparison actually is.

Overall

A site like this could be useful, especially if there could be a concerted effort to provide good information, such as the census, demographic, and geographic information from the US government. It is still in its very rough stages, and it hard to say how much value will come out of combining disparate data sets. Also, I security, information quality, and analysis will be a huge issue here. Still, an interesting idea.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

EC2, Licensing, and Competitive Advantage

December 8th, 2006 by morgan

I have been working with Amazon Web Services (and EC2) a lot lately, and have made some observations that really fly in the face of conventional wisdom.

I work in ETL, which means I need to get a hold of big iron to crunch on big data. Machines are expensive, licenses are expensive, storage and networks are cheap. Scalability is important, but measured.AWS would be perfect for sourcing ETL jobs that are one-offs or are particularly large or complex. However, the major vendors are very particular about making sure that their products are only installed on authorized machines. They make it pretty difficult for you to cluster easily, especially if you are a little guy just starting out.

This is antithetical to the AWS approach to problems. Here, machines and storage are dirt cheap, networks are pretty cheap, and scalability is paramount. The most difficult thing is arranging a problem so that it can be worked on by your infinite monkeys, in the form of EC2 instances. The biggest problem then becomes licensing.

In a highly scalable environment, it is incredibly compelling it becomes to use easily licensed software. Compelling to the point where it becomes worth it to build your own tools instead of purchasing off the shelf. For example, for a web server I could use Apache or Websphere. Apache is free, and I can install it on my instance with absolutely no problems (as a matter of fact, it is pre-installed). With Websphere I am going to have to purchase a license (or more), then monkey with the fact that it will be installed on a new machine with a new hostname each time. You can make the same argument for MySQL vs. Oracle, or Python vs. .NET.

Now this isn’t an anti-corporate rant, not by a long shot. But, I think it is a valid way to look at how licensing will be a competitive advantage in the future. Software vendors should start looking at their products in terms of AWS and other compute farms, especially at the enterprise level. Those who don’t get out in front of this are going to find their lunches eaten, and quickly. There is quite a hype around Web 2.0 companies these days, this could be a great way for someone to get their foot in the door of the Fortune 1000.

Perhaps Richard Stallman should send a Christmas card from the bazaar to Jeff Bezos over at the cathedral this year …

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

EC2 — Dynamic or Static (or Both)?

December 1st, 2006 by morgan

A Problem

I have been working more and more with AWS and EC2 and one of the challenges in working with EC2 is dealing with the fact that each instance gets a dynamic IP address upon creation. This makes it easy easy to crank out a large number of instances, which is a key feature of the system. At the same time, it makes it difficult to find and manage those instances in an automated, systematic way. So, there is a disconnect here.

A Solution

A decent solution is Dynamic DNS. That is, to have your instance be assigned a easily recognized hostname as it is being started, but still keeps its dynamic IP address and creation. To me, this seems like the best solution, as it lets AWS folks be good at what they are good at (providing cool technology infrastructures) and allows its users to be good what they are good at (making cool applications that use the infrastructure).

How can this be done? Well, it takes a couple of steps:

  1. Establish an account with one of the myriad DDNS providers.
  2. Configure your instance to use the DDNS software upon boot-up.
  3. Profit!

I have gotten some feedback about possibly using SQS to do something similar to this.  I actually thought of this, but there are issues here around cost (cost per message) and configuration (duplicates and ordering).  Because I would like to maximize the system’s reliability and scalability, I would probably rule these out.Which Brings Another Problem …

There is a caveat here, and it isn’t a small one if you don’t want to write code. The big problem is that every instance you create is going to behave identically. So, if you open up multiple instances each thinking they are ‘dynamic-name-1..com’ then you will not get the results you are looking for. Instead, most likely that name will be assigned to the last instance that started and the others will run around headless.

Which Requires a Hack …

This leaves a couple of options. First, you can write a script to go through your pool of potential host names, pick one that isn’t being used, and then request that name from the DDNS provider. Better yet is to pass the Dynamic DNS value you want to the instance through the keypair.

Of course, the best solution would be for someone to write a script that would make each node self-configure itself to get a dynamic hostname.  I would think this would be something that would be attractive to most of the DDNS providers, perhaps one of them will read this and get cracking.  If I don’t spot anything in the near future I will probably write a simple KSH to do this.
Conclusion

I think that at some point the folks at AWS are going to allow for some type of host identification, either through passing parameters to instances or by renting out static IP addresses or subdomain ranges. However, for now it will take some DIY in order to make this happen.

To be honest, probably the best solution here is to approach the problem as if you will never have a static IP address or host name and go from there. It will probably force you to think about your solution differently and challenge you to come up with a more flexible, scalable solution. It isn’t going to fit for every type of problem, but I think it works for the types of things that AWS is inherently good at.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

about


This is the about me section, you will prob. want to edit this. If you want to change the image you may do so by changing the avatar.jpg located in the NewZen images directory.

search

navigation

archives

categories