Data Quality and the Single View

January 3rd, 2007 by morgan

Steve Tuck from Datanomic has an post about data quality on dq:view, where he discusses (and tries to dismantle) the use of a government produced master data file for mailing addresses in the UK. While the posting is very specific to a single application, it speaks to a situation that drives a lot of data management issues.

He writes:

Authorative sources of data are indeed useful - just don’t count on them to tell the truth, the whole truth and nothing but the truth.

I believe that one of the biggest problems that we have in dealing with data is the false belief that for every organization and situation, there is a single view of information that can satisfy everyone’s needs. Now, this isn’t a technology problem and it isn’t a data problem, it’s an organizational problem.

The Myth of the Single View

In any organization, we end up with different groups with different needs, normally based around:

  • Speed
  • Reliability
  • Accuracy
  • Cost

Each group has specific needs based on their own situation. For example, when looking at customer data, the people in HQ might not care if every customer account has the most up-to-date address available, but the people in the warehouse certainly do. At the same time, the people in the warehouse don’t care about how much it costs to , while the people in HQ are much more focused on the bottom line.

Get these folks together in a room and you will have a terrific argument about what the organization needs and and how it is going to be done (BTW, there is a related post to this on the wonderful Creating Passionate Users).

While this sounds like a problem for human resources or general management, this phenomenon is usually expressed as a function of IT, because that is where the rubber hits the road. Since IT is often a shared resource and has a vested interest in interoperability, the issues of culture and organization come out as a function of architecture development.

An Honest Assessment

The honest truth is that there isn’t a single view of the business, its data, or its processes, that is going to meet the needs of the entire organization. A lot of vendors and consultants for CRM and MDM solutions are going to try to tell you otherwise, realize that they are selling something as they do this. The answer is that this is a complicated world, and things aren’t getting any easier.

If your IT is going to represent the entire organization, you must embrace complexity and understand the fact that there are going to be a cacophony of voices and a host of diverse world views that all exist simultaneously and are all using and competing for the same resources.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Working on Borrowed Productivity

September 8th, 2006 by morgan

Project X Discussions has an interesting article about project staffing and the bloat that often occurs with data related projects.

In a number of client environments I have often been amazed by the number of people that can be assigned to a project. Project Managers, Business Users, Business Analysts, Architects, Technical Analysts, Developers, Database Analysts, Database Administrators, Data Modelers, Subject Matter Experts, Testing specialists, Data Assurance people, Production and Operations People and of course a couple of people like me, the consultants.

While it is impossible for one person to do everything I often wonder how many people on such a project team could be removed from the project without impacting and perhaps improving the outcome.

They make some good points. In a structured environment it is very easy to get project bloat due to specialization and role playing. In my experience, a few smart, determined generalists deliver more than a pack of highly qualified specialists most of the time.

The Twist

The article then goes on:

One person I worked with recently proved that a single person acting alone can accomplish amazing things if they have access to all the right tools and know how to use them. In this scenario the end user need some reports - in excel format.

My friend, used UNIX scripts against flat files and SQL queries against the database to create a number of SAS data sets. He ran SAS functions against the data to aggregate it and exported it into Excel for the end user. It took him a couple of days.

It is easy to confuse project bloat with the need for high-quality business processes, and we have to be careful in this regards. I believe that this person was working with what I call “borrowed productivity”. He was able to deliver his part quickly, in large part by doing things that will make things more difficult (and expensive) in the future. He avoided all the processes that will save money over time with a quick fix immediately.

Where the organization will pay down the line will be:

  1. Quality.
  2. Sustainability.
  3. Reusability.
  4. Consistency.

My Experience

I once worked for a large (Fortune 500) company who had a division that did almost all of their reporting from Microsoft Access databases. They were able to crank out report after report and get them into excel format and delivered via email very quickly. At one point, this group was generating millions of emails

At the same time, there were a lot of things that didn’t work on this model:

  • The reporting group was really never able to get their reports to match up internally, not to mention reports generated through the centralized data warehouse.
  • When business processes or changed it was very difficult to find all the places where code needed to be updated.
  • There were a huge number of reports that were very similar, with only slightly different parameters.
  • When more reports were needed the only recourse was to hire more Analysts.
  • The users soon grew tired of having a large number of excel spreadhseets in their Inbox each morning and stopped reading many reports

Eventually, this reporting system became unsustainable and the group went through several major crises and ended up being mostly disbanded. This was a huge waste of resources and a bitter loss of business expertise and technical talent. A lot of good people lost their jobs because of a bad system.

Conclusion
Small teams are good, very good. They are efficient and often much easier to run. Small thinking is bad, very bad. It often unintentionally deceptive and very expensive in the long run. There is no causation between big groups and small thinking or small groups and big thinking, but often there is a correlation.

Regardless of your environment, in developing your inormation architecture make sure that you understand that:

  1. Data delivery has to be fast, at least as fast as the business that drives it.
  2. Initial development is usually the smallest cost in the data lifecycle.

Normally, a discussion around this subject will neglect one side or the other, but this really is just a disservice to the efficiency of your organization.

Other Reading

Rick Sherman has discussed shadow systems , this is one of the few works I have seen that view the discussion holistically. I would highly recommend taking a look at his work.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Enterprise Web 2.0, Linux, and Ecclesiastes

August 20th, 2006 by morgan

Dion Hinchcliffe has been writing some interesting stuff about web 2.0 in the enterprise. His latest post is a bit of a rant against Wikipedia, but push on and it is worth the read. Lately I have been pondering the impact of things like SOA and
mashups in the enterprise context, blending in the web dialtone discussion that is happening on the O’Reilly Radar.

Putting on my prediction hat, I would say on the back-end of the information architecture web 2.0 will have an impact similiar to that of Linux. That is, it will displace some really expensive, customized solutions, free up resources for real innovation, and push everyone forward about 10 years at no cost. You see, the really interesting thing that web 2.0 applications do for the enterprise is to dramatically reduce costs for existing processes. For next-generation tools like Basecamp you don’t need hardware, software, drivers, or an administrator. You need an intern and a scripting language. As a long-time ETL guy, I have to say that is huge. It strips away all the barnacles of the information architecture, leaving only the actual work that needs to be done.

I understand that there are new methods and processes that are waiting to be born using AJAX and mashups and the like. I don’t doubt that many of these can have a dramatic impact on the enterprise. However, when someone is going to sit down with the CFO and try to arrange funding for a big project, this isn’t going to be all that impressive. I take an Ecclesiastical view of these things and think that there truly is nothing new under the sun.

That being said, Web 2.0 has a great upside with very little risk or up-front cost. If your organization isn’t exploring this phenomenon it should be.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Six Ways to Secure your Architecture

August 16th, 2006 by morgan

Builder.au has a useful article titled, “Six Steps to Secure Sensitive Data in MySQL.” While I am a PostgreSQL fan myself, this article gives a good checklist of no-nonsense steps that can be taken in order to secure your data. Most of these steps can be applied across databases and even generalized to other systems, it is just good advice.
The thing I like most about the article is that five of the six steps to secure your data of them have no visible impact on the users whatsoever. I would argue that the value of architecture improvements are inversly proportional to the amount of extra effort it will take for a user to get their work done.

On a side rant, something I don’t hear enough about these days is the responsibility of the organization for their IT architecture and policies. Instead, most of the time I hear scapegoating and complaints about users. If it is your architecture, you are in charge of keeping things running safely and securely. It is your users responsibility to use your systems efficiently, not to make your life easy.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Case Study — Statistical Information Quality

August 14th, 2006 by morgan

Introduction

When an organization begins a concerted effort to improve its information quality, often it gets stuck in trying to figure out exactly where to start. This case study takes this to heart and gives a specific example of an approach to improving information quality.

Previously, we had discussed the semantic and statistical approaches to information quality and linked them to black box and white box testing. In addition, there is a case study on semantic information quality which is used to contrast this case study (you may want to take a look at these if you aren’t familiar with the subjects).

The example that we have been using is …

dealing with call data for a customer contact center. For simplicity, we can assume that all the call data we need is delivered nightly and is loaded into a single table that looks exactly like the files as they have arrived. This table has the following attributes:

  • employee_login_number
  • site_name
  • department_name
  • call_local_start_time
  • call_local_end_time

… from this data the business analysts are going to figure out how much to pay and to whom. Also, we need to figure out who is handling the highest call volume (vendors, locations, and employees) on a daily basis so that we can resolve issues and negotiate contracts. Our job is to make sure that the data is accurate enough to do this with confidence.

Also, before we get started, realize that with the semantic and statistical approaches we are trying to do the same thing in different ways. So, while we are doing things differently, there is bound to be some overlap.

The Statistical Approach

With a statistical approach, there are several things to consider:

  1. From a statistical point of view, there is nothing special about this dataset. It has very similar characteristics to all the ones that came before it and will come after it. We should try to create an architecture that can be re-used where appropriate.
  2. There is a lot that we can infer from the dataset itself. We can learn a great deal of information about the dataset very cheaply through black box testing. Focusing on these areas will maximize re-use as well.
  3. We can probably assume that any data that we recieve is of reasonably good quality when the process was first designed. Therefore, we can focus on events where the nature of the data changes substantially.

With these in mind, we can start to design a solution.

The place to start is to ask, “what can go wrong in our data?”. I can think of several situations that might impact the quality of this data:

  • The employee_login_number is invalid or NULL.
  • The site_name is invalid or NULL.
  • The department is invalid or NULL.
  • The call_local_start_time is invalid or NULL.
  • The call_local_end_time is invalid, NULL, or starts before the call_local_start_time.
  • Due to errors outside of our control, the process that created the data malfunctioned. Often, this will show up as duplicate values, irregular frequency or distribution of values

Off the top of my head, I have a number of questions about the data that we will see day to day:

  • For each column, is there a distinct list of values (call this the domain) that are valid?
  • For each column, is there a distinct pattern of values that are valid?
  • For each column, can the values be NULL?
  • Is there a distinct key? If so, is it unique?
  • For column values and keys, should the frequency for particular values be fairly normal?
  • Is there a certain number of rows that should be expected (by key or for the entire dataset)?
  • Is there a certain number of keys that should be expected?
  • For numeric values, can we do descriptive statistics to tell us if things are off-kilter?

Based on these, I think that we can establish a data model that would allow this metadata to be recorded for multiple processes, which would allow it to be used for reporting and decision-making.

For example, consider a table having the following attributes:

  • process_id
  • process_run_dt
  • distinct_value
  • distinct_value_type
  • distinct_value_count

This would allow the user to keep track of how many distinct values there were generated by a given process. Over time, this could be very useful in tracking down some sticky problems, and perhaps prevent bad data from ever getting into a data store in the first place.

For each of the meausurement processes we mentioned, they can probably be integrated into the overall data model in a process agnostic way. I apologize for not having more details at this point, I plan to move this to the wiki (at some point) and put in a reference model for doing some of these operations.

Comparisons With Data Profiling

For people with some experience with data management this may sound a lot like data profiling. In fact, a lot of the operations inherent in the statistical approach would probably be considered a part of data profiling as well.

However, there are some key differences between Statistical IQ and Data Profiling that need mentioning:

  1. Statistical IQ has an operational focus and needs to be as lightweight as possible. We want to use this to make day-to-day operational decisions about our data without slowing anything down.
  2. Statistical IQ does not include data discovery, while data profiling often does.
  3. One of the core functions of data profiling is establishing relationships between datasets. Statistical IQ has a very limited view of relationships in order to maximize functionality and reusability.

Similar base concepts, focusing on different areas.

Statistical IQ and Mad Libs

One thing that often gets lost in the re-use discussion is the price of user configuration. All too often, programmers push too much decision making out of their code and on to the operator, making it difficult to use.

The trick with Statistical IQ is that you have to be able to tie a generic statement (”there are 15 distinct values in this dataset”) back to something useful (”there is probably missing data, don’t continue the process”). While this might seem like a challenge, it can be done without a lot of heartburn.

In a recent engagement, I designed a solution where we tied every possible error back to an english description of the problem that was stored in an SQL database. This was done in a very generic way, so that new errors could be added or removed without any configuration required by the developer or operator.

Conclusion

There are different approaches to information quality, each with their own strengths, weaknesses, and costs. The statistical approach is cheaper (especially when you factor in Moore’s Law), but gives a less detailed picture of overall quality. The semantic approach is more expensive, but can be as comprehensive as the situation requires. A balanced approach will use both approaches to deliver the solution that is needed.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

about


This is the about me section, you will prob. want to edit this. If you want to change the image you may do so by changing the avatar.jpg located in the NewZen images directory.

search

navigation

archives

categories