Case Study — Statistical Information Quality

August 14th, 2006 by morgan

Introduction

When an organization begins a concerted effort to improve its information quality, often it gets stuck in trying to figure out exactly where to start. This case study takes this to heart and gives a specific example of an approach to improving information quality.

Previously, we had discussed the semantic and statistical approaches to information quality and linked them to black box and white box testing. In addition, there is a case study on semantic information quality which is used to contrast this case study (you may want to take a look at these if you aren’t familiar with the subjects).

The example that we have been using is …

dealing with call data for a customer contact center. For simplicity, we can assume that all the call data we need is delivered nightly and is loaded into a single table that looks exactly like the files as they have arrived. This table has the following attributes:

  • employee_login_number
  • site_name
  • department_name
  • call_local_start_time
  • call_local_end_time

… from this data the business analysts are going to figure out how much to pay and to whom. Also, we need to figure out who is handling the highest call volume (vendors, locations, and employees) on a daily basis so that we can resolve issues and negotiate contracts. Our job is to make sure that the data is accurate enough to do this with confidence.

Also, before we get started, realize that with the semantic and statistical approaches we are trying to do the same thing in different ways. So, while we are doing things differently, there is bound to be some overlap.

The Statistical Approach

With a statistical approach, there are several things to consider:

  1. From a statistical point of view, there is nothing special about this dataset. It has very similar characteristics to all the ones that came before it and will come after it. We should try to create an architecture that can be re-used where appropriate.
  2. There is a lot that we can infer from the dataset itself. We can learn a great deal of information about the dataset very cheaply through black box testing. Focusing on these areas will maximize re-use as well.
  3. We can probably assume that any data that we recieve is of reasonably good quality when the process was first designed. Therefore, we can focus on events where the nature of the data changes substantially.

With these in mind, we can start to design a solution.

The place to start is to ask, “what can go wrong in our data?”. I can think of several situations that might impact the quality of this data:

  • The employee_login_number is invalid or NULL.
  • The site_name is invalid or NULL.
  • The department is invalid or NULL.
  • The call_local_start_time is invalid or NULL.
  • The call_local_end_time is invalid, NULL, or starts before the call_local_start_time.
  • Due to errors outside of our control, the process that created the data malfunctioned. Often, this will show up as duplicate values, irregular frequency or distribution of values

Off the top of my head, I have a number of questions about the data that we will see day to day:

  • For each column, is there a distinct list of values (call this the domain) that are valid?
  • For each column, is there a distinct pattern of values that are valid?
  • For each column, can the values be NULL?
  • Is there a distinct key? If so, is it unique?
  • For column values and keys, should the frequency for particular values be fairly normal?
  • Is there a certain number of rows that should be expected (by key or for the entire dataset)?
  • Is there a certain number of keys that should be expected?
  • For numeric values, can we do descriptive statistics to tell us if things are off-kilter?

Based on these, I think that we can establish a data model that would allow this metadata to be recorded for multiple processes, which would allow it to be used for reporting and decision-making.

For example, consider a table having the following attributes:

  • process_id
  • process_run_dt
  • distinct_value
  • distinct_value_type
  • distinct_value_count

This would allow the user to keep track of how many distinct values there were generated by a given process. Over time, this could be very useful in tracking down some sticky problems, and perhaps prevent bad data from ever getting into a data store in the first place.

For each of the meausurement processes we mentioned, they can probably be integrated into the overall data model in a process agnostic way. I apologize for not having more details at this point, I plan to move this to the wiki (at some point) and put in a reference model for doing some of these operations.

Comparisons With Data Profiling

For people with some experience with data management this may sound a lot like data profiling. In fact, a lot of the operations inherent in the statistical approach would probably be considered a part of data profiling as well.

However, there are some key differences between Statistical IQ and Data Profiling that need mentioning:

  1. Statistical IQ has an operational focus and needs to be as lightweight as possible. We want to use this to make day-to-day operational decisions about our data without slowing anything down.
  2. Statistical IQ does not include data discovery, while data profiling often does.
  3. One of the core functions of data profiling is establishing relationships between datasets. Statistical IQ has a very limited view of relationships in order to maximize functionality and reusability.

Similar base concepts, focusing on different areas.

Statistical IQ and Mad Libs

One thing that often gets lost in the re-use discussion is the price of user configuration. All too often, programmers push too much decision making out of their code and on to the operator, making it difficult to use.

The trick with Statistical IQ is that you have to be able to tie a generic statement (”there are 15 distinct values in this dataset”) back to something useful (”there is probably missing data, don’t continue the process”). While this might seem like a challenge, it can be done without a lot of heartburn.

In a recent engagement, I designed a solution where we tied every possible error back to an english description of the problem that was stored in an SQL database. This was done in a very generic way, so that new errors could be added or removed without any configuration required by the developer or operator.

Conclusion

There are different approaches to information quality, each with their own strengths, weaknesses, and costs. The statistical approach is cheaper (especially when you factor in Moore’s Law), but gives a less detailed picture of overall quality. The semantic approach is more expensive, but can be as comprehensive as the situation requires. A balanced approach will use both approaches to deliver the solution that is needed.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Case Study — Semantic Information Quality

August 4th, 2006 by morgan

When an organization begins a concerted effort to improve its information quality, often it gets stuck in trying to figure out exactly where to start. This case study takes this to heart and gives a specific example of an approach to improving information quality.

Previously, we had discussed the semantic and statistical approaches to information quality and linked them to black box and white box testing (you may want to take a look at these if you aren’t familiar with the subjects, as these are the basis for this article).

The Semantic Approach

A more semantic approach would involve defining exactly what your data represents, and from there determine what it should look like and how it should behave. This sounds pretty easy, right? The problem is that things are often more complicated than they seem.

Let’s look at an example I ran into on a client engagement, dealing with call data for a customer contact center. For simplicity, we can assume that all the call data we need is delivered nightly and is loaded into a single table that looks exactly like the files as they have arrived. This table has the following attributes:

  • employee_login_number
  • site_name
  • department_name
  • call_local_start_time
  • call_local_end_time

OK, now from this data the business analysts are going to figure out how much to pay and to whom. Also, we need to figure out who is handling the highest call volume (vendors, locations, and employees) on a daily basis so that we can resolve issues and negotiate contracts. Our job is to make sure that the data is accurate enough to do this with confidence.

The Semantic Challenge

The first thing we would need to do is to find out exactly what is going on in the system. Talking with various people in technology and business units, we can define some basic terms. In our case, let’s say that we discover:

  • There are multiple contact center locations worldwide and each one has its own “switch” with data in its own local time. All of the locations are owned and operated by vendors.
  • All reporting for management is done in Eastern Time (US), but location and employee reporting should be done in local time.
  • An agent is signified by a login number in the “switch” (a piece of telephony equipment).
  • An agent works in a department, which handles a specific type of call.
  • An agent can have multiple logins on the same “switch” for different departments that they work in.
  • A “call” will be defined by a valid call record, including a start time and end time
  • Each time a call comes in a record will be created with the login number, start time, and stop time (in local time).
  • Calls to different departments are paid different rates.

Realize that this is the tip of the iceberg when it comes to business rules. There could easily be 100 more concepts and constraints involved in a decent sized business. Also, understand that this was very rapidly growing (over 100% per year) worldwide business that was intensely focused on customer service. We couldn’t ask the business to slow down. But, we still needed to provide data that was of high quality.

From the problem description, we know that there must be mappings between:

  • Logins and agents.
  • Locations and vendors.
  • Locations and time zones.
  • Departments and pay rates.

The Semantic Solution

Off the top of my head, there are a number of things that we can do to test this data. It shouldn’t be too hard to write SQL that would test the referential integrity of the system. For example:

  1. Join the call data with each of the mappings, noting what records have no matches.
  2. Join the call data with each of the mappings, noting what records have multiple matches.
  3. Look for duplicates in the mapping tables.
  4. Look for newly added or removed values in the mapping tables.

Next, I would look at some basic validation tests:

  1. Each agent should not have more than 3 logins (or some appropriate number) per day.
  2. Each agent should only be listed at one facility per day.
  3. Each agent should only be listed at one vendor per day.
  4. Locations should not disapear or change time zones from day to day.
  5. Vendors should not disapear from day to day.
  6. The Call Start Time should be earlier than the Call Stop Time.

Last, it would be good to write some sanity checks:

  1. After daily processing is complete, the total number of calls should exactly match the sum of the number of calls to each site.
  2. After daily processing is complete, the total number of calls to a site should exactly match the sum of the number of calls to each agent at that site.
  3. At all levels, the total amount billed should not be more than the total (number of calls) x (highest billing rate).
  4. The total number of call time for an agent should not be more than 12 hours in a given day.

Now, this is by no means a complete list of tests that should be run, but it gives you a good idea of what can be looked at.

Conclusion

As you can see, ensuring that the information coming out of this process is accurate is sometimes simple, sometimes complex, and sometimes downright daunting. Most of the solution here requires that custom tests be created, maintained, understood, and reported on (something that we haven’t even discussed). This is a lot of work, and customized work that can’t be easily reused. This is why in most cases I believe this type of testing is only added after an issue had occured.

In part 2, we will discuss a statistical approach to the same dataset.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Bad Information Quality Can Be Highly Offensive!

July 5th, 2006 by morgan

I just got several letters in the mail that were very troubling. My 3-year old just got an offer to purchase a timeshare through Marriot and my 6-year old just got an offer for a credit card through and Chase. This is highly offensive to me as a parent, even though as a professional I can understand why it might happen.

My guess is that these letters are a result of:

  1. Bad householding processes.
  2. Poor information quality.
  3. Sloppy decision making.

However, all of this could have been prevented with some minor filtering. Marriot and Chase got the address information through United Airlines frequent flyer miles program data.  This data includes an age, so clearly these offers should not be sent out to anyone under the age of 18.  I would think that anyone under the age of 18 would automatically have an opt-out for marketing, but clearly this isn’t the case.
This will probably have the net effect of making me sever my business relationship with United, as they are clearly not a company that I trust with my families personal data. I also am building up a great distaste for Marriot and Chase, as they are not managing their businesses well.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Real-Life Example of the Cost of Information Quality

May 29th, 2006 by morgan

There is a heartbreaking story that really demonstrates the sometimes all-too-high cost of poor information quality in the real world. The article IT Integration: The Army’s Pay Misstep discusses the problems that the US Army Reserve has had in paying its people properly and the impact that it has on real people, especially wounded soldiers and their families.

Like so many IQ related stories, this one has a bit of everything:

  • Organizations Outpacing Systems
  • Legacy Applications
  • Manual Data Entry
  • Complex Business Logic
  • Regulatory Compliance
  • Technical and Process Wizards Keeping Everything Running

The sad thing is, often it is the individual service members who end up paying the price. Something worth considering around Memorial Day.

technorati tags: , , , , , , , ,

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Making Metadata Pay

May 23rd, 2006 by morgan

I am currently working on a process to add some instrumentation to an existing legacy system. Not physical instrumentation, but conceptually similar. Consider the relationship of the speedometer or the tachometer to the engine of an automobile; I am doing the same thing for an ETL process.

Basically, we are trying to track the process of data provided by manufacturing systems through the entire information architecture, from delivery to publishing. A textbook example of metadata creation. To be honest, it isn’t the most exciting work in the world, but it is at least interesting to dissect an existing process and come up with something useful. Most importantly, it is very useful to our customers, and this is the measurement I really care about.

Anyway, one of the big stumbling blocks with tracking metadata is that it is expensive to make it useful. It is easy to build controls into a process that tracks every potential error that occurs. It is really useful to have an overall view of a process (or of all processes across an organization) to see how it is doing, especially over time. Unfortunately, it is often very challenging to bridge the gap between these two.

For this project, I think I found a way to do it fairly easily. There were three important steps …

  1. We decided that all potential errors (invalid records, bad assignments, data that does not join properly) would be written to individual error files, one error per line (separated by a ‘\n’).
  2. We decided to give our error files names that would describe what was inside at a glance. In our case we used a standard of <process name>.<program id>.<useful error description>.err.
  3. We wrote a simple process that would parse these files and put the results into a table on a database. The table had fields for:
  • process name
  • program id
  • useful error description
  • error count
  • parse date

For a file named xfer.999.invalid-file-names.err this would generate a record that looks like:

  • process name = xfer
  • program id = 9999
  • useful error description = “invalid file names”
  • error count = a simple count of the number of lines in the file
  • parse date = the date the operation happened

Now, we can process any error file (from any process) into the same table and now we have a generic method for capturing error data. On the development side, the only real cost is that of adhering to the file naming conventions, which is relatively low. On the operations side, we have the ability to track the results of our processes historically with simple SQL queries. A win-win at very low cost!

I am very pleased with this solution, hopefully it will help you as well.

technorati tags: , , , , ,

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

about


This is the about me section, you will prob. want to edit this. If you want to change the image you may do so by changing the avatar.jpg located in the NewZen images directory.

search

navigation

archives

categories