Case Study — Semantic Information Quality

When an organization begins a concerted effort to improve its information quality, often it gets stuck in trying to figure out exactly where to start. This case study takes this to heart and gives a specific example of an approach to improving information quality.

Previously, we had discussed the semantic and statistical approaches to information quality and linked them to black box and white box testing (you may want to take a look at these if you aren’t familiar with the subjects, as these are the basis for this article).

The Semantic Approach

A more semantic approach would involve defining exactly what your data represents, and from there determine what it should look like and how it should behave. This sounds pretty easy, right? The problem is that things are often more complicated than they seem.

Let’s look at an example I ran into on a client engagement, dealing with call data for a customer contact center. For simplicity, we can assume that all the call data we need is delivered nightly and is loaded into a single table that looks exactly like the files as they have arrived. This table has the following attributes:

  • employee_login_number
  • site_name
  • department_name
  • call_local_start_time
  • call_local_end_time

OK, now from this data the business analysts are going to figure out how much to pay and to whom. Also, we need to figure out who is handling the highest call volume (vendors, locations, and employees) on a daily basis so that we can resolve issues and negotiate contracts. Our job is to make sure that the data is accurate enough to do this with confidence.

The Semantic Challenge

The first thing we would need to do is to find out exactly what is going on in the system. Talking with various people in technology and business units, we can define some basic terms. In our case, let’s say that we discover:

  • There are multiple contact center locations worldwide and each one has its own “switch” with data in its own local time. All of the locations are owned and operated by vendors.
  • All reporting for management is done in Eastern Time (US), but location and employee reporting should be done in local time.
  • An agent is signified by a login number in the “switch” (a piece of telephony equipment).
  • An agent works in a department, which handles a specific type of call.
  • An agent can have multiple logins on the same “switch” for different departments that they work in.
  • A “call” will be defined by a valid call record, including a start time and end time
  • Each time a call comes in a record will be created with the login number, start time, and stop time (in local time).
  • Calls to different departments are paid different rates.

Realize that this is the tip of the iceberg when it comes to business rules. There could easily be 100 more concepts and constraints involved in a decent sized business. Also, understand that this was very rapidly growing (over 100% per year) worldwide business that was intensely focused on customer service. We couldn’t ask the business to slow down. But, we still needed to provide data that was of high quality.

From the problem description, we know that there must be mappings between:

  • Logins and agents.
  • Locations and vendors.
  • Locations and time zones.
  • Departments and pay rates.

The Semantic Solution

Off the top of my head, there are a number of things that we can do to test this data. It shouldn’t be too hard to write SQL that would test the referential integrity of the system. For example:

  1. Join the call data with each of the mappings, noting what records have no matches.
  2. Join the call data with each of the mappings, noting what records have multiple matches.
  3. Look for duplicates in the mapping tables.
  4. Look for newly added or removed values in the mapping tables.

Next, I would look at some basic validation tests:

  1. Each agent should not have more than 3 logins (or some appropriate number) per day.
  2. Each agent should only be listed at one facility per day.
  3. Each agent should only be listed at one vendor per day.
  4. Locations should not disapear or change time zones from day to day.
  5. Vendors should not disapear from day to day.
  6. The Call Start Time should be earlier than the Call Stop Time.

Last, it would be good to write some sanity checks:

  1. After daily processing is complete, the total number of calls should exactly match the sum of the number of calls to each site.
  2. After daily processing is complete, the total number of calls to a site should exactly match the sum of the number of calls to each agent at that site.
  3. At all levels, the total amount billed should not be more than the total (number of calls) x (highest billing rate).
  4. The total number of call time for an agent should not be more than 12 hours in a given day.

Now, this is by no means a complete list of tests that should be run, but it gives you a good idea of what can be looked at.

Conclusion

As you can see, ensuring that the information coming out of this process is accurate is sometimes simple, sometimes complex, and sometimes downright daunting. Most of the solution here requires that custom tests be created, maintained, understood, and reported on (something that we haven’t even discussed). This is a lot of work, and customized work that can’t be easily reused. This is why in most cases I believe this type of testing is only added after an issue had occured.

In part 2, we will discuss a statistical approach to the same dataset.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl
Digg this     Create a del.icio.us Bookmark     Add to Newsvine

4 Responses to “Case Study — Semantic Information Quality”

  1. Architected Information » Case Study — Statistical Information Quality Says:

    […] Previously, we had discussed the semantic and statistical approaches to information quality and linked them to black box and white box testing. In addition, there is a case study on semantic information quality which is used to contrast this case study (you may want to take a look at these if you aren’t familiar with the subjects). […]

  2. Architected Information » Two Methods for Defining Information Quality Says:

    […] A semantic approach would have the author define concepts and relationships ahead of time. You can see some examples in this tutorial, as they are long and would be difficult to reproduce here. The Semantic Web would be a good example of this methodology. […]

  3. Architected Information » Information Quality in Black and White Says:

    […] In my opinion, a semantic approach to information quality is the equivalent of a white-box test. Conversely, the statistical approach is the equivalent of a black-box test. Share and earn some karma …These icons link to social bookmarking sites where readers can share and discover new web pages. […]

  4. Fay Says:

    Good site! I found in google.com +

Leave a Reply