Two Methods for Defining Information Quality
In Information Science today two competing methods for indexing information: semantics and statistics. While this may not seem to have a lot to do with information quality, bear with me and I promise I will link them up (eventually). Both methods approximately the same job, that is to allow information to be read and manipulated by machines on a grand scale. The difference is in how this is done.
- A semantic approach would have the author define concepts and relationships ahead of time. You can see some examples in this tutorial, as they are long and would be difficult to reproduce here. The Semantic Web would be a good example of this methodology.
- A statistcal approach would simply look at the text that was available and try to determine what is there and how it relates to other things through textual analysis and aggregation. Google is a good example of the use of this approach.
The semantic way of looking at things is very abstract and much more rigorous. It says that there is a truth to be represented, it designs a way of doing it, and expects everyone to follow along. The statistical way of looking at things is much more flexible. It says that there are things to be gleaned regardless of form, and that we should accept this fact and try to make the best of things. Not surprisingly, the semantic approach is the favorite of academia and has been under development for many years, while the statistical approach is already in real-world use.
What got me thinking about this in the first place was the latest issue of Baseline. Specifically, it was an article from Paul A. Strassman titled, “How Clean Data Can Transform Your Business”. Normally Strassman’s stuff is pretty good, but it is helpful to note that Strassman is a senior consultant to the Department of Defense and has been in the business for a long, long, long time.
The crux of his argument was that:
The first step in business transformation: enterprisewide standardization of data. That calls for the declaration of a metadata directory as the template for defining data that can circulate within a firm’s information systems. The policy and implementation of an enforceable metadata directory likely will be resisted by bureaucrats, who see this as a threat to their indispensability. It will not be welcomed by systems developers, contractors and vendors, who prefer to concentrate on upgrading software as a technologically more interesting—and profitable—task.
A classic argument for a semantic model of truth. We just need to get everything defined and then it will be smooth sailing from there. For most vendors and consultants, the semantic view is the accepted one, probably because it is so structured and logical, although at least partially because it all those hours spent defining concepts are billable. Even Strassman acknowledges this reality …
To reach agreement on the representation, semantics and taxonomy of data, you will likely go through a painful political process that must be adjudicated by line management. This can get messy because it will reveal that a large percentage of installed software perpetuates incompatible, unreliable, insufficiently secure and delayed information.
With this in mind, is semantic definition the most efficient way to improve information quality? Is a statistical definition the most descriptive way to understand information quality? We will explore the basis for both of these methods in the next part of this series.









August 1st, 2006 at 8:21 pm
[…] Previously, we talked about the semantic and statistical approaches to information quality. Two distinctly different ways of trying to do the same thing. How can we reconcile these two different ideas and actually accomplish something in the real world? The best way I know is to try and fall back to some well established practices and try to adapt them to our needs. While we are working with data instead of applications, I think that these approaches correspond directly to principles from software engineering. For most applications, there are two types of testing: […]
August 4th, 2006 at 12:27 pm
[…] When an organization begins a concerted effort to improve its information quality, often it gets stuck in trying to figure out exactly where to start. Previously, we had discussed the semantic and statistical approaches to information quality and linked them to black box and white box testing (you may want to take a look at these if you aren’t familiar with the subjects, as these are the basis for this article). […]
August 14th, 2006 at 5:51 am
[…] Previously, we had discussed the semantic and statistical approaches to information quality and linked them to black box and white box testing. In addition, there is a case study on semantic information quality which is used to contrast this case study (you may want to take a look at these if you aren’t familiar with the subjects). […]
August 24th, 2006 at 5:54 pm
[…] I have talked a lot about the differences between statistical and semantic information (especially around quality) in the recent past. I have also been interested in ways to bridge the gaps between these approaches, as they both have their own strengths and weaknesses. A project that is aiming to do something like this is the Semantic Media Wiki. They take an interesting approach to bridging the gap between human understanding and machine-processable truth by including it an easy to use repository, Media Wiki. […]
July 21st, 2008 at 9:48 am
[…] One of the most highly valued features of information architecture is accuracy. Everyone wants everything to be perfect: every answer should be as factually accurate as possible and available immediately to whomever needs it. This was the promise of the internet as a whole, and of the web specifically (especially the “semantic web“, which I have ranted about before). […]