Two Methods for Defining Information Quality
July 31st, 2006 by morganIn Information Science today two competing methods for indexing information: semantics and statistics. While this may not seem to have a lot to do with information quality, bear with me and I promise I will link them up (eventually). Both methods approximately the same job, that is to allow information to be read and manipulated by machines on a grand scale. The difference is in how this is done.
- A semantic approach would have the author define concepts and relationships ahead of time. You can see some examples in this tutorial, as they are long and would be difficult to reproduce here. The Semantic Web would be a good example of this methodology.
- A statistcal approach would simply look at the text that was available and try to determine what is there and how it relates to other things through textual analysis and aggregation. Google is a good example of the use of this approach.
The semantic way of looking at things is very abstract and much more rigorous. It says that there is a truth to be represented, it designs a way of doing it, and expects everyone to follow along. The statistical way of looking at things is much more flexible. It says that there are things to be gleaned regardless of form, and that we should accept this fact and try to make the best of things. Not surprisingly, the semantic approach is the favorite of academia and has been under development for many years, while the statistical approach is already in real-world use.
What got me thinking about this in the first place was the latest issue of Baseline. Specifically, it was an article from Paul A. Strassman titled, “How Clean Data Can Transform Your Business”. Normally Strassman’s stuff is pretty good, but it is helpful to note that Strassman is a senior consultant to the Department of Defense and has been in the business for a long, long, long time.
The crux of his argument was that:
The first step in business transformation: enterprisewide standardization of data. That calls for the declaration of a metadata directory as the template for defining data that can circulate within a firm’s information systems. The policy and implementation of an enforceable metadata directory likely will be resisted by bureaucrats, who see this as a threat to their indispensability. It will not be welcomed by systems developers, contractors and vendors, who prefer to concentrate on upgrading software as a technologically more interesting—and profitable—task.
A classic argument for a semantic model of truth. We just need to get everything defined and then it will be smooth sailing from there. For most vendors and consultants, the semantic view is the accepted one, probably because it is so structured and logical, although at least partially because it all those hours spent defining concepts are billable. Even Strassman acknowledges this reality …
To reach agreement on the representation, semantics and taxonomy of data, you will likely go through a painful political process that must be adjudicated by line management. This can get messy because it will reveal that a large percentage of installed software perpetuates incompatible, unreliable, insufficiently secure and delayed information.
With this in mind, is semantic definition the most efficient way to improve information quality? Is a statistical definition the most descriptive way to understand information quality? We will explore the basis for both of these methods in the next part of this series.