<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.0.6" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Architected Information</title>
	<link>http://www.architected.info/blog</link>
	<description>How people, practices, and information are transformed into relationships and understanding.</description>
	<pubDate>Fri, 24 Apr 2009 15:19:15 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.0.6</generator>
	<language>en</language>
			<item>
		<title>Musings on Metadata and Compliance</title>
		<link>http://www.architected.info/blog/musings-on-metadata-and-compliance</link>
		<comments>http://www.architected.info/blog/musings-on-metadata-and-compliance#comments</comments>
		<pubDate>Fri, 29 Sep 2006 12:10:22 +0000</pubDate>
		<dc:creator>morgan</dc:creator>
		
		<category>Information Quality</category>

		<category>Relationships</category>

		<category>Business Intelligence</category>

		<category>Reporting</category>

		<category>Metadata</category>

		<guid isPermaLink="false">http://www.architected.info/blog/musings-on-metadata-and-compliance</guid>
		<description><![CDATA[Frank Dravis has a new post on metadata and its growing importance in the marketplace.  Dravis wonders how metadata became a subject of interest for non-technical folk &#8230;
In years past, metadata was the domain of data architects. It helped them understand what data they had and how it related to the sources and operations [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://eimblog.businessobjects.com/dravis/" onclick="javascript:urchinTracker ('/outbound/article/eimblog.businessobjects.com');">Frank Dravis</a> has a <a href="http://eimblog.businessobjects.com/dravis/2006/9/26/two-sides-of-the-metadata-coin.html" onclick="javascript:urchinTracker ('/outbound/article/eimblog.businessobjects.com');">new post</a> on <a href="http://www.architected.info/blog/category/relationships/metadata/" >metadata</a> and its <a href="http://eimblog.businessobjects.com/dravis/2006/9/26/two-sides-of-the-metadata-coin.html" onclick="javascript:urchinTracker ('/outbound/article/eimblog.businessobjects.com');">growing importance in the marketplace</a>.  Dravis wonders how metadata became a subject of interest for non-technical folk &#8230;</p>
<blockquote><p>In years past, metadata was the domain of data architects. It helped them understand what data they had and how it related to the sources and operations from which it came and to which it went. At the first mention of metadata business users would roll their eyes and head for the conference room door. Surely metadata was the stuff of arcane IT discussions best had out of earshot of people driving and running the business.</p>
<p>Then metadata management progressed and someone had the silly idea of articulating the business value, the value to the business side of the house, for metadata. The value came from the resolution of an age old problem. A corporate manager is sitting in a conference room looking at their regular monthly sales report and it is different from what they expected based on anecdotal evidence from the field: the numbers are too low.</p></blockquote>
<p>Personally, I think that this recent interest is driven by a few things:</p>
<ol>
<li><a href="http://www.architected.info/blog/market-based-information-architecture" >Regulation</a> and the threat of real penalties for inaccuracies in reporting.  People got interested enough to protect their own hides.</li>
<li>The rise of ERP and BPM in the marketplace.  If everything is in one place then metadata suddenly becomes a lot easier to manage.</li>
</ol>
<p>Truthfully, I wonder how all of this is going to turn out.   I know there are lots of people who want to sell metadata software, but in my experience it takes a lot of resource (time, effort, and expertise) to maintain a comprehensive metadata environment.  The threat of jail time helps to keep people motivated enough to save their necks, but not enough to make something useful.  Being locked into an ERP package can mean the same thing, only it is your data that is locked.</p>
<blockquote />
<div class="sociable"><span class="sociable_tagline"><strong>Share and earn some karma ...</strong><span>These icons link to social bookmarking sites where readers can share and discover new web pages.</span></span><ul>
	<li><a href="http://del.icio.us/post?url=http://www.architected.info/blog/musings-on-metadata-and-compliance&amp;title=Musings+on+Metadata+and+Compliance" title="del.icio.us" onclick="javascript:urchinTracker ('/outbound/article/del.icio.us');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/delicious.png" alt="del.icio.us" /></a></li>
	<li><a href="http://digg.com/submit?phase=2&amp;url=http://www.architected.info/blog/musings-on-metadata-and-compliance&amp;title=Musings+on+Metadata+and+Compliance" title="digg" onclick="javascript:urchinTracker ('/outbound/article/digg.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/digg.png" alt="digg" /></a></li>
	<li><a href="http://www.furl.net/storeIt.jsp?u=http://www.architected.info/blog/musings-on-metadata-and-compliance&amp;t=Musings+on+Metadata+and+Compliance" title="Furl" onclick="javascript:urchinTracker ('/outbound/article/www.furl.net');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/furl.png" alt="Furl" /></a></li>
	<li><a href="http://www.newsvine.com/_tools/seed&amp;save?u=http://www.architected.info/blog/musings-on-metadata-and-compliance&amp;h=Musings+on+Metadata+and+Compliance" title="NewsVine" onclick="javascript:urchinTracker ('/outbound/article/www.newsvine.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/newsvine.png" alt="NewsVine" /></a></li>
	<li><a href="http://reddit.com/submit?url=http://www.architected.info/blog/musings-on-metadata-and-compliance&amp;title=Musings+on+Metadata+and+Compliance" title="Reddit" onclick="javascript:urchinTracker ('/outbound/article/reddit.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/reddit.png" alt="Reddit" /></a></li>
	<li><a href="http://www.spurl.net/spurl.php?url=http://www.architected.info/blog/musings-on-metadata-and-compliance&amp;title=Musings+on+Metadata+and+Compliance" title="Spurl" onclick="javascript:urchinTracker ('/outbound/article/www.spurl.net');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/spurl.png" alt="Spurl" /></a></li>
</ul></div>
]]></content:encoded>
			<wfw:commentRss>http://www.architected.info/blog/musings-on-metadata-and-compliance/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Case Study &#8212; Statistical Information Quality</title>
		<link>http://www.architected.info/blog/case-study-statistical-information-quality</link>
		<comments>http://www.architected.info/blog/case-study-statistical-information-quality#comments</comments>
		<pubDate>Mon, 14 Aug 2006 13:51:32 +0000</pubDate>
		<dc:creator>morgan</dc:creator>
		
		<category>Databases</category>

		<category>Information Architecture</category>

		<category>Systems Integration</category>

		<category>Information Quality</category>

		<category>Case Studies</category>

		<category>Reporting</category>

		<category>Metadata</category>

		<guid isPermaLink="false">http://www.architected.info/blog/case-study-statistical-information-quality</guid>
		<description><![CDATA[Introduction 
When an organization begins a concerted effort to improve its information quality, often it gets stuck in trying to figure out exactly where to start. This case study takes this to heart and gives a specific example of an approach to improving information quality.
Previously, we had discussed the semantic and statistical approaches to information [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Introduction </strong></p>
<p>When an organization begins a concerted effort to improve its information quality, often it gets stuck in trying to figure out exactly where to start. This case study takes this to heart and gives a specific example of an approach to improving information quality.</p>
<p>Previously, we had discussed the <a href="http://www.architected.info/blog/two-methods-for-defining-information-quality" >semantic and statistical approaches to information quality</a> and linked them to <a href="http://www.architected.info/blog/testing-information-quality" >black box and white box testing</a>.  In addition, there is <a href="http://www.architected.info/blog/getting-started-with-information-quality-1-of-2" >a case study on semantic information quality</a> which is used to contrast this case study (you may want to take a look at these if you aren&#8217;t familiar with the subjects).</p>
<p>The example that we have been using is &#8230;</p>
<blockquote><p>dealing with call data for a customer contact center. For simplicity, we can assume that all the call data we need is delivered nightly and is loaded into a single table that looks exactly like the files as they have arrived. This table has the following attributes:</p>
<ul>
<li>employee_login_number</li>
<li>site_name</li>
<li>department_name</li>
<li>call_local_start_time</li>
<li>call_local_end_time</li>
</ul>
<p>&#8230; from this data the business analysts are going to figure out how much to pay and to whom. Also, we need to figure out who is handling the highest call volume (vendors, locations, and employees) on a daily basis so that we can resolve issues and negotiate contracts. Our job is to make sure that the data is accurate enough to do this with confidence.</p></blockquote>
<p>Also, before we get started, realize that with the <a href="http://www.architected.info/blog/getting-started-with-information-quality-1-of-2" >semantic</a> and statistical approaches we are trying to do the same thing in different ways.  So, while we are doing things differently, there is bound to be some overlap.</p>
<p><strong>The Statistical Approach</strong></p>
<p>With a statistical approach, there are several things to consider:</p>
<ol>
<li>From a statistical point of view, there is nothing special about this dataset.  It has very similar characteristics to all the ones that came before it and will come after it. We should try to create an architecture that can be re-used where appropriate.</li>
<li>There is a lot that we can infer from the dataset itself.  We can learn a great deal of information about the dataset very cheaply through <a href="http://www.stickyminds.com/sitewide.asp?Function=edetail&#038;ObjectType=COL&#038;ObjectId=2968" onclick="javascript:urchinTracker ('/outbound/article/www.stickyminds.com');">black box testing</a>.  Focusing on these areas will maximize re-use as well.</li>
<li>We can probably assume that any data that we recieve is of reasonably good quality when the process was first designed.  Therefore, we can focus on events where the nature of the data changes substantially.</li>
</ol>
<p>With these in mind, we can start to design a solution.</p>
<p>The place to start is to ask, &#8220;what can go wrong in our data?&#8221;.  I can think of several situations that might impact the quality of this data:</p>
<ul>
<li>The employee_login_number is invalid or NULL.</li>
<li>The site_name is invalid or NULL.</li>
<li>The department is invalid or NULL.</li>
<li>The call_local_start_time is invalid or NULL.</li>
<li>The call_local_end_time is invalid, NULL, or starts before the call_local_start_time.</li>
<li>Due to errors outside of our control, the process that created the data malfunctioned.  Often, this will show up as duplicate values, irregular frequency or distribution of values</li>
</ul>
<p>Off the top of my head, I have a number of questions about the data that we will see day to day:</p>
<ul>
<li>For each column, is there a distinct list of values (call this the domain) that are valid?</li>
<li>For each column, is there a distinct pattern of values that are valid?</li>
<li>For each column, can the values be NULL?</li>
<li>Is there a distinct key?  If so, is it unique?</li>
<li>For column values and keys, should the frequency for particular values be fairly normal?</li>
<li>Is there a certain number of rows that should be expected (by key or for the entire dataset)?</li>
<li>Is there a certain number of keys that should be expected?</li>
<li>For numeric values, can we do descriptive statistics to tell us if things are off-kilter?</li>
</ul>
<p>Based on these, I think that we can establish a data model that would allow this metadata to be recorded for multiple processes, which would allow it to be used for reporting and decision-making.</p>
<p>For example, consider a table having the following attributes:</p>
<ul>
<li>process_id</li>
<li>process_run_dt</li>
<li>distinct_value</li>
<li>distinct_value_type</li>
<li>distinct_value_count</li>
</ul>
<p>This would allow the user to keep track of how many distinct values there were generated by a given process.  Over time, this could be very useful in tracking down some sticky problems, and perhaps prevent bad data from ever getting into a data store in the first place.</p>
<p>For each of the meausurement processes we mentioned, they can probably be integrated into the overall data model in a process agnostic way.  I apologize for not having more details at this point, I plan to move this to the wiki (at some point) and put in a reference model for doing some of these operations.</p>
<p><em>Comparisons With Data Profiling</em><strong> </strong></p>
<p>For people with some experience with data management this may sound a lot like <a href="http://en.wikipedia.org/wiki/Data_profiling" onclick="javascript:urchinTracker ('/outbound/article/en.wikipedia.org');">data profiling</a>.  In fact, a lot of the operations inherent in the statistical approach would probably be considered a part of data profiling as well.</p>
<p>However, there are some key differences between Statistical IQ and Data Profiling that need mentioning:</p>
<ol>
<li>Statistical IQ has an operational focus and needs to be as lightweight as possible.  We want to use this to make day-to-day operational decisions about our data without slowing anything down.</li>
<li>Statistical IQ does not include data discovery, while data profiling often does.</li>
<li>One of the core functions of data profiling is establishing relationships between datasets.  Statistical IQ has a very limited view of relationships in order to maximize functionality and reusability.</li>
</ol>
<p>Similar base concepts, focusing on different areas.</p>
<p><em>Statistical IQ and Mad Libs</em></p>
<p>One thing that often gets lost in the re-use discussion is the price of user configuration.  All too often, programmers push too much decision making out of their code and on to the operator, making it difficult to use.</p>
<p>The trick with Statistical IQ is that you have to be able to tie a generic statement (&#8221;there are 15 distinct values in this dataset&#8221;) back to something useful (&#8221;there is probably missing data, don&#8217;t continue the process&#8221;).  While this might seem like a challenge, it can be done without a lot of heartburn.</p>
<p>In a recent engagement, I <a href="http://www.architected.info/blog/making-metadata-pay" >designed a solution</a> where we tied every possible error back to an english description of the problem that was stored in an SQL database.   This was done in a very generic way, so that new errors could be added or removed without any configuration required by the developer or operator.</p>
<p><strong>Conclusion</strong></p>
<p>There are different approaches to information quality, each with their own strengths, weaknesses, and costs.  The statistical approach is cheaper (especially when you factor in Moore&#8217;s Law), but gives a less detailed picture of overall quality.  The <a href="http://www.architected.info/blog/getting-started-with-information-quality-1-of-2" >semantic approach</a> is more expensive, but can be as comprehensive as the situation requires.  A balanced approach will use both approaches to deliver the solution that is needed.
</p>
<div class="sociable"><span class="sociable_tagline"><strong>Share and earn some karma ...</strong><span>These icons link to social bookmarking sites where readers can share and discover new web pages.</span></span><ul>
	<li><a href="http://del.icio.us/post?url=http://www.architected.info/blog/case-study-statistical-information-quality&amp;title=Case+Study+--+Statistical+Information+Quality" title="del.icio.us" onclick="javascript:urchinTracker ('/outbound/article/del.icio.us');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/delicious.png" alt="del.icio.us" /></a></li>
	<li><a href="http://digg.com/submit?phase=2&amp;url=http://www.architected.info/blog/case-study-statistical-information-quality&amp;title=Case+Study+--+Statistical+Information+Quality" title="digg" onclick="javascript:urchinTracker ('/outbound/article/digg.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/digg.png" alt="digg" /></a></li>
	<li><a href="http://www.furl.net/storeIt.jsp?u=http://www.architected.info/blog/case-study-statistical-information-quality&amp;t=Case+Study+--+Statistical+Information+Quality" title="Furl" onclick="javascript:urchinTracker ('/outbound/article/www.furl.net');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/furl.png" alt="Furl" /></a></li>
	<li><a href="http://www.newsvine.com/_tools/seed&amp;save?u=http://www.architected.info/blog/case-study-statistical-information-quality&amp;h=Case+Study+--+Statistical+Information+Quality" title="NewsVine" onclick="javascript:urchinTracker ('/outbound/article/www.newsvine.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/newsvine.png" alt="NewsVine" /></a></li>
	<li><a href="http://reddit.com/submit?url=http://www.architected.info/blog/case-study-statistical-information-quality&amp;title=Case+Study+--+Statistical+Information+Quality" title="Reddit" onclick="javascript:urchinTracker ('/outbound/article/reddit.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/reddit.png" alt="Reddit" /></a></li>
	<li><a href="http://www.spurl.net/spurl.php?url=http://www.architected.info/blog/case-study-statistical-information-quality&amp;title=Case+Study+--+Statistical+Information+Quality" title="Spurl" onclick="javascript:urchinTracker ('/outbound/article/www.spurl.net');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/spurl.png" alt="Spurl" /></a></li>
</ul></div>
]]></content:encoded>
			<wfw:commentRss>http://www.architected.info/blog/case-study-statistical-information-quality/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Case Study &#8212; Semantic Information Quality</title>
		<link>http://www.architected.info/blog/getting-started-with-information-quality-1-of-2</link>
		<comments>http://www.architected.info/blog/getting-started-with-information-quality-1-of-2#comments</comments>
		<pubDate>Fri, 04 Aug 2006 20:27:08 +0000</pubDate>
		<dc:creator>morgan</dc:creator>
		
		<category>Information Quality</category>

		<category>Case Studies</category>

		<category>Performance Measurement</category>

		<category>Reporting</category>

		<category>Metadata</category>

		<guid isPermaLink="false">http://www.architected.info/blog/getting-started-with-information-quality-1-of-2</guid>
		<description><![CDATA[When an organization begins a concerted effort to improve its information quality, often it gets stuck in trying to figure out exactly where to start. This case study takes this to heart and gives a specific example of an approach to improving information quality.
Previously, we had discussed the semantic and statistical approaches to information quality [...]]]></description>
			<content:encoded><![CDATA[<p>When an organization begins a concerted effort to improve its information quality, often it gets stuck in trying to figure out exactly where to start. This case study takes this to heart and gives a specific example of an approach to improving information quality.</p>
<p>Previously, we had discussed the <a href="http://www.architected.info/blog/two-methods-for-defining-information-quality" >semantic and statistical approaches to information quality</a> and linked them to <a href="http://www.architected.info/blog/testing-information-quality" >black box and white box testing</a> (you may want to take a look at these if you aren&#8217;t familiar with the subjects, as these are the basis for this article).</p>
<p><strong>The Semantic Approach</strong></p>
<p>A more semantic approach would involve defining exactly what your data represents, and from there determine what it should look like and how it should behave. This sounds pretty easy, right? The problem is that things are often more complicated than they seem.</p>
<p>Let&#8217;s look at an example I ran into on a client engagement, dealing with call data for a customer contact center. For simplicity, we can assume that all the call data we need is delivered nightly and is loaded into a single table that looks exactly like the files as they have arrived.  This table has the following attributes:</p>
<ul>
<li>employee_login_number</li>
<li>site_name</li>
<li>department_name</li>
<li>call_local_start_time</li>
<li>call_local_end_time</li>
</ul>
<p>OK, now from this data the business analysts are going to figure out how much to pay and to whom.  Also, we need to figure out who is handling the highest call volume (vendors, locations, and employees) on a daily basis so that we can resolve issues and negotiate contracts.  Our job is to make sure that the data is accurate enough to do this with confidence.</p>
<p><strong>The Semantic Challenge</strong></p>
<p>The first thing we would need to do is to find out exactly what is going on in the system.  Talking with various people in technology and business units, we can define some basic terms.  In our case, let’s say that we discover:</p>
<ul type="disc">
<li class="MsoNormal">There are multiple contact      center locations worldwide and each one has its own &#8220;switch&#8221;      with data in its own local time.  All      of the locations are owned and operated by vendors.</li>
<li class="MsoNormal">All reporting for management      is done in Eastern Time (US), but location and employee reporting should      be done in local time.</li>
<li class="MsoNormal">An agent is signified by a      login number in the &#8220;switch&#8221; (a piece of telephony equipment).</li>
<li class="MsoNormal">An agent works in a      department, which handles a specific type of call.</li>
<li class="MsoNormal">An agent can have multiple      logins on the same &#8220;switch&#8221; for different departments that they      work in.</li>
<li class="MsoNormal">A &#8220;call&#8221; will be      defined by a valid call record, including a start time and end time</li>
<li class="MsoNormal">Each time a call comes in a      record will be created with the login number, start time, and stop time      (in local time).</li>
<li class="MsoNormal">Calls to different      departments are paid different rates.</li>
</ul>
<p>Realize that this is the tip of the iceberg when it comes to business rules.  There could easily be 100 more concepts and constraints involved in a decent sized business. Also, understand that this was very rapidly growing (over 100% per year) worldwide business that was intensely focused on customer service. We couldn&#8217;t <a href="http://www.slowleadership.org/2006/07/time-decisions-and-action.html" onclick="javascript:urchinTracker ('/outbound/article/www.slowleadership.org');">ask the business to slow down</a>. But, we still needed to provide data that was of high quality.</p>
<p>From the problem description, we know that there must be mappings between:</p>
<ul>
<li>Logins and agents.</li>
<li>Locations and vendors.</li>
<li>Locations and time zones.</li>
<li>Departments and pay rates.</li>
</ul>
<p><strong>The Semantic Solution</strong></p>
<p>Off the top of my head, there are a number of things that we can do to test this data.  It shouldn&#8217;t be too hard to write SQL that would test the referential integrity of the system.  For example:</p>
<ol>
<li>Join the call data with each of the mappings, noting what records have no matches.</li>
<li>Join the call data with each of the mappings, noting what records have multiple matches.</li>
<li>Look for duplicates in the mapping tables.</li>
<li>Look for newly added or removed values in the mapping tables.</li>
</ol>
<p>Next, I would look at some basic validation tests:</p>
<ol>
<li>Each agent should not have more than 3 logins (or some appropriate number) per day.</li>
<li>Each agent should only be listed at one facility per day.</li>
<li>Each agent should only be listed at one vendor per day.</li>
<li>Locations should not disapear or change time zones from day to day.</li>
<li>Vendors should not disapear from day to day.</li>
<li>The Call Start Time should be earlier than the Call Stop Time.</li>
</ol>
<p>Last, it would be good to write some sanity checks:</p>
<ol>
<li>After daily processing is complete, the total number of calls should exactly match the sum of the number of calls to each site.</li>
<li>After daily processing is complete, the total number of calls to a site should exactly match the sum of the number of calls to each agent at that site.</li>
<li>At all levels, the total amount billed should not be more than the total (number of calls) x (highest billing rate).</li>
<li>The total number of call time for an agent should not be more than 12 hours in a given day.</li>
</ol>
<p>Now, this is by no means a complete list of tests that should be run, but it gives you a good idea of what can be looked at.</p>
<p><strong>Conclusion</strong></p>
<p>As you can see, ensuring that the information coming out of this process is accurate is sometimes simple, sometimes complex, and sometimes downright daunting.  Most of the solution here requires that custom tests be created, maintained, understood, and reported on (something that we haven&#8217;t even discussed).  This is a lot of work, and customized work that can&#8217;t be easily reused.  This is why in most cases I believe this type of testing is only added after an issue had occured.</p>
<p>In part 2, we will discuss <a href="http://www.architected.info/blog/case-study-statistical-information-quality" >a statistical approach</a> to the same dataset.
</p>
<div class="sociable"><span class="sociable_tagline"><strong>Share and earn some karma ...</strong><span>These icons link to social bookmarking sites where readers can share and discover new web pages.</span></span><ul>
	<li><a href="http://del.icio.us/post?url=http://www.architected.info/blog/getting-started-with-information-quality-1-of-2&amp;title=Case+Study+--+Semantic+Information+Quality+" title="del.icio.us" onclick="javascript:urchinTracker ('/outbound/article/del.icio.us');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/delicious.png" alt="del.icio.us" /></a></li>
	<li><a href="http://digg.com/submit?phase=2&amp;url=http://www.architected.info/blog/getting-started-with-information-quality-1-of-2&amp;title=Case+Study+--+Semantic+Information+Quality+" title="digg" onclick="javascript:urchinTracker ('/outbound/article/digg.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/digg.png" alt="digg" /></a></li>
	<li><a href="http://www.furl.net/storeIt.jsp?u=http://www.architected.info/blog/getting-started-with-information-quality-1-of-2&amp;t=Case+Study+--+Semantic+Information+Quality+" title="Furl" onclick="javascript:urchinTracker ('/outbound/article/www.furl.net');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/furl.png" alt="Furl" /></a></li>
	<li><a href="http://www.newsvine.com/_tools/seed&amp;save?u=http://www.architected.info/blog/getting-started-with-information-quality-1-of-2&amp;h=Case+Study+--+Semantic+Information+Quality+" title="NewsVine" onclick="javascript:urchinTracker ('/outbound/article/www.newsvine.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/newsvine.png" alt="NewsVine" /></a></li>
	<li><a href="http://reddit.com/submit?url=http://www.architected.info/blog/getting-started-with-information-quality-1-of-2&amp;title=Case+Study+--+Semantic+Information+Quality+" title="Reddit" onclick="javascript:urchinTracker ('/outbound/article/reddit.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/reddit.png" alt="Reddit" /></a></li>
	<li><a href="http://www.spurl.net/spurl.php?url=http://www.architected.info/blog/getting-started-with-information-quality-1-of-2&amp;title=Case+Study+--+Semantic+Information+Quality+" title="Spurl" onclick="javascript:urchinTracker ('/outbound/article/www.spurl.net');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/spurl.png" alt="Spurl" /></a></li>
</ul></div>
]]></content:encoded>
			<wfw:commentRss>http://www.architected.info/blog/getting-started-with-information-quality-1-of-2/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Two Methods for Defining Information Quality</title>
		<link>http://www.architected.info/blog/two-methods-for-defining-information-quality</link>
		<comments>http://www.architected.info/blog/two-methods-for-defining-information-quality#comments</comments>
		<pubDate>Tue, 01 Aug 2006 03:03:57 +0000</pubDate>
		<dc:creator>morgan</dc:creator>
		
		<category>Databases</category>

		<category>Information Architecture</category>

		<category>Systems Integration</category>

		<category>Information Quality</category>

		<category>Metadata</category>

		<guid isPermaLink="false">http://www.architected.info/blog/two-methods-for-defining-information-quality</guid>
		<description><![CDATA[In Information Science today two competing methods for indexing information: semantics and statistics.  While this may not seem to have a lot to do with information quality, bear with me and I promise I will link them up (eventually).  Both methods approximately the same job, that is to allow information to be read [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://en.wikipedia.org/wiki/Information_science" onclick="javascript:urchinTracker ('/outbound/article/en.wikipedia.org');">Information Science</a> today two competing methods for indexing information: <strong>semantics</strong> and <strong>statistics</strong>.  While this may not seem to have a lot to do with information quality, bear with me and I promise I will link them up (eventually).  Both methods approximately the same job, that is to allow information to be read and manipulated by machines on a grand scale.  The difference is in how this is done.</p>
<ul>
<li>A <a href="http://www.architected.info/blog/getting-started-with-information-quality-1-of-2" ><strong>semantic approach</strong></a> would have the author define concepts and relationships ahead of time.  You can see some examples in <a href="http://infomesh.net/2001/swintro/" onclick="javascript:urchinTracker ('/outbound/article/infomesh.net');">this tutorial</a>, as they are long and would be difficult to reproduce here.   The <a href="http://en.wikipedia.org/wiki/Semantic_Web" onclick="javascript:urchinTracker ('/outbound/article/en.wikipedia.org');">Semantic Web</a> would be a good example of this methodology.</li>
</ul>
<ul>
<li>A <a href="http://www.architected.info/blog/case-study-statistical-information-quality" ><strong>statistcal approach</strong></a> would simply look at the text that was available and try to determine what is there and how it relates to other things through textual analysis and aggregation.  <a href="http://www.google.com" onclick="javascript:urchinTracker ('/outbound/article/www.google.com');">Google</a> is a good example of the use of this approach.</li>
</ul>
<p>The semantic way of looking at things is very abstract and much more rigorous.  It says that there is a truth to be represented, it designs a way of doing it, and expects everyone to follow along.  The statistical way of looking at things is much more flexible.  It says that there are things to be gleaned regardless of form, and that we should accept this fact and try to make the best of things.  Not surprisingly, the semantic approach is the favorite of academia and has been under development for many years, while the statistical approach is already in real-world use.</p>
<p>What got me thinking about this in the first place was the latest issue of <a href="http://www.baselinemag.com/" onclick="javascript:urchinTracker ('/outbound/article/www.baselinemag.com');">Baseline</a>.  Specifically, it was <a href="http://www.baselinemag.com/article2/0,1397,1985493,00.asp" onclick="javascript:urchinTracker ('/outbound/article/www.baselinemag.com');">an article</a> from <a href="http://www.strassmann.com/bio.php" onclick="javascript:urchinTracker ('/outbound/article/www.strassmann.com');">Paul A. Strassman</a>  titled, &#8220;How Clean Data Can Transform Your Business&#8221;.  Normally Strassman&#8217;s stuff is pretty good, but it is helpful to note that Strassman is a senior consultant to the Department of Defense and has been in the business for a long, long, long time.</p>
<p>The crux of his argument was that:</p>
<blockquote><p>The first step in business transformation: enterprisewide standardization of data. That calls for the declaration of a metadata directory as the template for defining data that can circulate within a firm&#8217;s information systems. The policy and implementation of an enforceable metadata directory likely will be resisted by bureaucrats, who see this as a threat to their indispensability. It will not be welcomed by systems developers, contractors and vendors, who prefer to concentrate on upgrading software as a technologically more interesting—and profitable—task.</p></blockquote>
<p>A classic argument for a semantic model of truth.  We just need to get everything defined and then it will be smooth sailing from there.  For most vendors and consultants, the semantic view is the accepted one, probably because it is so structured and logical, although at least partially because it all those hours spent defining concepts are billable.  Even Strassman acknowledges this reality &#8230;</p>
<blockquote><p>To reach agreement on the representation, semantics and taxonomy of data, you will likely go through a painful political process that must be adjudicated by line management. This can get messy because it will reveal that a large percentage of installed software perpetuates incompatible, unreliable, insufficiently secure and delayed information.</p></blockquote>
<p>With this in mind, is <a href="http://www.architected.info/blog/getting-started-with-information-quality-1-of-2" >semantic definition</a> the most efficient way to improve information quality?  Is a <a href="http://www.architected.info/blog/case-study-statistical-information-quality" >statistical definition</a> the most descriptive way to understand information quality?  We will explore the basis for both of these methods in the <a href="http://www.architected.info/blog/testing-information-quality" >next part of this series</a>.
</p>
<div class="sociable"><span class="sociable_tagline"><strong>Share and earn some karma ...</strong><span>These icons link to social bookmarking sites where readers can share and discover new web pages.</span></span><ul>
	<li><a href="http://del.icio.us/post?url=http://www.architected.info/blog/two-methods-for-defining-information-quality&amp;title=Two+Methods+for+Defining+Information+Quality" title="del.icio.us" onclick="javascript:urchinTracker ('/outbound/article/del.icio.us');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/delicious.png" alt="del.icio.us" /></a></li>
	<li><a href="http://digg.com/submit?phase=2&amp;url=http://www.architected.info/blog/two-methods-for-defining-information-quality&amp;title=Two+Methods+for+Defining+Information+Quality" title="digg" onclick="javascript:urchinTracker ('/outbound/article/digg.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/digg.png" alt="digg" /></a></li>
	<li><a href="http://www.furl.net/storeIt.jsp?u=http://www.architected.info/blog/two-methods-for-defining-information-quality&amp;t=Two+Methods+for+Defining+Information+Quality" title="Furl" onclick="javascript:urchinTracker ('/outbound/article/www.furl.net');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/furl.png" alt="Furl" /></a></li>
	<li><a href="http://www.newsvine.com/_tools/seed&amp;save?u=http://www.architected.info/blog/two-methods-for-defining-information-quality&amp;h=Two+Methods+for+Defining+Information+Quality" title="NewsVine" onclick="javascript:urchinTracker ('/outbound/article/www.newsvine.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/newsvine.png" alt="NewsVine" /></a></li>
	<li><a href="http://reddit.com/submit?url=http://www.architected.info/blog/two-methods-for-defining-information-quality&amp;title=Two+Methods+for+Defining+Information+Quality" title="Reddit" onclick="javascript:urchinTracker ('/outbound/article/reddit.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/reddit.png" alt="Reddit" /></a></li>
	<li><a href="http://www.spurl.net/spurl.php?url=http://www.architected.info/blog/two-methods-for-defining-information-quality&amp;title=Two+Methods+for+Defining+Information+Quality" title="Spurl" onclick="javascript:urchinTracker ('/outbound/article/www.spurl.net');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/spurl.png" alt="Spurl" /></a></li>
</ul></div>
]]></content:encoded>
			<wfw:commentRss>http://www.architected.info/blog/two-methods-for-defining-information-quality/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Information Quality Pyramid</title>
		<link>http://www.architected.info/blog/106</link>
		<comments>http://www.architected.info/blog/106#comments</comments>
		<pubDate>Tue, 25 Jul 2006 03:18:11 +0000</pubDate>
		<dc:creator>morgan</dc:creator>
		
		<category>Information Architecture</category>

		<category>Information Quality</category>

		<category>Practices</category>

		<category>Metadata</category>

		<guid isPermaLink="false">http://www.architected.info/blog/106</guid>
		<description><![CDATA[I have been working on some more detailed articles for the wiki to help illustrate some ideas about information quality.  While I don&#8217;t want to just duplicate that article here, I thought I would post some things on the blog and get some feedback.
I am currently working on the Information Quality Pyramid, which discusses [...]]]></description>
			<content:encoded><![CDATA[<p>I have been working on some more detailed articles for <a href="http://www.architected.info/wiki/" >the wiki</a> to help illustrate some ideas about <a href="http://www.architected.info/blog/category/practices/quality/" >information quality</a>.  While I don&#8217;t want to just duplicate that article here, I thought I would post some things on the blog and get some feedback.</p>
<p>I am currently working on the <a href="http://www.architected.info/wiki/index.php?title=IQ_Pyramid" ><strong>Information Quality Pyramid</strong></a>, which discusses the various components that go into improving <a href="http://www.architected.info/blog/category/practices/quality/" >information quality</a> across an organization:</p>
<div align="center"><img align="middle" alt="IQ Pyramid" title="IQ Pyramid" src="http://architected.info/wiki/images/9/99/Iq_pyramid.jpg" /></div>
<p>The pyramid is made up of several parts, each of which are important in their own right.  However, the base components (in blue in green) have the interesting combination of being  very important, terribly inexpensive, and totally unglamorous.</p>
<p><strong>Understanding Your Organization</strong> – The single most important thing that you can do to ensure success in any information quality effort.  Without a solid understanding of how your organization works it is virtually guaranteed that you will not be able to deliver the solution your customers need.  This (coupled with the need for extreme customization) is one of the reasons that it is very difficult to outsource this type of work.</p>
<p><strong>Architecture and Design Practices –</strong> To put it bluntly, if you build your information architecture in an inconsistent manner then you have to expect inconsistencies in its output.  These inconsistencies become quality-related issues very quickly.  If you can proactively address (or at least mitigate issues around) consistency through your architecture then you can dramatically improve the quality of information that you produce.</p>
<p><strong>Automation – </strong>The key to high-value, high-quality information architecture is automating everything possible.  This is because:</p>
<ol>
<li><a href="http://en.wikipedia.org/wiki/Moore%27s_law" onclick="javascript:urchinTracker ('/outbound/article/en.wikipedia.org');">Moore&#8217;s Law</a> will double the speed of computerized processes every two years.  It is pretty tough for humans to keep up.</li>
<li>Humans make mistakes.</li>
<li>In order to automate a process, it has to be understood by more than just the designer or the developer.</li>
</ol>
<p><strong><a href="http://www.architected.info/blog/checked-your-sanity-lately" >Sanity Checks</a> – </strong>The easiest and most cost effective ways to catch issues before they become problems.</p>
<p><strong><a href="http://en.wikipedia.org/wiki/Data_profiling" onclick="javascript:urchinTracker ('/outbound/article/en.wikipedia.org');">Data Profiling</a> –</strong> The only way to understand your information is to know your data.  Intimately.  Regularly.  Historically.  Profiling takes a generic look at an arbitrary dataset and discovers important statistical information about it.  Profiling is by far the cheapest and most reliable way of examining data (think of it as an expanded <strong><a href="http://www.architected.info/blog/checked-your-sanity-lately" >sanity check</a></strong>).</p>
<p><strong>Process Testing –</strong> Instead of looking at a dataset in a generic way, process testing looks at things in a very specific way.  These should be customized tests that will tell information that are automated and deliver results that are unique to the process.  Because of the level of customization and effort, this is significantly more expensive than <a href="http://en.wikipedia.org/wiki/Data_profiling" onclick="javascript:urchinTracker ('/outbound/article/en.wikipedia.org');">profiling</a> or <a href="http://www.architected.info/blog/checked-your-sanity-lately" >sanity checks</a>.</p>
<p><strong>Human Intervention –</strong>  Anything that involves humans, from adjusting processes already in production to performing manual analysis to resolve concerns to creating new code.  Think of it as if all of information quality was outsourced to a 3rd party company and all personnel costs came directly out of your budget.  This is the true cost of IQ, it is just that people see it in a more abstract sense.</p>
<p>The one category that I can see people might think is missing here is <a href="http://www.architected.info/blog/?s=metadata&#038;submit=Go" >metadata</a>.  I think <a href="http://www.architected.info/blog/?s=metadata&#038;submit=Go" >metadata</a> is an incredibly important part of information quality, but I tend to value it in its most concrete form instead of in the abstract.  I will get into this more in the wiki article.<br />
Any feedback would be most appreciated!</p>
<p><!-- technorati tags begin --></p>
<p style="font-size: 10px; text-align: right"><strong>technorati tags</strong>:<a href="http://technorati.com/tag/information%20architecture"rel="tag"  onclick="javascript:urchinTracker ('/outbound/article/technorati.com');">information architecture</a>, <a href="http://technorati.com/tag/information%20quality"rel="tag"  onclick="javascript:urchinTracker ('/outbound/article/technorati.com');">information quality</a>, <a href="http://technorati.com/tag/data%20quality"rel="tag"  onclick="javascript:urchinTracker ('/outbound/article/technorati.com');">data quality</a>, <a href="http://technorati.com/tag/automation"rel="tag"  onclick="javascript:urchinTracker ('/outbound/article/technorati.com');">automation</a>, <a href="http://technorati.com/tag/metadata"rel="tag"  onclick="javascript:urchinTracker ('/outbound/article/technorati.com');">metadata</a></p>
<div class="sociable"><span class="sociable_tagline"><strong>Share and earn some karma ...</strong><span>These icons link to social bookmarking sites where readers can share and discover new web pages.</span></span><ul>
	<li><a href="http://del.icio.us/post?url=http://www.architected.info/blog/106&amp;title=The+Information+Quality+Pyramid" title="del.icio.us" onclick="javascript:urchinTracker ('/outbound/article/del.icio.us');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/delicious.png" alt="del.icio.us" /></a></li>
	<li><a href="http://digg.com/submit?phase=2&amp;url=http://www.architected.info/blog/106&amp;title=The+Information+Quality+Pyramid" title="digg" onclick="javascript:urchinTracker ('/outbound/article/digg.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/digg.png" alt="digg" /></a></li>
	<li><a href="http://www.furl.net/storeIt.jsp?u=http://www.architected.info/blog/106&amp;t=The+Information+Quality+Pyramid" title="Furl" onclick="javascript:urchinTracker ('/outbound/article/www.furl.net');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/furl.png" alt="Furl" /></a></li>
	<li><a href="http://www.newsvine.com/_tools/seed&amp;save?u=http://www.architected.info/blog/106&amp;h=The+Information+Quality+Pyramid" title="NewsVine" onclick="javascript:urchinTracker ('/outbound/article/www.newsvine.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/newsvine.png" alt="NewsVine" /></a></li>
	<li><a href="http://reddit.com/submit?url=http://www.architected.info/blog/106&amp;title=The+Information+Quality+Pyramid" title="Reddit" onclick="javascript:urchinTracker ('/outbound/article/reddit.com');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/reddit.png" alt="Reddit" /></a></li>
	<li><a href="http://www.spurl.net/spurl.php?url=http://www.architected.info/blog/106&amp;title=The+Information+Quality+Pyramid" title="Spurl" onclick="javascript:urchinTracker ('/outbound/article/www.spurl.net');"><img src="http://www.architected.info/blog/wp-content/plugins/sociable/images/spurl.png" alt="Spurl" /></a></li>
</ul></div>
]]></content:encoded>
			<wfw:commentRss>http://www.architected.info/blog/106/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
