Quote for the Week of 2006-08-19

August 16th, 2006 by morgan

“We came out of the meeting with a strong consensus of what direction we needed to go in. The problem is that I don’t think that everyone was agreeing to the same thing.”

– A client who wishes to remain anonymous.

Six Ways to Secure your Architecture

August 16th, 2006 by morgan

Builder.au has a useful article titled, “Six Steps to Secure Sensitive Data in MySQL.” While I am a PostgreSQL fan myself, this article gives a good checklist of no-nonsense steps that can be taken in order to secure your data. Most of these steps can be applied across databases and even generalized to other systems, it is just good advice.
The thing I like most about the article is that five of the six steps to secure your data of them have no visible impact on the users whatsoever. I would argue that the value of architecture improvements are inversly proportional to the amount of extra effort it will take for a user to get their work done.

On a side rant, something I don’t hear enough about these days is the responsibility of the organization for their IT architecture and policies. Instead, most of the time I hear scapegoating and complaints about users. If it is your architecture, you are in charge of keeping things running safely and securely. It is your users responsibility to use your systems efficiently, not to make your life easy.

WWHD (What Would A Hacker Do)?

August 15th, 2006 by morgan

Slashdot had an interesting thread on creative responses to security threats. While the article itself was about wireless networking, the conversation that followed was very thought provoking (perhaps even inspiring).

The problem was that people were having others try to piggy-back over their private wireless network. The solutions, ranged from scary to annoying to hilarious. Instead of taking the enterprise route (find a vendor, spend some money, integrate the system) it is interesting to see what a web hacker would do in the same situation.

The solution I liked the most was very subtle:

  1. Figure out who was an unauthorized user.
  2. Analyze their web traffic and figure out what images are being requested.
  3. Modify the incoming images so that they are blurry or inverted.
  4. Watch them pull their hair out when they think that their graphics driver is corrupted or their hardware is broken.

This is a great response, and one that a lot of “professionals” wouldn’t even consider. This is mostly because the way we deal with threats is to take a lot of time and effort and spend a lot of money to ensure that nothing bad ever happens. Instead, the hacker approach just makes it very annoying for someone to do something they shouldn’t be doing anyway.

This just goes to show that addressing architecture issues by focusing on the people issues takes less time, less money, less effort, and is a lot more fun. Is there anything in your information architecture that could benefit from a more people-centered focus?

Case Study — Statistical Information Quality

August 14th, 2006 by morgan

Introduction

When an organization begins a concerted effort to improve its information quality, often it gets stuck in trying to figure out exactly where to start. This case study takes this to heart and gives a specific example of an approach to improving information quality.

Previously, we had discussed the semantic and statistical approaches to information quality and linked them to black box and white box testing. In addition, there is a case study on semantic information quality which is used to contrast this case study (you may want to take a look at these if you aren’t familiar with the subjects).

The example that we have been using is …

dealing with call data for a customer contact center. For simplicity, we can assume that all the call data we need is delivered nightly and is loaded into a single table that looks exactly like the files as they have arrived. This table has the following attributes:

  • employee_login_number
  • site_name
  • department_name
  • call_local_start_time
  • call_local_end_time

… from this data the business analysts are going to figure out how much to pay and to whom. Also, we need to figure out who is handling the highest call volume (vendors, locations, and employees) on a daily basis so that we can resolve issues and negotiate contracts. Our job is to make sure that the data is accurate enough to do this with confidence.

Also, before we get started, realize that with the semantic and statistical approaches we are trying to do the same thing in different ways. So, while we are doing things differently, there is bound to be some overlap.

The Statistical Approach

With a statistical approach, there are several things to consider:

  1. From a statistical point of view, there is nothing special about this dataset. It has very similar characteristics to all the ones that came before it and will come after it. We should try to create an architecture that can be re-used where appropriate.
  2. There is a lot that we can infer from the dataset itself. We can learn a great deal of information about the dataset very cheaply through black box testing. Focusing on these areas will maximize re-use as well.
  3. We can probably assume that any data that we recieve is of reasonably good quality when the process was first designed. Therefore, we can focus on events where the nature of the data changes substantially.

With these in mind, we can start to design a solution.

The place to start is to ask, “what can go wrong in our data?”. I can think of several situations that might impact the quality of this data:

  • The employee_login_number is invalid or NULL.
  • The site_name is invalid or NULL.
  • The department is invalid or NULL.
  • The call_local_start_time is invalid or NULL.
  • The call_local_end_time is invalid, NULL, or starts before the call_local_start_time.
  • Due to errors outside of our control, the process that created the data malfunctioned. Often, this will show up as duplicate values, irregular frequency or distribution of values

Off the top of my head, I have a number of questions about the data that we will see day to day:

  • For each column, is there a distinct list of values (call this the domain) that are valid?
  • For each column, is there a distinct pattern of values that are valid?
  • For each column, can the values be NULL?
  • Is there a distinct key? If so, is it unique?
  • For column values and keys, should the frequency for particular values be fairly normal?
  • Is there a certain number of rows that should be expected (by key or for the entire dataset)?
  • Is there a certain number of keys that should be expected?
  • For numeric values, can we do descriptive statistics to tell us if things are off-kilter?

Based on these, I think that we can establish a data model that would allow this metadata to be recorded for multiple processes, which would allow it to be used for reporting and decision-making.

For example, consider a table having the following attributes:

  • process_id
  • process_run_dt
  • distinct_value
  • distinct_value_type
  • distinct_value_count

This would allow the user to keep track of how many distinct values there were generated by a given process. Over time, this could be very useful in tracking down some sticky problems, and perhaps prevent bad data from ever getting into a data store in the first place.

For each of the meausurement processes we mentioned, they can probably be integrated into the overall data model in a process agnostic way. I apologize for not having more details at this point, I plan to move this to the wiki (at some point) and put in a reference model for doing some of these operations.

Comparisons With Data Profiling

For people with some experience with data management this may sound a lot like data profiling. In fact, a lot of the operations inherent in the statistical approach would probably be considered a part of data profiling as well.

However, there are some key differences between Statistical IQ and Data Profiling that need mentioning:

  1. Statistical IQ has an operational focus and needs to be as lightweight as possible. We want to use this to make day-to-day operational decisions about our data without slowing anything down.
  2. Statistical IQ does not include data discovery, while data profiling often does.
  3. One of the core functions of data profiling is establishing relationships between datasets. Statistical IQ has a very limited view of relationships in order to maximize functionality and reusability.

Similar base concepts, focusing on different areas.

Statistical IQ and Mad Libs

One thing that often gets lost in the re-use discussion is the price of user configuration. All too often, programmers push too much decision making out of their code and on to the operator, making it difficult to use.

The trick with Statistical IQ is that you have to be able to tie a generic statement (”there are 15 distinct values in this dataset”) back to something useful (”there is probably missing data, don’t continue the process”). While this might seem like a challenge, it can be done without a lot of heartburn.

In a recent engagement, I designed a solution where we tied every possible error back to an english description of the problem that was stored in an SQL database. This was done in a very generic way, so that new errors could be added or removed without any configuration required by the developer or operator.

Conclusion

There are different approaches to information quality, each with their own strengths, weaknesses, and costs. The statistical approach is cheaper (especially when you factor in Moore’s Law), but gives a less detailed picture of overall quality. The semantic approach is more expensive, but can be as comprehensive as the situation requires. A balanced approach will use both approaches to deliver the solution that is needed.

The Value of Versatility

August 13th, 2006 by morgan

Computerworld has a great article about job skills and the marketplace. While I normally don’t like the “what’s hot/what’s not” angle, this Stacy Collett was right on the money when she wrote that:

The most sought-after corporate IT workers in 2010 may be those with no deep-seated technical skills at all. The nuts-and-bolts programming and easy-to-document support jobs will have all gone to third-party providers in the U.S. or abroad. Instead, IT departments will be populated with “versatilists” — those with a technology background who also know the business sector inside and out, can architect and carry out IT plans that will add business value, and can cultivate relationships both inside and outside the company.

eWeek also chimed in with “Building the Perfect IT Person.”  Deborah Rothberg opines that …

“The old model of IT doesn’t work anymore,” said Steve Novak, CIO at Kirkland & Ellis, a Chicago-based law firm.

While that model is still being sorted out, Novak, along with other CIOs interviewed by eWEEK, is on the lookout for the holy grail—a designer IT person who can adapt and thrive in changing environments and still remain valuable.

These really confirmed my own thoughts on the workplace, based on experience both as a consultant and as an employee. However, it holds a lot more weight when written in a trade magazine and backed up with statistics from Gartner.

If I had one piece of advice to give to a person graduating college and looking to pursue a career in IT, it would be to always seek to increase your value to the organization that you are working for (or with). Hard-core technical skills are important, and useful, and valuable, but only in the short term. Becoming a solution builder means so much more.

about


This is the about me section, you will prob. want to edit this. If you want to change the image you may do so by changing the avatar.jpg located in the NewZen images directory.

search

navigation

archives

categories