Spreadsheets, Architecture, and Compliance

January 9th, 2007 by morgan

The Sarbanes-Oxley Compliance Journal has a very detailed article discussing the implications of spreadsheets on compliance and financial regulation. It is a well thought out, well written piece that gives very specific, very conservative steps to ensure that an organization doesn’t run into regulatory issues from day-to-day business practices.

One thing I appreciated in the article was its realistic tone, which recognized the existence of shadow systems and the role that they play in the real world. While the article focused on spreadsheets, this doesn’t mean that they are bad and that centralized CRM is good (as a matter of fact I might argue the opposite in many cases). The real issue being discussed is the risk that occurs when information architecture doesn’t match the needs of the organization.

Looking Globally

For many low-importance, one-off operations, a spreadsheets is fine. It can be shared easily, and usually owned by one person. However, what normally happens is that someone sharp (and usually not someone in IT) decides to do try to build some infrastructure around a spreadsheet without considering the consequences. It is quick, cheap, and easy to do, to a point. The problem is that the point when a spreadsheet becomes unmanageable is often well after the point where an organization depends on its output to function effectively.

For any system, an information architect needs to consider:

  • Cost
  • Effectiveness
  • Visibility
  • Traceability
  • Mangageability
  • Quality
  • Auditability

Furthermore, the situation needs to be considered from a forward-thinking perspective. That is, we need to try to understand how the landscape is going to look a few years down the road, and to make our systems flexible enough so that we don’t put our organization into a bind if we are wrong.

Spreadsheets are normally built only considering cost and effectiveness, something that is only discovered after time has gone on. It is often the case that a system would be much more effective living in a database, BI tool, or custom software application when considered over time.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

EC2 — Dynamic or Static (or Both)?

December 1st, 2006 by morgan

A Problem

I have been working more and more with AWS and EC2 and one of the challenges in working with EC2 is dealing with the fact that each instance gets a dynamic IP address upon creation. This makes it easy easy to crank out a large number of instances, which is a key feature of the system. At the same time, it makes it difficult to find and manage those instances in an automated, systematic way. So, there is a disconnect here.

A Solution

A decent solution is Dynamic DNS. That is, to have your instance be assigned a easily recognized hostname as it is being started, but still keeps its dynamic IP address and creation. To me, this seems like the best solution, as it lets AWS folks be good at what they are good at (providing cool technology infrastructures) and allows its users to be good what they are good at (making cool applications that use the infrastructure).

How can this be done? Well, it takes a couple of steps:

  1. Establish an account with one of the myriad DDNS providers.
  2. Configure your instance to use the DDNS software upon boot-up.
  3. Profit!

I have gotten some feedback about possibly using SQS to do something similar to this.  I actually thought of this, but there are issues here around cost (cost per message) and configuration (duplicates and ordering).  Because I would like to maximize the system’s reliability and scalability, I would probably rule these out.Which Brings Another Problem …

There is a caveat here, and it isn’t a small one if you don’t want to write code. The big problem is that every instance you create is going to behave identically. So, if you open up multiple instances each thinking they are ‘dynamic-name-1..com’ then you will not get the results you are looking for. Instead, most likely that name will be assigned to the last instance that started and the others will run around headless.

Which Requires a Hack …

This leaves a couple of options. First, you can write a script to go through your pool of potential host names, pick one that isn’t being used, and then request that name from the DDNS provider. Better yet is to pass the Dynamic DNS value you want to the instance through the keypair.

Of course, the best solution would be for someone to write a script that would make each node self-configure itself to get a dynamic hostname.  I would think this would be something that would be attractive to most of the DDNS providers, perhaps one of them will read this and get cracking.  If I don’t spot anything in the near future I will probably write a simple KSH to do this.
Conclusion

I think that at some point the folks at AWS are going to allow for some type of host identification, either through passing parameters to instances or by renting out static IP addresses or subdomain ranges. However, for now it will take some DIY in order to make this happen.

To be honest, probably the best solution here is to approach the problem as if you will never have a static IP address or host name and go from there. It will probably force you to think about your solution differently and challenge you to come up with a more flexible, scalable solution. It isn’t going to fit for every type of problem, but I think it works for the types of things that AWS is inherently good at.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Scripting EC2

October 30th, 2006 by morgan

I wrote a shell script that will automate the creation of EC2 instances. Really, all it does is glue together the existing command line tools that Amazon provides in a fairly crude manner. However, it works and it makes my life easier, so I like it. Hopefully it can do the same for you ….

Things You MUST Understand

  • Every time you run this script you will be charged for at least one hour’s worth of time by Amazon, even if you shut things down immediately. These aren’t my rules, I have complained about them previously.
  • Make sure you use the ec2-terminate-instances script after running this script. If you don’t, you will be charged by Amazon until you shut it down.
  • You have to have the EC2 API Tools installed for this script to work.
  • You have to have an active EC2 account for this script to work.

Caveats

  • This script hasn’t been tested by anyone other than me. It works just fine for me, and I am able to use it on a regular basis. However, if you are looking for a polished, documented, or commercial product then this is the wrong place for you (unless you are willing to pay, of course).
  • The script is for UNIX, and was developed under Mac OS X. It may run under Windows with Cygwin, although I haven’t tried (and don’t really want to).
  • The script is in Korn. It doesn’t use any particularly odd syntax, so my guess is that it would run under other shells. To be honest I haven’t tried, simply because I use KSH so much with my current work that I am far more efficient with it than any other shell. If anyone wants to test it and/or clean it up to run more universally it would be most appreciated.
  • Every time you run this script you will be charged for at least one hour’s worth of time by Amazon. These aren’t my rules, I have complained about them previously.
  • Make sure you use the ec2-terminate-instances script after running this script. If you don’t, you will be charged by Amazon until you shut it down.

Licensing

This script is provide it under the MIT license. To summarize, it is provided as-is and can be used free of charge. I am not liable for your screw-ups.

OK, with all that being said, if you still want to use it you can download the script here.

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Parameterized vs. Generic

July 17th, 2006 by morgan

Have you ever come across a process that is so parameterized that it has become generic? A programmer takes all the things that help their code to make logical decisions and pushes them out to the user. The process then becomes a shell that is all based around user-supplied information, such as command line arguments or a configuration file. At best, it is minimalism run amok.

For example, I came across a shell script that looked like this:

#!/bin/ksh

$1/$2 $(echo $* | sed “s/$1//g” | sed “s/$2//g”)

For those of you who aren’t UNIX people, this script takes a number of arguments, the first two being a path and a program name. The script then pastes them together and executes them along with any extra arguments that might have been provided.

This example would be run like:

test ls $HOME

Of course, you could just run the command

ls $HOME

And get the same result.

This example is extremely silly, as it really doesn’t have any value to the person who is calling it. As a matter of fact, it is less than valuable, as it makes the person calling the program do something they wouldn’t normally do (split a command line into two pieces). What has been done here is that all the effort that it takes to execute the program has been pushed out to the person calling the program.

A parameterized process will allow certain parts of its execution to change based on well-defined, well controlled input from the user. It provides value by allowing the user to do things faster or accomplish things they couldn’t otherwise do easily. A generic process is that has taken parameterization too far. It is merely a container for executing user logic that could be better done elsewhere.

Often, people new to data-centric programming misguidedly try to apply the principles of object-oriented programming to their work. The problem is, you end up with programs that are generic instead of parameterized. Writing good code is a matter of making tools that allow your users as productive and flexible as possible. Normally, this involves a combination of user parameters and internal logic to build something coherent and truly useful.

Here are some rules to see if your processes are in the sweet spot.

A process is well parameterized if …

  1. It simplifies the use and understanding of another tool or combination of tools.
  2. It can easily be run in a loop from the command line (in whatever operating system you use).
  3. It works well with the environment specific features of your operating system, such as pipes and redirection in UNIX.
  4. It is designed to run on a variety of machines (but not any possible one) without much effort.

A process is probably generic if …

  1. It absolutely requires a GUI in order to execute.
  2. There are more command line options than you can easily remember.
  3. The man page (for UNIX tools) is more than the user can comfortably read in one sitting.
  4. The primary logic of the process is contained outside the process itself.
  5. There is no reason to contact you if there is a problem with the process, as it can all be associated to either the user configuration or the underlying system.
  6. It takes more effort to use the process than it did to develop it.

Remember, there is no free ride in the data life cycle, the logic and effort has do be done somewhere. The only reason to develop tools is to make the users or systems more productive in the long term.

Don’t put a white label and bar-code on your processes. Add value!

technorati tags:information architecture, parameterization, parameters

Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

Process Meta-Usability

July 6th, 2006 by morgan

Something I have been thinking about a lot lately is something I like to call process meta-usability. I define this as:

Process meta-usability: The degree to which a process is able to be operated and understood by those who design, develop, execute, and troubleshoot it.

Process meta-usability is probably a subset of the field of usability. However, with most data-oriented processes (such as ETL and databases), the proof is in the pudding. That is, the people who are actually consuming the data have no idea what went to making it happen, and the old quote about sausage and politics probably applies.

This brings up some subtleties that we need to consider:

  • Most people who design data-oriented processes either don’t know or don’t care about usability. They aren’t normally impacted by it at all.
  • Most people who operate data-oriented processes are too busy to care about usability, but are greatly impacted by it.
  • Most people who sponsor data-oriented process don’t see the need to improve usability, and are impacted by it unknowingly.

To design for meta-usability, a data-oriented process needs to consider:

  1. Automation – CPU cycles, disk sectors and memory chips are cheap and reliable. Humans are not. Consider the cost of Moore’s Law vs. the cost of employee turnover for the life cycle of your process.
  2. Human intervention – Trap and deal with every possible situation in the design phase. Also, when a human does have to be involved, make it easy for them to understand what is going on at a glance. This probably means documentation, standards, naming conventions, log files, metadata, the works. This isn’t as hard as it sounds, there are some easy steps that you can take to make things run more smoothly for your friendly neighborhood operator.
  3. Understandability – A simple process is always better than a complicated one, readable source is always better than faster code. Again, consider standards, naming conventions, and all that stuff that developers hate to consider.
  4. Communication – If everyone knows what is going on, it is a lot easier to make things happen. If a process that is supposed to be automated is actually taking someone 10 hours a week to keep running then it is easier to justify spending 40 hours to fix the issue.
  5. Downtime – During the design phase, try to calculate the cost per incident of downtime for your data process. If you don’t have any hard numbers, consider the business opportunity cost plus the cost per hour of having an analyst, developer, DBA, and operator all on a conference call trying to figure out what is going on. Compare this with the cost of developing processes to avoid or mitigate downtime and see what side of the curve you want to be on.
  6. Staffing – Consider that for any data-oriented system (like feeding a shadow system, data mart, or data warehouse) the bottleneck is most often keeping the overall system running effectively. If it takes longer to resolve issues with individual processes then fewer of them can be run at the same staffing level. This isn’t because of a technological, but because there literally aren’t enough hours in a day.
Share and earn some karma ...These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • Spurl

about


This is the about me section, you will prob. want to edit this. If you want to change the image you may do so by changing the avatar.jpg located in the NewZen images directory.

search

navigation

archives

categories