A while back I thought I would sit down and spend a bit of time writing about how easy it was to roll your own cluster. I would do it myself, take a few hours to get things up and around, then record my thoughts for prosperity. Well, it took a lot longer than I thought I would, but here we are …
Now, don’t get me wrong, it isn’t that anything about EC2 is particularly difficult, if you are a techie and are comfortable with the UNIX side of things. And, once you get things up and running it is very easy to use the infrastructure reliably and repeatably, just like the marketing hype says. However, getting to that point is harder than you might think.
How Do I Get Started?
If you are thinking about establishing a presence on EC2, the first thing you need to understand is persistence. Or, more accurately, the lack of it. An EC2 server forgets (or at least should forget) everything about itself every time it is rebooted, and has to be told what to load as it starts. This is very different from the standard desktop computing paradigm, where after a reboot you might lose any unsaved information. Not so with EC2.
Personally, I like to think of an EC2 server is the main character from Memento, who forgets everything that happened the day before whenever he goes to sleep. If you haven’t seen the movie, a good analogy would be if you carried around a CD-ROM with you and plugged it into any PC you saw and did your work from there.
This means that:
- You have to figure out what you need before the machine is rebooted and make an “image” of what your machine looks like ahead of time
- You can’t change your configuration very easily, as you will have to update your “image” in order to keep the changes for the future.
- You can’t really store recent data in your “image”, as it will all go away the next time you restart.
Most importantly, your entire system (OS, programs, and all) need to be stored in a way that it can be loaded quickly and easily. In the world of EC2, this is called an image.
Image Is Everything
Because the memory of a machine is wiped each time it is rebooted, it can be configured any way it needs to be, but it needs to be told exactly how to configure itself. The actual snapshot of the system that is loaded when the machine starts is called an image. It consists of all of the software and configuration instructions needed to run a Linux server in the AWS environment.
So, if you are planning to use EC2 to do real work, the thing you should be most concerned with to get started is creating a image that has all the software that you need to get things done. Amazon makes a number of images readily available, including ones with MySQL and Apache. However, if you need more than this (and the most basic UNIX tools) you need to create your own image, often called an AMI.
Provisioning
I consider provisioning to encompass everything that needs to be done in order to create a re-usable AMI for use with EC2. The easiest way to create an AMI is to take one that already works and modify it. The cheapest way to do this is to create an instance on an EC2 machine, modify it, then save the results. This article will focus on provisioning in the cheapest and easiest ways that I can find.
This takes several steps:
- Creating an instance
- Adding and configuring users
- Installing and configuring software
- Setting up the environment
- Bundling the volume
Before we get started, realize that this can be a long and time consuming process. Because you will have an EC2 instance running while you go through these steps and will be transferring data to that instance, it will cost you money to do this. Caveat emptor!
Also, I would most strongly recommend that as you go through the steps you have your terminal application record your keystrokes to a file, to ensure that you can go back through and repeat yourself easily in case you have to start over.
Creating an InstanceYou can use my shell script for or follow some other instructions to create your instance. nce it is created you can use Telnet or SSH to login as the root user.
Adding and Configuring Users and Groups
When an instance is first created, the first thing that needs to be done is to create UNIX users to do the work itself, as running with the root user is a security risk. I would suggest that you create a superuser and give them the rights to do whatever it is you need to have done for administration.
You would do this by connecting as the root user and running the commands:
adduser superuser-name
createpasswd superuser-name
At this point the user exists. We now need to give them the rights to manipulate the system as a superuser.
To do this we need to give them the ability to use the sudo command. To do this you would run the command visudo (this allows you to use vi to edit the /etc/sudoers file safely). Once visduo is running, search for the line for the root user, which looks like:
root ALL = (ALL) ALL
Then, add a line that looks like:
superuser-name ALL = (ALL) ALL
At this point you may wish to configure this or any any other users and groups that you know that you would want in advance, such as the default shell, user profiles, etc. This will totally depend on what you are doing with the machine, so think carefully.
I would recommend that you make as few users with as little access as possible, for security’s sake.
Installing and Configuring Software
An EC2 volume is based on the Fedora core, for better or for worse. This means that we have some pretty standard tools available to us. However, we will have to download and compile the source, which will actually cost a bit of money. Not a ton, mind you, but enough to be concerned about.
I followed these instructions and upgraded my box to Fedora Core 5 first thing. This gives a pretty broad swath of tools to use, and is a good start. Everything isn’t cutting edge, but it is stable and very standard, which is what I am looking for in an image. This probably involves several hundred MB of file transfer, which will be charged at the going rate ($0.20/GB as of this writing).
I also installed GCC , as I knew I might need it later. Yum is available, so I will use that when I can. The command:
yum install gcc*
will get the ball rolling.
The real things that I needed to get on the image were Python 2.5, the SciPy package, and SQLite 3. I went ahead and downloaded and compiled each of these from source because they weren’t available in yum and also to make sure they didn’t interfere with anything else. As soon as all three are available using yum I will get them that way.
In order to run the tools that we are going to use to make our volume, we need to install Java. There are detailed instructions on how to do this, it is a relatively painless install. When you do this, you will need to make sure that the JAVA_HOME variable is set up correctly.
Lastly, I would recommend setting up SSH for the superuser at a minimum. If you have other users set up, I would do the same for them, as you will then have secure, passwordless access to any of your instances in the future.
My guess is that this entire process will entail less than 1 GB of file transfers, and it could be a lot less if you don’t want to use Fedora Core. Because you are downloading to your image, it will only need to be done once.
Setting Up the Environment
First, you will need to set up the EC2 tools, as specified in the documentation. You should have done this once already in order to log into an EC2 instance, so I won’t dwell on it. I found it is easiest to create a ~/.ec2 directory on my local machine, then transferring it to the remote instance with scp. I put this in my superuser account, and make it readable only by the superuser and its group. This makes setup a lot easier.
The next thing you need to do is set up the system so that it boots up and is running the programs. If you aren’t familiar with how to do this, you can check out the documentation for the boot process and running programs at boot time. Beginners might have to modify the rc.local, but normally most configuration is taken care of during the application installation process.
Last, you will need to set up the file and directory permissions on the machine. Again, this is beyond the scope of this post, but there is plenty of documentation available. Sometimes this can be a bit more art than science, so consider your decisions here carefully.
Bundling the Volume
Once you have gotten through each of these steps, you should be ready to bundle a volume. Remember, what we are doing is taking a snapshot of the machine as it is in its current state and storing it for later use. The next time we run, we are going to have a machine that looks just the way it does now, so make sure everything is taken care of.
Now, I had planned on writing something simple on how to actually do the bundling, but AWS has already made very good instructions available. It takes a little while to finish, but overall it works pretty well.
If you are having trouble, it is probably that:
- The EC2 software isn’t installed.
- The environment variables aren’t set up properly.
- You are providing incorrect AWS account information.
Understand that once the image uploaded to S3 you will have to pay for monthly storage for it, although it should not be particularly expensive.
Conclusion
Once you have created an AMI you should see it the next time you run the ec2-describe-images command. It is now available for your private use. Enjoy!