Previous Section Table of Contents Next Section

3.2 Environment

You are going to need some place to put your computers. If you are lucky enough to have a dedicated machine room, then you probably have everything you need. Otherwise, select or prepare a location that provides physical security, adequate power, and adequate heating and cooling. While these might not be issues with a small cluster, proper planning and preparation is essential for large clusters. Keep in mind, you are probably going to be so happy with your cluster that you'll want to expand it. Since small clusters have ways of becoming large clusters, plan for growth from the start.

3.2.1 Cluster Layout

Since the more computers you have, the more space they will need, plan your layout with wiring, cooling, and physical access in mind. Ignore any of these at your peril. While it may be tempting to stack computers or pack them into large shelves, this can create a lot of problems if not handled with care. First, you may find it difficult to physically access individual computers to make repairs. If the computers are packed too tightly, you'll create heat dissipation problems. And while this may appear to make wiring easier, in practice it can lead to a rat's nest of cables, making it difficult to divide your computers among different power circuits.

From the perspective of maintenance, you'll want to have physical access to individual computers without having to move other computers and with a minimum of physical labor. Ideally, you should have easy access to both the front and back of your computers. If your nodes are headless (no monitor, mouse, or keyboard), it is a good idea to assemble a crash cart. So be sure to leave enough space to both wheel and park your crash cart (and a chair) among your machines.

To prevent overheating, leave a small gap between computers and take care not to obstruct any ventilation openings. (These are occasionally seen on the sides of older computers!) An inch or two usually provides enough space between computers, but watch for signs of overheating.

Cable management is also a concern. For the well-heeled, there are a number of cable management systems on the market. Ideally, you want to keep power cables and data cables separated. The traditional rule of thumb was that there should be at least a foot of separation between parallel data cables and power cables runs, and that data cables and power cables should cross at right angles. In practice, the 60Hz analog power signal doesn't affect high-speed digital signals. Still, separating cables can make your cluster more manageable.

Standard equipment racks are very nice if you can afford them. Cabling is greatly simplified. But keep in mind that equipment racks pack things very closely and heat can be a problem. One rule of thumb is to stay under 100 W per square foot. That is about 1000 W for a 6-foot, 19-inch rack.

Otherwise, you'll probably be using standard shelving. My personal preference is metal shelves that are open on all sides. When buying shelves, take into consideration both the size and the weight of all the equipment you will have. Don't forget any displays, keyboards, mice, KVM switches, network switches, or uninterruptible power supplies that you plan to use. And leave yourself some working room.

3.2.2 Power and Air Conditioning

You'll need to make sure you have adequate power for your cluster, and to remove all the heat generated by that power, you'll need adequate air conditioning. For small clusters, power and air conditioning may not be immediate concerns (for now!), but it doesn't hurt to estimate your needs. If you are building a large cluster, take these needs into account from the beginning. Your best bet is to seek professional advice if it is readily available. Most large organizations have heating, ventilation, and air conditioning (HVAC) personnel and electricians on staff. While you can certainly estimate your needs yourself, if you have any problems you will need to turn to these folks for help, so you might want to include them from the beginning. Also, a second set of eyes can help prevent a costly mistake.

3.2.2.1 Power

In an ideal universe, you would simply know the power requirements of your cluster. But if you haven't built it yet, this knowledge can be a little hard to come by. The only alternative is to estimate your needs. A rough estimate is fairly straightforward: just inventory all your equipment and then add up all the wattages. Divide the total wattage by the voltage to get the amperage for the circuit, and then figure in an additional 50 percent or so as a safety factor.

For a more careful analysis, you should take into account the power factor. A switching power supply can draw more current than reported by their wattage ratings. For example, a fully loaded 350 W power supply may draw 500 W for 70 percent of the time and be off the other 30 percent of the time. And since a power supply may be 70 percent efficient, delivering those 500 W may require around 715 W. In practice, your equipment will rarely operate at maximum-rated capacity. Some power supplies are power-factor corrected (PFC). These power supplies will have power factors closer to 95 percent than 70 percent.

As you can see, this can get complicated very quickly. Hopefully, you won't be working with fully loaded systems. On the other hand, if you expect your cluster to grow, plan for more. Having said all this, for small clusters a 20-amp circuit should be adequate, but there are no guarantees.

When doing your inventory, the trick is remembering to include everything that enters the environment. It is not just the computers, network equipment, monitors, etc., that make up a cluster. It includes everything-equipment that is only used occasionally such as vacuum cleaners, personal items such as the refrigerator under your desk, and fixtures such as lights. (Ideally, you should keep the items that potentially draw a lot of current, such as vacuum cleaners, floor polishers, refrigerators, and laser printers, off the circuits your cluster is on.) Also, be careful to ensure you aren't sharing a circuit unknowingly-a potential problem in an older building, particularly if you have remodeled and added partitions.

The quality of your power can be an issue. If in doubt, put a line monitor on your circuit to see how it behaves. You might consider an uninterruptible power supply (UPS), particularly for your servers or head nodes. However, the cost can be daunting when trying to provide UPSs for an entire cluster. Moreover, UPSs should not be seen as an alternative to adequate wiring. If you are interested in learning more about or sizing a UPS, see the UPS FAQ at the site of the Linux Documentation Project (http://www.tldp.org/).

While you are buying UPSs, you may also want to consider buying other power management equipment. There are several vendors that supply managed power distribution systems. These often allow management over the Internet, through a serial connection, or via SNMP. With this equipment, you'll be able to monitor your cluster and remotely power-down or reboot equipment.

And one last question to the wise:

Do you know how to kill the power to your system?


This is more than idle curiosity. There may come a time when you don't want power to your cluster. And you may be in a big hurry when the time comes.

Knowing where the breakers are is a good start. Unfortunately, these may not be close at hand. They may even be locked away in a utility closet. One alternative is a scram switch. A scram switch should be installed between the UPS and your equipment. You should take care to ensure the switch is accessible but will not inadvertently be thrown.

You should also ensure that your maintenance staff knows what a UPS is. I once had a server/UPS setup in an office that flooded. When I came in, the UPS had been unplugged from the wall, but the computer was still plugged into the UPS. Both computer and UPS were drenched-a potentially deadly situation. Make sure your maintenance staff knows what they are dealing with.

3.2.2.2 HVAC

As with most everything else, when it comes to electronics, heat kills. There is no magical temperature or temperature range that if you just keep your computers and other equipment within that range, everything will be OK. Unfortunately, it just isn't that simple.

Failure rate is usually a nonlinear function of temperature. As the temperature rises, the probability of failure also increases. For small changes in temperature, a rough rule of thumb is that you can expect the failure rate to double with an 18F (10C) increase in temperature. For larger changes, the rate of failure typically increases more rapidly than the rise in temperature. Basically, you are playing the odds. If you operate your machine room at a higher than average temperature, you'll probably see more failures. It is up to you to decide if the failure rate is unacceptable.

Microenvironments also matter. It doesn't matter if it is nice and cool in your corner of the room if your equipment rack is sitting in a corner in direct sunlight where the temperature is 15F (8C) warmer. If the individual pieces of equipment don't have adequate cooling, you'll have problems. This means that computers that are spread out in a room with good ventilation may be better off at a higher room temperature than those in a tightly packed cluster that lacks ventilation, even when the room temperature is lower.

Finally, the failure rate will also depend on the actual equipment you are using. Some equipment is designed and constructed to be more heat tolerant, e.g., military grade equipment. Consult the specifications if in doubt.

While occasionally you'll see recommended temperature ranges for equipment or equipment rooms, these should be taken with a grain of salt. Usually, recommended temperatures are a little below 70F (21C). So if you are a little chilly, your machines are probably comfortable.

Maintaining a consistent temperature can be a problem, particularly if you leave your cluster up and running at night, over the weekend, and over holidays. Heating and air conditioning are often turned off or scaled back when people aren't around. Ordinarily, this makes good economic sense. But when the air conditioning is cut off for a long Fourth of July weekend, equipment can suffer. Make sure you discuss this with your HVAC folks before it becomes a problem. Again, occasional warm spells probably won't be a problem, but you are pushing your luck.

Humidity is also an issue. At a high humidity, condensation can become a problem; at a low humidity, static electricity is a problem. The optimal range is somewhere in between. Recommended ranges are typically around 40 percent to 60 percent.

Estimating your air conditioning needs is straightforward but may require information you don't have. Among other things, proper cooling depends on the number and area of external walls, the number of windows and their exposure to the sun, the external temperature, and insulation. Your maintenance folks may have already calculated all this or may be able to estimate some of it.

What you are adding is heat contributed by your equipment and staff, something that your maintenance folks may not have been able to accurately predict. Once again, you'll start with an inventory of your equipment. You'll want the total wattage. You can convert this to British Thermal Units per hour by multiplying the wattage by 3.412. Add in another 300 BTU/H for each person working in the area. Add in the load from the lights, walls, windows, etc., and then figure in another 50 percent as a safety factor. Since air conditioning is usually expressed in tonnage, you may need to divide the BTU/H total by 12,000 to get the tonnage you need. (Or, just let the HVAC folks do all this for you.)

3.2.3 Physical Security

Physical security includes both controlling access to computers and protecting computers from physical threats such as flooding. If you are concerned about someone trying to break into your computers, the best solution is to take whatever steps you can to ensure that they don't have physical access to the computers. If you can't limit access to the individual computers, then you should password protect the CMOS, set the boot order so the system only boots from the hard drive, and put a lock on each case. Otherwise, someone can open the case and remove the battery briefly (roughly 15 to 20 minutes) to erase the information in CMOS including the password.[3] With the password erased, the boot order can be changed. Once this is done, it is a simple matter to boot to a floppy or CD-ROM, mount the hard drive, and edit the password files, etc. (Even if you've removed both floppy and CD-ROM drives, an intruder could bring one with them.) Obviously, this solution is only as good as the locks you can put on the computers and does very little to protect you from vandals.

[3] Also, there is usually a jumper that will immediately discharge the CMOS.

Broken pipes and similar disasters can be devastating. Unfortunately, it can be difficult to access these potential threats. Computers can be damaged when a pipe breaks on another floor. Just because there is no pipe immediately overhead doesn't mean that you won't be rained on as water from higher floors makes its way to the basement. Keeping equipment off the floor and off the top of shelves can provide some protection. It is also a good idea to keep equipment away from windows.

There are several web sites and books that deal with disaster preparedness. As the importance of your cluster grows, disaster preparedness will become more important.

    Previous Section Table of Contents Next Section