Posted on February 25, 2004
The cluster consists of a controlling node, with a large capacity hard drive, and several computational nodes, each with their own hard disk drive (these hard drives can be smaller).
The software which performs the parallelization (MPI) is installed on the controlling node, and the computational nodes mount a shared directory on the controlling node via NFS.
Communications between the nodes is established via rsh by MPI, and shared files are found via the mounted NFS file system,
The networking is fast ethernet (100 Mbit) and makes use of a fast ethernet switch switch. Gigbit ethernet is faster (and better for fast file I/O) but 100 Mbit ethernet is quite adequate for number crunching.
The version of MPI used is mpich-188.8.131.52
The Operating system for the controlling node and all the computational nodes is FreeBSD MINI 4.8 RELEASE
FreeBSD has moved forward a bit since I began building my cluster, so check with freebsd.org to see what is currently available. Whatever distribution you use, you should be using RELEASE or STABLE versions.
Install and configure the controlling node
Keep it simple. Resist the temptation to add a lot of options. JUST MAKE IT WORK.
Keep all the nodes as identical as possible, they will be running code that is generated on the controlling node.
Setup a firewall between the cluster and the outside world. The cluster needs a high degree of connectivity and has rather poor security.
Assemble the nodes and test them one at a time.
Install Mini-FBSD on the controlling node first (I'm using the mini 4.8 distribution).
Use the same root password on the controlling node and on all the computational nodes.
Configure the controlling node as an NFS server and export /usr to be accessed with root privileges.
Enable inetd, and edit /etc/inetd.conf to allow rlogin.
Setup rsh and ssh so that the controlling node and computational nodes can access each other.
Be sure to edit /etc/ssh/sshd_config to allow root login.
DO NOT allow the controlling node to rsh/ssh to itself. Doing this will not only cause security issues, but can lead to the controlling node getting saturated with rsh connections during a program run , and can cause slowness and program crashes.
Allow only essential external computers to access the controlling node by ssh. Do not allow any external computers to use rsh to access any node. Use ssh instead.
Edit /etc/rc.conf for the appropriate hostname and ip address.
Edit /etc/hosts to include the hostnames and ip's of the controlling node, computational nodes, and any external computers which need to access the controlling node.
Download and install MPI. Be sure to read the documentation on the MPI web site. Install MPI in /usr/local/mpi. I built MPI to run in P_4 mode to keep things simple.
In '/root/.cshrc' add '/usr/local/mpi/bin' to the path. You might also wish to edit '/etc/skel/.cshrc' with the same value so that new users get a working MPI.
Install FBSD on one computation node
Configure it as an nfs client.
Enable inetd and edit /etc/inetd.conf to allow rlogin
Edit '/etc/fstab' to add the nfs mount for /usr and set the mount point as /mnt/usr . Create a symbolic link at /usr/local/mpi that points to /mnt/usr/local/mpi
Add the hostnames and ip addresses for the controlling nodes and all the computational nodes to /etc/hosts
Edit /etc/rc.conf for the appropriate hostname and IP address for the node.
Edit /etc/ssh/ssh_config to configure the node as an ssh client.
Use rcp/scp to copy the /etc/ssh/sshd_config file from the controlling node to the computational node.
Create an empty file with the name of '.hushlogin' and put it in '/root'. You may wish to also put .hushlogin in /etc/skel so new users automatically get a copy of it. This inhibits motd and limits the login text to a prompt. It serves to keep mpi from complaining about getting an unexpected response when it uses rsh to connect to a node.
You may need to have .rhosts in /root, be sure to include all nodes in this, if you use it. You might wish to put a copy of .rhosts in /etc/skel so that new users can use ssh/rsh without being root.
You will need to add each node to '/usr/local/mpi/share/machines.freebsd '. This file is the list of nodes usable by MPI.
Run the test script /usr/locall/mpi/sbin/tstmachines with the -v option. 'sh /usr/local/mpi/sbin/tstmachines -v" It may complain that it can not access the controlling node (this is normal), but it should talk to all the nodes in the nodelist and run some test software to confirm that all is working. The script uses rsh to talk to all the nodes, and if the controlling node cannot rsh to itself, the script will complain. Resist the temptation to allow the controlling node to rsh to itself. MPI will run a process on localhost in addition to any nodes listed in '/usr/local/mpi/share/machines.freebsd', so even if the script complains that it can't find the controlling node, mpi will still work.
Compile and run some of the sample programs that come with mpi to confirm that all is working properly.
Copy the newly configured node to an "empty" hard drive.
If all is well, connect an empty hard drive for the next node to the secondary controller and use dd to copy the configured hard drive to the empty one. Be sure the "empty" drive is configured as slave and does not contain a primary partition. or FBSD might not know what to do with two hard drives at the same time.
Shut down the computer and remove the copied drive and install it in the second node. Don't forget to move the jumper from slave to master.
Configure the new node by booting it, and logging in from a keyboard, and editing /etc/rc.conf for the appropriate hostname and ip address.
Add the new node to '/usr/local/mpi/share/machines.freebsd' on the controlling node.
Reboot the new node and rsh to it from the controlling node to confirm communications.
Run /usr/local/mpi/sbin/tstmachines in verbose mode to assure the new node works properly.
If the new node is working properly, use dd to install copies of the computational node on all the drives for the remaining cluster nodes.
Plan for some odd things to happen. Clustering has a way of exposing "flaky" hardware and software. Usually if a node crashes frequently for no apparent reason , you might want to consider it as having potential hardware problems.
Power up the new cluster and let it idle for a day or two, and check the nodes to see if they spontaneously crash, disconnect, or otherwise misbehave. If the cluster seems stable, you need to begin writing programs designed to stress the machine so that you can expose software bugs, and latent hardware issues. Work through these issues one at a time. Depending on the hardware, the size of the cluster. and it's complexity, it could take from a few weeks to several months to weed out the worst of the quirks and bugs. Replace flaky hardware. One bad node in the nodelist can render a cluster useless, so don't waste your time and money trying to limp along with wounded hardware.
Power it up and leave it up. Cycle the power on a node only when you absolutely must. This reduces failures from inrush currents at power-up as well as reducing thermomechanical stresses that lead to component failures.
You might wish to set aside a node for development, so you can test new kernels or software. Once you are sure your new code is stable, you can migrate it to the other nodes. Exclude this node from the nodelist so the users don't get unhappy surprises when they run their software.
Plan on having about ten percent of the cluster failed or failing at any given time. If you need a machine with 10 nodes operational, you had best plan on having 12 nodes, and some spare parts. The larger the cluster is, the more failed hardware you can expect. Really large clusters have hardware failures on a more or less continuous basis. Alternatively, you can just build a lot of extra nodes and take bad nodes offline as the cluster "burns in" (this seems expensive and wasteful to me). Run the cluster on a good UPS. It is not an option. You need clean power to get good hardware life, and with this many computers the investment in a UPS will pay off in terms of longer hardware life.
Consumer grade electronics is designed with an operational life of two years. Lower quality components have an even shorter design life. This means that once you get all the bugs worked out, and everything is "burned in" you can expect a year or two of fairly trouble-free service. After that, the components age sufficiently that you will begin to see hardware failures rising to the point that you probably will want to consider just building a new machine.
Building a parallel computing machine is a big investment in time and money. Take your time and plan your project carefully. Make sure all of the components you plan to use are available, and will continue to be available over the several months it is likely to take you to build and test your creation. A little thought will save you a lot in terms of time, money and disappointment, and will pay big dividends in satisfaction.
The MPI Home Page. You can download the latest distribution of MPI as well as useful documentation.
The FreeBSD Home Page. Download your favorite distribution of FreeBSD and browse online documentation.