Date: Fri, 1 May 1998 15:34:32 -0600 From: "Michael S. Warren" <msw@t6-serv.lanl.gov> To: beowulf@cesdis.gsfc.nasa.gov Subject: Introducing Avalon I see that word has leaked out already about our new cluster. I've been fairly busy over the past three weeks, but I finally got a chance to put together some information. Enjoy. Avalon Frequently Asked Questions Author: Michael S. Warren, msw@lanl.gov Version: 0.02 May 1, 1998 What is Avalon made of? Each node of the machine is a DEC alpha workstation in an ATX case. The processors are Alpha 21164A microprocessors running at 533 Mhz. The motherboards are the DEC 164LX. Memory consists of two 64Mb ECC SDRAM DIMMs per node. The disks are Quantum Fireball ST3.2A, 3079MB EIDE U-ATA drives. The ethernet cards are made by Kingston and use the DEC Tulip chipset. We use a 3Com fast ethernet switch, which is described below. We also have a serial network which utilizes 3 Cyclades Cyclom 32-YeP multiport serial cards. What kind of fast ethernet switch do you have? We have four 3Com SuperStack II 3900 36-port fast ethernet switches, with 2 Gigabit uplink modules added to each one. This provides 3 Gigabit links on each switch, which are trunked together and attached to a 3Com SuperStack II 9300 12-port Gigabit Ethernet switch. Overall, this provides a switched network of 144 fast ethernet ports, at a cost of about $300 per port. This provides us plenty of room to grow from our current 70 nodes, and lets us experiment with "channel bonding" where one attempts to double the network throughput by using two ethernet cards on each node. How much did it all cost? 152 thousand American dollars. How long did it take to get the machine running? The nodes arrived the morning of Friday, April 10. The machine was running parallel code at over 10 Gflops on Monday, April 13. Which operating system do you use? Linux. How well does the machine perform? In the past three weeks we have run parallel Linpack at 19.7 Gflops (improved from 19.3 last week), a molecular dynamics code (SPaSM) at 12.8 Gflops, a gravitational treecode at 10.0 Gflops, and the NAS Class B version 2.3 benchmarks (BT: 2.2 Gflops, SP: 1.0 Gflops, LU: 3.5 Gflops, MG: 2.1 Gflops). Those were benchmarks, what about real production code? SPaSM sustained 10 Gflops for 44 hours on a 60 million particle simulation of shock-induced plasticity, and wrote 68 Gbytes of raw data. The treecode sustained 6.8 Gflops for 26 hours on a 10 million particle simulation of galaxy formation. How does this compare in performance to "real" parallel machines? SPaSM, the treecode and Linpack run at about the same speed on a 64 processor 195 Mhz SGI Origin 2000. How much does an Origin 2000 cost? The May 1998 list price for a 64-processor Origin 2000 with 250 Mhz processors and 8 Gbytes of memory is around 1.8 million dollars. I think you can get a 180 Mhz version under the Varsity program for about a million. Which distribution of Linux do you use? We use RedHat 5.0. Which kernel are you currently running? As of May 1, we are running Linux-2.1.99-pre3. The 2.1 series provides improved network performance over 2.0. Which compilers did you use? We used gcc and g77 from egcs-1.0.2. The Linpack result was due in large part to the unbelievably fast Alpha/Linux DGEMM written by goto@statabo.rim.or.jp. Thanks Goto, we owe you lots of beer. What message passing library did you use? We have used MPICH, and our own basic set of MPI routines written on top of TCP sockets. Our own stuff tends to run faster. We will release this code under GPL as soon as we can decouple it from some other unrelated garbage. Who did you buy the nodes from? We bought them from Carrera. They did all the node assembly, and installed and configured Linux using a disk and "clone" script that we provided. Where did you buy the 3Com Ethernet switch? We bought it though an established government contract, but they are offered at the same price through various Internet vendors. Try searching at http://www.uvision.com or http://www.pricewatch.com Linux is obviously cheaper than a commercial operating system like Solaris, HP-UX, AIX or Windows NT, but wouldn't one of those operating systems offer better performance? We don't use Linux because it is free. We use it because it has open source code, superior networking performance, and is being developed in an open and accessible manner. Linux mailing lists and newsgroups on the Internet have made it vastly easier to identify and fix the problems that will always occur. We can't afford to waste our time asking customer support to hold our hand and tell us they will certainly fix the problem, but it may take a few months. Given the limited amount of human resources that we have available, this project would not have been possible without Linux. If you have the resources of a major Computer Science department at your disposal, a commercial operating system with a source code license might be a viable option. Isn't fast ethernet too slow for such a machine? Why didn't you use Myrinet or some other faster technology? Ethernet is a commodity product, with all the benefits that entails. If you look at the performance of the codes we have run, you can see that fast ethernet performs admirably well. A faster network would clearly improve performance, but the key question is how much, and for how much money? Myrinet would roughly double the cost of the machine, and none of the codes we have run thus far would improve in performance by a factor of two with a faster network. Maybe you have a code that would justify buying Myrinet for your cluster, but you'll have to do the math to figure it out. Why didn't you rackmount the nodes? Rackmounting would cost several hundred dollars per node. Saving 30 square feet of floor space did not seem to be worth $20,000 to us. Increased reliability from redundant cooling and power may justify more complex packaging, but we have no experience with such a setup. What is the Cyclades serial network for? A "feature" of DECs AlphaBios is that a headless node will not automatically boot without a carriage return over the serial port. The Cyclades network was the best short-term solution to this problem, and also offered a diagnostic and control network which is independent of the ethernet. Wouldn't a different type of ethernet card provide better performance? That is a very good question. Because of alignment restrictions, the DEC tulip chipset (as found in the Kingston cards we are using) performs sub-optimally on Alpha hardware by forcing an unnecessary buffer copy. We originally intended to use the SMC EtherPower II card, but found that this card does not work in the current C0 revision of the 164LX motherboard (it works great in the B3 revision). This problem is currently under investigation. Where did the name Avalon come from? It's an oblique Beowulf reference. Read the "The Legacy of Heorot" and "Beowulf's Children". Why are you doing this? We wanted a supercomputer, and this solution seemed to be the best balance of cost and utility. Please realize that this is not a research project, and all the money we had to spend was spent on hardware. All of the system management and code development is being provided on a volunteer basis from scientists and system managers who are paid to do other things. Don't expect to see a lot of "value added" like fancy new parallel languages and "global shared memory". We are not Computer Scientists, we just want to run our physics simulations using the best hardware and software that we can find. Where can I learn more? http://cnls.lanl.gov/avalon -- Michael S. Warren Email: msw@lanl.gov Theoretical Astrophysics, T-6 URL: http://qso.lanl.gov/~msw/ Mail Stop B288 Phone: (505) 665-5023 Los Alamos National Laboratory FAX: (505) 665-3003 Los Alamos, NM 87545