[LWN Logo]

Date: Fri, 1 May 1998 15:34:32 -0600
From: "Michael S. Warren" <msw@t6-serv.lanl.gov>
To: beowulf@cesdis.gsfc.nasa.gov
Subject: Introducing Avalon

I see that word has leaked out already about our new cluster.  I've been
fairly busy over the past three weeks, but I finally got a chance to put
together some information.  Enjoy.


Avalon Frequently Asked Questions

Author: Michael S. Warren, msw@lanl.gov
Version: 0.02 May 1, 1998

What is Avalon made of?

   Each node of the machine is a DEC alpha workstation in an ATX case.
   The processors are Alpha 21164A microprocessors running at 533 Mhz.
   The motherboards are the DEC 164LX.  Memory consists of two 64Mb ECC
   SDRAM DIMMs per node.  The disks are Quantum Fireball ST3.2A, 3079MB
   EIDE U-ATA drives.  The ethernet cards are made by Kingston and use
   the DEC Tulip chipset.  We use a 3Com fast ethernet switch, which is
   described below.  We also have a serial network which utilizes 3
   Cyclades Cyclom 32-YeP multiport serial cards.

What kind of fast ethernet switch do you have?

   We have four 3Com SuperStack II 3900 36-port fast ethernet switches,
   with 2 Gigabit uplink modules added to each one.  This provides 3
   Gigabit links on each switch, which are trunked together and attached
   to a 3Com SuperStack II 9300 12-port Gigabit Ethernet switch.
   Overall, this provides a switched network of 144 fast ethernet ports,
   at a cost of about $300 per port.  This provides us plenty of room to
   grow from our current 70 nodes, and lets us experiment with "channel
   bonding" where one attempts to double the network throughput by using
   two ethernet cards on each node.

How much did it all cost?

   152 thousand American dollars.

How long did it take to get the machine running?

   The nodes arrived the morning of Friday, April 10.  The machine was
   running parallel code at over 10 Gflops on Monday, April 13.

Which operating system do you use?

   Linux.

How well does the machine perform?

   In the past three weeks we have run parallel Linpack at 19.7 Gflops
   (improved from 19.3 last week), a molecular dynamics code
   (SPaSM) at 12.8 Gflops, a gravitational treecode at 10.0 Gflops, and
   the NAS Class B version 2.3 benchmarks (BT: 2.2 Gflops, SP: 1.0 Gflops,
   LU: 3.5 Gflops, MG: 2.1 Gflops).

Those were benchmarks, what about real production code?

   SPaSM sustained 10 Gflops for 44 hours on a 60 million particle
   simulation of shock-induced plasticity, and wrote 68 Gbytes of 
   raw data.  The treecode sustained 6.8 Gflops for 26 hours on a
   10 million particle simulation of galaxy formation.

How does this compare in performance to "real" parallel machines?

   SPaSM, the treecode and Linpack run at about the same speed on a 64
   processor 195 Mhz SGI Origin 2000.

How much does an Origin 2000 cost?

   The May 1998 list price for a 64-processor Origin 2000 with 250 Mhz
   processors and 8 Gbytes of memory is around 1.8 million dollars.  I 
   think you can get a 180 Mhz version under the Varsity program for 
   about a million.

Which distribution of Linux do you use?

   We use RedHat 5.0.

Which kernel are you currently running?

   As of May 1, we are running Linux-2.1.99-pre3.  The 2.1 series
   provides improved network performance over 2.0.

Which compilers did you use?

   We used gcc and g77 from egcs-1.0.2.  The Linpack result was due 
   in large part to the unbelievably fast Alpha/Linux DGEMM written 
   by goto@statabo.rim.or.jp.  Thanks Goto, we owe you lots of beer.

What message passing library did you use?

   We have used MPICH, and our own basic set of MPI routines written
   on top of TCP sockets.  Our own stuff tends to run faster.  We will
   release this code under GPL as soon as we can decouple it from some
   other unrelated garbage.

Who did you buy the nodes from?

   We bought them from Carrera.  They did all the node assembly, and
   installed and configured Linux using a disk and "clone" script that we
   provided.

Where did you buy the 3Com Ethernet switch?

   We bought it though an established government contract, but they are
   offered at the same price through various Internet vendors.  Try
   searching at http://www.uvision.com or http://www.pricewatch.com

Linux is obviously cheaper than a commercial operating system like
Solaris, HP-UX, AIX or Windows NT, but wouldn't one of those operating
systems offer better performance?

   We don't use Linux because it is free.  We use it because it has open
   source code, superior networking performance, and is being developed
   in an open and accessible manner.  Linux mailing lists and newsgroups 
   on the Internet have made it vastly easier to identify and fix the 
   problems that will always occur.  We can't afford to waste our time
   asking customer support to hold our hand and tell us they will
   certainly fix the problem, but it may take a few months. Given the
   limited amount of human resources that we have available, this
   project would not have been possible without Linux.  If you have the
   resources of a major Computer Science department at your disposal, a
   commercial operating system with a source code license might be a
   viable option.

Isn't fast ethernet too slow for such a machine?  Why didn't you use
Myrinet or some other faster technology?

   Ethernet is a commodity product, with all the benefits that entails.
   If you look at the performance of the codes we have run, you can see
   that fast ethernet performs admirably well.  A faster network would
   clearly improve performance, but the key question is how much, and
   for how much money?  Myrinet would roughly double the cost of the
   machine, and none of the codes we have run thus far would improve in
   performance by a factor of two with a faster network.  Maybe you have
   a code that would justify buying Myrinet for your cluster, but you'll
   have to do the math to figure it out.

Why didn't you rackmount the nodes?

   Rackmounting would cost several hundred dollars per node.  Saving 30
   square feet of floor space did not seem to be worth $20,000 to us.
   Increased reliability from redundant cooling and power may justify
   more complex packaging, but we have no experience with such a setup.

What is the Cyclades serial network for?

   A "feature" of DECs AlphaBios is that a headless node will not
   automatically boot without a carriage return over the serial port.
   The Cyclades network was the best short-term solution to this problem,
   and also offered a diagnostic and control network which is independent
   of the ethernet.

Wouldn't a different type of ethernet card provide better performance?

   That is a very good question.  Because of alignment restrictions,
   the DEC tulip chipset (as found in the Kingston cards we are using)
   performs sub-optimally on Alpha hardware by forcing an unnecessary
   buffer copy.  We originally intended to use the SMC EtherPower II 
   card, but found that this card does not work in the current C0 revision 
   of the 164LX motherboard (it works great in the B3 revision).  This
   problem is currently under investigation.
   
Where did the name Avalon come from?

   It's an oblique Beowulf reference.  Read the "The Legacy of Heorot"
   and "Beowulf's Children".

Why are you doing this?

   We wanted a supercomputer, and this solution seemed to be the best
   balance of cost and utility.   Please realize that this is not a 
   research project, and all the money we had to spend was spent on 
   hardware.  All of the system management and code development is being 
   provided on a volunteer basis from scientists and system managers who 
   are paid to do other things.  Don't expect to see a lot of "value
   added" like fancy new parallel languages and "global shared memory".  
   We are not Computer Scientists, we just want to run our physics 
   simulations using the best hardware and software that we can find.

Where can I learn more?

   http://cnls.lanl.gov/avalon

--
Michael S. Warren			Email:	msw@lanl.gov
Theoretical Astrophysics, T-6		URL:	http://qso.lanl.gov/~msw/
Mail Stop B288				Phone:	(505) 665-5023
Los Alamos National Laboratory		FAX:	(505) 665-3003
Los Alamos, NM 87545