Contemplating Clusters
Posted 2021-03-30 13:41 PDT | Tags: software hardware
As far as I can tell, there's not a lot of good documentation out there for making a computer cluster out of open-source software.
Back in 2003, I suggested to my boss at that we publish a How-To document, since we were supposed to be all open-sourcey and stuff, but he didn't think it was a good idea. He said everyone was doing what we were doing, and there was no point in documenting what would soon become common knowledge.
That sounded totally reasonable at the time, and the broad strokes had already been written out in the Beowulf Cluster How-To but the industry has taken a quirky turn since then. The FAANG companies have hoovered up nearly everyone with distributed systems experience, and most companies which would have built their own clusters fifteen years ago are renting VMs in "The Cloud" instead.
"The Cloud" is provided by those same FAANG companies. who run customers' work loads on their big-ass in-house clusters and jealously guard their operational art.  This bodes ill for open docmentation.
On one hand the economies of scale are hard to argue against, but on the other hand there are still niche uses for clustering up a big pile of hardware, and guides for doing it well haven't improved much in the last twenty years.  If anything they have grown dated or fallen off the web entirely.  I had to dig this practical guide out of the Wayback Machine:
Someone in r/HPC asked for advice making a thirty-node cluster, and after looking at the current How-To docs, I pointed them at those and offered some advice on power and heat management, since that aspect of cluster operations was totally absent from all of the documents I could find.
It's been eating at me, because there's a lot more also absent from those documents -- scaling factors, effective use of ssh multiplexing, job control frameworks (like Gearman), tiered master nodes, monitoring .. it makes me think that either there's documentation I haven't found or more needs to be written.
Rather than leaping right in and brain-dumping to this blog, I've joined #clusterlabs-dev and #linux-cluster on Freenode, to get a feel for how the community thinks about these things nowadays.  I haven't built a nontrivial cluster in nearly ten years, and some of my skills are bound to be a bit stale and perhaps irrelevant.
There are a few more modern guides, but they too are narrow of scope, like
The #clusterlabs-dev channel folks have thusfar pointed me at and!_and_why_do_I_care%3F which is nice and modern, but also quite narrow.
When I asked them about power/heat management best practices, they responded with a resounding "meh!" so perhaps I'll write about that first.  Doubtless my FAANG friends will rip it to shreds, but that will just make for a better second draft.