Contemplating Clusters
Posted 2021-03-30 13:41 PDT | Tags: software hardware
 
As far as I can tell, there's not a lot of good documentation out there for making a computer cluster out of open-source software.
 
Back in 2003, I suggested to my boss at archive.org that we publish a How-To document, since we were supposed to be all open-sourcey and stuff, but he didn't think it was a good idea. He said everyone was doing what we were doing, and there was no point in documenting what would soon become common knowledge.
 
That sounded totally reasonable at the time, and the broad strokes had already been written out in the Beowulf Cluster How-To http://www.faqs.org/docs/Linux-HOWTO/Beowulf-HOWTO.html but the industry has taken a quirky turn since then. The FAANG companies have hoovered up nearly everyone with distributed systems experience, and most companies which would have built their own clusters fifteen years ago are renting VMs in "The Cloud" instead.
 
"The Cloud" is provided by those same FAANG companies. who run customers' work loads on their big-ass in-house clusters and jealously guard their operational art.  This bodes ill for open docmentation.
 
On one hand the economies of scale are hard to argue against, but on the other hand there are still niche uses for clustering up a big pile of hardware, and guides for doing it well haven't improved much in the last twenty years.  If anything they have grown dated or fallen off the web entirely.  I had to dig this practical guide out of the Wayback Machine: https://web.archive.org/web/20090329024138/http://www.cacr.caltech.edu/beowulf/tutorial/building.html
 
Someone in r/HPC asked for advice making a thirty-node cluster, and after looking at the current How-To docs, I pointed them at those and offered some advice on power and heat management, since that aspect of cluster operations was totally absent from all of the documents I could find.
 
It's been eating at me, because there's a lot more also absent from those documents -- scaling factors, effective use of ssh multiplexing, job control frameworks (like Gearman), tiered master nodes, monitoring .. it makes me think that either there's documentation I haven't found or more needs to be written.
 
Rather than leaping right in and brain-dumping to this blog, I've joined #clusterlabs-dev and #linux-cluster on Freenode, to get a feel for how the community thinks about these things nowadays.  I haven't built a nontrivial cluster in nearly ten years, and some of my skills are bound to be a bit stale and perhaps irrelevant.
 
There are a few more modern guides, but they too are narrow of scope, like https://github.com/darshanmandge/Cluster
 
The #clusterlabs-dev channel folks have thusfar pointed me at https://github.com/ClusterLabs/anvil and https://www.alteeve.com/w/What_is_an_Anvil!_and_why_do_I_care%3F which is nice and modern, but also quite narrow.
 
When I asked them about power/heat management best practices, they responded with a resounding "meh!" so perhaps I'll write about that first.  Doubtless my FAANG friends will rip it to shreds, but that will just make for a better second draft.

 
Building an Alternative to RHEL6: Saving the Repositories
Posted 2020-07-06 14:04 PDT | Tags: software rhel6fork
 
It has seemed to me for some years now that there is quite a bit of demand from the enterprise sector for a RHEL-like distribution which is just a forked RHEL6 https://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux with updated kernel and packages, omitting the "features" Red Hat introduced with RHEL7.
 
The desire to avoid these features is motivating a lot of businesses to keep using RHEL6 (or its derivatives) beyond its EOL, which is a suboptimal solution for everyone involved.  Red Hat has chosen to accomodate those customers to a degree by offering "Extended Lifecycle Support" for RHEL6 through the end of June 2024, so they are at least getting security patches for these systems.
 
The longer RHEL6 users lack alternatives to RHEL7/8 (and their derivatives), the more those users will feel pressured to make the painful transition to an operating system which is less reliable and harder to maintain.  The best time to develop an alternative was five years ago, when most RHEL6-using businesses were just starting to weigh their options, but I don't think the window of opportunity has closed quite yet.  There are still many foot-dragging stragglers who would welcome the chance to perpetuate their accustomed infrastructure.  If enough of these stragglers adopted the alternative and found it good, it might entice other companies which have already made the transition (and have been suffering the consequences) to transition back.
 
I've talked to some people who voiced interest in such a project, but couldn't get them to talk to each other, and AFAIK they didn't follow through.  I'm interested in contributing to the project, but have resisted taking the lead because I'm already drowning in unfinished projects and would rather invest my time and energy in making Slackware more enterprise ready.  Existing RHEL6 users would likely find an RHEL-based alternative more appealing, though.
 
Time is not on our side.  One of the consequences of delaying the fork is that the RHEL6 package repositories are being neglected and falling into disrepair.  Information is being lost, which an RHEL6 fork would need as a reference template.
 
I've been downloading Scientific Linux 6 https://en.wikipedia.org/wiki/Scientific_Linux and the related RHEL6 repositories -- sl, sl6x, sl-other, epel, adobe-linux, and rpmforge.  SL6 seems like the best RHEL6-clone from which to base a fork because their community is very fork-friendly.  They have invested resources in making "spins" (shallow forks) easy for their users and would gladly give advice to a project fork. In contrast, when I brought up the notion of a fork in CentOS6 forums, it was received with uniform hostility.  The difference was like night and day.
 
The downloads are ongoing.  I've got about 400GB so far, with I think about 200GB more to go.  It's quite a bit of information, but not unmanageable.
 
I'd still appreciate it if someone else took point, but while I'm waiting for that champion, I'll put in some work to make sure the repository data is complete and available as a working SL6 mirror.  If a project fully materializes, I'd be happy to hand over the mirror or manage it on behalf of the project.
 
I tried using yum and createrepo to construct a local mirror in the prescribed manner https://www.unixmen.com/setup-local-yum-repository-on-centos-rhel-scientific-linux-6-4/ but the repo metadata was too badly in disrepair for some repos to work.  Some mirror lists are stale, some domains have disappeared, some mirrors are -empty- (directories are there, but no files), some are redirecting to nonexistent locations, and some have slightly-wrong packages which make yum unhappy.  I've switched to just wget'ing the entire contents of good mirrors (once I found them) and will reorganize them into proper repos and fix broken dependencies later.
 
That dysfunction will only grow worse with time, which makes it all the more important to obtain copies of them -now-, before the dysfunction deepens.
 
I'll also see about getting it onto redundant storage.  I'd hate to lose it all to a disk crash.

 
"The Tragedy of systemd", a rebuttal
Posted 2018-11-17 14:46 PST | Tags: software systemd
 
Benno Rice gave an interesting talk at the 2018 BSDCAN, titled "The Tragedy of systemd" -- https://www.youtube.com/watch?v=6AeWu1fZ7bY
 
He presents a very sympathetic narrative about systemd from the BSD perspective, and he does it fairly well, but he also makes some assertions of dubious validity.
 
Early in the talk he proposes there is a "confusion" between system configuration and service bootstrap in traditional UNIXy models. He talks about mounting filesystems and bringing up the network, and how these are "slightly disparate things" which should be treated differently and managed with different tools.
 
From my perspective, one of the strengths of the UNIX approach is that it abstracts a great many "slightly disparate things" as though they were the same thing. Making (almost) everything a file is a good example of this. When I was new to UNIX, it seemed strange to me, but over time I have grown appreciative of how powerful this abstraction can be.
 
When things are sufficiently different that they need to be treated as something other than a file, we have system calls like ioctl() specifically for that, and in my experience that has worked well.
 
Later he made a curious assertion without any accompanying reasoning to support it, that "other things manage services well [..] Windows always had a strong notion of services". He proceeds from there on the assumption that his assertion is true to discuss Windows' contract-oriented (declarative) approach, which he says "has been kind of neat".
 
I've known people to be used to Windows service management, but I've never known anyone who was familiar with both Windows and UNIX who preferred Windows service management. Mostly I've heard them complain that Windows was inflexible and opaque. I've not spent any time configuring Windows services myself, so perhaps some of you with such experience could weigh in on this. Is there anything to be said about Windows' declarative service management approach which warrants admiration?
 
Benno also praises the ideas behind launchd -- "if you need to have services running all the time, and you can't call the system booted until services have started, that's such a pain in the ass" and "We know we'll need this at some point but we won't start it until we actually need it." This is an approach systemd borrowed from launchd, not starting services until something requests them.
 
The problem I have with this approach (which is also applicable to inetd) is that when the system comes up, I want to know that its services will work. The litmus test for this is bringing the service up so that it processes its configuration and finds any errors or missing dependencies. This can be improved upon by monitoring services with things like Nagios, which submits requests and validates the results.
 
If the service doesn't come up until it's needed, then we don't know if it will work until the moment something actually needs it, which seems like a really bad idea. Also, this approach presupposes a lack of monitoring. Production systems absolutely should be monitored. If Nagios hits up a service immediately or soon after system start and starts the service, then why did we bother deferring the service start in the first place?
 
He also talks about how modern systems need to be more reactive to changes around them, which is absolutely true, but IMO the traditional UNIXy toolbox has adapted fairly well in this regard. It's not done yet -- things like changing wireless networks without interrupting TCP connections still needs some work -- but pitching the toolbox entirely is unwarranted.
 
He is dismissive of criticism of systemd as buggy, saying sardonically "it's software" and "we've all had bugs in our code". He says that if we hold init to a higher standard of quality, that implies we can never write another pid 1.
 
That is simplistic to the point of dishonesty. There are certainly ways we can reduce or mitigate bugs in our software. We can make software simpler, so that there are fewer things to go wrong. Less code means fewer bugs. We can also limit critical dependencies so that a failure in one component doesn't cause the system to more broadly fail (as is the case with systemd when dbus fails). We can also hold on to known-good, already debugged software until such a time that new software has absorbed enough debug/release cycles on ancillary systems that we can trust it on mission-critical systems.
 
Systemd as a project does none of these things. It is by no means simple. It contains a great deal of code, and thus a great number of bugs, which its developers have shown themselves reluctant to fix. It is tightly integrated with a variety of far-flung components and is vulnerable to any of them failing. It has been forced into adoption on mission-critical systems before it is ready (as RHEL6/CentOS6 fall out of support and there are no sufficiently good alternatives to RHEL-derived distributions in the enterprise).
 
Benno also says "UNIX as a concept is dead" in that we no longer have a diversity of UNIX systems across which software must be ported, and we don't have to be beholden to POSIX when it is inconvenient. This seemed curious, coming from a BSD developer in a Linux-dominated world. Projects do have reason to be portable. Projects can get away with being Linux-specific because Linux is dominant, but that's no substitute to being portable. How many projects of the past targeted the dominant platform of their time only to be left behind when the dominant platform changed? There is a vast ocean of excellent software which was written for MS-DOS or classic MacOS which were never ported forward to those platforms' successors.
 
He then talks about how "change can be scary" and "change threatens what we find familiar". These are true things, but he makes it sound like anyone who opposes systemd is a narrow-minded neanderthal. He also does not recognize that some change can be genuinely bad, or that there might be other ways to change which are less bad. He assumes that since people are reacting badly to the changes represented by systemd, then those people will always react badly to changes, and challenges people to overcome their "kneejerk reactions" and accept systemd rather than be opposed to all change in general.
 
As someone who dislikes systemd on the basis of its design and implementation, that felt a little disingenuous.
 
He said a lot more which I'm still mulling over, but these points stood out to me as rather uncompelling, and they make me view all of his points with a sharply critical eye.
 
Blog comments are not working at the moment, but please feel free to reply via these alternative channels:
 
   * My LQ blog -- https://www.linuxquestions.org/questions/blog/ttk-652585/
 
   * IRC channel ##slackware-help on freenode
 
   * Facebook -- https://www.facebook.com/ttkciar
 
   * Twitter -- https://twitter.com/ttk_ciar
 
   * Email (please let me know if I may share your message via this blog) -- ttk (at) ciar (dot) org

 
New Blog Seems To Be Working
Posted 2017-06-25 10:56 PDT | Tags: blogging software
 
Welcome to my new blog, "Entropy Pump"!  I finally gave up on Blosxom and wrote my own blogging software from scratch.  It ended up taking less time than all the hours I sank into trying to make Blosxom not suck.
 
That having been said, it is not complete.  The search feature doesn't work yet, there is still no way for visitors to leave comments, there's no way to link to a specific post, and pagination is broken.  If I don't fix pagination before posting twenty posts, only the twenty newest posts will be displayed, with no way of accessing older posts.  So I'd better fix pagination!
 
Also the side-panels to the right of the screen, over there --> are not very interesting.  I'll come up with more interesting content and replace them.
 
I'm fairly pleased at how the html and css turned out.  This page has a much spiffier look-and-feel than my old blogs hosted by LinuxQuestions and Slashdot, and it absolutely puts my main website to shame.  It's not great (I kind of suck at css) but finally feel like I can provide people links to my blog without feeling embarrassed about it.
 
Thinking the top priority will be permalinks for posts .. the title of each post should be a permalink.  Second priority will be pagination.  Then I can think about full-text search (Lucy or dezi?).  Also, maybe improving the markup a bit.  It would be nice if "Lucy" and "dezi" were links to https://metacpan.org/pod/distribution/Lucy/lib/Lucy.pod and https://dezi.org/ respectively, but right now the blog only knows how to expand full URLs and Wikipedia references.
 
As introductions go, this one's pretty boring.  Perhaps another entry is in order, ruminating on the significance of "Entropy Pump".  Will do that soon.
 
UPDATE 2017-06-25 12:25 -- adding permalinks was really, really trivial.