Linus Torvalds, Linux, and the Issue of Software Quality

Friend and Maemo/MeeGo bugjar master Stephen Gadsby alerted twitterites yesterday to a Fedora bugzilla flamefest, and at first blush it made for interesting comic relief.  Who doesn’t enjoy a good Internet argument?

But a second read sobered me up quickly.  The bug turned out to be an issue introduced into the crucial (and occasionally controversialglibc code library that doesn’t appear to have been sufficiently regression-tested.  The code change reason is described as an execution speed improvement, but it appears to have come at the expense of pre-emptive error-checking.

Most people aren’t going to care about the technical reasons underlying the discovered bug.  Most will, instead, be concerned with its impact.  And that gets us to the reason behind me writing today.  

The first known manifestation appears to have shown up on a Flash-powered website (save that information for later– we’ll get back to it).  Distorted audio was noted.  Long story short, impressive detective work on the part of the Red Hat Linux community narrowed the cause down to an efficiency improvement in glibc that had the unfortunate side-effect of corrupting system memory.  Further investigation proved the problem did not exist in Fedora 13 but is blatantly apparent in the Fedora 14 release.

What’s particularly noteworthy in the case of this bug is the participation of Linus Torvalds, famous father of Linux.  Linus’ involvement in this bug report escalated when certain community members defended the rationale behind the recent glibc code change, and hyperfocused on details that are either irrelevant (i.e., the proprietary nature of Adobe’s Flash technology) or lacking in critical context.  What followed was the typical aggressive exchange common on the internet when two sides are right in their way, but one fails to recognize the bigger picture.

The big picture in this sense has users in it.

Linus very appropriately identified the problem:

I’d personally suggest that glibc just alias memcpy() to memmove().

Yes, clearly overlapping memcpy’s are invalid code, but equally clearly they do happen. And from a user perspective, what’s the advantage of doing the wrong thing when you could just do the right thing? Optimize the hell out of memmove(), by all means.

Of course, it would be even better if the distro also made it easy for developers to see the bad memcpy’s, so that they can fix their apps. Even if they’d _work_ fine (due to the memcpy just doing the RightThing(tm)), fixing the app is also the right thing to do, and this would just make Fedora and glibc look good.

Rather than make it look bad in the eyes of users who really don’t care _why_flash doesn’t work, they just see it not working right.

There is no advantage to being just difficult and saying “that app does something that it shouldn’t do, so who cares?”. That’s not going to help the _user_, is it?

And what was the point of making a distro again? Was it to teach everybody a lesson, or was it to give the user a nice experience?

That last rhetorical question is key here.  Purists defend the recent glibc changes, regardless of detrimental impact, on the basis of the ostensible speed improvements– and claim that it is up to application developers, such as Adobe’s, to exercise the due diligence necessary to prevent memory corruption.  But such defenses blithely ignore the responsibility of upstream developers to implement reasonable safeguards, and even more importantly, the entire raison d’être of software in the first place:

To solve a problem for users.

Along with cohorts Dan Leinir Turthra Jensen and Timo Härkönen, I covered this topic tongue-in-cheek in a presentation at the inaugural MeeGo Conference in 2010.  But the lighthearted approach doesn’t take away the seriousness of the subject.  When I ask “who are we coding for?”, I believe I’m in the same ballpark as Linus Torvalds.  After all, if execution speed at any cost is the goal, let’s strip out all error checking from core code libraries and let downstream developers worry about the consequences.  Right?  Think of the “wasted” clock cycles we could get back!

But in all seriousness, we don’t code in a vacuum.  Our work has consequences.  As developers, upstream, downstream or at any point along the solution continuum, we need to exercise a practical, reasonable responsibility to protect users from software mischief.  Of course, we will still disagree on what is reasonable at times, but as Linus points out, simply adding users into the equation should resolve that dilemma in most cases.  How useful is it to code from ivory towers?

It’s difficult to do full justice to the discussion that inspired this post.  There’s a great deal of history and biases involved that struggle to pull the discussion into old, festering tangents.  And I’m certainly not trying to demonize one side in the debate, or trivialize the validity of any fact-based points.  But I believe the quick and detail-focused defense of the change and its risks is disingenuous, and exposes a flawed process.  Even worse, I believe that embedded in and underlying those defenses is the idealist thinking that marginalizes Linux as a “geeks only” operating system.

Who are we coding for?

The needs and expectations of users must be a key part of any solution development process, and indeed I highly recommend that average users be involved to some extent in regression testing.  I have found my best testers to fit under one or both categories: people willfully trying to break things, and/or those who are not knowledgeable of the application or its ecosystem.  Satisfy those two classes of users, and odds are you’re putting out a fault-tolerant product… at the very least.


12 responses to “Linus Torvalds, Linux, and the Issue of Software Quality

  1. Thanks for the insight Tex. Respect for Linus T. dutifully updated as a result…

    But please, if you’re going to use hip French expressions, be a good Norman Spinrad and look ’em up first : it’s “raison d’être” and nothing else 🙂

  2. gbeddingfield

    I dunno. memcpy() is well known to only work with non-overlapping buffers. Doing stuff like aliasing memcpy() to memmove() because Adobe can’t be bothered to do it right is how we end up with crap like IE6.

    I would favor a (a) some kind of transition-period warning for F14 and then (b) for >F14 a quick check for overlapping buffers that results in a SIGSEGV when the assumption is violated.

    • Well, I don’t mean to focus as much on the first statements in that quote as I do on the last. But the point lost context if I edited anything out, and got too busy if I added further quotes. The summation, though, is what’s key.

      • gbeddingfield

        Yes, I agree with the summation. However, it’s implying that the NOTABUG crowd isn’t considering the needs of the user. That’s not true. They consider it better for the user in the long-run to call a spade a spade (NOTABUG) and force program that /actually has the bug/ to fix it. In the long-run, this does indeed serve the user better. Putting a band-aid on it now usually means it will never be fixed properly and we end up with dumb API rules.

      • I agree with *your* assessment. What I don’t agree with are many, many of the comments defending the cause of the bug, and dismissing some of the “NOTABUG” responses without considering the actual points.

        I’m all about addressing root cause, but sometimes that involves a comprehensive process, and some participate very well while others cling a bit too tightly to semantics that detract from the issue.

      • Also, just curious: what would be the negative consequences of regressing back to the previous version of glibc with an interim release of F14 pending a permanent fix in the library?

      • gbeddingfield

        OK, Linus convinced me with this comment: 🙂

        As for the negative consequence… it depends on how impending F14’s release is. If it’s very near, I say work around it. If it’s farther off… then the negative is that the bug will be put off and possibly forgotten. But as I said, Linus convinced me about the implementation details.

      • Yeah, I was tempted to at least link that comment and probably should have followed that instinct. Maybe an edit…

  3. Pingback: Linus Torvalds, Linux, and the Issue of Software Quality « Tabula … · Technology Computer

  4. You forgot to mention one key point; the other side has absolutely no argument in favor.

    There is no performance improvement, the code is not more maintainable, nor simpler, there’s _no_ reason whatsoever to break people’s software. Yet they do it. Why? Because they can, and POSIX says they can.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s