A Higgs-Bugson in the Linux Kernel

blog.janestreet.com

167 points by Ne02ptzero 20 hours ago

gnfargbl 3 hours ago

Calling this a "Higgs-Bugson" doesn't make a lot of sense. There's nothing uncertain or difficult to reproduce about the Higgs.

The reason that it took so long to find was that the cross-section of production is very low, the decay signatures are hard to separate from the background, the specific energy scale it existed at was not well-defined, and building the LHC was (to put it mildly) difficult and expensive.

Roughly, if you'll forgive a bad analogy from a long-lapsed physicist, it was the equivalent of trying to find a very weak glow from a specific type of bug hiding at an unknown location in a huge field of corn. Except that your vision was very bad, so you had to invent a new type of previously-unimaginably excellent eyeglasses to see the thing. Also before you could even start looking you had to expend a painful amount of time and money building a flashlight so incredibly huge that it needed new types of cryogenic cooling inventing, just to stop it from melting when you switched it on.

If you had a software bug that you were almost certain was there, but you needed half of the world's GPU clusters for three years to locate and prove it, then that would be a Higgs-Bugson.

alexpotato 2 hours ago

Regarding NFS, I've always loved this quote from the CTO at a hedge fund I once worked at:

"NFS is lot like heroin:

at first, it seems amazing.

But then it ruins your life"

(This is a place that did almost EVERYTHING via NFS including different applications communicating via shared files on NFS mounts. It also had the weird setup of using BOTH Linux AND Windows permissions on NFS mounts shared between user desktops [windows] an servers [linux])

stavros an hour ago

The problem I have with reviews like these is that they're expressed in absolute terms. Yes, NFS might ruin my life, but if it ruins my life less than every other alternative, it's still a win.
- eqvinox an hour ago
  
  I'd go as far as saying most networked concurrent file access will ruin your life one way or another, because it's just a hard problem, and it's trying to solve it at a very odd layer; a "classic" fs can't really take advantage of higher layer transactional or other known constraints in order to make things work better…

vrnvu 4 hours ago

I'd like to highlight this:

>NFS with Kerberos

secure, simple, battle tested. no crazy architecture

works so well a bug showed up in the kernel :-)

eqvinox an hour ago

> works so well a bug showed up in the kernel :-)
What exactly are you trying to highlight here? Most code has bugs. This one is someone forgetting to stick to actual behavior described in 1997, it's a mistake, mistakes happen. Which one of "secure", "simple", "battle tested" and "no crazy architecture" do you think this disproves?
Or do you think CIFS or Ceph have no bugs?
- gyesxnuibh 41 minutes ago
  
  I think they're saying typically the kernel one of the last places you'd expect the bug, so it shows that it is battle tested?
  I don't think they're being snarky.

hglee 11 hours ago

https://lists.openwall.net/linux-kernel/2025/03/19/1374

penguin_booze 8 hours ago

I wish developers--new and old alike--pay attention to the commit messages that goes into the kernel. Granted, it takes a subject matter expert to really understand what's being said, but the general format and layout of commit messages is instructive. Commit messages helps the reader/reviewer get their bearings; they also help to build the case from the bottom up.
The fact that the development team is globally distributed both necessitates this kind of knowledge serialization and preserves it for posterity. It's completely different from tapping a colleague sitting next you on the shoulder, and saying "psst, can you approve this quick? It's just a bunch of fixes".

protocolture 11 hours ago

I love the term "Higgs Bugson". Its much better than what I usually do which is just call a system haunted.

GTP 5 hours ago

I was used to the more common Heisenbug, but I find Higgs-Bugson more funny.
EarlKing 7 hours ago

Haunted? Hell, it's positively possessed.

anonymousiam 10 hours ago

"The normal timeout logic can take care of retransmission in the unlikely case that one is needed."

NFS can be run over TCP or UDP. Does the retransmission occur when using UDP?

ninjha 8 hours ago

Yes! The retransmission logic in Linux NFS is independent of transport (see the `retrans` option in `mount.nfs`).
Weirdly enough this also means that if you’re running with TCP you can have retransmits at the NFS/SunRPC level and at the TCP level depending on your configuration.

jxjnskkzxxhx 31 minutes ago

Content marketing for Jane street.

sedatk 11 hours ago

> A higgs-bugson is a bug that is reported in practice but difficult to reproduce

This was the first time I heard of "higgs-bugson". The term sounded so forced that I had to know how it differed from Heisenbug. In short, it doesn't[1].

Then why did it even exist?

The term somehow made it to the "Heisenbug"'s Wikipedia page[1], so I checked the sources. There were two and both end up at the same site: Jeff Atwood's blog post[2] quoting some StackOverflow answers to a poll-like question ("what's a programming term you coined?") because he wanted to remove lighthearted content from the site as he thought it clashed with SO's mission of educating people and advancing their skills[3].

There was a proposal on Meta StackExchange about undeleting that question with the answers, but it was refused by Jeff Atwood again because it invited "made up stuff"[4] among other reasons.

So, Wikipedia in the end, has this term in Heisenbug page because someone just blurted out something in 2010, it was copy-pasted to a blog, and then got scooped up by some news outlet. There are no other sources. Kagi doesn't find any instances of the term before it was coined on StackOverflow in 2010. For all we know, "gingerbreadboy" from England invented it.

The irony is that the term somehow made it to the literature -hence the blog post here- because someone was just having fun at StackOverflow. It obviously either sounded good, or just clicked that others started using it. StackOverflow deleted the content that actually made a small part of computer science history because it wasn't "serious".

In other words, StackOverflow cut off one of its history-making parts because it had an incomplete and simplistic view of useful. I think it might be possible to draw a line from their understanding of communities and societal dynamics to the downfall of StackOverflow after the emergence of AI[5].

[1] https://en.wikipedia.org/wiki/Heisenbug

[2] https://blog.codinghorror.com/new-programming-jargon/

[3] https://stackoverflow.blog/2010/01/04/stack-overflow-where-w...

[4] https://meta.stackexchange.com/questions/122164/can-we-un-de...

[5] https://blog.pragmaticengineer.com/stack-overflow-is-almost-...

dh2022 9 hours ago

I think Heisenbug refers to a bug that stops repro’ing during debugging (the act of observing the system changes the system behavior). This bug was different: it was very rare and debugging it didn’t make it go away.
zahlman 9 hours ago

> because he wanted to remove lighthearted content from the site as he thought it clashed with SO's mission of educating people and advancing their skills[3].
No; he wanted to remove discussion and socialization, because it clashed with SO's mission of presenting useful information without parsing through others' discussion.
https://meta.stackexchange.com/questions/2950
https://meta.stackexchange.com/questions/19665
https://meta.stackexchange.com/questions/92107
https://meta.stackexchange.com/questions/131009
> In other words, StackOverflow cut off one of its history-making parts because it had an incomplete and simplistic view of useful.
How does this in any way demonstrate that the view of usefulness was "incomplete" or "simplistic"?
How is the deleted content "useful"?
> I think it might be possible to draw a line from their understanding of communities and societal dynamics to the downfall of StackOverflow after the emergence of AI[5].
What downfall?
Before you point at any of the incoming-question-rate statistics: why should they be interpreted as representing a "downfall"? That is, why is it actually bad if fewer questions are asked?
Before you answer that, keep in mind that Stack Overflow already has more than three times as many publicly visible questions about programming as Wikipedia has articles about literally anything notable.
- robertlagrant 7 hours ago
  
  > why should they be interpreted as representing a "downfall"?
  I agree, but also SO has certainly gone through ups and downs. It does feel as though it's now in a terminal "down" having invested its limited resources in things lots of the dedicated members didn't seem to want, instead of basic improvements to moderation and to chat features.
chris_wot 10 hours ago

Yeah, stack overflow is dying, we all know it.

Havoc 4 hours ago

Didn't know jane street did tech writeups

Balinares 2 hours ago

It's a great recruitment device. It takes a certain kind of nerd to salivate over the glorious technical depths that such a write-up goes into, and for the kind of company who values this flavor of nerd, this is a great way to attract their attention.

konsalexee 7 hours ago

TIL higgs-bugson and Heisenbug

ribcage 6 hours ago

[dead]

snvzz 6 hours ago

With millions of LoCs, it is no surprise there are bugs.

Worse yet, the kernel runs in supervisor mode.

This kernel design is bankrupt. There's much better available, such as seL4+Genode.

eqvinox 4 hours ago

Please try keeping your snide comments to issues they actually apply to. This is a logic bug, with the kernel missing a piece of abnormality handling. You can get the exact same bug in a microkernel (or, FWIW, a memory safe, e.g. Rust) implementation; neither of those concepts help here.
lotharcable an hour ago

> This kernel design is bankrupt. There's much better available, such as seL4+Genode.
I am sure that the tech community would love to read the details of your great success in deploying microkernels for large variety of production workloads.
eddythompson80 5 hours ago

seL4+Genode is equally as bankrupt. I run my code in the SMM anyway.