How the Sun Enterprise 10000 was born (2007)

50 points by robin_reala 6 hours ago

dmd 5 hours ago

Not the 10000, but I admin'd a 4500 back in 1999 at Bristol-Myers Squibb at the ripe old age of 21. It was running Sun's mail server, which required constant care and feeding to even remotely reliably serve our 30,000+ users.

One time it just stopped responding, and my boss said "now, pay attention" and body-checked the machine as hard as he could.

It immediately started pinging again, and he refused to say anything else about it.

defaultcompany 5 hours ago

This reminds me of the “drop fix” for the sparc station where people would pick up the box and drop it to reseat the PROMs.
- linsomniac 3 hours ago
  
  Amiga had a similar issue. One of the chips (fat Agnes IIRC?) didn't quite fit in the socket correctly, and a common fix was to pull out the drive mechanisms and drop the chassis something like a foot onto a carpeted floor.
  Somewhat related, one morning I was in the office early and an accounting person came in and asked me for help, her computer wouldn't turn on and I was the only other one in the office. I went over, poked the power button and nothing happened. This was on a PC clone. She has a picture of her daughter on top of the computer, so I picked it up, gave the computer a good solid whack on the side, sat the picture down and poked the power button and it came to life.
  We call this: Percussive Engineering
- badc0ffee an hour ago
  
  Apparently you also had to do this with the Apple ///.
bionsystem 5 hours ago

I can't wait for the mandatory "brendan gregg screams at disks" youtube link.
- znpy 4 hours ago
  
  There you go :)
  https://www.youtube.com/watch?v=tDacjrSCeq4
  (btw it's titled "Shouting in the Datacenter")
- bitwize 3 hours ago
  
  Ah, the old "Fus Ro Data Loss" vulnerability.
theideaofcoffee 5 hours ago

Ah, percussive maintenance! Also good for reseating disks that just don’t quite reliably get enumerated, slam the thing back in. I had to do something similar on a power supply for a V440, thankfully it was a month or so away from retirement, I didn’t feel too bad giving it some encouragement like that. Great machines.

eugenekay 4 hours ago

Throughout the late 90s, “Mail.com” provided white-label SMTP services for a lot of businesses, and was one of the early major “free email” providers. Each Free user had a storage limit of something like 10MB, which is plenty in an era before HTML email and attachments were commonplace. There were racks upon racks of SCSI disks from various vendors for the backend - but the front end was all standard Sendmail, running on Solaris servers.

Anyway, here’s the front end SMTP servers in 1999, then in-service at 25 Broadway, NYC. I am not sure exactly which model these were, but they were BIG Iron! https://kashpureff.org/album/1999/1999-08-07/M0000002.jpg

packetslave 33 minutes ago

Those look like E5500 or E6500 cabinets (hard to tell from the angle).
jeffbee 3 hours ago

I worked at a competing white-label email provider in the 90s and even then it seemed obvious that running SMTP on a Sun Enterprise was a mistake. You're not gaining anything from its multiuser single-system scalability. I guess it stands as an early example of pets/cattle debate. My company was firmly on the cattle side.
- eugenekay 3 hours ago
  
  I was just the Teenage intern responsible for doing the PDU Cabling every time a new rack was added, since nobody on the Network or Software Engineering teams could fit into the crawl spaces without disassembling the entire raised-floor.
  I do know that scale-out and scale-up were used for different parts of the stack. The web services were all handled by standard x86 machines running Linux - and were all netbooted in some early orchestration magic, until the day the netboot server died. I think the rationale for the large Sun systems was the amount of Memory that they could hold - so the user name and spammer databases could be held in-memory on each front end, allowing for a quick ACCEPT or DENY on each incoming message - before saving it out to a mailbox via NFS.

trollied 5 hours ago

I used to love working with E10k/E15k boxes. I was a performance engineer for a telco software provider, and it was so much fun squeezing every single thing out of the big iron.

It’s a bit sad that nobody gives a shit about performance any more. They just provision more cloud hardware. I saved telcos millions upon millions in my early career. I’d jump straight into it again if a job came up, so much fun.

amiga386 2 hours ago

I used to work for a telco equipment provider around the time everyone was replacing PDH with SONET. Telcos were gagging to buy our stuff, the main reason being basic hardware advances.
Telephone Exchanges / Central Offices have to be in the centre of the lines they serve, meaning some very expensive real estate, and datacenter-level HVAC in the middle of cities is very, very expensive.
They loved nothing more than to replace old 1980s switches with ones that took up a quarter to a tenth of the floorspace, used less than half the electricity, and had fabrics that could switch fibre optics directly.
kstrauser 3 hours ago

My experience was a bit different. I first saw a Starfire when we were deploying a bunch of Linux servers in the DC. The Sun machine was brilliant, fast, enormous, and far more expensive per unit of work than these little x86 boxes we were carting in.
The Starfire started at around $800K. Our Linux servers started at around $1K. The Sun box was not 800x faster at anything than a single x86 box.
It was an impressive example of what I considered the wrong road. I think history backs me on this one.
> It’s a bit sad that nobody gives a shit about performance any more.
Everyone gives a shit about performance at some point, but the answer is horizontal scaling. You can’t vertically scale a single machine to run a FAANG. At a certain vertical scale, it starts to look a helluva lot like horizontal scaling (“how many CPUs for this container? How many drives?”), except in a single box with finite and small limits.
- axiolite 2 hours ago
  
  > The Sun box was not 800x faster at anything than a single x86 box.
  You don't buy enterprise gear because it's economical for bulk number-crunching... You buy enterprise gear when you have a critical SPOF application (typically the database) that has to be super-reliable, or that requires greater resources than you can get in commodity boxes.
  RAS is an expensive proposition. Commodity servers often don't have it, or have much less of it than enterprise gear. Proprietary Unix systems offered RAS as a major selling point. IBM mainframes still have a strong market today.
  It wasn't until the late 2000's when x86 went to 64-bit, so if your application wanted to gobble more than 2GB/4GB of RAM, you had to go with something proprietary.
  It was even more recently that the world collectively put a huge amount of effort in, and figured out how to parallelize a large amount of number-crunching problems that were previously limited to single-threaded.
  There have been many situations like these through the history of computing... Going commodity is always cheaper, but if you have needs commodity systems don't meet, you pay the premium for proprietary systems that do.
  - kstrauser an hour ago
    
    First, yes, everything you said is true. And especially when you’re supporting an older application designed around such SPOFs, you need those to be bulletproof. That’s completely reasonable. That said, a fair chunk of my work since the 90s has been in building systems that try to avoid SPOFs in the first place. Can we use sharded databases such that upgrading one doesn’t take the others down? Shared-nothing backend servers? M-to-N meshes so we’re not shoving everything through a single load balancer or switch? Redundant data centers? The list goes on.
    I don’t think that approach is inherently better than what you described. Each has its own tradeoffs and there’s a time and place for both of them. I absolutely did see a lot of Big Iron companies marketing their giant boxes as the “real, proven” alternative to a small cluster of LAMP servers, though. I don’t blame them for wanting to be big players in that market, too, but that wasn’t a good reason to use them (unless you already had their stuff installed and wanted to add a web service next to your existing programs).
    I wouldn’t run a bank on an EC2 instance, but neither would I ever buy a mainframe to host Wordpress at any scale.
- trollied 3 hours ago
  
  I don’t disagree. But most also don’t give a shit and then scale horizontally endlessly, and spend too much money, to deal with their crappy code.
  As a dev it isn’t your problem if the company you work for just happily provisions and sucks it up.
  - kstrauser 3 hours ago
    
    That’s a thing, to be sure. The calculus gets a little complicated when that developer’s pay is far more than the EC2 bill. There’s a spectrum with a small shop wasting $1000 a year hosting inefficient code, and Google-scale where SRE teams would love to put “saved .3% on our cloud bill!” on their annual review.
  - rjsw 2 hours ago
    
    > ... to deal with their crappy code
    written in an interpreted language.
- znpy 2 hours ago
  
  > Everyone gives a shit about performance at some point, but the answer is horizontal scaling. You can’t vertically scale a single machine to run a FAANG.
  You might be surprised about how many companies think they're FAANG (but aren't) though.
  - kstrauser an hour ago
    
    That’s a whole other story, to be sure! “We absolutely must have multi-region simultaneous writes capable of supporting 300,000,000 simultaneous users!” “Your company sells door knobs and has 47 customers. Throw it on PostgreSQL and call it solved.”
mlyle 4 hours ago

> It’s a bit sad that nobody gives a shit about performance any more. They just provision more cloud hardware.
It's hard to get as excited about performance when the typical family sedan has >250HP. Or when a Raspberry Pi 5 can outrun a maxxed-E10k on almost everything.
...(yah, less RAM, but you need fewer client connections when you can get rid of them quickly enough).
lokar 3 hours ago

In the end that approach to very high scale and reliability was a dead end. It’s much better and cheaper to solve these problems in software using cheap computers and fast networks.
- chasil an hour ago
  
  If you have applications that run (and rely) on z/OS, this kind a machine makes sense.
  The e10k didn't have applications like that. Just about everything you could do on it could be made to work on commodity x86 with Linux (after some years, for 64-bit).
- trollied 3 hours ago
  
  Less cheap computers is still a thing. Entirely missing the point.
  - lokar 3 hours ago
    
    A lot of the examples here are things like running a large email service. Doing that with this kind of hardware makes no sense.
    
    Henchman21 3 hours ago
    
    It might make no sense today, but it made loads of sense back then. One cannot apply modern circumstances backwards in time.

neilv 3 hours ago

> They were also joined with several engineers in Beaverton, Oregon through these mergers.

They might mean from Floating Point Systems (FPS):

https://en.wikipedia.org/wiki/Cray#Cray_Research_Inc._and_Cr...

> In December 1991, Cray purchased some of the assets of Floating Point Systems, another minisuper vendor that had moved into the file server market with its SPARC-based Model 500 line.[15] These symmetric multiprocessing machines scaled up to 64 processors and ran a modified version of the Solaris operating system from Sun Microsystems. Cray set up Cray Research Superservers, Inc. (later the Cray Business Systems Division) to sell this system as the Cray S-MP, later replacing it with the Cray CS6400. In spite of these machines being some of the most powerful available when applied to appropriate workloads, Cray was never very successful in this market, possibly due to it being so foreign to its existing market niche.

Some other candidates for server and HPC expertise there (just outside of Portland proper):

https://en.wikipedia.org/wiki/Sequent_Computer_Systems

https://en.wikipedia.org/wiki/Intel#Supercomputers

(I was very lucky to have mentors and teachers from those places and others in the Silicon Forest, and also got to use the S-MP.)

JSR_FDED 4 hours ago

This was one of the all time biggest strategic mistakes SGI made - for a mere $50 million they enabled their largest competitor to rack up huge wins against them almost overnight. A friend at SUN at the time was telling me how much glee they took in sticking it to SGI with its own machines.

cf100clunk 2 hours ago

> one of the all time biggest strategic mistakes SGI made
SGI in the Ewald years tripped itself up, then in the Rick Belluzzo years made a cavalcade of avoidable mistakes.

nocoiner 5 hours ago

To this day, “Sun E10000 Starfire” is basically synonymous in my head with “top-of-the-line, bad-ass computer system.” What a damn cool name. It made a big impression on an impressionable youth, I guess!

beng-nl 4 hours ago

I agree on all counts, but the installation I had at my job at the time regularly needed repairs..! Hopefully this was an exceptional case, but it gave me the impression of “redundancy added too much complexity to make the whole reliable.”
ETA: particularly because the redundancy was supposed to make it super reliable
- somat 3 hours ago
  
  I worry about this sometimes, there is this long tail of "reliability" you can chase, redundant systems, processes, voting, failover, "shoot the other node in the head scripts" etc. But everything adds complexity, now it has more moving parts, more things that can go wrong on weird ways. I wonder if the system would be more reliable if it were a lot simpler and stupid, a single box that can be rebooted if needed.
  It reminds me of the lesson of the Apollo computers, The AGC was to more famous computer, probably rightfully so, but there were actually two computers, The other was the LVDC, made by IBM for controlling the Saturn V during launch, now it was a proper aerospace computer, redundant everything, a can not fail architecture, etc. In contrast the AGC was a toy, However this let the AGC be much faster and smaller, instead of reliability they made it reboot well, and instead of automatic redundancy they just put two of them.
  https://en.wikipedia.org/wiki/Launch_Vehicle_Digital_Compute...
  There is something to be learned here, I am not exactly sure what is is, worse is better?
- jeffbee 3 hours ago
  
  No, I think that was typical. Nostalgia tends to gloss over the reality of how dodgy the old unix systems were. The Sun guy had to show up at my site with system boards for the SPARCcenter pretty regularly.

jasongill 5 hours ago

This is one of my dream machines to own. The Sun E10k was like the Gibson, it was so mythically powerful. It was a Cray inside of your own server closet, and being able to be the admin of an E10k and have root on a machine with so much power was a real status symbol at the time.

tverbeure 3 hours ago

I worked for a company that bought one of these. It was delivered, lifted through the window of the server room with a crane and worked fine.

A few days later, our admin noticed over the weekend that he couldn’t remote log in. He checked it out and… the machine was gone. Stolen.

Somebody within Sun must have tipped off where these things were delivered and rented a crane to undeliver them.

pavlov 3 hours ago

Isn’t it more likely it was someone within the company you worked for?
They would have access to site-specific info like how easy it is to get access to that server room to open the windows.
The old saying is “opportunity makes the thief.” Somebody at Sun has much less visibility into the opportunity.
- tverbeure an hour ago
  
  I was told that it had happened before with a delivery at a different company.
cf100clunk 2 hours ago

Hmmm... I wondered why the official E10K demo machine in the lobby of Sun's HQ back then had been enclosed in glass. It also might very well have just been a mockup, I suppose.
AStonesThrow an hour ago

Where do you fence such a thing? That is more than stealing a car. Do you take it to a SPARC Chop Shop and strip it for parts to sell on eBay?
Did they recover this monstrous thing or have any witnesses/leads on who just rocked up with an unauthorized crane to your machine room?
That is sort of a crown-jewels level heist. They pulled it off more than once??

bobmcnamara 4 hours ago

Cray-cyber.org used to have free shell accounts on one in Germany.

hpcjoe 2 hours ago

I recall that while I was at SGI. Many of us within SGI were strongly against the move to sell this off to Sun. We blamed Bo Ewald for the disaster to SGI that this was, the lack of strategic vision on his part. We also blamed the idiots in SGI management for thinking that only MIPS and Irix would be what we would be delivering.

Years later, Ewald and others had a hand in destroying the Beast and Alien CPUs in favor of the good ship Itanic (for reasons).

IMO, Ewald went from company to company, leaving behind a strategic ruin or failure. Cray to SGI to Linux Networx to ...

znpy 4 hours ago

According to https://www.filibeto.org/aduritz/truetrue/e10000/e10000.pdf "Its online storage capacity can exceed 60 Tbytes" ... and it could host 64 cpus and 64GB of memory ... crazy considered it's from 1997 :)

kstrauser 3 hours ago

It was only a couple of years after that when I owned my first computer faster than a Cray X-MP. I love being on the receiving end of Moore’s Law.