Way back in Jan, I mentioned a problem we were seeing on our Sun JES messaging server machines with corrupt disks. After some fscking and much swearing (more downtime. *sigh*) Sun couldn’t really come up with a definite reason but given they had pretty much ruled everything else out they decided it was worth upgrading the firmware in the 3510 arrays that we use as message stores. Given we were running v3.x firmware and the upgrade to v4.x firmware looked terrifying we were pretty pleased that they were going to do this for us.
They came, they swore a bit, they finally managed to upgrade us to v4.15F. Great. Problem solved. Or so we thought. A few days later we see
Mar 2 02:45:02 mailstore2 ufs: [ID 913664 kern.warning] WARNING: /data/mailstore2/msg/students4: unexpected allocated inode 2050084, run fsck(1M) -o f
Mar 2 02:45:02 mailstore2 ufs: [ID 913664 kern.warning] WARNING: /data/mailstore2/msg/students4: unexpected allocated inode 2050095, run fsck(1M) -o f
Then we get
Mar 6 10:08:51 mailstore2 ufs: [ID 913664 kern.warning] WARNING: /data/mailstore2/msg/staff2: unexpected allocated inode 6310183, run fsck(1M) -o f
And so it continues. We are still seeing some form of filesystem rot…and if anything, it’s now worse than it was 🙁 Due to the amount of downtime we have already had on this so-called resilient HA system (Sun, this is getting harder and harder to sell to management btw, particularly as it seems impossible to even get a renewal quote for the bloody thing. Exchange is creeping up…) it was decided by some
expertsmanagers that we should shut the system down and fsck a filesystem each week during our normal at risk period. A dangerous idea given we have already had one kernel panic but understandable given the bad image the system already has.
A couple of fscks pass, and we are still seeing the occasional error when we realise that one of the disks reported as needing fscking is the one we did last week. Shit. Disks are now corrupting at a faster rate than we are fixing them. Around this time, Sun send us info on bugid 102815. Bingo. That sounds like the thing. Except it seems to suggest the bug is in the firmware that we paid sun to move us to a month earlier :-/ Annoying is one word (and it doesn’t completely explain our earlier problems).
Rather rapidly, a whole days downtime is arranged for the following Sunday. We plan to come in, fsck all the filesystems on the arrays and then do the firmware upgrade which this time is a simple upgrade that can even be done live (ha!). Excellent.
Sunday comes, fscks take around 5 hours to complete cleanly (after swearing a bit at bugid 4775289 grrrr) and we start the firmware upgrade. First bad sign, it fails the controller over as it is meant to but the secondary then refuses to start up and do anything so all the disks go offline (and then report they need a fsck – argggghh). Still, the prompt reappears so we continue with the doc that says
Verify that the firmware upgrade succeeded by examining the firmware
revision again. Type the following command at the sccli> prompt:
and verify that the firmware revision is now reported as 415G.
Our still shows 415F. And to top it off, we now can’t retry the upgrade as one of the controllers is now flagged as “detected” instead of “enabled” and appears to have no serial number reporting in the gui…this is not a documented state 🙁 A quick phonecall to sun (thank god for our elevated support at the moment following my email) and a rather unfazed support guy suggests we ignore the instructions and reset the controllers. This half worked….in that it did indeed make the controllers go offline. 15 minutes later they haven’t come back online (this should take “a couple of minutes”). Arse. We ring again. “try pulling the power from the array?”. Sigh. We do this and finally, it wakes up, realises it has new firmware and boots. Phew….
Service restored with 8 mins of our booked outage left. A full 8 hours of downtime *again*. It remains to be seen if we have finally nailed the problem or if we are still slowly dropping bits of filesystem at random 🙁