I’ve been emailed by 3 people now asking what happened about my email to sun and if they have actually made any progress (seems we are not the only ones a bit pissed off at the endless stream of issues with Communications Express and JES…).
Turned out that before Christmas was a bad time both for us and for Sun so a meeting has been arranged for next week. Sun will be bringing a few people down to chat through some of our concerns and issues and look for a way forward. Hopefully, they won’t be coming armed with baseball bats and broken bottles in an attempt to sort us out for good 🙂
This it seems has turned out to be good timing. Before Christmas, UWC had been relatively stable with few issues (apart from the long term – “it’s shit” complaints) and even the Outlook Connector was almost behaving itself. It was beginning to look as if Sun would be coming down and we would have a fairly short list of outstanding issues.
Since then however, things have gone a bit pap again. Firstly, Ian has made some progress on working out why the Outlook connector causes us so much grief with shared calendars – it *looks* like either the connector or Comms express is breaking the ACLs everytime an ACI is added or removed. This seems to have caused bit of confusion with Sun who can’t seem to agree if they ACLs are processed first-match or Cumulative (they say one thing, tests and their docs say another). This one has gone quiet now while (I presume) they go and play with their own system in an attempt to work our what it does.
Then, last Thursday we had a more serious problem. Around 11pm one of the cluster nodes (we are running dual node cluster, active-active with half the users on each node in normal operation) suffered a kernel panic:
Mpanic[cpu2]/thread=2a102c65d40: free: freeing free frag, dev:0x550000600f,blk:37553, cg:4, ino:47378, fs:/data/mailstore1/msg/students1
Oh bugger. While this node rebooted cluster managed to bring calendar services up correctly on the second node but failed to get messaging online complaining that:
Jan 18 23:08:37 mailstore1 /dev/md/msga-ds/rdsk/d15: UNEXPECTED INCONSISTENCY;RUN fsck MANUALLY. Jan 18 23:08:37 mailstore1 THE FOLLOWING FILE SYSTEM(S) HAD AN UNEXPECTED INCONSISTENCY:/dev/md/msga-ds/rdsk/d15 (/data/mailstore1/msg/students1)
Next morning we had little choice but to fsck the disk which trashed a few files and produced plenty of “FRAG BITMAP WRONG” messages (a new one on me that).
Messaging was happy to start up then but what caused this corruption is still a bit of a mystery. It appears that (for once ;-)) JES was innocent in this as was cluster. It looks like a Solaris bug but so far, nothing conclusive has been found. The latest theory is that it may have been an issue when we lost a disk in the array earlier in the week. It seems that we actually had two disks go bad in short succession – one died like a deaded thing and it rebuilt onto a hot spare. The other just went offline and is marked “used”. Not quite sure what this is all about yet but it seems rather a coincidence. The other thing that is a little worrying (but maybe nothing – I’ve only just noticed it) is that although the array noticed that a disk had died at 6pm on the friday it appears the rebuild onto the hot spare (the hot spare that then threw errors *sigh*) only started after 10pm on the *saturday*. I might be missing something here but it seems “odd”…
Fri Jan 12 18:01:54 2007  #148: LD-ID 70A11CBD on StorEdge Array SN#8025854:ALERT: SCSI drive failure (CH2 ID11) Sat Jan 13 22:22:16 2007  #79: LD-ID 70A11CBD on StorEdge Array SN#8025854:NOTICE: starting logical drive rebuild Sat Jan 13 22:24:01 2007 [113f] #80: StorEdge Array SN#8025854 CH2 ID9:ALERT: redundant path failure detected (CH3 ID9)
Oh well, wouldn’t have had this much fun had we been running Exchange. That would have been a whole different sort of pain. One with a nice shiny GUI that enables you to do almost-but-not-quite what you want sort of pain no doubt. Still, maybe we’ll find out one day 🙂