Interesting day today.
We have a virtualised 5.1 environment mixed with Solaris 10 and Sunray thin clients. We have load balanced terminal servers and the vm running the load balancing software was non responsive as was the DC. In fact, 4 of our 10 hosts were non responsive. Nobody could log on so we set about trying to ascertain what the issue was. First and foremost was trying to get the hosts back in the datacenter. After much troubleshooting and some reboots we managed to get the hosts all back in and things up and running. Except we found a number of servers were greyed out and 'inaccessible'. Thankfully we had VMDK backups so they were restored and more troubleshooting took place.
It would appear that the datastore containing the load balancer and the DC had gone offline at 03:30 and some time later re-presented itself as an empty disk. I rescanned the storage and HBA's and then tried to readd and this was the screen I was met with:
..and after checking the events I found this:
Our SAN is a Fujitsu DX90 and I have a ticket open with Fujitsu but am I right in assuming this is an issue with the RAID on the SAN? I'm pretty sure it's not FC related as we'd be seeing a wealth of other conenction issues. I've been following this KB about identifying disks and trying to access the volume via SSH: http://goo.gl/QHJV1 - but so far I've been unsuccessful in accessing the volume. Which leads me to believe that it is in fact a RAID issue.
My other question is should the SAN be intelligent enough to do something about this if an entire RAID goes down? Shouldn't there be some kind of redundancy? And if not is this something you might find on a newer SAN?
Or, has all of this been caused by vCenter and something errored at 03:30 thereby wiping the RAID?