This has been quite an eventful week with not much sleep.
At the moment we are in a situation where no one knows what else we can do. Let me first explain what happened.
We introduced an additional blade to our infrastructure. It was load-tested for 10 days, all stable and nice. Monday then that host disappears from the vCenter.
The host itself is still up, just cannot connect to vCenter / Client. VMs are up too so that was a bonus. After hours with VMware support they basically gave up and we had not choice but to bounce the host - well, to add insult to the injury, HA didn't work and did not fail the VMs over.
Problem in scenarios like that is that while the (disconnected) host is still in vCenter - the VMs are too - which are disconnected but showing as powered on - which they are not. So you cannot even migrate them (like you can with powered off VMs).
Next "solution" was to remove the host from vCenter. At this stage we were finally able to add the VMs back to the inventory using other hosts.
Of course there were some corruptions / broken VMs / Fricked up VMDK descriptor files and the list (and hours) go on.
We initially thouight that was it - far from it ... we continued to see latencies on all datastores / hosts of 250k-700k ms ... yepp .. 700.000 ms ...
A power-on operation (or even adding VMs back into the inventory) took up to 30 minutes / VM.
Anyway ... we obviously opened tickets with the storage vendor as well and they of course blamed VMware .. I actually managed to get both in a phone conference, VMware and Storage vendor with VMware confirming yet again a storage issue. Three days later still no result.
At some point we had a hunch - all these VMs, which were affected, were also migrated using DRS (when you least need it) which bombed out when the host crashed the second time (before we finally pulled the blade).
Locks - our guess .. So some VMs we expected to be the culprit, were rebooted .. and ola ... latency gone.
No one can explain what happens, why that "fixed" some issues, but heh - we were happy ...
Well now the weirdest thing ... and to actually finally get to the point, we have two hosts .. EMPTY hosts .. no VMs, showing the same sort of device latency on ONE particular datastore. As soon as you put the hosts back into maintenance mode, the latency goes down to nothing
Attached shows where the host was taken out of maintenance mode and put back in again.
Now VMkernel logs show some SCSI aborts and yes, this is likely due to storage issues which we may still have - however, how can the only hosts showing now a latency with no VMs on it when they are out of maintenance mode, but look fine when in maintenance mode and all other hosts with the VMs running, are fine ?
Now we are in a blame loop - storage vendor blames vmware, vmware blames storage vendor.
VMware Supports also just shrugs when I try to get an explanation how a rebooted VM can cause the latency to calm down as it surely shouldn't make a difference if the storage back end is to be blamed ....
So I hope someone here can give me some pointers, because right now we are out of ideas (and clearly so are the vendors)